Who links to Wikipedia?

AllWiki.png
Here are the top external sites that link to Wikipedia, based on overall link volume for all Wikipedia languages and all top-level domains. Graph by Gianluca Demartini, CC-BY-SA 4.0.
To learn more about who links to Wikipedia, our research team at the University of Sheffield analyzed the structure of links that point to Wikipedia pages from external websites, looking specifically at which top-level domains dominate the link volume for each Wikipedia language. Here’s what we found.

Key findings

  • The most popular Wikipedia according to the number of links that it receives from external websites is the English Wikipedia. After that, the Spanish, the Indonesian, and the German Wikipedias are the most popular.
  • Websites under blogspot.com contribute many links to Wikipedia. For the Spanish, Indonesian, and Portuguese Wikipedias they contribute more than all the other .com websites.
  • The main top-level domains that link to the English Wikipedia are: .com, .org, .blogspot.com, .net, .edu, and .co.uk.
  • Out of the 36 million links to Wikipedia, 18 million links are from .com domains to the English Wikipedia.
  • Most links to the German and French Wikipedias come from .de and .fr domains respectively.

 

Data

In early 2015 the website WikiReverse.org published raw data about all the 36 million hyperlinks from any external website to any Wikipedia page. Such data was extracted from CommonCrawl (read more about the extraction process on the WikiReverse website): a large crawl of the Web run by the Common Crawl Foundation in July 2014.
We have analyzed the WikiReverse dataset to understand the linking structure from the external Web to Wikipedia. We have aggregated and visualized this data according to the top-level domain (e.g., .com, .net.) of the websites linking to Wikipedia. The list of top-level domains we aggregated the raw data against has been manually curated. We then visualized the aggregated data using Tree Maps to show the proportion of the link volume between domains and Wikipedia languages.
The result for all Wikipedia languages and all data is shown in the graph above. The first line of text in each box indicates the Wikipedia language. The second line indicates the top-level domain. The size of each box is proportional to the number of links from websites having a certain top-level domain to a certain Wikipedia language. Here you can find the interactive plot with the link volume for all Wikipedias
As the English Wikipedia dominates the visualization, this is the resulting visualization for all non-English Wikipedia languages:
NonEngWiki.png
Here are the top ‘non-english’ external sites that link to Wikipedia, excluding link volume for English-language top-level domains. Graph by Gianluca Demartini, CC-BY-SA 4.0.
Here you can find the interactive plot with the link volume for non-English Wikipedias
The code used to process the raw data from WikiReverse can be found in this iPython notebook.

Limitations

The original data used for this analysis and observation is the Common Crawl which is an incomplete crawl of the Web. This means that all observation are valid only based on the data within Common Crawl and may not be representative of the entire Web.
Gianluca Demartini, Lecturer in Data Science at the University of Sheffield. gianlucademartini.net

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

9 Comments
Inline Feedbacks
View all comments

“blogspot.com” is not a TLD, clearly. Did you include all second-level domains above a certain frequency? Are you normalising all blogspot domains (blogspot.de, blogspot.co.uk etc. etc.) to “.com”? Did I understand correctly that you are counting only the number of domains and not of links? If so, why? This says little on how many or how important links they bring. For instance, if blog1.blogspot.com used a different URL structure, blogspot.com/blog1/, it would be only one domain linking us, but same result for traffic. The method skews the data in favour of the second level domains which use many subdomains to… Read more »

Thanks for the comments. Some clarifications follow. I decided not to slip domains on “.” but rather to use this list https://publicsuffix.org/list/effective_tld_names.dat as under the blogspot.com domain totally different websites exist. In the list there are 43 different blogspot.* which are counted separately. As stated, there is a bias on how the crawl has worked. I count each link from each website to any Wikipedia page so there should be no bias on the number of subdomains: The linking data comes from wikireverse.org This analysis gives no information about registered domains as it only looks at those with links to… Read more »

I am really surprised by the small rate of .eu domain. Is it really so unused?

According to general CommonCrawl statistics, .eu is not in the top-10 most used top-level domains (see https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/edit ). That report also refers to http://w3techs.com/technologies/overview/top_level_domain/all which claims that .eu is used by 0.5% of all websites.

According to general CommonCrawl statistics, .eu is not in the 10 most used top-level domains (see https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/edit ). That report also refers to http://w3techs.com/technologies/overview/top_level_domain/all which claims that .eu is used by 0.5% of all websites.

.eu however is important for some language editions, for instance the rank for it.wikipedia is com, it, org, eu, net

wikipedia its great, it was better to put very few ads

blogspot.com is not a TLD is subdomain 😀

[…] Who links to Wikipedia? – 8k views […]