We kicked off this series by exploring the foundational principles of knowledge on Wikipedia: its dynamic, factual, verifiable, collectively significant, and neutral content, all presented in a way that’s digital, meaningful, accessible, and usable by machines. There’s a lot of valuable knowledge still waiting to be added on Wikipedia, and on the flip side, some content may have found its way in that doesn’t fully belong. These gaps and excesses in knowledge, which we call knowledge discrepancies, are the subject of this blog post.
Structure of Wikipedia’s knowledge
Wikipedia has over 300 language editions. While English Wikipedia has around 7 million articles, there are language editions that may have just a few thousand. Each language version serves a unique audience, and the articles in each language reflect the informational priorities and cultural/linguistic nuances of the users using that language. Topics can differ significantly in emphasis, depth, or even framing in different languages based on the perspectives of the users shaping it. For example, the article on a religious figure like Jesus can look remarkably different in the English, Hebrew, Arabic, and Malayalam editions of Wikipedia, each shaped by its own cultural lens.
Within each language version, knowledge is primarily packaged into self-contained articles. Each article aims to give readers a comprehensive understanding of a specific topic, serving as a complete, informative hub. Every article is designed to be a one-stop source, where readers can grasp all relevant information related to a topic without needing to leave the page. That brings us to the question, what elements constitute a Wikipedia article?
Wikipedia’s articles are written in wikitext, a custom markup language, and enriched by several digital components:
- Multimedia elements: Directly embedded images, videos, and audio clips enhance engagement beyond the ways traditional encyclopedias can.
- Structured data: Elements like infoboxes and property values pulled from Wikidata provide standardized, machine-readable information, making classifications, comparisons, and data extraction more efficient. Wikipedia also has templates, which are pre-designed, reusable code, for various purposes such as navigation (example), citation (example) and for providing maintenance information (list of maintenance templates).
- Categories: Another organizational structure is the thematic, hierarchical tagging system called category trees. Tagging articles with broad and specific topics allows users to navigate and discover related content. The category tree of most Wikipedia articles can be complex, with each category branching in different directions. This system helps users navigate through the content and discover related articles on similar subjects. The hierarchical structure allows users to understand where an article fits within the broader context of related topics. The categories related to the Roman Empire can be found here, and you can see one path to the Roman Empire starting from all Articles:
- Internal links: Arguably the most defining structural feature, internal links transform Wikipedia from a collection of isolated articles into a vast, interconnected web of knowledge.
Emergence of Knowledge Discrepancies
While Wikipedia has a robust structure, the content presented within this structural framework may contain knowledge discrepancies. A knowledge discrepancy can manifest as the absence of content on a collectively significant topic or the presence of unwanted knowledge that does not belong on Wikipedia. For clarity, we propose classifying knowledge discrepancies within articles into four categories:
Missing content: Refers to areas where information should be present but is entirely absent. This includes missing content within articles, like a missing section on a major historical event within a country’s history article, as well as the absence of articles on significant topics. Missing content can also include missing multimedia like a key image, or structured data like birth dates on a biography article. Missing content on Wikipedia has been widely studied and is straightforward to understand. We recommend going through the “content” section of the knowledge gaps taxonomy to get a broader perspective about the classification of missing content on Wikipedia.
Poorly structured content: Refers to information that is present but lacks proper organization, context, or accuracy, making it difficult for readers to comprehend. This could include excess or improper content that negatively affects the meaningfulness of the article. On the other hand, poorly structured content is often not strictly considered a knowledge discrepancy and, therefore, less well studied. As we continue, we will offer a framework for classifying poorly structured content that appears within Wikipedia.
Surplus content: Some areas on Wikipedia have a knowledge surplus, with content that is not relevant for Wikipedia. These include non-notable content, inappropriate material such as hate speech or vandalism, and content that involves copyrighted material not suitable for Wikipedia’s open-licensed framework.
Cataloging mismatches: Beyond the content itself, the organizational backbone of Wikipedia can also present misunderstanding. These mismatches involve structural elements like missing or incorrectly applied categories, templates, or other structured data. Such categorical imperfections impede the discoverability of information for both humans and machines. These create “blind spots” that hinder machines from fully understanding, indexing, and leveraging Wikipedia’s content.
The rest of this blog post focuses on deconstructing poorly structured content. As mentioned earlier, missing content has already been well studied, and surplus content and cataloging mismatches are, for now, outside the scope of our interest.
Deconstructing Poorly Structured Content
Knowledge discrepancies—ranging from a typo to the absence of an entire article to a surplus of vandalized content—leave significant holes or vulnerabilities in the knowledge available to readers. Identifying poorly structured content becomes easier when seen through the lens of what knowledge Wikipedia already contains:
As we’ve established, Wikipedia contains dynamically updated(1), factual(2), verifiable(3), and collectively significant knowledge(4), presented neutrally(5) and organized in a digital format(6) that is meaningful(7) and accessible(8) to humans and usable for machines(9). Therefore, knowledge discrepancies can also include content that falls short of these ideals. Here’s a framework for classifying these types of discrepancies:
- Outdated: An article becomes outdated when it fails to incorporate new discoveries, significant events, or other changes in understanding that have occurred since its last update. This leaves readers with incomplete or even misleading information. For example, an article on a political issue not updated after major elections or policy changes, or a scientific article missing the recent findings in a rapidly advancing field, would both be considered outdated.
- Inaccurate: This could range from simple typos or incorrect dates to misquoted sources, flawed calculations, or fundamentally incorrect statements that misrepresent established facts. Such inaccuracies erode trust and directly contradict Wikipedia’s commitment to providing reliable information. For instance, an article might include incorrectly translated material from a foreign language source or present a historical fact with the wrong date or context.
- Unverified: Wikipedia’s core principle of verifiability demands that every claim be attributable to a reliable, published source. When content appears without any citations or relies on sources that don’t meet Wikipedia’s standards for reliability (like personal blogs, social media, or opinion pieces masquerading as fact), it falls into this category. This gap means readers cannot check the information themselves, making it untrustworthy. An example would be an article related to a disease condition claiming benefits of alternative medicine without credible citations, or one with dead external links that can no longer be checked.
- Lacking Notability: Wikipedia is an encyclopedia, not a directory or a platform for every piece of information. The “collective significance” principle means a topic must have received significant coverage in reliable, independent sources to warrant a standalone article. An article about a local celebrity who, while well-known in their immediate community, has not received significant coverage in national or international media, would be considered unnotable.
- Biased: Wikipedia’s principle of neutrality requires all significant viewpoints to be represented fairly and in proportion to their prominence in reliable sources. Bias can manifest through selective sourcing, emotionally charged or loaded language, or the outright omission of counterarguments, leading to a skewed understanding for the reader. For instance, a biography that gives excessive and undue weight to the personal life of a public figure would be unbalanced.
- Un-wikified: Content that exists on Wikipedia but lacks proper internal linking, adherence to formatting conventions, or other basic “wikification” elements. This makes the content less integrated into Wikipedia’s broader knowledge web, difficult to navigate, and failing to fully leverage the digital nature of the platform. For example, a page about a historical event may lack internal links to related individuals, locations, or other relevant articles, forcing readers to manually search for context. When extended to broader styling and markup, it can leave for a jarring visual experience that is difficult to read.
- Disorganized: Articles that are poorly structured, unfocused, excessively lengthy, or presented in a confusing narrative, despite containing factual information. This category also includes articles with conflicting information, logical inconsistencies, or references that are difficult to match to specific citations. These issues undermine the overall meaningfulness of the article, as readers struggle to comprehend the topic efficiently and accurately. For instance, a long medical article that reads like a disjointed academic essay, lacking clear headings and sections, and existing in a structure that is difficult to navigate. Similarly, a political article might include self-contradictory statements about a party’s stance.
- Inaccessible: Content presenting barriers to understanding, particularly for readers with disabilities or those unfamiliar with specialized terminology. This includes accessibility challenges (e.g., color-coded graphs without alternative text for colorblind individuals) as well as the use of highly technical jargon without sufficient explanation, rendering content opaque to non-experts. Wikipedia aims for universal readability, and content that requires specialized knowledge just to understand basic concepts falls short of this goal. For example, an article about a mathematical theorem that is filled with advanced technical terms may be less accessible for readers who are unfamiliar with the subject.
- Unfriendly to machines: Wikipedia’s digital nature means it’s not just for human eyes; it’s also a foundational dataset for artificial intelligence, search engines, and data analysis tools. Content that lacks infoboxes, structured data, categories, and internal links impedes a machine’s ability to understand contextual information. Wikipedia’s potential as a backbone for machine learning, advanced research tools, and countless other digital applications relies on structured, clean data. For instance, the availability of birth dates, occupations, and nationality in an infobox may enable an AI assistant to quickly answer factual questions. Or, an entire topic that isn’t adequately categorized prevents a machine from easily identifying and grouping related articles together when requested.
Wrapping Up
In this blog post, we have navigated the architecture of Wikipedia’s knowledge, including the structuring of Wikipedia’s articles. We introduced a framework for mapping knowledge discrepancies, with key categories for classifying poorly structured content on Wikipedia.
We believe that understanding the knowledge gaps is the first step towards addressing them. In our next blog post, we will shift our focus into action: how to prioritize and flag these knowledge gaps to effectively improve the quality of content on Wikipedia. Stay tuned to learn how to start tackling these vital issues!
Can you help us translate this article?
In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?
Start translation


