Mining meaning from Wikipedia
Introduction
Wikipedia requires little introduction or explanation. As everyone knows, it was launched in 2001 with the goal of building free encyclopedias in all languages. Today it is easily the largest and most widely used encyclopedia in existence. Wikipedia has become something of a phenomenon among computer scientists as well as the general public. It represents a vast investment of freely given manual effort and judgment, and the last few years have seen a multitude of papers that apply it to a host of different problems. This paper provides the first comprehensive summary of this research (up to mid-2008), which we collect under the deliberately vague umbrella of mining meaning from Wikipedia. By meaning, we encompass everything from concepts, topics, and descriptions to facts, semantic relations, and ways of organizing information. Mining involves both gathering meaning into machine-readable structures (such as ontologies), and using it in areas like information retrieval and natural language processing.
Traditional approaches to mining meaning fall into two broad camps. On one side are carefully hand-crafted resources, such as thesauri and ontologies. These resources are generally of high quality, but by necessity are restricted in size and coverage. They rely on the input of experts, who cannot hope to keep abreast of the incalculable tide of new discoveries and topics that arise constantly. Even the most extensive manually created resource—the Cyc ontology, whose hundreds of contributors have toiled for 20 years—has limited size and patchy coverage (Sowa, 2004). The other extreme is to sacrifice quality for quantity and obtain knowledge by performing large-scale analysis of unstructured text. However, human language is rife with inconsistency, and our intuitive understanding of it cannot be entirely replicated in rules or trends, no matter how much data they are based upon. Approaches based on statistical inference might emulate human intelligence for particular purposes and in specific situations, but cracks appear when generalizing or moving into new domains and tasks.
Wikipedia provides a middle ground between these two camps—quality and quantity—by offering a rare mix of scale and structure. With two million articles and thousands of contributors, it dwarfs any other manually created resource by an order of magnitude in the number of concepts covered, has far greater potential for growth, and offers a wealth of further useful structural features. It contains around 18 Gb of text, and its extensive network of links, categories and infoboxes provide a variety of explicitly defined semantics that other corpora lack. One must, however, keep Wikipedia in perspective. It does not always engender the same level of trust or expectations of quality as traditional resources, because its contributors are largely unknown and unqualified. It is also far smaller and less representative of all human language use than the web as a whole. Nevertheless, Wikipedia has received enthusiastic attention as a promising natural language and informational resource of unexpected quality and utility. Here we focus on research that makes use of Wikipedia, and as far as possible leave aside its controversial nature.
This paper is structured as follows. In Section 2 we describe Wikipedia's creation process and structure, and how it is viewed by computer scientists as anything from a corpus, taxonomy, thesaurus, or hierarchy of knowledge topics to a full-blown ontology. The next three sections describe different research applications. Section 3 explains how it is being drawn upon for natural language processing; understanding written text. In Section 4 we describe its applications for information retrieval; searching through documents, organizing them and answering questions. Section 5 focuses on information extraction and ontology building—mining text for topics, relations and facts—and asks whether this adds up to Tim Berners-Lee's vision of the Semantic Web. Section 6 documents the people and research groups involved, and the resources they have produced, with URLs. Section 7 gives a brief overall summary.
Section snippets
Wikipedia: a resource for mining meaning
Wikipedia, one of the most visited sites on the web, outstrips all other encyclopedias in size and coverage. Its English language articles alone are 10 times the size of the Encyclopedia Britannica, its nearest rival. But material in English constitutes only a quarter of Wikipedia—it has articles in 250 other languages as well. Co-founder Jimmy Wales is on record as saying that he aspires to distribute a free encyclopedia to every person on the planet, in their own language.
This section
Solving natural language-processing tasks
Natural language-processing applications fall into two major groups: (i) those using symbolic methods, where the system utilizes a manually encoded repository of human language, and (ii) statistical methods, which infer properties of language by processing large text corpora. The problem with the former is a dearth of high-quality knowledge bases. Even the lexical database WordNet, which, as the largest of its kind, receives substantial attention (Fellbaum, 1998), has been criticized for low
Information retrieval
Wikipedia is already one of the most popular web sites for locating information. Here we ask how it can be used to make information easier to obtain from elsewhere—how to apply it to organize and locate other resources.
Given its applications for natural language processing (Section 3), it is not surprising to see Wikipedia leveraged to gain a deeper understanding of both queries and documents, and improve how they are matched to each other. Section 4.1 describes how it has been used to expand
Information extraction and ontology building
Whereas information retrieval aims to answer specific questions, information extraction seeks to deduce meaningful structures from unstructured data such as natural language text—though in practice the dividing line between the fields is not sharp. These structures are usually represented as relations. For example, from:
Apple Inc.'s world corporate headquarters are located in the middle of Silicon Valley, at 1 Infinite Loop, Cupertino, California.
People, places and resources
Wikipedia began with the goal of distributing a free encyclopedia to every person on the planet, in their own language. It is reassuring to see the research community that benefits so much from Wikipedia maintaining the same international perspective. The research described in this survey is scattered across the globe. Fig. 14 shows prominent countries and institutions at the time of writing (mid-2008, as stated at the outset).
US and Germany are the largest contributors. In the US, research is
Summary
A whole host of researchers have been quick to grasp the potential of Wikipedia as a resource for mining meaning: the literature is large and growing rapidly. We began this article by describing Wikipedia's creation process and structure (Section 2). The unique open editing philosophy, which accounts for its success, is subversive. Although regarded as suspect by the academic establishment, it is a remarkable concrete realization of the American pragmatist philosopher Peirce's proposal that
Acknowledgements
We warmly thank Evgeniy Gabrilovich, Rada Mihalcea, Dan Weld, Sören Auer, Fabian Suchanek and the YAGO team for their valuable comments on a draft of this paper. We are also grateful to Enrico Motta and Susan Wiedenbeck for guiding us in the right direction. Medelyan is supported by a scholarship from Google, Milne by the New Zealand Tertiary Education Commission.
References (148)
- et al.
The anatomy of a large-scale hypertextual web search engine
Computer Networks and ISDN Systems
(1998) - et al.
Semantic Wikipedia
Journal of Web Semantics
(2007) - et al.
Context-based spelling correction
Information Processing and Management
(1991) - et al.
Fact Discovery in Wikipedia
- et al.
Finding similar sentences across multiple languages in Wikipedia
Multilingual Agricultural Thesaurus
(1995)- et al.
Using Wikipedia at the TREC QA Track
HARD track overview in TREC 2005: high accuracy retrieval from documents
- Auer, S., Bizer, C., Lehmann, J., Kobilarov, G., Cyganiak, R., Ives, Z., 2007. DBpedia: a nucleus for a web of open...
- et al.
What have Innsbruck and Leipzig in common? Extracting semantics from Wiki content
Boosting inductive transfer for text classification using Wikipedia
Open information extraction from the Web
Extracting named entities and relating them over time based on Wikipedia
Informatica
Network analysis for Wikipedia
The semantic web
Scientific American
The mathematics of statistical machine translation: parameter estimation
Computational Linguistics
Using encyclopedic knowledge for named entity disambiguation
A bag-of-words based ranking method for the Wikipedia question answering
Task Evaluation of Multilingual and Multi-modal Information Retrieval
N-Gram-based text categorization
The Google similarity distance
IEEE Transactions on Knowledge and Data Engineering
Towards large-scale, open-domain and ontology-based named entity classification
Wikipedia risks
Communications of the ACM
The Wikipedia XML corpus
SIGIR Forum
Extracting trust from domain analysis: a case study on the Wikipedia Project
Autonomous and Trusted Computing
Green's Functions with Applications
Inductive learning algorithms and representations for text categorization
Introduction to the special issue on evaluating word sense disambiguation systems
Journal of Natural Language Engineering
Collaborative authoring on the Web: a genre analysis of online encyclopedias
An approach for extracting bilingual terminology from Wikipedia
Placing search in context: the concept revisited
ACM Transactions on Information Systems
Internet encyclopaedias go head to head
Nature
An analysis of topical coverage of Wikipedia
Journal of Computer-Mediated Communication
Topic-sensitive PageRank: a context-sensitive ranking algorithm for web search
IEEE Transactions on Knowledge and Data Engineering
Cited by (278)
An approach for measuring semantic similarity between Wikipedia concepts using multiple inheritances
2020, Information Processing and ManagementW-Tree: A Concept Correlation Tree for Data Analysis and Annotations
2024, Lecture Notes in Electrical EngineeringHow Did They Build the Free Encyclopedia A Literature Review of Collaboration and Coordination among Wikipedia Editors
2023, ACM Transactions on Computer-Human InteractionA Conceptual Graph-Based Method to Compute Information Content
2023, Mathematics