Mining meaning from Wikipedia

https://doi.org/10.1016/j.ijhcs.2009.05.004Get rights and content

Abstract

Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks.

This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.

Introduction

Wikipedia requires little introduction or explanation. As everyone knows, it was launched in 2001 with the goal of building free encyclopedias in all languages. Today it is easily the largest and most widely used encyclopedia in existence. Wikipedia has become something of a phenomenon among computer scientists as well as the general public. It represents a vast investment of freely given manual effort and judgment, and the last few years have seen a multitude of papers that apply it to a host of different problems. This paper provides the first comprehensive summary of this research (up to mid-2008), which we collect under the deliberately vague umbrella of mining meaning from Wikipedia. By meaning, we encompass everything from concepts, topics, and descriptions to facts, semantic relations, and ways of organizing information. Mining involves both gathering meaning into machine-readable structures (such as ontologies), and using it in areas like information retrieval and natural language processing.

Traditional approaches to mining meaning fall into two broad camps. On one side are carefully hand-crafted resources, such as thesauri and ontologies. These resources are generally of high quality, but by necessity are restricted in size and coverage. They rely on the input of experts, who cannot hope to keep abreast of the incalculable tide of new discoveries and topics that arise constantly. Even the most extensive manually created resource—the Cyc ontology, whose hundreds of contributors have toiled for 20 years—has limited size and patchy coverage (Sowa, 2004). The other extreme is to sacrifice quality for quantity and obtain knowledge by performing large-scale analysis of unstructured text. However, human language is rife with inconsistency, and our intuitive understanding of it cannot be entirely replicated in rules or trends, no matter how much data they are based upon. Approaches based on statistical inference might emulate human intelligence for particular purposes and in specific situations, but cracks appear when generalizing or moving into new domains and tasks.

Wikipedia provides a middle ground between these two camps—quality and quantity—by offering a rare mix of scale and structure. With two million articles and thousands of contributors, it dwarfs any other manually created resource by an order of magnitude in the number of concepts covered, has far greater potential for growth, and offers a wealth of further useful structural features. It contains around 18 Gb of text, and its extensive network of links, categories and infoboxes provide a variety of explicitly defined semantics that other corpora lack. One must, however, keep Wikipedia in perspective. It does not always engender the same level of trust or expectations of quality as traditional resources, because its contributors are largely unknown and unqualified. It is also far smaller and less representative of all human language use than the web as a whole. Nevertheless, Wikipedia has received enthusiastic attention as a promising natural language and informational resource of unexpected quality and utility. Here we focus on research that makes use of Wikipedia, and as far as possible leave aside its controversial nature.

This paper is structured as follows. In Section 2 we describe Wikipedia's creation process and structure, and how it is viewed by computer scientists as anything from a corpus, taxonomy, thesaurus, or hierarchy of knowledge topics to a full-blown ontology. The next three sections describe different research applications. Section 3 explains how it is being drawn upon for natural language processing; understanding written text. In Section 4 we describe its applications for information retrieval; searching through documents, organizing them and answering questions. Section 5 focuses on information extraction and ontology building—mining text for topics, relations and facts—and asks whether this adds up to Tim Berners-Lee's vision of the Semantic Web. Section 6 documents the people and research groups involved, and the resources they have produced, with URLs. Section 7 gives a brief overall summary.

Section snippets

Wikipedia: a resource for mining meaning

Wikipedia, one of the most visited sites on the web, outstrips all other encyclopedias in size and coverage. Its English language articles alone are 10 times the size of the Encyclopedia Britannica, its nearest rival. But material in English constitutes only a quarter of Wikipedia—it has articles in 250 other languages as well. Co-founder Jimmy Wales is on record as saying that he aspires to distribute a free encyclopedia to every person on the planet, in their own language.

This section

Solving natural language-processing tasks

Natural language-processing applications fall into two major groups: (i) those using symbolic methods, where the system utilizes a manually encoded repository of human language, and (ii) statistical methods, which infer properties of language by processing large text corpora. The problem with the former is a dearth of high-quality knowledge bases. Even the lexical database WordNet, which, as the largest of its kind, receives substantial attention (Fellbaum, 1998), has been criticized for low

Information retrieval

Wikipedia is already one of the most popular web sites for locating information. Here we ask how it can be used to make information easier to obtain from elsewhere—how to apply it to organize and locate other resources.

Given its applications for natural language processing (Section 3), it is not surprising to see Wikipedia leveraged to gain a deeper understanding of both queries and documents, and improve how they are matched to each other. Section 4.1 describes how it has been used to expand

Information extraction and ontology building

Whereas information retrieval aims to answer specific questions, information extraction seeks to deduce meaningful structures from unstructured data such as natural language text—though in practice the dividing line between the fields is not sharp. These structures are usually represented as relations. For example, from:

  • Apple Inc.'s world corporate headquarters are located in the middle of Silicon Valley, at 1 Infinite Loop, Cupertino, California.

a relation hasHeadquarters(Apple Inc., 1

People, places and resources

Wikipedia began with the goal of distributing a free encyclopedia to every person on the planet, in their own language. It is reassuring to see the research community that benefits so much from Wikipedia maintaining the same international perspective. The research described in this survey is scattered across the globe. Fig. 14 shows prominent countries and institutions at the time of writing (mid-2008, as stated at the outset).

US and Germany are the largest contributors. In the US, research is

Summary

A whole host of researchers have been quick to grasp the potential of Wikipedia as a resource for mining meaning: the literature is large and growing rapidly. We began this article by describing Wikipedia's creation process and structure (Section 2). The unique open editing philosophy, which accounts for its success, is subversive. Although regarded as suspect by the academic establishment, it is a remarkable concrete realization of the American pragmatist philosopher Peirce's proposal that

Acknowledgements

We warmly thank Evgeniy Gabrilovich, Rada Mihalcea, Dan Weld, Sören Auer, Fabian Suchanek and the YAGO team for their valuable comments on a draft of this paper. We are also grateful to Enrico Motta and Susan Wiedenbeck for guiding us in the right direction. Medelyan is supported by a scholarship from Google, Milne by the New Zealand Tertiary Education Commission.

References (148)

  • S. Brin et al.

    The anatomy of a large-scale hypertextual web search engine

    Computer Networks and ISDN Systems

    (1998)
  • M. Krötzsch et al.

    Semantic Wikipedia

    Journal of Web Semantics

    (2007)
  • E. Mays et al.

    Context-based spelling correction

    Information Processing and Management

    (1991)
  • S.F. Adafre et al.

    Fact Discovery in Wikipedia

  • S.F. Adafre et al.

    Finding similar sentences across multiple languages in Wikipedia

  • Multilingual Agricultural Thesaurus

    (1995)
  • D. Ahn et al.

    Using Wikipedia at the TREC QA Track

  • J. Allan

    HARD track overview in TREC 2005: high accuracy retrieval from documents

  • Auer, S., Bizer, C., Lehmann, J., Kobilarov, G., Cyganiak, R., Ives, Z., 2007. DBpedia: a nucleus for a web of open...
  • S. Auer et al.

    What have Innsbruck and Leipzig in common? Extracting semantics from Wiki content

  • S. Banerjee

    Boosting inductive transfer for text classification using Wikipedia

  • Banerjee, S., Ramanathan, K., Gupta, A., 2007. Clustering short texts using Wikipedia. In: Proceedings of the 30th...
  • M. Banko et al.

    Open information extraction from the Web

  • A. Bhole et al.

    Extracting named entities and relating them over time based on Wikipedia

    Informatica

    (2007)
  • F. Bellomi et al.

    Network analysis for Wikipedia

  • T. Berners-Lee et al.

    The semantic web

    Scientific American

    (2001)
  • P. Brown et al.

    The mathematics of statistical machine translation: parameter estimation

    Computational Linguistics

    (1993)
  • Budanitsky, A., Hirst, G., 2001. Semantic distance in WordNet: an experimental, application-oriented evaluation of five...
  • B. Bunescu et al.

    Using encyclopedic knowledge for named entity disambiguation

  • D. Buscaldi et al.

    A bag-of-words based ranking method for the Wikipedia question answering

    Task Evaluation of Multilingual and Multi-modal Information Retrieval

    (2007)
  • Buscaldi, D., Rosso, P.A., 2007b. Comparison of methods for the automatic identification of locations in Wikipedia. In:...
  • W.B. Cavnar et al.

    N-Gram-based text categorization

  • Chernov, S., Iofciu, T., Nejdl, W., Zhou, X., 2006. Extracting semantic relationships between Wikipedia categories. In:...
  • R.L. Cilibrasi et al.

    The Google similarity distance

    IEEE Transactions on Knowledge and Data Engineering

    (2007)
  • P. Cimiano et al.

    Towards large-scale, open-domain and ontology-based named entity classification

  • A. Csomai et al.
    (2007)
  • Cucerzan, S., 2007. Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of the 2007 Joint...
  • Culotta, A., McCallum, A., Betz, J., 2006. Integrating probabilistic extraction models and data mining to discover...
  • Dakka, W., Cucerzan, S., 2008. Augmenting Wikipedia with Named Entity Tags. In: Proceedings of the Third International...
  • P. Denning et al.

    Wikipedia risks

    Communications of the ACM

    (2005)
  • L. Denoyer et al.

    The Wikipedia XML corpus

    SIGIR Forum

    (2006)
  • P. Dondio et al.

    Extracting trust from domain analysis: a case study on the Wikipedia Project

    Autonomous and Trusted Computing

    (2006)
  • D.G. Duffy

    Green's Functions with Applications

    (2001)
  • S. Dumais et al.

    Inductive learning algorithms and representations for text categorization

  • P. Edmonds et al.

    Introduction to the special issue on evaluating word sense disambiguation systems

    Journal of Natural Language Engineering

    (1998)
  • Egozi, O., Gabrilovich, E., Markovitch, S., 2008. Concept-based feature generation and selection for information...
  • W. Emigh et al.

    Collaborative authoring on the Web: a genre analysis of online encyclopedias

  • M. Erdmann et al.

    An approach for extracting bilingual terminology from Wikipedia

  • Ferrández, F., Toral, A., Ferrández, Ó., Ferrández, A., Muñoz, R., 2007. Applying Wikipedia's multilingual knowledge to...
  • L. Finkelstein et al.

    Placing search in context: the concept revisited

    ACM Transactions on Information Systems

    (2002)
  • Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G., 1999. Domain-specific keyphrase extraction....
  • Gabrilovich, G., Markovitch, S., 2007. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis....
  • Gabrilovich, G., Markovitch, S., 2006. Overcoming the brittleness bottleneck using Wikipedia: enhancing text...
  • J. Giles

    Internet encyclopaedias go head to head

    Nature

    (2005)
  • Gleim, R., Mehler, A., Dehmer, M., 2007. Web corpus mining by instance of Wikipedia. In: Kilgarriff Adam, Baroni Marco...
  • Gregorowicz, A., Kramer, M.A., 2006. Mining a large-scale term-concept network from Wikipedia. Mitre Technical Report...
  • A. Halavais et al.

    An analysis of topical coverage of Wikipedia

    Journal of Computer-Mediated Communication

    (2008)
  • T.H. Haveliwala

    Topic-sensitive PageRank: a context-sensitive ranking algorithm for web search

    IEEE Transactions on Knowledge and Data Engineering

    (2003)
  • Cited by (278)

    View all citing articles on Scopus
    View full text