X hits on this document

PDF document

Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis - page 1 / 6

13 views

0 shares

0 downloads

0 comments

1 / 6

Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis

Evgeniy Gabrilovich and Shaul Markovitch Department of Computer Science Technion—Israel Institute of Technology, 32000 Haifa, Israel {gabr,shaulm}@cs.technion.ac.il

Abstract

Computing semantic relatedness of natural lan- guage texts requires access to vast amounts of common-sense and domain-specific world knowl-

edge. (ESA),

We propose Explicit Semantic Analysis a novel method that represents the mean-

ing of texts in a high-dimensional derived from Wikipedia. We use

space of concepts machine learning

techniques any text as

to a

explicitly represent the meaning of weighted vector of Wikipedia-based

concepts.

Assessing

the

relatedness

of

texts

in

this space amounts to comparing the vectors using conventional metrics

corresponding (e.g., cosine).

Compared

with

the

previous

state

of

the

art,

using

ESA

results

in

substantial

improvements

in

corre-

lation of computed judgments: from r

relatedness = 0.56 to

scores with human 0.75 for individual

words tantly, model

and from r = 0.60 to 0.72 for texts. Impor- due to the use of natural concepts, the ESA is easy to explain to human users.

We propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic representation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of natural concepts de- rived from Wikipedia (http://en.wikipedia.org), the largest encyclopedia in existence. We employ text classi- fication techniques that allow us to explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on automatically computing the degree of semantic relatedness between frag- ments of natural language text.

The contributions of this paper are threefold. First, we present Explicit Semantic Analysis, a new approach to rep- resenting semantics of natural language texts using natural concepts. Second, we propose a uniform way for computing relatedness of both individual words and arbitrarily long text fragments. Finally, the results of using ESA for computing semantic relatedness of texts are superior to the existing state of the art. Moreover, using Wikipedia-based concepts makes our model easy to interpret, as we illustrate with a number of examples in what follows.

1 Introduction

2 Explicit Semantic Analysis

How related are “cat” and “mouse”? And what about “prepar- ing a manuscript” and “writing an article”? Reasoning about semantic relatedness of natural language utterances is rou- tinely performed by humans but remains an unsurmountable obstacle for computers. Humans do not judge text relatedness merely at the level of text words. Words trigger reasoning at a much deeper level that manipulates concepts—the ba- sic units of meaning that serve humans to organize and share their knowledge. Thus, humans interpret the specific wording of a document in the much larger context of their background knowledge and experience.

It has long been recognized that in order to process nat- ural language, computers require access to vast amounts of common-sense and domain-specific world knowledge [Buchanan and Feigenbaum, 1982; Lenat and Guha, 1990]. However, prior work on semantic relatedness was based on purely statistical techniques that did not make use of back- ground knowledge [Baeza-Yates and Ribeiro-Neto, 1999; Deerwester et al., 1990], or on lexical resources that incor- porate very limited knowledge about the world [Budanitsky and Hirst, 2006; Jarmasz, 2003].

Our approach is inspired by the desire to augment text rep- resentation with massive amounts of world knowledge. We represent texts as a weighted mixture of a predetermined set of natural concepts, which are defined by humans themselves and can be easily explained. To achieve this aim, we use con- cepts defined by Wikipedia articles, e.g., COMPUTER SCI- ENCE, INDIA, or LANGUAGE. An important advantage of our approach is thus the use of vast amounts of highly orga- nized human knowledge encoded in Wikipedia. Furthermore, Wikipedia undergoes constant development so its breadth and depth steadily increase over time.

We opted to use Wikipedia because it is currently the largest knowledge repository on the Web. Wikipedia is avail- able in dozens of languages, while its English version is the largest of all with 400+ million words in over one million articles (compared to 44 million words in 65,000 articles in Encyclopaedia Britannica1). Interestingly, the open editing approach yields remarkable quality—a recent study [Giles, 2005] found Wikipedia accuracy to rival that of Britannica.

1

http://store.britannica.com (visited on May 12, 2006).

IJCAI-07 1606

Document info
Document views13
Page views13
Page last viewedSat Dec 03 05:10:21 UTC 2016
Pages6
Paragraphs318
Words5186

Comments