X hits on this document

PDF document

Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis - page 2 / 6





2 / 6

We use machine learning techniques to build a semantic interpreter that maps fragments of natural language text into a weighted sequence of Wikipedia concepts ordered by their relevance to the input. This way, input texts are represented as weighted vectors of concepts, called interpretation vectors. The meaning of a text fragment is thus interpreted in terms of its affinity with a host of Wikipedia concepts. Comput- ing semantic relatedness of texts then amounts to comparing their vectors in the space defined by the concepts, for exam- ple, using the cosine metric [Zobel and Moffat, 1998]. Our semantic analysis is explicit in the sense that we manipulate manifest concepts grounded in human cognition, rather than “latent concepts” used by Latent Semantic Analysis.

Observe that input texts are given in the same form as Wikipedia articles, that is, as plain text. Therefore, we can use conventional text classification algorithms [Sebastiani, 2002] to rank the concepts represented by these articles according to their relevance to the given text fragment. It is this key ob- servation that allows us to use encyclopedia directly, without the need for deep language understanding or pre-cataloged common-sense knowledge. The choice of encyclopedia arti- cles as concepts is quite natural, as each article is focused on a single issue, which it discusses in detail.

Each Wikipedia concept is represented as an attribute vec- tor of words that occur in the corresponding article. Entries of these vectors are assigned weights using TFIDF scheme

[Salton and McGill, 1983].

These weights quantify


strength of association between words and concepts.

To speed up semantic interpretation, we build an inverted index, which maps each word into a list of concepts in which it appears. We also use the inverted index to discard insignif- icant associations between words and concepts by removing those concepts whose weights for a given word are too low.

Building Semantic I nterpreter


B ui l di ng wei ghted inverted index


W i k i pedi a


W eighted list of concepts ( = W i ki pedi a articles)

W eighted inver ted index

Using Semantic I nterpreter

T ex t1

Semantic I nter pr eter

V ector compar i son

R el atednes s estimation

T ex t2

W ei ghted vector of W i k i pedi a concepts

Figure 1: Semantic interpreter

# 1 2 3 4 5 6 7 8 9 10

Input: “equipment” Tool Digital Equipment Corporation Military technology and equipment Camping Engineering vehicle Weapon Original equipment manufacturer French Army Electronic test equipment Distance Measuring Equipment

Input: “investor Investment Angel investor Stock trader Mutual fund Margin (finance) Modern portfolio theory Equity investment Exchange-traded fund Hedge fund Ponzi scheme

Table 1: First ten concepts in sample interpretation vectors.

We implemented the semantic interpreter as a centroid- based classifier [Han and Karypis, 2000], which, given a text fragment, ranks all the Wikipedia concepts by their relevance to the fragment. Given a text fragment, we first represent it as a vector using TFIDF scheme. The semantic interpreter iter- ates over the text words, retrieves corresponding entries from the inverted index, and merges them into a weighted vector of concepts that represents the given text. Let T = {wi} be input text, and let vibe its TFIDF vector, where vi is t h e w e i g h t o f w o r d w i . L e t k j b e a n i n v e r t e d i n d e x e n t r y f o r w o r d w i , w h e r e k j q u a n t i fi e s t h e s t r e n g t h o f a s s o c i a t i of word wi with Wikipedia concept cj , {cj c1, . . . , cN } (where N is the total number of Wikipedia concepts). Then, the semantic interpretation vector V for text T is a vector of length N, in which the weight of each concept cj is defined as wiT vi · kj . Entries of this vector reflect the relevance of the corresponding concepts to text T . To compute seman- tic relatedness of a pair of text fragments we compare their vectors using the cosine metric. o n

Figure 1 illustrates the process of Wikipedia-based seman- tic interpretation. Further implementation details are avail- able in [Gabrilovich, In preparation].

In our earlier work [Gabrilovich and Markovitch, 2006], we used a similar method for generating features for text cat- egorization. Since text categorization is a supervised learning task, words occurring in the training documents serve as valu-

able features; consequently, in that work we used Wikipedia concepts to augment the bag of words. On the other hand, computing semantic relatedness of a pair of texts is essen- tially a “one-off” task, therefore, we replace the bag of words representation with the one based on concepts.

To illustrate our approach, we show the ten highest-scoring Wikipedia concepts in the interpretation vectors for sample text fragments. When concepts in each vector are sorted in the decreasing order of their score, the top ten concepts are the most relevant ones for the input text. Table 1 shows the most relevant Wikipedia concepts for individual words (“equip- ment” and “investor”, respectively), while Table 2 uses longer passages as examples. It is particularly interesting to jux- tapose the interpretation vectors for fragments that contain ambiguous words. Table 3 shows the first entries in the vec- tors for phrases that contain ambiguous words “bank” and ”jaguar”. As can be readily seen, our semantic interpreta- tion methodology is capable of performing word sense dis- ambiguation, by considering ambiguous words in the context of their neighbors.

3 Empirical Evaluation

We implemented our ESA approach using a Wikipedia snap- shot as of March 26, 2006. After parsing the Wikipedia XML dump, we obtained 2.9 Gb of text in 1,187,839 articles. Upon

IJCAI-07 1607

Document info
Document views11
Page views11
Page last viewedFri Oct 28 06:51:20 UTC 2016