X hits on this document

PDF document

Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis - page 3 / 6





3 / 6


Input: “U.S. intelligence cannot say conclu- sively that Saddam Hussein has weapons of mass destruction, an information gap that is complicating White House efforts to build sup- port for an attack on Saddam’s Iraqi regime. The CIA has advised top administration offi- cials to assume that Iraq has some weapons of mass destruction. But the agency has not given President Bush a “smoking gun,” according to U.S. intelligence and administration officials.” Iraq disarmament crisis Yellowcake forgery Senate Report of Pre-war Intelligence on Iraq Iraq and weapons of mass destruction Iraq Survey Group September Dossier

1 2 3 4 5 6

7 8 9 10

Iraq War Scott Ritter Iraq War- Rationale Operation Desert Fox

Input: “The development of T-cell leukaemia following the oth- erwise successful treatment of three patients with X-linked se- vere combined immune deficiency (X-SCID) in gene-therapy tri- als using haematopoietic stem cells has led to a re-evaluation of this approach. Using a mouse model for gene therapy of X- SCID, we find that the corrective therapeutic gene IL2RG itself can act as a contributor to the genesis of T-cell lymphomas, with one-third of animals being affected. Gene-therapy trials for X- SCID, which have been based on the assumption that IL2RG is minimally oncogenic, may therefore pose some risk to patients.” Leukemia Severe combined immunodeficiency Cancer Non-Hodgkin lymphoma AIDS ICD-10 Chapter II: Neoplasms; Chapter III: Diseases of the blood and blood-forming organs, and certain disorders involving the immune mechanism Bone marrow transplant Immunosuppressive drug Acute lymphoblastic leukemia Multiple sclerosis

Table 2: First ten concepts of the interpretation vectors for sample text fragments.

“Bank of America”

“Bank of Amazon”

“Jaguar car models”

“Jaguar (Panthera onca)”



Amazon River

Jaguar (car)



Bank of America

Amazon Basin

Jaguar S-Type



Bank of America Plaza (Atlanta)

Amazon Rainforest

Jaguar X-type

Black panther


Bank of America Plaza (Dallas)


Jaguar E-Type





Jaguar XJ



VISA (credit card)

Atlantic Ocean




Bank of America Tower,


British Leyland Motor

Panthera hybrid

  • #

    Ambiguous word: “Bank”

Ambiguous word: “Jaguar”

8 9 10

New York City NASDAQ MasterCard

Loreto Region River

Corporation Luxury vehicles V8 engine

Bank of America Corporate Center

Economy of Brazil

Jaguar Racing

Cave lion American lion


Table 3: First ten concepts of the interpretation vectors for texts with ambiguous words.

removing small and overly specific concepts (those having fewer than 100 words and fewer than 5 incoming or outgo- ing links), 241,393 articles were left. We processed the text of these articles by removing stop words and rare words, and stemming the remaining words; this yielded 389,202 distinct terms, which served for representing Wikipedia concepts as attribute vectors.

To better evaluate Wikipedia-based semantic interpreta- tion, we also implemented a semantic interpreter based on another large-scale knowledge repository—the Open Directory Project (ODP, http://www.dmoz.org). The ODP is the largest Web directory to date, where con- cepts correspond to categories of the directory, e.g., TOP/COMPUTERS/ARTIFICIAL INTELLIGENCE. In this case, interpretation of a text fragment amounts to computing

a weighted vector to the input text.








We built the ODP-based semantic interpreter using an ODP snapshot as of April 2004. After pruning the Top/World branch that contains non-English material, we obtained a hierarchy of over 400,000 concepts and 2,800,000 URLs.

Textual descriptions of the concepts and URLs amounted to 436 Mb of text. In order to increase the amount of train- ing information, we further populated the ODP hierarchy by crawling all of its URLs, and taking the first 10 pages en- countered at each site. After eliminating HTML markup and truncating overly long files, we ended up with 70 Gb of ad- ditional textual data. After removing stop words and rare words, we obtained 20,700,000 distinct terms that were used to represent ODP nodes as attribute vectors. Up to 1000 most informative attributes were selected for each ODP node us- ing the document frequency criterion [Sebastiani, 2002]. A centroid classifier was then trained, whereas the training set for each concept was combined by concatenating the crawled content of all the URLs cataloged under this concept. Fur- ther implementation details are available in [Gabrilovich and Markovitch, 2005].

Using world knowledge requires additional computation. This extra computation includes the (one-time) preprocess- ing step where the semantic interpreter is built, as well as the actual mapping of input texts into interpretation vectors, per- formed online. On a standard workstation, the throughput of

IJCAI-07 1608

Document info
Document views7
Page views7
Page last viewedSat Oct 22 05:20:17 UTC 2016