scale knowledge repositories. We use Wikipedia and the ODP, the largest knowledge repositories of their kind, which contain hundreds of thousands of human-defined concepts and provide a cornucopia of information about each concept. Our approach is called Explicit Semantic Analysis, since it uses concepts explicitly defined and described by humans.
Compared to LSA, which only uses statistical cooc- currence information, our methodology explicitly uses the knowledge collected and organized by humans. Compared to lexical resources such as WordNet, our methodology lever- ages knowledge bases that are orders of magnitude larger and more comprehensive.
[Gabrilovich and Markovitch, 2006] Evgeniy
Shaul Markovitch. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI’06, pages 1301–1306, July 2006.
[Gabrilovich, In preparation] Evgeniy Gabrilovich. Feature Gener- ation for Textual Information Retrieval Using World Knowledge. PhD thesis, Department of Computer Science, Technion—Israel Institute of Technology, Haifa, Israel, In preparation.
[Giles, 2005] Jim Giles. Internet encyclopaedias go head to head. Nature, 438:900–901, 2005.
[Grefenstette, 1992] Gregory Grefenstette. SEXTANT: Exploring unexplored contexts for semantic extraction from syntactic anal-
Empirical evaluation confirms that using ESA leads to sub- stantial improvements in computing word and text related- ness. Compared with the previous state of the art, using ESA results in notable improvements in correlation of computed relatedness scores with human judgements: from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts. Furthermore, due to the use of natural concepts, the ESA model is easy to explain to human users.
ysis. In ACL’92, pages 324–326, 1992.
[Han and Karypis, 2000]
Karypis. Centroid-based document classification: Analysis and experimental results. In PKDD’00, September 2000.
[Jarmasz, 2003] Mario Jarmasz. Roget’s thesaurus as a lexical re- source for natural language processing. Master’s thesis, Univer- sity of Ottawa, 2003.
[Jiang and Conrath, 1997] Jay J. Jiang and David W. Conrath. Se- mantic similarity based on corpus statistics and lexical taxonomy. In ROCLING’97, 1997.
We thank Michael D. Lee and Brandon Pincombe for making available their document similarity data. This work was par- tially supported by funding from the EC-sponsored MUSCLE Network of Excellence.
[Lee et al., 2005] Michael D. Lee, Brandon Pincombe, and Matthew Welsh. An empirical evaluation of models of text doc- ument similarity. In CogSci2005, pages 1254–1259, 2005.
[Lee, 1999] Lillian Lee. Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the ACL, 1999.
[Lenat and Guha, 1990] D. Lenat and R. Guha. Building Large Knowledge Based Systems. Addison Wesley, 1990.
[Baeza-Yates and Ribeiro-Neto, 1999] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, New York, NY, 1999.
[Banerjee and Pedersen, 2003] Satanjeev Banerjee and Ted Peder- sen. Extended gloss overlaps as a measure of semantic related- ness. In IJCAI, pages 805–810, 2003.
[Buchanan and Feigenbaum, 1982] B. G. Buchanan and E. A. Feigenbaum. Forward. In R. Davis and D. B. Lenat, editors, Knowledge-Based Systems in Artificial Intelligence. McGraw- Hill, 1982.
[Budanitsky and Hirst, 2006] Alexander Budanitsky and Graeme Hirst. Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics, 32(1):13–47, 2006.
[Dagan et al., 1999] Ido Dagan, Lillian Lee, and Fernando C. N. Pereira. Similarity-based models of word cooccurrence probabil- ities. Machine Learning, 34(1–3):43–69, 1999.
[Lin, 1998] Dekang Lin. An information-theoretic definition of word similarity. In ICML’98, 1998.
[Mihalcea et al., 2006] Rada Mihalcea, Courtney Corley, and Carlo Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI’06, July 2006.
[Miller and Charles, 1991] George A. Miller and Walter G. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28, 1991.
[Resnik, 1999] Philip Resnik. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. JAIR, 11:95–130, 1999.
[Roget, 1852] Peter Roget. Roget’s Thesaurus of English Words
and Phrases. Longman Group Ltd., 1852. [Rubenstein and Goodenough, 1965] Herbert
Communications of the ACM, 8(10):627–633, 1965.
[Deerwester et al., 1990] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic anal- ysis. JASIS, 41(6):391–407, 1990.
[Fellbaum, 1998] Christiane Fellbaum, editor. WordNet: An Elec- tronic Lexical Database. MIT Press, Cambridge, MA, 1998.
[Finkelstein et al., 2002] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey- tan Ruppin. Placing search in context: The concept revisited.
ACM TOIS, 20(1):116–131, January 2002. [Gabrilovich and Markovitch, 2005] Evgeniy
Shaul Markovitch. Feature generation for text categorization using world knowledge. In IJCAI’05, pages 1048–1053, 2005.
[Sahami and Heilman, 2006] Mehran Sahami and Timothy Heil- man. A web-based kernel function for measuring the similarity of short text snippets. In WWW’06. ACM Press, May 2006.
[Salton and McGill, 1983] G. Salton and M.J. McGill. An Introduc- tion to Modern Information Retrieval. McGraw-Hill, 1983.
[Sebastiani, 2002] Fabrizio Sebastiani. Machine learning in auto- mated text categorization. ACM Comp. Surv., 34(1):1–47, 2002.
[Strube and Ponzetto, 2006] Michael Strube and Simon Paolo Ponzetto. WikiRelate! Computing semantic relatedness using Wikipedia. In AAAI’06, Boston, MA, 2006.
[Zobel and Moffat, 1998] Justin Zobel and Alistair Moffat. Explor- ing the similarity space. ACM SIGIR Forum, 32(1):18–34, 1998.