the semantic interpreter is several hundred words per second.
Datasets and Evaluation Procedure
Humans have an innate ability to judge semantic relatedness of texts. Human judgements on a reference set of text pairs can thus be considered correct by definition, a kind of “gold standard” against which computer algorithms are evaluated. Several studies measured inter-judge correlations and found them to be consistently high [Budanitsky and Hirst, 2006; Jarmasz, 2003; Finkelstein et al., 2002], r = 0.88 − 0.95. These findings are to be expected—after all, it is this consen- sus that allows people to understand each other.
the WordSimilarity-353 collection2 [Finkelstein et al., 2002],
Correlation with humans
WordNet [Jarmasz, 2003] Roget’s Thesaurus [Jarmasz, 2003] LSA [Finkelstein et al., 2002] WikiRelate! [Strube and Ponzetto, 2006] ESA-Wikipedia ESA-ODP
0.33–0.35 0.55 0.56 0.19 – 0.48 0.75 0.65
Table 4: Computing word relatedness
Bag of words [Lee et al., 2005]
LSA [Lee et al., 2005] ESA-Wikipedia ESA-ODP
0.60 0.72 0.69
Correlation with humans
judgements, which a single relatedness
were averaged for each pair score. Spearman rank-order
to produce correlation
Table 5: Computing text relatedness
coefficient was used to compare with human judgements.
For document similarity, we used a collection of 50 docu- ments from the Australian Broadcasting Corporation’s news mail service [Lee et al., 2005]. These documents were paired in all possible ways, and each of the 1,225 pairs has 8–12 human judgements. When human judgements have been av- eraged for each pair, the collection of 1,225 relatedness scores have only 67 distinct values. Spearman correlation is not ap- propriate in this case, and therefore we used Pearson’s linear correlation coefficient.
Table 4 shows the results of applying our methodology to estimating relatedness of individual words. As we can see, both ESA techniques yield substantial improvements over prior studies. ESA also achieves much better results than the other Wikipedia-based method recently introduced [Strube and Ponzetto, 2006]. Table 5 shows the results for computing relatedness of entire documents.
is common in Web pages. On the other hand, Wikipedia ar- ticles are virtually noise-free, and mostly qualify as Standard Written English.
4 Related Work
The ability to quantify semantic relatedness of texts under- lies many fundamental tasks in computational linguistics, including word sense disambiguation, information retrieval, word and text clustering, and error correction [Budanitsky and Hirst, 2006]. Prior work in the field pursued three main directions: comparing text fragments as bags of words in vec- tor space [Baeza-Yates and Ribeiro-Neto, 1999], using lexical resources, and using Latent Semantic Analysis (LSA) [Deer- wester et al., 1990]. The former technique is the simplest, but performs sub-optimally when the texts to be compared share few words, for instance, when the texts use synonyms to convey similar messages. This technique is also trivially inappropriate for comparing individual words. The latter two techniques attempt to circumvent this limitation.
On both test collections, Wikipedia-based semantic inter- pretation is superior to that of the ODP-based one. Two fac- tors contribute to this phenomenon. First, axes of a multi- dimensional interpretation space should ideally be as orthog- onal as possible. However, the hierarchical organization of the ODP defines the generalization relation between con- cepts and obviously violates this orthogonality requirement. Second, to increase the amount of training data for build- ing the ODP-based semantic interpreter, we crawled all the URLs cataloged in the ODP. This allowed us to increase the amount of textual data by several orders of magnitude, but also brought about a non-negligible amount of noise, which
2 http://www.cs.technion.ac.il/˜gabr/ resources/data/wordsim353
3Despite its name, this test collection is designed for testing word relatedness and not merely similarity, as instructions for hu- man judges specifically directed the participants to assess the degree of relatedness of the words. For example, in the case of antonyms, judges were instructed to consider them as “similar” rather than “dis- similar”.
Lexical databases such as WordNet [Fellbaum, 1998] or Roget’s Thesaurus [Roget, 1852] encode relations between words such as synonymy, hypernymy. Quite a few met- rics have been defined that compute relatedness using vari- ous properties of the underlying graph structure of these re- sources [Budanitsky and Hirst, 2006; Jarmasz, 2003; Baner- jee and Pedersen, 2003; Resnik, 1999; Lin, 1998; Jiang and Conrath, 1997; Grefenstette, 1992]. The obvious drawback of this approach is that creation of lexical resources requires lexicographic expertise as well as a lot of time and effort, and consequently such resources cover only a small fragment of the language lexicon. Specifically, such resources contain few proper names, neologisms, slang, and domain-specific tech- nical terms. Furthermore, these resources have strong lexical orientation and mainly contain information about individual words but little world knowledge in general.
WordNet-based techniques are similar to ESA in that both approaches manipulate a collection of concepts. There are, however, several important differences. First, WordNet-based methods are inherently limited to individual words, and their