scale knowledge repositories. We use Wikipedia and the ODP, the largest knowledge repositories of their kind, which contain hundreds of thousands of human-defined concepts and provide a cornucopia of information about each concept. Our approach is called Explicit Semantic Analysis, since it uses concepts explicitly defined and described by humans.

Compared to LSA, which only uses statistical cooc- currence information, our methodology explicitly uses the knowledge collected and organized by humans. Compared to lexical resources such as WordNet, our methodology lever- ages knowledge bases that are orders of magnitude larger and more comprehensive.

Empirical evaluation confirms that using ESA leads to sub- stantial improvements in computing word and text related- ness. Compared with the previous state of the art, using ESA results in notable improvements in correlation of computed relatedness scores with human judgements: from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts. Furthermore, due to the use of natural concepts, the ESA model is easy to explain to human users.

6 Acknowledgments

We thank Michael D. Lee and Brandon Pincombe for making available their document similarity data. This work was par- tially supported by funding from the EC-sponsored MUSCLE Network of Excellence.

IJCAI-07 1611

