X hits on this document

Powerpoint document

Python for NLP and the Natural Language Toolkit - page 19 / 47

143 views

0 shares

0 downloads

0 comments

19 / 47

Tokenization (continued)

Tokenization is harder that it seems

I’ll see you in New York.

The aluminum-export ban.

The simplest approach is to use “graphic words” (i.e., separate words using whitespace)

Another approach is to use regular expressions to specify which substrings are valid words.

NLTK provides a generic tokenization interface: TokenizerI

Document info
Document views143
Page views143
Page last viewedSun Dec 11 04:56:38 UTC 2016
Pages47
Paragraphs392
Words1978

Comments