X hits on this document

Powerpoint document

Python for NLP and the Natural Language Toolkit - page 19 / 47

177 views

0 shares

0 downloads

0 comments

19 / 47

Tokenization (continued)

Tokenization is harder that it seems

I’ll see you in New York.

The aluminum-export ban.

The simplest approach is to use “graphic words” (i.e., separate words using whitespace)

Another approach is to use regular expressions to specify which substrings are valid words.

NLTK provides a generic tokenization interface: TokenizerI

Document info
Document views177
Page views177
Page last viewedSat Jan 21 06:55:08 UTC 2017
Pages47
Paragraphs392
Words1978

Comments