Tokenization is harder that it seems
I’ll see you in New York.
The aluminum-export ban.
The simplest approach is to use “graphic words” (i.e., separate words using whitespace)
Another approach is to use regular expressions to specify which substrings are valid words.
NLTK provides a generic tokenization interface: TokenizerI