Some tokenization "test" to do ! We are checking so-so and so - so and | |
--- also... Especially we are interested in abrs e.g. Ave. which are | |
very special. Single char abrs like John C. Mills. This is zyz. but | |
not known as abbreviation. This is zyz. BUT not known as abbreviation. | |
Another case is . in a sentence??? Or .Net .12 or so. Numbers 9.23 1,23 | |
$12 22% #2 and so on !!! Parentheses (which are important) and [numeric] | |
{expressions} ((*)) like (3 - 5) + 2 * -1 / 12 or 1/2 must work too. | |
Also mark@twain.com and 9.4.124.8 and | |
www.ibm-research.com @are also@ ### $$ @@ -checked. Commas, and semicolons; and colons: | |
are ::: interesting,,, ,too? Apostrophes ''' are' 'interesting as well: L'Oreal | |
Tom's 'don't' 1'2'3 8''. Also 'used' as 'quotations'. Let's go to the internet-cafe and | |
chat with foo-bar. | |
The next lines are paragraph boundary tests: | |
tok1 tok2 | |
tok3 tok4 | |
tok5 | |
- tokX tokY | |
tok6 |