blob: 8b820a916cca2840cdebb1d60f31c2d064d25308 [file] [log] [blame]
Some tokenization "test" to do ! We are checking so-so and so - so and
--- also... Especially we are interested in abrs e.g. Ave. which are
very special. Single char abrs like John C. Mills. This is zyz. but
not known as abbreviation. This is zyz. BUT not known as abbreviation.
Another case is . in a sentence??? Or .Net .12 or so. Numbers 9.23 1,23
$12 22% #2 and so on !!! Parentheses (which are important) and [numeric]
{expressions} ((*)) like (3 - 5) + 2 * -1 / 12 or 1/2 must work too.
Also mark@twain.com and 9.4.124.8 and
www.ibm-research.com @are also@ ### $$ @@ -checked. Commas, and semicolons; and colons:
are ::: interesting,,, ,too? Apostrophes ''' are' 'interesting as well: L'Oreal
Tom's 'don't' 1'2'3 8''. Also 'used' as 'quotations'. Let's go to the internet-cafe and
chat with foo-bar.
The next lines are paragraph boundary tests:
tok1 tok2
tok3 tok4
tok5
- tokX tokY
tok6