website/versioned_docs/version-0.66.0/pe/org.apache.streampipes.processors.textmining.jvm.tokenizer.md

id: version-0.66.0-org.apache.streampipes.processors.textmining.jvm.tokenizer title: Tokenizer (English) sidebar_label: Tokenizer (English) original_id: org.apache.streampipes.processors.textmining.jvm.tokenizer

Description

Segments a given text into Tokens (usually words, numbers, punctuations, ...). Works best with english text.

Required input

A stream with a string property which contains a text.

Configuration

Simply assign the correct output of the previous stream to the tokenizer input. To use this component you have to download or train an openNLP model: https://opennlp.apache.org/models.html

Output

Adds a list to the stream which contains all tokens of the corresponding text.

Example:

Input: (text: "Hi, how are you?")

Output: (text: "Hi, how are you?", tokens: ["Hi", ",", "how", "are", "you", "?"])