| = LangChain4j Tokenizer Component |
| :doctitle: LangChain4j Tokenizer |
| :shortname: langchain4j-tokenizer |
| :artifactid: camel-langchain4j-tokenizer |
| :description: LangChain4j Tokenizer |
| :since: 4.8 |
| :supportlevel: Experimental |
| :tabs-sync-option: |
| //Manually maintained attributes |
| :group: AI |
| :camel-spring-boot-name: langchain4j-tokenizer |
| |
| *Since Camel {since}* |
| |
| The LangChain4j tokenizer component provides support to tokenize (chunk) larger blocks of texts into text segments |
| that can be used when interacting with LLMs. Tokenization is particularly helpful when used with |
| https://en.wikipedia.org/wiki/Vector_database[vector databases] to provide better and more contextual search results |
| for https://en.wikipedia.org/wiki/Retrieval-augmented_generation[retrieval-augmented generation (RAG)]. |
| |
| This component uses the https://docs.langchain4j.dev/tutorials/rag/#document-splitter[LangChain4j document splitter] |
| to handle chunking. |
| |
| Maven users will need to add the following dependency to their `pom.xml` |
| for this component: |
| |
| [source,xml] |
| ---- |
| <dependency> |
| <groupId>org.apache.camel</groupId> |
| <artifactId>camel-langchain4j-tokenizer</artifactId> |
| <version>x.x.x</version> |
| <!-- use the same version as your Camel core version --> |
| </dependency> |
| ---- |
| |
| == Usage |
| |
| === Chunking DSL |
| |
| The tokenization process is done in route, using a DSL that handles the parameters of the tokenization: |
| |
| [tabs] |
| ==== |
| Java:: |
| + |
| [source,java] |
| ------------------------------------------------------- |
| from("direct:start") |
| .tokenize(tokenizer() |
| .byParagraph() |
| .maxSegmentSize(1024) |
| .maxOverlap(10) |
| .using(LangChain4jTokenizerDefinition.TokenizerType.OPEN_AI) |
| .end()) |
| .split().body() |
| .to("mock:result"); |
| ------------------------------------------------------- |
| |
| ==== |
| |
| The tokenization creates a composite message (i.e.: an array of Strings). This composite message, then can be split |
| using the xref:eips:split-eip.adoc[Split EIP] so that each text segment is separately to an endpoint. Alternatively, the |
| contents of the composite message may be passed through a processor so that invalid data is filtered. |
| |
| === Supported Splitters |
| |
| The following type of splitters is supported: |
| |
| * By paragraph: using the DSL `tokenizer().byParagraph()` |
| * By sentence: using the DSL `tokenizer().bySentence()` |
| * By word: using the DSL `tokenizer().byWord()` |
| * By line: using the DSL `tokenizer().byLine()` |
| * By character: using the DSL `tokenizer().byCharacter()` |
| |
| === Supported Tokenizers |
| |
| The following tokenizers are supported: |
| |
| * OpenAI: using `LangChain4jTokenizerDefinition.TokenizerType.OPEN_AI` |
| * Azure: using `LangChain4jTokenizerDefinition.TokenizerType.AZURE` |
| * Qwen: using `LangChain4jTokenizerDefinition.TokenizerType.QWEN` |
| |
| The application must provide the specific implementation of the tokenizer from LangChain4j. At this moment, they are: |
| |
| [tabs] |
| ==== |
| Open AI:: |
| + |
| [source,xml] |
| ------------------------------------------------------- |
| <dependency> |
| <groupId>dev.langchain4j</groupId> |
| <artifactId>langchain4j-open-ai</artifactId> |
| <version>${langchain4j-version}</version> |
| </dependency> |
| ------------------------------------------------------- |
| |
| Azure:: |
| + |
| [source,xml] |
| ------------------------------------------------------- |
| <dependency> |
| <groupId>dev.langchain4j</groupId> |
| <artifactId>langchain4j-azure-open-ai</artifactId> |
| <version>${langchain4j-version}</version> |
| </dependency> |
| ------------------------------------------------------- |
| |
| Qwen:: |
| + |
| [source,xml] |
| ------------------------------------------------------- |
| <dependency> |
| <groupId>dev.langchain4j</groupId> |
| <artifactId>langchain4j-dashscope</artifactId> |
| <version>${langchain4j-version}</version> |
| </dependency> |
| ------------------------------------------------------- |
| ==== |
| |
| |
| Starting with Camel 4.12, the code defaults to using segment sizes for determining the maximum number of data to tokenize. To |
| actually tokenize by tokens, you must inform the underlying model used instead. For instance: |
| |
| [tabs] |
| ==== |
| Java:: |
| + |
| [source,java] |
| ------------------------------------------------------- |
| from("direct:start") |
| .tokenize(tokenizer() |
| .byParagraph() |
| .maxTokens(1024, "gpt-4o-mini") |
| .maxOverlap(10) |
| .using(LangChain4jTokenizerDefinition.TokenizerType.OPEN_AI) |
| .end()) |
| .split().body() |
| .to("mock:result"); |
| ------------------------------------------------------- |
| ==== |