data-extract-langchain4j/README.adoc - camel-quarkus-examples - Git at Google

 = Unstructured Data Extraction with LangChain4j: A Camel Quarkus example
 :cq-example-description: An example that shows how to convert unstructured text data to structured Java objects helped with a Large Language Model and LangChain4j

 {cq-description}

 TIP: Check the https://camel.apache.org/camel-quarkus/latest/first-steps.html[Camel Quarkus User guide] for prerequisites
 and other general information.

 Suppose the volume of https://en.wikipedia.org/wiki/Unstructured_data[unstructured data] grows at a high pace in a given organization.
 How could one transform those disseminated gold particles into a conform bullion that could be used in banks.
 For instance, let's imagine an insurance company that would record the transcripts of the conversation when customers are discussing with the hotline.
 There is probably a lot of valuable information that could be extracted from those conversation transcripts.
 In this example, we'll convert those text conversations into Java Objects that could then be used in the rest of the Camel route.

 image::schema.png[]

 In order to achieve this extraction, we'll need a https://en.wikipedia.org/wiki/Large_language_model[Large Language Model (LLM)] that natively supports JSON output.
 Here, we arbitrarily choose https://ollama.com/library/codellama[codellama] served through https://ollama.com/[ollama].
 In order to request inference to the served model, we'll use the high-level LangChain4j APIs like https://docs.langchain4j.dev/tutorials/ai-services[AiServices].
 More precisely, we'll setup the https://docs.quarkiverse.io/quarkus-langchain4j/dev/index.html[Quarkus LangChain4j extension] to register an AiService bean.
 Finally, we'll invoke the AiService extraction method via the https://camel.apache.org/camel-quarkus/latest/reference/extensions/bean.html[Camel Quarkus bean extension] .

 === Start the Large Language Model

 Let's start a container to serve the LLM with Ollama:

 [source,shell]
 ----
 docker run -p11434:11434 langchain4j/ollama-codellama:latest
 ----

 After a moment, a log like below should be output:

 [source,shell]
 ----
 time=2024-09-03T08:03:15.532Z level=INFO source=types.go:98 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="62.5 GiB" available="54.4 GiB"
 ----

 That's it, the LLM is now ready to serve our data extraction requests.

 === Package and run the application

 You are now ready to package and run the application.

 TIP: Find more details about the JVM mode and Native mode in the Package and run section of
 https://camel.apache.org/camel-quarkus/latest/first-steps.html#_package_and_run_the_application[Camel Quarkus User guide]

 ==== JVM mode

 [source,shell]
 ----
 mvn clean package -DskipTests
 java -jar target/quarkus-app/quarkus-run.jar
 ----

 ==== Extracting data from unstructured conversation

 Let's atomically copy/move the transcript files to the input folder named `target/transcripts/`, for instance like below:

 [source,shell]
 ----
 cp -rf src/test/resources/transcripts/ target/transcripts-tmp
 mv target/transcripts-tmp/*.json target/transcripts/
 ----

 The Camel route should output a log as below:

 [source,shell]
 ----
 024-09-03 10:14:34,757 INFO  [route1] (Camel (camel-1) thread #1 - file://target/transcripts) A document has been received by the camel-quarkus-file extension: {
   "id": 1,
   "content": "Operator: Hello, how may I help you ?\nCustomer: Hello, I'm calling because I need to declare an accident on my main vehicle.\nOperator: Ok, can you please give me your name ?\nCustomer: My name is Sarah London.\nOperator: Could you please give me your birth date ?\nCustomer: 1986, July the 10th.\nOperator: Ok, I've got your contract and I'm happy to share with you that we'll be able to reimburse all expenses linked to this accident.\nCustomer: Oh great, many thanks."
 }
 ----

 In the first log above, we can see that a JSON file handling transcript related information has been consumed.
 The conversation is present in the JSON field named `content`.
 This content will be injected into the LLM prompt.

 After a few seconds or minutes depending on your hardware setup, the LLM provides an answer strictly conforming to the expected JSON schema.
 It's now easy for LangChain4j to convert the returned JSON into a Java Object.
 At the end, we are provided with a Plain Old Java Object (POJO) handling the extracted data like below.

 [source,shell]
 ----
 2024-09-03 10:14:51,284 INFO  [org.acm.ext.CustomPojoStore] (Camel (camel-1) thread #1 - file://target/transcripts) An extracted POJO has been added to the store:
 {
     "customerSatisfied": "true",
     "customerName": "Sarah London",
     "customerBirthday": "10 July 1986",
     "summary": "Declare an accident on main vehicle and receive reimbursement for expenses."
 }
 ----

 See how the LLM shows its capacity to:
  * Extract a human friendly sentiment like `customerSatisfied`
  * Exhibits https://nlp.stanford.edu/projects/coref.shtml#:~:text=Overview,question%20answering%2C%20and%20information%20extraction.[coreference resolution], like `customerName` that is deduced from information spread in the whole conversation
  * Manage issues related to date format, like the field `customerBirthday`
  * Mixed structured and unstructured data (semi-structured data) with the field `summary`.

 Cherry on the cake, all those informations are computed simultaneously during a single LLM inference.

 At the end, the application should have extracted 3 POJOs.
 For each of them, it could be interesting to compare the unstructured input text and the corresponding structured POJO.

 More details can be found in the `src/main/java/org/acme/extraction/CustomPojoExtractionService.java` class.

 ==== Native mode

 IMPORTANT: Native mode requires having GraalVM and other tools installed. Please check the Prerequisites section
 of https://camel.apache.org/camel-quarkus/latest/first-steps.html#_prerequisites[Camel Quarkus User guide].

 If the application is still running in JVM mode, please kill it, for instance with `CTRL+C`.

 Now, to prepare a native executable using GraalVM, run the following commands:

 [source,shell]
 ----
 mvn clean package -DskipTests -Dnative
 ./target/*-runner
 ----

 The compilation is a bit slower. Beyond that, notice how the application behaves the same way.
 Indeed, you should be able to send the JSON files and see the extracted data exactly as it was done in JVM mode.
 The only variation compared to the JVM mode is actually that the application was packaged as a native executable.

 == Feedback

 Please report bugs and propose improvements via https://github.com/apache/camel-quarkus/issues[GitHub issues of Camel Quarkus] project.
	= Unstructured Data Extraction with LangChain4j: A Camel Quarkus example
	:cq-example-description: An example that shows how to convert unstructured text data to structured Java objects helped with a Large Language Model and LangChain4j

	{cq-description}

	TIP: Check the https://camel.apache.org/camel-quarkus/latest/first-steps.html[Camel Quarkus User guide] for prerequisites
	and other general information.

	Suppose the volume of https://en.wikipedia.org/wiki/Unstructured_data[unstructured data] grows at a high pace in a given organization.
	How could one transform those disseminated gold particles into a conform bullion that could be used in banks.
	For instance, let's imagine an insurance company that would record the transcripts of the conversation when customers are discussing with the hotline.
	There is probably a lot of valuable information that could be extracted from those conversation transcripts.
	In this example, we'll convert those text conversations into Java Objects that could then be used in the rest of the Camel route.

	image::schema.png[]

	In order to achieve this extraction, we'll need a https://en.wikipedia.org/wiki/Large_language_model[Large Language Model (LLM)] that natively supports JSON output.
	Here, we arbitrarily choose https://ollama.com/library/codellama[codellama] served through https://ollama.com/[ollama].
	In order to request inference to the served model, we'll use the high-level LangChain4j APIs like https://docs.langchain4j.dev/tutorials/ai-services[AiServices].
	More precisely, we'll setup the https://docs.quarkiverse.io/quarkus-langchain4j/dev/index.html[Quarkus LangChain4j extension] to register an AiService bean.
	Finally, we'll invoke the AiService extraction method via the https://camel.apache.org/camel-quarkus/latest/reference/extensions/bean.html[Camel Quarkus bean extension] .

	=== Start the Large Language Model

	Let's start a container to serve the LLM with Ollama:

	[source,shell]
	----
	docker run -p11434:11434 langchain4j/ollama-codellama:latest
	----

	After a moment, a log like below should be output:

	[source,shell]
	----
	time=2024-09-03T08:03:15.532Z level=INFO source=types.go:98 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="62.5 GiB" available="54.4 GiB"
	----

	That's it, the LLM is now ready to serve our data extraction requests.

	=== Package and run the application

	You are now ready to package and run the application.

	TIP: Find more details about the JVM mode and Native mode in the Package and run section of
	https://camel.apache.org/camel-quarkus/latest/first-steps.html#_package_and_run_the_application[Camel Quarkus User guide]

	==== JVM mode

	[source,shell]
	----
	mvn clean package -DskipTests
	java -jar target/quarkus-app/quarkus-run.jar
	----

	==== Extracting data from unstructured conversation

	Let's atomically copy/move the transcript files to the input folder named `target/transcripts/`, for instance like below:

	[source,shell]
	----
	cp -rf src/test/resources/transcripts/ target/transcripts-tmp
	mv target/transcripts-tmp/*.json target/transcripts/
	----

	The Camel route should output a log as below:

	[source,shell]
	----
	024-09-03 10:14:34,757 INFO [route1] (Camel (camel-1) thread #1 - file://target/transcripts) A document has been received by the camel-quarkus-file extension: {
	"id": 1,
	"content": "Operator: Hello, how may I help you ?\nCustomer: Hello, I'm calling because I need to declare an accident on my main vehicle.\nOperator: Ok, can you please give me your name ?\nCustomer: My name is Sarah London.\nOperator: Could you please give me your birth date ?\nCustomer: 1986, July the 10th.\nOperator: Ok, I've got your contract and I'm happy to share with you that we'll be able to reimburse all expenses linked to this accident.\nCustomer: Oh great, many thanks."
	}
	----

	In the first log above, we can see that a JSON file handling transcript related information has been consumed.
	The conversation is present in the JSON field named `content`.
	This content will be injected into the LLM prompt.

	After a few seconds or minutes depending on your hardware setup, the LLM provides an answer strictly conforming to the expected JSON schema.
	It's now easy for LangChain4j to convert the returned JSON into a Java Object.
	At the end, we are provided with a Plain Old Java Object (POJO) handling the extracted data like below.

	[source,shell]
	----
	2024-09-03 10:14:51,284 INFO [org.acm.ext.CustomPojoStore] (Camel (camel-1) thread #1 - file://target/transcripts) An extracted POJO has been added to the store:
	{
	"customerSatisfied": "true",
	"customerName": "Sarah London",
	"customerBirthday": "10 July 1986",
	"summary": "Declare an accident on main vehicle and receive reimbursement for expenses."
	}
	----

	See how the LLM shows its capacity to:
	* Extract a human friendly sentiment like `customerSatisfied`
	* Exhibits https://nlp.stanford.edu/projects/coref.shtml#:~:text=Overview,question%20answering%2C%20and%20information%20extraction.[coreference resolution], like `customerName` that is deduced from information spread in the whole conversation
	* Manage issues related to date format, like the field `customerBirthday`
	* Mixed structured and unstructured data (semi-structured data) with the field `summary`.

	Cherry on the cake, all those informations are computed simultaneously during a single LLM inference.

	At the end, the application should have extracted 3 POJOs.
	For each of them, it could be interesting to compare the unstructured input text and the corresponding structured POJO.

	More details can be found in the `src/main/java/org/acme/extraction/CustomPojoExtractionService.java` class.

	==== Native mode

	IMPORTANT: Native mode requires having GraalVM and other tools installed. Please check the Prerequisites section
	of https://camel.apache.org/camel-quarkus/latest/first-steps.html#_prerequisites[Camel Quarkus User guide].

	If the application is still running in JVM mode, please kill it, for instance with `CTRL+C`.

	Now, to prepare a native executable using GraalVM, run the following commands:

	[source,shell]
	----
	mvn clean package -DskipTests -Dnative
	./target/*-runner
	----

	The compilation is a bit slower. Beyond that, notice how the application behaves the same way.
	Indeed, you should be able to send the JSON files and see the extracted data exactly as it was done in JVM mode.
	The only variation compared to the JVM mode is actually that the application was packaged as a native executable.

	== Feedback

	Please report bugs and propose improvements via https://github.com/apache/camel-quarkus/issues[GitHub issues of Camel Quarkus] project.