blob: 5c5bd4bf1ca8f4181a416fdc92814a373bb17213 [file] [log] [blame] [view]
# Embedding
> Embedding Transform Plugin
## Description
The `Embedding` transform plugin leverages embedding models to convert text and multimodal data into vectorized representations. This
transformation can be applied to various fields including text, images, and videos. The plugin supports multiple model providers and can be integrated with
different API endpoints.
> **Important Note:** The current embedding precision only supports float32 format.
## Options
| Name | Type | Required | Default Value | Description |
|--------------------------------|--------|----------|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| model_provider | enum | yes | - | The model provider for embedding. Options may include `AMAZON`, `QIANFAN`, `OPENAI`, etc. |
| api_key | string | yes | - | The API key required to authenticate with the embedding service. |
| secret_key | string | yes | - | The secret key required for additional authentication with the embedding service. |
| aws_region | string | no | | AWS Region. Required for use Amazon Bedrock model. |
| single_vectorized_input_number | int | no | 1 | The number of inputs vectorized in one request. Default is 1. |
| vectorization_fields | map | yes | - | A mapping between input fields and their corresponding output vector fields. |
| model | string | yes | - | The specific model to use for embedding (e.g: `text-embedding-3-small` for OPENAI). |
| api_path | string | no | - | The API endpoint for the embedding service. Typically provided by the model provider. |
| dimension | int | no | - | TThe vector dimension defaults to 2048. The Embedding-3 model supports custom vector dimensions, and it is recommended to choose dimensions of 256, 512, 1024, or 2048. |
| oauth_path | string | no | - | The API endpoint for the oauth service. |
| custom_config | map | no | | Custom configurations for the model. |
| custom_response_parse | string | no | | Specifies how to parse the response from the model using JsonPath. Example: `$.choices[*].message.content`. |
| custom_request_headers | map | no | | Custom headers for the request to the model. |
| custom_request_body | map | no | | Custom body for the request. Supports placeholders like `${model}`, `${input}`. |
## Precision Support
**Important:** The current version of the Embedding plugin only supports **float32** precision for vector data.
- All generated embedding vectors will be stored in float32 format
- If your model or API returns other precision formats (such as float64), the plugin will automatically convert them to float32
### model_provider
The providers for generating embeddings include common options such as `AMAZON`, `DOUBAO`, `QIANFAN`, and `OPENAI`. Additionally,
you can choose `CUSTOM` to implement requests and retrievals for custom embedding models.
### api_key
The API key for authenticating requests to the embedding service. This is typically provided by the model provider when
you register for their service.
### secret_key
The secret key used for additional authentication. Some providers may require this for secure API requests.
### single_vectorized_input_number
Specifies how many inputs are processed in a single vectorization request. The default is 1. Adjust based on your
processing
capacity and the model provider's API limitations.
### vectorization_fields
A mapping between input fields and their respective output vector fields. This allows the plugin to understand which
fields to vectorize and how to store the resulting vectors. The plugin supports multimodal data by allowing you to specify
the modality type for each field.
**Basic Text Vectorization:**
```hocon
vectorization_fields {
book_intro_vector = book_intro
author_biography_vector = author_biography
}
```
**Multimodal Vectorization:**
```hocon
vectorization_fields {
# Basic text field
text_vector = text_field
# Explicit modality type configuration
product_image_vector = {
field = product_image_url
modality = jpeg
format = url
}
# Auto-detect modality type (based on file suffix)
thumbnail_vector = {
field = thumbnail_image # If value is "image.png", auto-detects as PNG modality
format = url
}
# Video field configuration
demo_video_vector = {
field = product_video_url
modality = mp4
format = url
}
# Binary data configuration
binary_image_vector = {
field = image_data
modality = jpeg
format = binary
}
}
```
**Field Specification Formats:**
**Supported Modality Types:**
- **Images:** `jpeg` (jpg, jpeg), `png` (png, apng), `gif`, `webp`, `bmp` (bmp, dib), `tiff` (tiff, tif), `ico`, `icns`, `sgi`, `jpeg2000` (j2c, j2k, jp2, jpc, jpf, jpx)
- **Videos:** `mp4`, `avi`, `mov`
- **Text:** `text` (default)
**Payload Formats:**
- `text` - Text format (default)
- `url` - URL format
- `binary` - Binary data format
**Automatic Modality Detection:**
When `modality` is not explicitly specified and `format` is not `binary`, the system automatically detects the modality type based on the file suffix of the field value:
> **Important:** When using multimodal fields (image or video), ensure your model provider supports multimodal embedding. Image and video fields must contain valid URLs or binary data. Currently, `DOUBAO` provider supports multimodal data processing.
### model
The specific embedding model to use. This depends on the `embedding_model_provider`. For example, if using OPENAI, you
might specify `text-embedding-3-small`.
### api_path
The API endpoint to use for making requests to the embedding service. This might vary based on the provider and model
used. Generally, this is provided by the model provider.
### oauth_path
The API endpoint for the oauth service. Get certification information. This might vary based on the provider and model
used. Generally, this is provided by the model provider.
### custom_config
The `custom_config` option allows you to provide additional custom configurations for the model. This is a map where you
can define various settings that might be required by the specific model you're using.
### custom_response_parse
The `custom_response_parse` option allows you to specify how to parse the model's response. You can use JsonPath to
extract the specific data you need from the response. For example, by using `$.data[*].embedding`, you can extract
the `embedding` field values from the following JSON and obtain a `List` of nested `List` results. For more details on
using JsonPath, please refer to
the [JsonPath Getting Started guide](https://github.com/json-path/JsonPath?tab=readme-ov-file#getting-started).
```json
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [
-0.006929283495992422,
-0.005336422007530928,
-0.00004547132266452536,
-0.024047505110502243
]
}
],
"model": "text-embedding-3-small",
"usage": {
"prompt_tokens": 5,
"total_tokens": 5
}
}
```
### custom_request_headers
The `custom_request_headers` option allows you to define custom headers that should be included in the request sent to
the model's API. This is useful if the API requires additional headers beyond the standard ones, such as authorization
tokens, content types, etc.
### custom_request_body
The `custom_request_body` option supports placeholders:
- `${model}`: Placeholder for the model name.
- `${input}`: Placeholder to determine input value and define request body request type based on the type of body
value. Example: `["${input}"]` -> ["input"] (list)
### common options
Transform plugin common parameters, please refer to [Transform Plugin](common-options.md) for details.
## Example Configurations
### Basic Text Embedding
```hocon
env {
job.mode = "BATCH"
}
source {
FakeSource {
row.num = 5
schema = {
fields {
book_id = "int"
book_name = "string"
book_intro = "string"
author_biography = "string"
}
}
rows = [
{fields = [1, "To Kill a Mockingbird",
"Set in the American South during the 1930s, To Kill a Mockingbird tells the story of young Scout Finch and her brother, Jem, who are growing up in a world of racial inequality and injustice. Their father, Atticus Finch, is a lawyer who defends a black man falsely accused of raping a white woman, teaching his children valuable lessons about morality, courage, and empathy.",
"Harper Lee (1926–2016) was an American novelist best known for To Kill a Mockingbird, which won the Pulitzer Prize in 1961. Lee was born in Monroeville, Alabama, and the town served as inspiration for the fictional Maycomb in her novel. Despite the success of her book, Lee remained a private person and published only one other novel, Go Set a Watchman, which was written before To Kill a Mockingbird but released in 2015 as a sequel."
], kind = INSERT}
{fields = [2, "1984",
"1984 is a dystopian novel set in a totalitarian society governed by Big Brother. The story follows Winston Smith, a man who works for the Party rewriting history. Winston begins to question the Party’s control and seeks truth and freedom in a society where individuality is crushed. The novel explores themes of surveillance, propaganda, and the loss of personal autonomy.",
"George Orwell (1903–1950) was the pen name of Eric Arthur Blair, an English novelist, essayist, journalist, and critic. Orwell is best known for his works 1984 and Animal Farm, both of which are critiques of totalitarian regimes. His writing is characterized by lucid prose, awareness of social injustice, opposition to totalitarianism, and support of democratic socialism. Orwell’s work remains influential, and his ideas have shaped contemporary discussions on politics and society."
], kind = INSERT}
{fields = [3, "Pride and Prejudice",
"Pride and Prejudice is a romantic novel that explores the complex relationships between different social classes in early 19th century England. The story centers on Elizabeth Bennet, a young woman with strong opinions, and Mr. Darcy, a wealthy but reserved gentleman. The novel deals with themes of love, marriage, and societal expectations, offering keen insights into human behavior.",
"Jane Austen (1775–1817) was an English novelist known for her sharp social commentary and keen observations of the British landed gentry. Her works, including Sense and Sensibility, Emma, and Pride and Prejudice, are celebrated for their wit, realism, and biting critique of the social class structure of her time. Despite her relatively modest life, Austen’s novels have gained immense popularity, and she is considered one of the greatest novelists in the English language."
], kind = INSERT}
{fields = [4, "The Great GatsbyThe Great Gatsby",
"The Great Gatsby is a novel about the American Dream and the disillusionment that can come with it. Set in the 1920s, the story follows Nick Carraway as he becomes entangled in the lives of his mysterious neighbor, Jay Gatsby, and the wealthy elite of Long Island. Gatsby's obsession with the beautiful Daisy Buchanan drives the narrative, exploring themes of wealth, love, and the decay of the American Dream.",
"F. Scott Fitzgerald (1896–1940) was an American novelist and short story writer, widely regarded as one of the greatest American writers of the 20th century. Born in St. Paul, Minnesota, Fitzgerald is best known for his novel The Great Gatsby, which is often considered the quintessential work of the Jazz Age. His works often explore themes of youth, wealth, and the American Dream, reflecting the turbulence and excesses of the 1920s."
], kind = INSERT}
{fields = [5, "Moby-Dick",
"Moby-Dick is an epic tale of obsession and revenge. The novel follows the journey of Captain Ahab, who is on a relentless quest to kill the white whale, Moby Dick, that once maimed him. Narrated by Ishmael, a sailor aboard Ahab’s ship, the story delves into themes of fate, humanity, and the struggle between man and nature. The novel is also rich with symbolism and philosophical musings.",
"Herman Melville (1819–1891) was an American novelist, short story writer, and poet of the American Renaissance period. Born in New York City, Melville gained initial fame with novels such as Typee and Omoo, but it was Moby-Dick, published in 1851, that would later be recognized as his masterpiece. Melville’s work is known for its complexity, symbolism, and exploration of themes such as man’s place in the universe, the nature of evil, and the quest for meaning. Despite facing financial difficulties and critical neglect during his lifetime, Melville’s reputation soared posthumously, and he is now considered one of the great American authors."
], kind = INSERT}
]
plugin_output = "fake"
}
}
transform {
Embedding {
plugin_input = "fake"
embedding_model_provider = QIANFAN
model = bge_large_en
api_key = xxxxxxxxxx
secret_key = xxxxxxxxxx
api_path = xxxxxxxxxx
vectorization_fields {
book_intro_vector = book_intro
author_biography_vector = author_biography
}
plugin_output = "embedding_output"
}
}
sink {
Assert {
plugin_input = "embedding_output"
rules =
{
field_rules = [
{
field_name = book_id
field_type = int
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_name
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_intro
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = author_biography
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_intro_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = author_biography_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
}
]
}
}
}
```
### Multimodal Embedding (Volcengine Doubao)
Multimodal Embedding supports input as accessible URL or Binary data formats to process multimodal data.
#### URL
```hocon
env {
job.mode = "BATCH"
}
source {
FakeSource {
row.num = 5
schema = {
fields {
id = "int"
product_name = "string"
description = "string"
product_image_url = "string"
product_video_url = "string"
thumbnail_image = "string"
promotional_video = "string"
category = "string"
price = "decimal(10,2)"
created_at = "timestamp"
}
}
rows = [
{
fields = [
1,
"iPhone 15 Pro",
"Latest iPhone with advanced camera system and A17 Pro chip",
"https://example.com/images/iphone15pro.jpg",
"https://example.com/videos/iphone15pro_demo.mp4",
"https://example.com/thumbnails/iphone15pro_thumb.png",
"https://example.com/videos/iphone15pro_promo.mov",
"Electronics",
999.99,
"2024-01-15T10:30:00"
],
kind = INSERT
},
{
fields = [
2,
"MacBook Air M3",
"Ultra-thin laptop with M3 chip for incredible performance",
"https://example.com/images/macbook_air_m3.jpeg",
"https://example.com/videos/macbook_air_review.avi",
"https://example.com/thumbnails/macbook_thumb.webp",
"https://example.com/videos/macbook_commercial.mp4",
"Computers",
1299.99,
"2024-02-20T14:15:00"
],
kind = INSERT
}
]
plugin_output = "fake"
}
}
transform {
Embedding {
plugin_input = "fake"
model_provider = DOUBAO
model = "doubao-embedding-vision"
api_key = "your-api-key"
api_path = "https://ark.cn-beijing.volces.com/api/v3/embeddings/multimodal"
single_vectorized_input_number = 1
vectorization_fields {
# Text field - defaults to text modality
description_vector = description
product_image_vector = {
field = product_image_url
modality = jpeg
format = url
}
thumbnail_vector = {
field = thumbnail_image # If value is "thumb.png", auto-detects as PNG
format = url
}
demo_video_vector = {
field = product_video_url
modality = mp4
format = url
}
promo_video_vector = {
field = promotional_video # If value is "promo.mov", auto-detects as MOV
format = url
}
# Mixed content - product name
product_name_vector = product_name
}
plugin_output = "multimodal_embedding_output"
}
}
sink {
Assert {
plugin_input = "multimodal_embedding_output"
rules = {
field_rules = [
{
field_name = id
field_type = int
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = description_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = product_image_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = thumbnail_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = demo_video_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
}
]
}
}
}
```
#### Binary
```hocon
env {
job.mode = "BATCH"
}
source {
LocalFile {
path = "/seatunnel/read/binary/"
file_format_type = "binary"
binary_complete_file_mode = false
binary_chunk_size = 1024
plugin_output = "binary_source"
}
}
transform {
Embedding {
plugin_input = "binary_source"
model_provider = DOUBAO
model = "doubao-embedding-vision-250615"
api_key = "test-api-key"
api_path = "http://mockserver:1080/api/v3/embeddings/multimodal"
single_vectorized_input_number = 1
vectorization_fields = {
image_embedding = {
field = "data"
modality = "jpeg"
format = "binary"
}
}
plugin_output = "binary_embedding_output"
}
}
sink {
Assert {
plugin_input = "binary_embedding_output"
rules = {
row_rules = [
{
rule_type = MAX_ROW
rule_value = 1
}
],
field_rules = [
{
field_name = image_embedding
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = relativePath
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
}
]
}
}
}
```
### Customize the embedding model
```hocon
env {
job.mode = "BATCH"
}
source {
FakeSource {
row.num = 5
schema = {
fields {
book_id = "int"
book_name = "string"
book_intro = "string"
author_biography = "string"
}
}
rows = [
{fields = [1, "To Kill a Mockingbird",
"Set in the American South during the 1930s, To Kill a Mockingbird tells the story of young Scout Finch and her brother, Jem, who are growing up in a world of racial inequality and injustice. Their father, Atticus Finch, is a lawyer who defends a black man falsely accused of raping a white woman, teaching his children valuable lessons about morality, courage, and empathy.",
"Harper Lee (1926–2016) was an American novelist best known for To Kill a Mockingbird, which won the Pulitzer Prize in 1961. Lee was born in Monroeville, Alabama, and the town served as inspiration for the fictional Maycomb in her novel. Despite the success of her book, Lee remained a private person and published only one other novel, Go Set a Watchman, which was written before To Kill a Mockingbird but released in 2015 as a sequel."
], kind = INSERT}
{fields = [2, "1984",
"1984 is a dystopian novel set in a totalitarian society governed by Big Brother. The story follows Winston Smith, a man who works for the Party rewriting history. Winston begins to question the Party’s control and seeks truth and freedom in a society where individuality is crushed. The novel explores themes of surveillance, propaganda, and the loss of personal autonomy.",
"George Orwell (1903–1950) was the pen name of Eric Arthur Blair, an English novelist, essayist, journalist, and critic. Orwell is best known for his works 1984 and Animal Farm, both of which are critiques of totalitarian regimes. His writing is characterized by lucid prose, awareness of social injustice, opposition to totalitarianism, and support of democratic socialism. Orwell’s work remains influential, and his ideas have shaped contemporary discussions on politics and society."
], kind = INSERT}
{fields = [3, "Pride and Prejudice",
"Pride and Prejudice is a romantic novel that explores the complex relationships between different social classes in early 19th century England. The story centers on Elizabeth Bennet, a young woman with strong opinions, and Mr. Darcy, a wealthy but reserved gentleman. The novel deals with themes of love, marriage, and societal expectations, offering keen insights into human behavior.",
"Jane Austen (1775–1817) was an English novelist known for her sharp social commentary and keen observations of the British landed gentry. Her works, including Sense and Sensibility, Emma, and Pride and Prejudice, are celebrated for their wit, realism, and biting critique of the social class structure of her time. Despite her relatively modest life, Austen’s novels have gained immense popularity, and she is considered one of the greatest novelists in the English language."
], kind = INSERT}
{fields = [4, "The Great GatsbyThe Great Gatsby",
"The Great Gatsby is a novel about the American Dream and the disillusionment that can come with it. Set in the 1920s, the story follows Nick Carraway as he becomes entangled in the lives of his mysterious neighbor, Jay Gatsby, and the wealthy elite of Long Island. Gatsby's obsession with the beautiful Daisy Buchanan drives the narrative, exploring themes of wealth, love, and the decay of the American Dream.",
"F. Scott Fitzgerald (1896–1940) was an American novelist and short story writer, widely regarded as one of the greatest American writers of the 20th century. Born in St. Paul, Minnesota, Fitzgerald is best known for his novel The Great Gatsby, which is often considered the quintessential work of the Jazz Age. His works often explore themes of youth, wealth, and the American Dream, reflecting the turbulence and excesses of the 1920s."
], kind = INSERT}
{fields = [5, "Moby-Dick",
"Moby-Dick is an epic tale of obsession and revenge. The novel follows the journey of Captain Ahab, who is on a relentless quest to kill the white whale, Moby Dick, that once maimed him. Narrated by Ishmael, a sailor aboard Ahab’s ship, the story delves into themes of fate, humanity, and the struggle between man and nature. The novel is also rich with symbolism and philosophical musings.",
"Herman Melville (1819–1891) was an American novelist, short story writer, and poet of the American Renaissance period. Born in New York City, Melville gained initial fame with novels such as Typee and Omoo, but it was Moby-Dick, published in 1851, that would later be recognized as his masterpiece. Melville’s work is known for its complexity, symbolism, and exploration of themes such as man’s place in the universe, the nature of evil, and the quest for meaning. Despite facing financial difficulties and critical neglect during his lifetime, Melville’s reputation soared posthumously, and he is now considered one of the great American authors."
], kind = INSERT}
]
plugin_output = "fake"
}
}
transform {
Embedding {
plugin_input = "fake"
model_provider = CUSTOM
model = text-embedding-3-small
api_key = xxxxxxxx
api_path = "http://mockserver:1080/v1/doubao/embedding"
single_vectorized_input_number = 2
vectorization_fields {
book_intro_vector = book_intro
author_biography_vector = author_biography
}
custom_config={
custom_response_parse = "$.data[*].embedding"
custom_request_headers = {
"Content-Type"= "application/json"
"Authorization"= "Bearer xxxxxxx
}
custom_request_body ={
modelx = "${model}"
inputx = ["${input}"]
}
}
plugin_output = "embedding_output_1"
}
}
sink {
Assert {
plugin_input = "embedding_output_1"
rules =
{
field_rules = [
{
field_name = book_id
field_type = int
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_name
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_intro
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = author_biography
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_intro_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = author_biography_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
}
]
}
}
}
```