Embedding Transform Plugin
The Embedding
transform plugin leverages embedding models to convert text and multimodal data into vectorized representations. This transformation can be applied to various fields including text, images, and videos. The plugin supports multiple model providers and can be integrated with different API endpoints.
Important Note: The current embedding precision only supports float32 format.
Name | Type | Required | Default Value | Description |
---|---|---|---|---|
model_provider | enum | yes | - | The model provider for embedding. Options may include AMAZON , QIANFAN , OPENAI , etc. |
api_key | string | yes | - | The API key required to authenticate with the embedding service. |
secret_key | string | yes | - | The secret key required for additional authentication with the embedding service. |
aws_region | string | no | AWS Region. Required for use Amazon Bedrock model. | |
single_vectorized_input_number | int | no | 1 | The number of inputs vectorized in one request. Default is 1. |
vectorization_fields | map | yes | - | A mapping between input fields and their corresponding output vector fields. |
model | string | yes | - | The specific model to use for embedding (e.g: text-embedding-3-small for OPENAI). |
api_path | string | no | - | The API endpoint for the embedding service. Typically provided by the model provider. |
dimension | int | no | - | TThe vector dimension defaults to 2048. The Embedding-3 model supports custom vector dimensions, and it is recommended to choose dimensions of 256, 512, 1024, or 2048. |
oauth_path | string | no | - | The API endpoint for the oauth service. |
custom_config | map | no | Custom configurations for the model. | |
custom_response_parse | string | no | Specifies how to parse the response from the model using JsonPath. Example: $.choices[*].message.content . | |
custom_request_headers | map | no | Custom headers for the request to the model. | |
custom_request_body | map | no | Custom body for the request. Supports placeholders like ${model} , ${input} . |
Important: The current version of the Embedding plugin only supports float32 precision for vector data.
The providers for generating embeddings include common options such as AMAZON
, DOUBAO
, QIANFAN
, and OPENAI
. Additionally, you can choose CUSTOM
to implement requests and retrievals for custom embedding models.
The API key for authenticating requests to the embedding service. This is typically provided by the model provider when you register for their service.
The secret key used for additional authentication. Some providers may require this for secure API requests.
Specifies how many inputs are processed in a single vectorization request. The default is 1. Adjust based on your processing capacity and the model provider's API limitations.
A mapping between input fields and their respective output vector fields. This allows the plugin to understand which fields to vectorize and how to store the resulting vectors. The plugin supports multimodal data by allowing you to specify the modality type for each field.
Basic Text Vectorization:
vectorization_fields { book_intro_vector = book_intro author_biography_vector = author_biography }
Multimodal Vectorization:
vectorization_fields { # Basic text field text_vector = text_field # Explicit modality type configuration product_image_vector = { field = product_image_url modality = jpeg format = url } # Auto-detect modality type (based on file suffix) thumbnail_vector = { field = thumbnail_image # If value is "image.png", auto-detects as PNG modality format = url } # Video field configuration demo_video_vector = { field = product_video_url modality = mp4 format = url } # Binary data configuration binary_image_vector = { field = image_data modality = jpeg format = binary } }
Field Specification Formats:
Supported Modality Types:
jpeg
(jpg, jpeg), png
(png, apng), gif
, webp
, bmp
(bmp, dib), tiff
(tiff, tif), ico
, icns
, sgi
, jpeg2000
(j2c, j2k, jp2, jpc, jpf, jpx)mp4
, avi
, mov
text
(default)Payload Formats:
text
- Text format (default)url
- URL formatbinary
- Binary data formatAutomatic Modality Detection: When modality
is not explicitly specified and format
is not binary
, the system automatically detects the modality type based on the file suffix of the field value:
Important: When using multimodal fields (image or video), ensure your model provider supports multimodal embedding. Image and video fields must contain valid URLs or binary data. Currently,
DOUBAO
provider supports multimodal data processing.
The specific embedding model to use. This depends on the embedding_model_provider
. For example, if using OPENAI, you might specify text-embedding-3-small
.
The API endpoint to use for making requests to the embedding service. This might vary based on the provider and model used. Generally, this is provided by the model provider.
The API endpoint for the oauth service. Get certification information. This might vary based on the provider and model used. Generally, this is provided by the model provider.
The custom_config
option allows you to provide additional custom configurations for the model. This is a map where you can define various settings that might be required by the specific model you're using.
The custom_response_parse
option allows you to specify how to parse the model's response. You can use JsonPath to extract the specific data you need from the response. For example, by using $.data[*].embedding
, you can extract the embedding
field values from the following JSON and obtain a List
of nested List
results. For more details on using JsonPath, please refer to the JsonPath Getting Started guide.
{ "object": "list", "data": [ { "object": "embedding", "index": 0, "embedding": [ -0.006929283495992422, -0.005336422007530928, -0.00004547132266452536, -0.024047505110502243 ] } ], "model": "text-embedding-3-small", "usage": { "prompt_tokens": 5, "total_tokens": 5 } }
The custom_request_headers
option allows you to define custom headers that should be included in the request sent to the model's API. This is useful if the API requires additional headers beyond the standard ones, such as authorization tokens, content types, etc.
The custom_request_body
option supports placeholders:
${model}
: Placeholder for the model name.${input}
: Placeholder to determine input value and define request body request type based on the type of body value. Example: ["${input}"]
-> [“input”] (list)Transform plugin common parameters, please refer to Transform Plugin for details.
env { job.mode = "BATCH" } source { FakeSource { row.num = 5 schema = { fields { book_id = "int" book_name = "string" book_intro = "string" author_biography = "string" } } rows = [ {fields = [1, "To Kill a Mockingbird", "Set in the American South during the 1930s, To Kill a Mockingbird tells the story of young Scout Finch and her brother, Jem, who are growing up in a world of racial inequality and injustice. Their father, Atticus Finch, is a lawyer who defends a black man falsely accused of raping a white woman, teaching his children valuable lessons about morality, courage, and empathy.", "Harper Lee (1926–2016) was an American novelist best known for To Kill a Mockingbird, which won the Pulitzer Prize in 1961. Lee was born in Monroeville, Alabama, and the town served as inspiration for the fictional Maycomb in her novel. Despite the success of her book, Lee remained a private person and published only one other novel, Go Set a Watchman, which was written before To Kill a Mockingbird but released in 2015 as a sequel." ], kind = INSERT} {fields = [2, "1984", "1984 is a dystopian novel set in a totalitarian society governed by Big Brother. The story follows Winston Smith, a man who works for the Party rewriting history. Winston begins to question the Party’s control and seeks truth and freedom in a society where individuality is crushed. The novel explores themes of surveillance, propaganda, and the loss of personal autonomy.", "George Orwell (1903–1950) was the pen name of Eric Arthur Blair, an English novelist, essayist, journalist, and critic. Orwell is best known for his works 1984 and Animal Farm, both of which are critiques of totalitarian regimes. His writing is characterized by lucid prose, awareness of social injustice, opposition to totalitarianism, and support of democratic socialism. Orwell’s work remains influential, and his ideas have shaped contemporary discussions on politics and society." ], kind = INSERT} {fields = [3, "Pride and Prejudice", "Pride and Prejudice is a romantic novel that explores the complex relationships between different social classes in early 19th century England. The story centers on Elizabeth Bennet, a young woman with strong opinions, and Mr. Darcy, a wealthy but reserved gentleman. The novel deals with themes of love, marriage, and societal expectations, offering keen insights into human behavior.", "Jane Austen (1775–1817) was an English novelist known for her sharp social commentary and keen observations of the British landed gentry. Her works, including Sense and Sensibility, Emma, and Pride and Prejudice, are celebrated for their wit, realism, and biting critique of the social class structure of her time. Despite her relatively modest life, Austen’s novels have gained immense popularity, and she is considered one of the greatest novelists in the English language." ], kind = INSERT} {fields = [4, "The Great GatsbyThe Great Gatsby", "The Great Gatsby is a novel about the American Dream and the disillusionment that can come with it. Set in the 1920s, the story follows Nick Carraway as he becomes entangled in the lives of his mysterious neighbor, Jay Gatsby, and the wealthy elite of Long Island. Gatsby's obsession with the beautiful Daisy Buchanan drives the narrative, exploring themes of wealth, love, and the decay of the American Dream.", "F. Scott Fitzgerald (1896–1940) was an American novelist and short story writer, widely regarded as one of the greatest American writers of the 20th century. Born in St. Paul, Minnesota, Fitzgerald is best known for his novel The Great Gatsby, which is often considered the quintessential work of the Jazz Age. His works often explore themes of youth, wealth, and the American Dream, reflecting the turbulence and excesses of the 1920s." ], kind = INSERT} {fields = [5, "Moby-Dick", "Moby-Dick is an epic tale of obsession and revenge. The novel follows the journey of Captain Ahab, who is on a relentless quest to kill the white whale, Moby Dick, that once maimed him. Narrated by Ishmael, a sailor aboard Ahab’s ship, the story delves into themes of fate, humanity, and the struggle between man and nature. The novel is also rich with symbolism and philosophical musings.", "Herman Melville (1819–1891) was an American novelist, short story writer, and poet of the American Renaissance period. Born in New York City, Melville gained initial fame with novels such as Typee and Omoo, but it was Moby-Dick, published in 1851, that would later be recognized as his masterpiece. Melville’s work is known for its complexity, symbolism, and exploration of themes such as man’s place in the universe, the nature of evil, and the quest for meaning. Despite facing financial difficulties and critical neglect during his lifetime, Melville’s reputation soared posthumously, and he is now considered one of the great American authors." ], kind = INSERT} ] plugin_output = "fake" } } transform { Embedding { plugin_input = "fake" embedding_model_provider = QIANFAN model = bge_large_en api_key = xxxxxxxxxx secret_key = xxxxxxxxxx api_path = xxxxxxxxxx vectorization_fields { book_intro_vector = book_intro author_biography_vector = author_biography } plugin_output = "embedding_output" } } sink { Assert { plugin_input = "embedding_output" rules = { field_rules = [ { field_name = book_id field_type = int field_value = [ { rule_type = NOT_NULL } ] }, { field_name = book_name field_type = string field_value = [ { rule_type = NOT_NULL } ] }, { field_name = book_intro field_type = string field_value = [ { rule_type = NOT_NULL } ] }, { field_name = author_biography field_type = string field_value = [ { rule_type = NOT_NULL } ] }, { field_name = book_intro_vector field_type = float_vector field_value = [ { rule_type = NOT_NULL } ] }, { field_name = author_biography_vector field_type = float_vector field_value = [ { rule_type = NOT_NULL } ] } ] } } }
Multimodal Embedding supports input as accessible URL or Binary data formats to process multimodal data.
env { job.mode = "BATCH" } source { FakeSource { row.num = 5 schema = { fields { id = "int" product_name = "string" description = "string" product_image_url = "string" product_video_url = "string" thumbnail_image = "string" promotional_video = "string" category = "string" price = "decimal(10,2)" created_at = "timestamp" } } rows = [ { fields = [ 1, "iPhone 15 Pro", "Latest iPhone with advanced camera system and A17 Pro chip", "https://example.com/images/iphone15pro.jpg", "https://example.com/videos/iphone15pro_demo.mp4", "https://example.com/thumbnails/iphone15pro_thumb.png", "https://example.com/videos/iphone15pro_promo.mov", "Electronics", 999.99, "2024-01-15T10:30:00" ], kind = INSERT }, { fields = [ 2, "MacBook Air M3", "Ultra-thin laptop with M3 chip for incredible performance", "https://example.com/images/macbook_air_m3.jpeg", "https://example.com/videos/macbook_air_review.avi", "https://example.com/thumbnails/macbook_thumb.webp", "https://example.com/videos/macbook_commercial.mp4", "Computers", 1299.99, "2024-02-20T14:15:00" ], kind = INSERT } ] plugin_output = "fake" } } transform { Embedding { plugin_input = "fake" model_provider = DOUBAO model = "doubao-embedding-vision" api_key = "your-api-key" api_path = "https://ark.cn-beijing.volces.com/api/v3/embeddings/multimodal" single_vectorized_input_number = 1 vectorization_fields { # Text field - defaults to text modality description_vector = description product_image_vector = { field = product_image_url modality = jpeg format = url } thumbnail_vector = { field = thumbnail_image # If value is "thumb.png", auto-detects as PNG format = url } demo_video_vector = { field = product_video_url modality = mp4 format = url } promo_video_vector = { field = promotional_video # If value is "promo.mov", auto-detects as MOV format = url } # Mixed content - product name product_name_vector = product_name } plugin_output = "multimodal_embedding_output" } } sink { Assert { plugin_input = "multimodal_embedding_output" rules = { field_rules = [ { field_name = id field_type = int field_value = [ { rule_type = NOT_NULL } ] }, { field_name = description_vector field_type = float_vector field_value = [ { rule_type = NOT_NULL } ] }, { field_name = product_image_vector field_type = float_vector field_value = [ { rule_type = NOT_NULL } ] }, { field_name = thumbnail_vector field_type = float_vector field_value = [ { rule_type = NOT_NULL } ] }, { field_name = demo_video_vector field_type = float_vector field_value = [ { rule_type = NOT_NULL } ] } ] } } }
env { job.mode = "BATCH" } source { LocalFile { path = "/seatunnel/read/binary/" file_format_type = "binary" binary_complete_file_mode = false binary_chunk_size = 1024 plugin_output = "binary_source" } } transform { Embedding { plugin_input = "binary_source" model_provider = DOUBAO model = "doubao-embedding-vision-250615" api_key = "test-api-key" api_path = "http://mockserver:1080/api/v3/embeddings/multimodal" single_vectorized_input_number = 1 vectorization_fields = { image_embedding = { field = "data" modality = "jpeg" format = "binary" } } plugin_output = "binary_embedding_output" } } sink { Assert { plugin_input = "binary_embedding_output" rules = { row_rules = [ { rule_type = MAX_ROW rule_value = 1 } ], field_rules = [ { field_name = image_embedding field_type = float_vector field_value = [ { rule_type = NOT_NULL } ] }, { field_name = relativePath field_type = string field_value = [ { rule_type = NOT_NULL } ] } ] } } }
env { job.mode = "BATCH" } source { FakeSource { row.num = 5 schema = { fields { book_id = "int" book_name = "string" book_intro = "string" author_biography = "string" } } rows = [ {fields = [1, "To Kill a Mockingbird", "Set in the American South during the 1930s, To Kill a Mockingbird tells the story of young Scout Finch and her brother, Jem, who are growing up in a world of racial inequality and injustice. Their father, Atticus Finch, is a lawyer who defends a black man falsely accused of raping a white woman, teaching his children valuable lessons about morality, courage, and empathy.", "Harper Lee (1926–2016) was an American novelist best known for To Kill a Mockingbird, which won the Pulitzer Prize in 1961. Lee was born in Monroeville, Alabama, and the town served as inspiration for the fictional Maycomb in her novel. Despite the success of her book, Lee remained a private person and published only one other novel, Go Set a Watchman, which was written before To Kill a Mockingbird but released in 2015 as a sequel." ], kind = INSERT} {fields = [2, "1984", "1984 is a dystopian novel set in a totalitarian society governed by Big Brother. The story follows Winston Smith, a man who works for the Party rewriting history. Winston begins to question the Party’s control and seeks truth and freedom in a society where individuality is crushed. The novel explores themes of surveillance, propaganda, and the loss of personal autonomy.", "George Orwell (1903–1950) was the pen name of Eric Arthur Blair, an English novelist, essayist, journalist, and critic. Orwell is best known for his works 1984 and Animal Farm, both of which are critiques of totalitarian regimes. His writing is characterized by lucid prose, awareness of social injustice, opposition to totalitarianism, and support of democratic socialism. Orwell’s work remains influential, and his ideas have shaped contemporary discussions on politics and society." ], kind = INSERT} {fields = [3, "Pride and Prejudice", "Pride and Prejudice is a romantic novel that explores the complex relationships between different social classes in early 19th century England. The story centers on Elizabeth Bennet, a young woman with strong opinions, and Mr. Darcy, a wealthy but reserved gentleman. The novel deals with themes of love, marriage, and societal expectations, offering keen insights into human behavior.", "Jane Austen (1775–1817) was an English novelist known for her sharp social commentary and keen observations of the British landed gentry. Her works, including Sense and Sensibility, Emma, and Pride and Prejudice, are celebrated for their wit, realism, and biting critique of the social class structure of her time. Despite her relatively modest life, Austen’s novels have gained immense popularity, and she is considered one of the greatest novelists in the English language." ], kind = INSERT} {fields = [4, "The Great GatsbyThe Great Gatsby", "The Great Gatsby is a novel about the American Dream and the disillusionment that can come with it. Set in the 1920s, the story follows Nick Carraway as he becomes entangled in the lives of his mysterious neighbor, Jay Gatsby, and the wealthy elite of Long Island. Gatsby's obsession with the beautiful Daisy Buchanan drives the narrative, exploring themes of wealth, love, and the decay of the American Dream.", "F. Scott Fitzgerald (1896–1940) was an American novelist and short story writer, widely regarded as one of the greatest American writers of the 20th century. Born in St. Paul, Minnesota, Fitzgerald is best known for his novel The Great Gatsby, which is often considered the quintessential work of the Jazz Age. His works often explore themes of youth, wealth, and the American Dream, reflecting the turbulence and excesses of the 1920s." ], kind = INSERT} {fields = [5, "Moby-Dick", "Moby-Dick is an epic tale of obsession and revenge. The novel follows the journey of Captain Ahab, who is on a relentless quest to kill the white whale, Moby Dick, that once maimed him. Narrated by Ishmael, a sailor aboard Ahab’s ship, the story delves into themes of fate, humanity, and the struggle between man and nature. The novel is also rich with symbolism and philosophical musings.", "Herman Melville (1819–1891) was an American novelist, short story writer, and poet of the American Renaissance period. Born in New York City, Melville gained initial fame with novels such as Typee and Omoo, but it was Moby-Dick, published in 1851, that would later be recognized as his masterpiece. Melville’s work is known for its complexity, symbolism, and exploration of themes such as man’s place in the universe, the nature of evil, and the quest for meaning. Despite facing financial difficulties and critical neglect during his lifetime, Melville’s reputation soared posthumously, and he is now considered one of the great American authors." ], kind = INSERT} ] plugin_output = "fake" } } transform { Embedding { plugin_input = "fake" model_provider = CUSTOM model = text-embedding-3-small api_key = xxxxxxxx api_path = "http://mockserver:1080/v1/doubao/embedding" single_vectorized_input_number = 2 vectorization_fields { book_intro_vector = book_intro author_biography_vector = author_biography } custom_config={ custom_response_parse = "$.data[*].embedding" custom_request_headers = { "Content-Type"= "application/json" "Authorization"= "Bearer xxxxxxx } custom_request_body ={ modelx = "${model}" inputx = ["${input}"] } } plugin_output = "embedding_output_1" } } sink { Assert { plugin_input = "embedding_output_1" rules = { field_rules = [ { field_name = book_id field_type = int field_value = [ { rule_type = NOT_NULL } ] }, { field_name = book_name field_type = string field_value = [ { rule_type = NOT_NULL } ] }, { field_name = book_intro field_type = string field_value = [ { rule_type = NOT_NULL } ] }, { field_name = author_biography field_type = string field_value = [ { rule_type = NOT_NULL } ] }, { field_name = book_intro_vector field_type = float_vector field_value = [ { rule_type = NOT_NULL } ] }, { field_name = author_biography_vector field_type = float_vector field_value = [ { rule_type = NOT_NULL } ] } ] } } }