| --- |
| title: "Vector Storage" |
| weight: 7 |
| type: docs |
| aliases: |
| - /append-table/vector-storage.html |
| --- |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| # Vector Storage |
| |
| ## Overview |
| |
| With the explosive growth of AI scenarios, vector storage has become increasingly important. |
| |
| Paimon provides optimized storage solutions specifically designed for vector data to meet the needs of various scenarios. |
| |
| ## Vector Data Type |
| |
| Vector data comes in many types, among which dense vectors are the most commonly used. They are typically expressed as fixed-length, densely packed arrays, generally without `null` elements. |
| |
| Paimon supports defining columns of type `VECTOR<t, n>`, which represents a fixed-length, dense vector column, where: |
| - **`t`**: The element type of the vector. Currently supports seven primitive types: `BOOLEAN`, `TINYINT`, `SMALLINT`, `INT`, `BIGINT`, `FLOAT`, `DOUBLE`; |
| - **`n`**: The vector dimension, must be a positive integer not exceeding `2,147,483,647`; |
| - **`null constraint`**: `VECTOR` type supports defining `NOT NULL` or the default nullable. However, if a specific `VECTOR` value itself is not `null`, its elements are not allowed to be `null`. |
| |
| Compared to variable-length arrays, these features make dense vectors more concise in storage and memory representation, with benefits including: |
| - More natural semantic constraints, preventing mismatched lengths, `null` elements, and other anomalies at the data storage layer; |
| - Better point-lookup performance, eliminating offset array storage and access; |
| - Closer alignment with type representations in specialized vector engines, often avoiding memory copies and type conversions during queries. |
| |
| Example: Define a table with a `VECTOR` column using Java API and write one row of data. |
| ```java |
| public class CreateTableWithVector { |
| |
| public static void main(String[] args) throws Exception { |
| // Schema |
| Schema.Builder schemaBuilder = Schema.newBuilder(); |
| schemaBuilder.column("id", DataTypes.BIGINT()); |
| schemaBuilder.column("embed", DataTypes.VECTOR(3, DataTypes.FLOAT())); |
| schemaBuilder.option(CoreOptions.FILE_FORMAT.key(), "lance"); |
| schemaBuilder.option(CoreOptions.FILE_COMPRESSION.key(), "none"); |
| Schema schema = schemaBuilder.build(); |
| |
| // Create catalog |
| String database = "default"; |
| String tempPath = System.getProperty("java.io.tmpdir") + UUID.randomUUID(); |
| Path warehouse = new Path(TraceableFileIO.SCHEME + "://" + tempPath); |
| Identifier identifier = Identifier.create("default", "my_table"); |
| try (Catalog catalog = CatalogFactory.createCatalog(CatalogContext.create(warehouse))) { |
| |
| // Create table |
| catalog.createDatabase(database, true); |
| catalog.createTable(identifier, schema, true); |
| FileStoreTable table = (FileStoreTable) catalog.getTable(identifier); |
| |
| // Write data |
| BatchWriteBuilder builder = table.newBatchWriteBuilder(); |
| InternalVector vector = BinaryVector.fromPrimitiveArray(new float[] {1.0f, 2.0f, 3.0f}); |
| try (BatchTableWrite batchTableWrite = builder.newWrite()) { |
| try (BatchTableCommit commit = builder.newCommit()) { |
| batchTableWrite.write(GenericRow.of(1L, vector)); |
| commit.commit(batchTableWrite.prepareCommit()); |
| } |
| } |
| |
| // Read data |
| ReadBuilder readBuilder = table.newReadBuilder(); |
| TableScan.Plan plan = readBuilder.newScan().plan(); |
| try (RecordReader<InternalRow> reader = readBuilder.newRead().createReader(plan)) { |
| reader.forEachRemaining(row -> { |
| float[] readVector = row.getVector(1).toFloatArray(); |
| System.out.println(Arrays.toString(readVector)); |
| }); |
| } |
| } |
| } |
| } |
| ``` |
| |
| **Notes**: |
| - Columns of `VECTOR` type cannot be used as primary key columns, partition columns, or for sorting. |
| |
| ## Engine-Level Representation |
| |
| Since engine layers typically don't have dedicated vector types, to support `VECTOR` type in engine SQL, Paimon provides a separate configuration to convert the engine's `ARRAY` type to Paimon's `VECTOR` type. |
| |
| Usage: |
| - **`'vector-field'`**: Declare columns as `VECTOR` type, multiple columns separated by commas (`,`); |
| - **`'field.{field-name}.vector-dim'`**: Declare the dimension of the vector column. |
| |
| Example: Define a table with a `VECTOR` column using Flink SQL. |
| ```sql |
| CREATE TABLE IF NOT EXISTS ts_table ( |
| id BIGINT, |
| embed1 ARRAY<FLOAT>, |
| embed2 ARRAY<FLOAT> |
| ) WITH ( |
| 'file.format' = 'lance', |
| 'vector-field' = 'embed1,embed2', |
| 'field.embed1.vector-dim' = '128', |
| 'field.embed2.vector-dim' = '768' |
| ); |
| ``` |
| |
| **Notes**: |
| - When defining `vector-field` columns, you must provide the vector dimension; otherwise, the CREATE TABLE statement will fail; |
| - Currently, only Flink SQL supports this configuration; other engines have not been implemented yet. |
| |
| ## Dedicated File Format for Vector |
| |
| When mapping `VECTOR` type to the file format layer, the ideal storage format is `FixedSizeList`. Currently, this is only supported for certain file formats (such as `lance`) through the `paimon-arrow` integration. This means that to use `VECTOR` type, you must specify a particular format via `file.format`, which has a global impact. In particular, this may be unfavorable for scalars and multimodal (Blob) data. |
| |
| Therefore, Paimon provides a solution to store vector columns separately within Data Evolution tables. |
| |
| Layout: |
| ``` |
| table/ |
| ├── bucket-0/ |
| │ ├── data-uuid-0.parquet # Contains id, name columns |
| │ ├── data-uuid-1.blob # Contains blob data |
| │ ├── data-uuid-2.vector.lance # Contains vector data using lance format |
| │ └── ... |
| ├── manifest/ |
| ├── schema/ |
| └── snapshot/ |
| ``` |
| |
| Usage: |
| - **`vector.file.format`**: Store `VECTOR` type columns separately in the specified file format; |
| - **`vector.target-file-size`**: If stored separately, specifies the target file size for vector data, defaulting to `10 * 'target-file-size'`. |
| |
| Example: Store `VECTOR` columns separately using Flink SQL. |
| ```sql |
| CREATE TABLE IF NOT EXISTS ts_table ( |
| id BIGINT, |
| embed ARRAY<FLOAT> |
| ) WITH ( |
| 'file.format' = 'parquet', |
| 'vector.file.format' = 'lance', |
| 'vector-field' = 'embed', |
| 'field.embed.vector-dim' = '128', |
| 'row-tracking.enabled' = 'true', |
| 'data-evolution.enabled' = 'true' |
| ); |
| ``` |