blob: bdd3ce541be29b2682b079008e5747d7869c85c9 [file] [log] [blame] [view]
# Intro To Schema Feature
## Why We Need Schema
Some NoSQL databases or message queue are not strongly limited schema, so the schema cannot be obtained through the api.
At this time, a schema needs to be defined to convert to TableSchema and obtain data.
## SchemaOptions
We can use SchemaOptions to define schema, the SchemaOptions contains some configs to define the schema. e.g. columns, primaryKey, constraintKeys.
```
schema = {
table = "database.schema.table"
schema_first = false
comment = "comment"
columns = [
...
]
primaryKey {
...
}
constraintKeys {
...
}
}
```
### table
The table full name of the table identifier which the schema belongs to, it contains database, schema, table name. e.g. `database.schema.table`, `database.table`, `table`.
### schema_first
Default is false.
If the schema_first is true, the schema will be used first, this means if we set `table = "a.b"`, `a` will be parsed as schema rather than database, then we can support write `table = "schema.table"`.
### comment
The comment of the CatalogTable which the schema belongs to.
### Columns
Columns is a list of configs used to define the column in schema, each column can contains name, type, nullable, defaultValue, comment field.
```
columns = [
{
name = id
type = bigint
nullable = false
columnLength = 20
defaultValue = 0
comment = "primary key id"
}
]
```
| Field | Required | Default Value | Description |
|:-------------|:---------|:--------------|----------------------------------------------------------------------------------|
| name | Yes | - | The name of the column |
| type | Yes | - | The data type of the column |
| nullable | No | true | If the column can be nullable |
| columnLength | No | 0 | The length of the column which will be useful when you need to define the length |
| columnScale | No | - | The scale of the column which will be useful when you need to define the scale |
| defaultValue | No | null | The default value of the column |
| comment | No | null | The comment of the column |
#### What type supported at now
| Data type | Value type in Java | Description |
|:-------------|:---------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| string | `java.lang.String` | string |
| boolean | `java.lang.Boolean` | boolean |
| tinyint | `java.lang.Byte` | -128 to 127 regular. 0 to 255 unsigned*. Specify the maximum number of digits in parentheses. |
| smallint | `java.lang.Short` | -32768 to 32767 General. 0 to 65535 unsigned*. Specify the maximum number of digits in parentheses. |
| int | `java.lang.Integer` | All numbers from -2,147,483,648 to 2,147,483,647 are allowed. |
| bigint | `java.lang.Long` | All numbers between -9,223,372,036,854,775,808 and 9,223,372,036,854,775,807 are allowed. |
| float | `java.lang.Float` | Float-precision numeric data from -1.79E+308 to 1.79E+308. |
| double | `java.lang.Double` | Double precision floating point. Handle most decimals. |
| decimal | `java.math.BigDecimal` | Double type stored as a string, allowing a fixed decimal point. |
| null | `java.lang.Void` | null |
| bytes | `byte[]` | bytes |
| date | `java.time.LocalDate` | Only the date is stored. From January 1, 0001 to December 31, 9999. |
| time | `java.time.LocalTime` | Only store time. Accuracy is 100 nanoseconds. |
| timestamp | `java.time.LocalDateTime` | Stores date and time information without time zone. Represents the time of an event in local time. It does not include any offset or zone information. |
| timestamp_tz | `java.time.OffsetDateTime` | Stores date and time information with an offset from UTC. It includes both the local date-time and the offset from UTC, providing more precise temporal information when working with multiple time zones. |
| row | `org.apache.seatunnel.api.table.type.SeaTunnelRow` | Row type, can be nested. |
| map | `java.util.Map` | A Map is an object that maps keys to values. The key type includes `int` `string` `boolean` `tinyint` `smallint` `bigint` `float` `double` `decimal` `date` `time` `timestamp` `null` , and the value type includes `int` `string` `boolean` `tinyint` `smallint` `bigint` `float` `double` `decimal` `date` `time` `timestamp` `null` `array` `map` `row`. |
| array | `ValueType[]` | A array is a data type that represents a collection of elements. The element type includes `int` `string` `boolean` `tinyint` `smallint` `bigint` `float` `double`. |
#### How to declare type supported
SeaTunnel provides a simple and direct way to declare basic types. Basic type keywords include `string`, `boolean`, `tinyint`, `smallint`, `int`, `bigint`, `float`, `double`, `date`, `time`, `timestamp`, and `null`. The keyword names for basic types can be used directly as type declarations, and SeaTunnel is case-insensitive to type keywords. For example, if you need to declare a field with integer type, you can simply define the field as `int` or `"int"`.
> The null type declaration must be enclosed in double quotes, like `"null"`. This approach helps avoid confusion with [HOCON](https://github.com/lightbend/config/blob/main/HOCON.md)'s `null` type which represents undefined object.
When declaring complex types (such as **decimal**, **array**, **map**, and **row**), pay attention to specific considerations.
- When declaring a decimal type, precision and scale settings are required, and the type definition follows the format `decimal(precision, scale)`. It's essential to emphasize that the declaration of the decimal type must be enclosed in `"`; you cannot use the type name directly, as with basic types. For example, when declaring a decimal field with precision 10 and scale 2, you specify the field type as `"decimal(10,2)"`.
- When declaring an array type, you need to specify the element type, and the type definition follows the format `array<T>`, where `T` represents the element type. The element type includes `int`,`string`,`boolean`,`tinyint`,`smallint`,`bigint`,`float` and `double`. Similar to the decimal type declaration, it also be enclosed in `"`. For example, when declaring a field with an array of integers, you specify the field type as `"array<int>"`.
- When declaring a map type, you need to specify the key and value types. The map type definition follows the format `map<K,V>`, where `K` represents the key type and `V` represents the value type. `K` can be any basic type and decimal type, and `V` can be any type supported by SeaTunnel. Similar to previous type declarations, the map type declaration must be enclosed in double quotes. For example, when declaring a field with map type, where the key type is string and the value type is integer, you can declare the field as `"map<string, int>"`.
- When declaring a row type, you need to define a [HOCON](https://github.com/lightbend/config/blob/main/HOCON.md) object to describe the fields and their types. The field types can be any type supported by SeaTunnel. For example, when declaring a row type containing an integer field `a` and a string field `b`, you can declare it as `{a = int, b = string}`. Enclosing the definition in `"` as a string is also acceptable, so `"{a = int, b = string}"` is equivalent to `{a = int, c = string}`. Since HOCON is compatible with JSON, `"{\"a\":\"int\", \"b\":\"string\"}"` is equivalent to `"{a = int, b = string}"`.
Here is an example of complex type declarations:
```hocon
schema {
fields {
c_decimal = "decimal(10, 2)"
c_array = "array<int>"
c_row = {
c_int = int
c_string = string
c_row = {
c_int = int
}
}
# Hocon style declare row type in generic type
map0 = "map<string, {c_int = int, c_string = string, c_row = {c_int = int}}>"
# Json style declare row type in generic type
map1 = "map<string, {\"c_int\":\"int\", \"c_string\":\"string\", \"c_row\":{\"c_int\":\"int\"}}>"
}
}
```
### PrimaryKey
Primary key is a config used to define the primary key in schema, it contains name, columns field.
```
primaryKey {
name = id
columns = [id]
}
```
| Field | Required | Default Value | Description |
|:--------|:---------|:--------------|-----------------------------------|
| name | Yes | - | The name of the primaryKey |
| columns | Yes | - | The column list in the primaryKey |
### ConstraintKeys
Constraint keys is a list of config used to define the constraint keys in schema, it contains constraintName, constraintType, constraintColumns field.
```
constraintKeys = [
{
constraintName = "id_index"
constraintType = KEY
constraintColumns = [
{
columnName = "id"
sortType = ASC
}
]
},
]
```
| Field | Required | Default Value | Description |
|:------------------|:---------|:--------------|-------------------------------------------------------------------------------------------------------------------------------------------|
| constraintName | Yes | - | The name of the constraintKey |
| constraintType | No | KEY | The type of the constraintKey |
| constraintColumns | Yes | - | The column list in the primaryKey, each column should contains constraintType and sortType, sortType support ASC and DESC, default is ASC |
#### What constraintType supported at now
| ConstraintType | Description |
|:---------------|:------------|
| INDEX_KEY | key |
| UNIQUE_KEY | unique key |
## Multi table schemas
```
tables_configs = [
{
schema {
table = "database.schema.table1"
schema_first = false
comment = "comment"
columns = [
...
]
primaryKey {
...
}
constraintKeys {
...
}
}
},
{
schema = {
table = "database.schema.table2"
schema_first = false
comment = "comment"
columns = [
...
]
primaryKey {
...
}
constraintKeys {
...
}
}
}
]
```
## How to use schema
### Recommended
```
source {
FakeSource {
parallelism = 2
plugin_output = "fake"
row.num = 16
schema {
table = "FakeDatabase.FakeTable"
columns = [
{
name = id
type = bigint
nullable = false
defaultValue = 0
comment = "primary key id"
},
{
name = name
type = "string"
nullable = true
comment = "name"
},
{
name = age
type = int
nullable = true
comment = "age"
}
]
primaryKey {
name = "id"
columnNames = [id]
}
constraintKeys = [
{
constraintName = "unique_name"
constraintType = UNIQUE_KEY
constraintColumns = [
{
columnName = "name"
sortType = ASC
}
]
},
]
}
}
}
```
### Deprecated
If you only need to define the column, you can use fields to define the column, this is a simple way but will be remove in the future.
```
source {
FakeSource {
parallelism = 2
plugin_output = "fake"
row.num = 16
schema = {
fields {
id = bigint
c_map = "map<string, smallint>"
c_array = "array<tinyint>"
c_string = string
c_boolean = boolean
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
c_decimal = "decimal(2, 1)"
c_bytes = bytes
c_date = date
c_timestamp = timestamp
}
}
}
}
```
## When we should use it or not
If there is a `schema` configuration project in Options,the connector can then customize the schema. Like `Fake` `Pulsar` `Http` source connector etc.