tree: 9a9e8fde915a03216846edde0e8f12ad3ac797f6
  1. src/
  2. pom.xml
  3. README-zh.md
  4. README.md
java/tools/README.md

English | Chinese

TsFile Tools Manual

Introduction

Development

Prerequisites

To build the Java version of TsFile Tools, you must have the following dependencies installed:

  1. Java >= 17. Make sure the environment variable is set.
  2. Maven >= 3.6 (if you are compiling TsFile from source).

Build with Maven

mvn clean package -P with-java -DskipTests

Install to local machine

mvn install -P with-java -DskipTests

Schema Definition

Parameters

ParameterDescriptionRequiredDefault
table_nameTable nameYes
time_precisionTime precision (ms / us / ns / s)Noms
has_headerWhether CSV contains a header (true / false). Ignored for Parquet / Arrow.Notrue
separatorCSV delimiter (, / tab / ;). Ignored for Parquet / Arrow.No,
null_formatString value treated as null in CSV. Ignored for Parquet / Arrow (native null).No
tag_columnsTag columns (device identifiers / primary key). Supports virtual columns with DEFAULT value.No
time_columnTime column nameYes
source_columnsColumn definitions mapping to source file columnsYes

Backward compatibility: id_columns and csv_columns are still accepted as aliases for tag_columns and source_columns.

Column Concepts

  • time_column: Exactly one per table. Written as time column with type TIMESTAMP in TsFile.
  • tag_columns: Device identifiers (composite primary key), 0 or more. Supports virtual columns not present in the source file via DEFAULT keyword.
    • Data type is always STRING and cannot be changed. Any type declared for a tag column in source_columns is ignored. We recommend writing tag columns in source_columns with the column name only (no type).
  • source_columns: Maps every column in the source file by position (CSV) or by name (Parquet / Arrow). Use SKIP to ignore a column.
  • FIELD (derived, not configured): All columns in source_columns that are not time_column, not in tag_columns, and not SKIP. These are the measurement columns whose values change over time.

Column name case: TsFile table-model column and table names are case-insensitive and stored as lowercase. Regardless of whether you write Time / TIME / time in import.schema, the on-disk and read-back name is time.

Schema Example

Duplicate timestamps within the same device are not supported — rows sharing identical tag column values and the same timestamp will fail to write.

CSV file content:

Region,FactoryNumber,DeviceNumber,Model,MaintenanceCycle,Time,Temperature,Emission
hebei,1001,1,10,1,1,80.0,1000.0
hebei,1001,1,10,1,4,80.0,1000.0
hebei,1002,7,5,2,1,90.0,1200.0

Schema file (import.schema):

table_name=root.db1
time_precision=ms
has_header=true
separator=,
null_format=\N

tag_columns
Group DEFAULT Datang
Region
FactoryNumber
DeviceNumber

time_column=Time

source_columns
Region,
FactoryNumber,
DeviceNumber,
SKIP,
SKIP,
Time INT64,
Temperature FLOAT,
Emission DOUBLE,

In this example:

  • Group is a virtual tag column (not in CSV) with default value Datang
  • Region, FactoryNumber, DeviceNumber are tag columns read from CSV; their type is fixed as STRING and need not be declared
  • Model and MaintenanceCycle are skipped via SKIP
  • Temperature and Emission are automatically derived as FIELD columns

For Parquet / Arrow in schema mode, source_columns matches by column name instead of position. Named SKIP is also supported:

source_columns
Time INT64,
unused_col SKIP,
Temperature FLOAT,
Emission DOUBLE,

Validation rules for Parquet / Arrow schema mode (enforced — mismatches raise an error and the source file is moved to --fail_dir):

  • Column count must match exactly. The number of entries in source_columns must equal the number of columns in the Parquet / Arrow file. Use SKIP for any file column you don't want to import.
  • Every name must exist in the source file. Each non-SKIP column name and every named SKIP must resolve to an actual column in the file.
  • Unnamed SKIP is not allowed. Because matching is by name, an unqualified SKIP cannot identify a column. Always use columnName SKIP.

CLI Parameters

ParameterDescriptionRequiredDefault
-s, --sourceInput file or directoryYes
-t, --targetOutput directoryYes
--schemaSchema file path. Omit for auto mode.No
--fail_dirDirectory for failed source filesNofailed
--formatSource format: csv / parquet / arrow. Auto-detected by file extension if omitted.Noauto-detect
--table_nameTable name override (auto mode)Noderived from filename
--time_precisionTime precision override (auto mode): ms / us / ns / sNoms
--separatorCSV delimiter (auto mode): , / tab / ;No,
-b, --block_sizeCSV chunk size (e.g. 256M, 1G)No256M
-tn, --thread_numThread count for parallel processingNo8

Modes

Schema Mode

Provide a --schema file to explicitly define column mapping, types, tags, and time column.

# CSV
csv2tsfile.sh --source ./data/csv --target ./output --fail_dir ./failed --schema ./schema/import.schema
csv2tsfile.bat --source .\data\csv --target .\output --fail_dir .\failed --schema .\schema\import.schema

# Parquet
parquet2tsfile.sh --source ./data/parquet --target ./output --fail_dir ./failed --schema ./schema/import.schema
parquet2tsfile.bat --source .\data\parquet --target .\output --fail_dir .\failed --schema .\schema\import.schema

# Arrow
arrow2tsfile.sh --source ./data/arrow --target ./output --fail_dir ./failed --schema ./schema/import.schema
arrow2tsfile.bat --source .\data\arrow --target .\output --fail_dir .\failed --schema .\schema\import.schema

Auto Mode

Omit --schema to automatically infer column types and detect the time column.

Auto mode rules:

  • Time column: must be named exactly time or TIME (case-sensitive, strict match).
    • Parquet / Arrow: if the source file contains multiple Timestamp-typed columns, only the one named time / TIME is selected as the time axis. The remaining Timestamp columns become FIELD columns and are stored as INT64 (raw value preserved, TIMESTAMP semantic dropped). To keep them as TIMESTAMP, switch to schema mode and declare them explicitly.
  • All other columns become FIELD (no tag inference)
  • CSV type inference uses a 100-row sampling window per column. Each non-null cell is classified into a base type (BOOLEAN / INT64 / DOUBLE / STRING).
    • If only one base type appears across the sampled rows for a column, that type is used.
    • When different base types appear in the same column, the column is promoted: INT64 + DOUBLE → DOUBLE; any other mixed combination (including BOOLEAN with any numeric type) → STRING.
  • Parquet / Arrow use native schema types directly
  • Default table name: derived from the source filename (e.g. sensor.csv → table sensor). Sanitization rules applied in order:
    1. Strip the file extension (.csv / .parquet / .arrow / .ipc / .feather, or the last .-suffix as a fallback).
    2. Keep only ASCII letters (a–z, A–Z), digits (0–9), underscore (_), and dot (.). Every other character is replaced with _.
    3. Collapse consecutive _ into a single _; strip leading and trailing _.
    4. If the result is empty, use a format-specific default: csv_data / parquet_data / arrow_data.
    5. If the result starts with a digit, prefix t_ (TsFile table names cannot start with a digit).
  • Default null tokens (CSV only): empty cell and \N

Auto mode example:

CSV file (sensor.csv):

time,temperature,humidity,status
1000,25.5,60.0,true
2000,26.1,55.3,false
3000,27.0,58.1,true

Auto mode infers:

table name:  sensor        (from filename)
time column: time
fields:      temperature DOUBLE, humidity DOUBLE, status BOOLEAN
tags:        (none)

Commands:

# CSV
csv2tsfile.sh --source ./data/csv --target ./output --fail_dir ./failed
csv2tsfile.bat --source .\data\csv --target .\output --fail_dir .\failed

# CSV with options
csv2tsfile.sh --source ./data/csv --target ./output --table_name my_table --separator tab --time_precision us

# Parquet
parquet2tsfile.sh --source ./data/parquet --target ./output --fail_dir ./failed
parquet2tsfile.bat --source .\data\parquet --target .\output --fail_dir .\failed

# Arrow (.arrow / .ipc / .feather)
arrow2tsfile.sh --source ./data/arrow --target ./output --fail_dir ./failed
arrow2tsfile.bat --source .\data\arrow --target .\output --fail_dir .\failed

Output File Naming

  • Single batch: {source_basename}.tsfile
  • Multiple batches: {source_basename}_1.tsfile, {source_basename}_2.tsfile, ...
  • Table name and output filename are independent — table name comes from schema or --table_name, filename comes from source file.