blob: 3fc251aed86e366ca61242cabca2f8bdbd0a2174 [file] [view]
# doris-native: Rust Native Readers for Apache Doris
This workspace contains the Rust-based format readers for Doris BE, starting with Lance support.
## Architecture
```
C++ (Doris BE) Rust (doris-native)
┌─────────────────┐ ┌──────────────────┐
│ LanceRustReader │──JSON config─>│ lance_reader_open │
│ (GenericReader) │ │ lance_reader_next │──> lance-rs
│ │<─Arrow C ABI──│ lance_reader_close│ Dataset::scan()
└─────────────────┘ └──────────────────┘
```
Data exchange uses the Arrow C Data Interface (zero-copy between Rust and C++).
Each reader owns a single-threaded tokio runtime (`block_on()` on the scanner thread).
## Prerequisites
- Rust stable toolchain (see `rust-toolchain.toml`)
- For BE integration: `BUILD_RUST_READERS=ON` in CMake
## Quick Start
### Run Rust tests
```bash
cd be/src/rust/doris-native
cargo test
```
Expected output: 24 tests passing (error handling, lance reader, FFI bridge).
### Build release library
```bash
cargo build --release
# Output: target/release/libdoris_ffi.a (linked into doris_be)
```
### Build with Doris BE
```bash
# From repo root:
export DORIS_HOME=$PWD
export DORIS_THIRDPARTY=/path/to/thirdparty
export BUILD_RUST_READERS=ON
# Via build.sh:
./build.sh --be
# Or via cmake directly:
cd be/build_Release
cmake -DBUILD_RUST_READERS=ON ...
make -j$(nproc) doris_be
```
## Crate Structure
```
doris-native/
├── Cargo.toml # Workspace root
├── rust-toolchain.toml # Rust version pin
└── crates/
└── doris-ffi/ # Static library linked into doris_be
├── Cargo.toml
└── src/
├── lib.rs # Module root + rust_echo FFI
├── error.rs # Thread-local error handling (FFI_OK, FFI_ERR_*)
├── lance_reader.rs # LanceReader + LanceReaderConfig
└── ffi.rs # extern "C" functions (lance_reader_open, etc.)
```
## FFI Functions
| Function | Purpose |
|----------|---------|
| `lance_reader_open(uri, columns, batch_size, handle_out)` | Open dataset (simple API) |
| `lance_reader_open_json(config_json, len, handle_out)` | Open with full config (S3 creds, version, vector search) |
| `lance_reader_next_batch(handle, schema, array, eof, bytes)` | Read next Arrow batch |
| `lance_reader_get_schema(handle, schema_out)` | Get dataset schema |
| `lance_reader_close(handle)` | Free resources |
| `lance_reader_last_error(buf, len)` | Get error message |
| `lance_test_create_dataset(path, len)` | Create 5-row test dataset |
| `lance_test_create_multi_fragment_dataset(path, len)` | Create 15-row, 3-fragment test dataset |
## JSON Config
The `lance_reader_open_json` accepts a JSON config string:
```json
{
"uri": "s3://bucket/data.lance",
"columns": ["id", "name"],
"batch_size": 4096,
"version": 0,
"storage_options": {
"AWS_ACCESS_KEY_ID": "...",
"AWS_SECRET_ACCESS_KEY": "..."
},
"filter": "category = 'shoes'",
"vector_search": {
"column": "embedding",
"query": [0.1, 0.2, 0.3],
"k": 10,
"metric": "cosine",
"nprobes": 20,
"ef": 100
},
"full_text_search": "machine learning",
"limit": 100,
"offset": 0,
"fragment_ids": [0, 1, 2]
}
```
## Running E2E Tests
### Standalone C++ test (no Doris cluster needed)
```bash
# Build test binary:
RUST_LIB=be/src/rust/doris-native/target/release/libdoris_ffi.a
ARROW_LIB=/path/to/thirdparty/installed/lib64
clang++ -std=c++20 -O2 \
-I/path/to/thirdparty/installed/include \
be/test/format/lance/standalone_lance_test.cpp \
$RUST_LIB -Wl,--start-group $ARROW_LIB/libarrow.a ... -Wl,--end-group \
-lpthread -ldl -lm -lrt -o lance_test
./lance_test
# All 8 tests PASSED!
```
### Live Doris cluster test
```bash
# 1. Create test datasets on BE:
./lance_create single /opt/apache-doris/be/lance_test_data/single.lance
./lance_create multi /opt/apache-doris/be/lance_test_data/multi.lance
# 2. Query via MySQL client:
mysql -h 127.0.0.1 -P 9030 -u root -e "
SELECT * FROM local(
\"file_path\" = \"lance_test_data/single.lance/data/\",
\"backend_id\" = \"<BE_ID>\",
\"format\" = \"lance\"
) ORDER BY id;"
# Expected:
# id name score
# 1 alice 90.5
# 2 bob 85.0
# 3 carol 92.3
# 4 dave 78.1
# 5 eve 88.7
```
### Regression test
```bash
# Run the lance TVF regression test suite:
./run-regression-test.sh --run -s test_lance_tvf
# Run by file:
./run-regression-test.sh --run \
-f regression-test/suites/external_table_p0/tvf/lance/test_lance_tvf.groovy
# Generate expected output (first time):
./run-regression-test.sh --run -s test_lance_tvf -genOut
```
## Test Summary
| Layer | Tests | What's verified |
|-------|-------|----------------|
| Rust unit (24) | `cargo test` | Error handling, LanceReader open/read/close, FFI lifecycle, JSON config |
| C++ standalone (8) | `lance_test` binary | FFI bridge, Arrow import, schema inference, data verification, multi-fragment |
| Live cluster (8) | MySQL queries | Full TVF: SELECT *, projection, COUNT, WHERE, LIMIT, multi-fragment, aggregation |
| Regression (9) | `test_lance_tvf.groovy` | Automated CI-ready version of live cluster tests |