Add EmbeddingSearch — a new SearchService implementation that enables natural language search across Zeppelin notebooks using ONNX-based sentence embeddings. This is a drop-in replacement for LuceneSearch that understands meaning, not just keywords.
Example: Searching “yesterday's spending” finds paragraphs containing SELECT sum(cost) FROM analytics.daily_sales WHERE date = current_date - interval '1' day — something keyword search cannot do (returns 0 results with LuceneSearch).
Zeppelin‘s current search (LuceneSearch) uses keyword-based full-text search with Lucene’s StandardAnalyzer. This has several limitations for notebook search:
current_date - 1StandardAnalyzer breaks on underscores and dots in table names like analytics_db.daily_salesFor teams with hundreds or thousands of notebooks (common in data/analytics teams), finding the right query becomes a significant productivity bottleneck.
SearchService (abstract)
├── LuceneSearch (existing, keyword-based)
├── EmbeddingSearch (new, semantic)
└── NoSearchService (existing, no-op)
┌─────────────────────────────────────────────────────────────┐
│ EmbeddingSearch │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ HuggingFace │ │ ONNX Runtime │ │ In-Memory Index │ │
│ │ Tokenizer │→ │ Inference │→ │ float[][] + meta │ │
│ │ (DJL) │ │ (CPU) │ │ ConcurrentHashMap│ │
│ └──────────────┘ └──────────────┘ └────────┬─────────┘ │
│ │ │
│ Two-phase query: │ │
│ 1. Embed query → cosine sim → find tables │ │
│ 2. Re-rank with table boost → top-20 │ │
│ ▼ │
│ Index: text + title + output + tables embedding_index.bin│
│ (persisted to disk, versioned) │
└─────────────────────────────────────────────────────────────┘
zeppelin.search.index.path/models/ConcurrentHashMap<String, IndexEntry> with ReadWriteLockembedding_index.bin, currently v3)| Content | LuceneSearch | EmbeddingSearch |
|---|---|---|
| Paragraph text | ✓ | ✓ |
| Paragraph title | ✓ | ✓ |
| Notebook name | ✓ | ✓ (in embedding context) |
| Paragraph output (TABLE, TEXT) | ✗ | ✓ |
| SQL table names (FROM/JOIN) | ✗ | ✓ (extracted + boosted) |
| Interpreter prefix stripped | ✗ | ✓ |
This helps queries like “click funnel analysis” surface all paragraphs that query the same tables, even if their SQL text is very different.
Disabled by default. Enable with a single property:
<!-- In zeppelin-site.xml --> <property> <name>zeppelin.search.semantic.enable</name> <value>true</value> </property>
Requires zeppelin.search.enable = true (already the default).
search.enable | search.semantic.enable | Result |
|---|---|---|
| true | false (default) | LuceneSearch (existing behavior) |
| true | true | EmbeddingSearch (semantic) |
| false | any | NoSearchService |
zeppelin-zengine/.../search/EmbeddingSearch.java — Core implementation (~700 lines)zeppelin-zengine/.../search/EmbeddingSearchTest.java — 11 tests including semantic validationdocs/embedding-search.md — This documentzeppelin-zengine/pom.xml — Add onnxruntime and djl-tokenizers dependencieszeppelin-zengine/.../conf/ZeppelinConfiguration.java — Add ZEPPELIN_SEARCH_SEMANTIC_ENABLEzeppelin-server/.../server/ZeppelinServer.java — Wire EmbeddingSearch based on configNOTICE — Attribution for ONNX Runtime and DJLzeppelin-web-angular/.../result-item/ — Render search results with separate code block, output block, and table name display (replaces Monaco editor)zeppelin-web/src/app/search/ — Same improvements for Classic UIcom.microsoft.onnxruntime:onnxruntime:1.18.0 (~50MB, Apache 2.0 compatible)ai.djl.huggingface:tokenizers:0.28.0 (~2MB, Apache 2.0, JNA excluded to avoid version conflict with Zeppelin's existing JNA 4.1.0)Both LuceneSearch and EmbeddingSearch return List<Map<String, String>> with these keys:
| Key | LuceneSearch | EmbeddingSearch |
|---|---|---|
id | noteId or noteId/paragraph/paragraphId | Same |
name | Notebook title | Notebook title |
snippet | Highlighted paragraph text (<B> tags) | Paragraph text (no highlighting) |
text | Full paragraph text | Full paragraph text |
header | Highlighted paragraph title (<B> tags) | Paragraph title (plain) |
title | Same as header | Paragraph title (plain) |
tables | "" (empty) | Space-separated SQL table names |
output | "" (empty) | Paragraph output (truncated to 300 chars) |
The title, tables, and output fields are dedicated structured fields. The header field preserves backward compatibility — for LuceneSearch it contains the highlighted paragraph title, for EmbeddingSearch it contains the plain title.
Both Angular and Classic UIs render search results with:
output field)tables field)sql, python, md, etc.ONNX Runtime is the standard inference engine for transformer models. It supports the exact same model files used by Python (HuggingFace, ChromaDB, etc.), ensuring embedding compatibility.
For Zeppelin's scale (typically < 50K paragraphs), brute-force cosine similarity on normalized vectors is fast enough (< 50ms), exact (no approximation error), and adds zero complexity.
The ONNX model is 86MB. Bundling it would bloat the Zeppelin distribution. Downloading on first use keeps the distribution lean and allows users to swap models.
Zeppelin uses Lucene 8.7.0. Upgrading to 9.x is a separate, larger effort.
# Run embedding search tests (requires model download, ~86MB first time) ZEPPELIN_EMBEDDING_TEST=true mvn test -pl zeppelin-zengine \ -Dtest=EmbeddingSearchTest # Run existing Lucene tests (should still pass, no changes) mvn test -pl zeppelin-zengine -Dtest=LuceneSearchTest
semanticSearchFindsRelatedConcepts — validates that “yesterday's spending” ranks a SQL spend query above an unrelated user count querynewParagraphIsLiveIndexed — validates that newly added paragraphs are immediately searchable without restart