fe-sql-parser is a standalone ANTLR4-based syntax parser for Apache Doris SQL. It produces an ANTLR parse tree (CST) for any Doris-dialect SQL string. It performs no semantic analysis: identifiers are not resolved, tables and columns are not validated, and types are not checked. The module has a single runtime dependency: org.antlr:antlr4-runtime.
The module is the single source of truth for the Doris SQL grammar; fe-core consumes the parser through this module rather than maintaining its own copy.
fe-sql-parser/
├── pom.xml
├── src/main/antlr4/org/apache/doris/nereids/
│ ├── DorisLexer.g4 # Doris SQL lexer grammar
│ └── DorisParser.g4 # Doris SQL parser grammar
└── src/main/java/
├── org/apache/doris/nereids/
│ ├── parser/ # Parser support: CaseInsensitiveStream,
│ │ # Origin, OriginAware, ParserUtils,
│ │ # ParseErrorListener, PostProcessor
│ ├── exceptions/ # ParseException, SyntaxParseException
│ └── errors/QueryParsingErrors.java
└── org/apache/doris/sqlparser/
├── DorisSqlParser.java # Public library facade
└── DorisSqlParserCli.java # Command-line entry point
At build time the ANTLR Maven plugin generates org.apache.doris.nereids.DorisLexer, DorisParser, DorisParserBaseVisitor, and DorisParserBaseListener into target/generated-sources/antlr4/.
The module has two build modes: the default mode produces a thin library jar that fe-core and downstream tools depend on; the cli profile additionally produces a self-contained executable jar.
# From the fe/ directory mvn -pl fe-sql-parser -am package
Output: fe/fe-sql-parser/target/doris-fe-sql-parser.jar (~1.3 MB). This jar contains only the parser classes; it expects org.antlr:antlr4-runtime:4.13.1 to be provided by the consuming project's classpath.
To install it to your local Maven repository so other projects can resolve it:
mvn -pl fe-sql-parser -am -Pflatten install -DskipTests
The flatten profile is required so the installed POM has ${revision} resolved to a concrete version.
# From the fe/ directory mvn -pl fe-sql-parser -Pcli package -DskipTests
Output: fe/fe-sql-parser/target/fe-sql-parser-1.2-SNAPSHOT-cli.jar (~1.7 MB).
This is a self-contained executable jar produced by maven-shade-plugin:
antlr4-runtime so the jar runs anywhere with a JRE 8+Main-Class: org.apache.doris.sqlparser.DorisSqlParserCli<minimizeJar>true</minimizeJar> strips unused classes (transitively-inherited logging, test utilities, etc.) so the final jar contains only the parser plus its actual reachable dependenciesThe CLI profile is gated so default Doris builds do not pay the shading cost. The thin library jar produced by the default build is unaffected — fe-core continues to consume it directly.
java -jar fe-sql-parser-1.2-SNAPSHOT-cli.jar [OPTIONS] [SQL]
| Source | Example |
|---|---|
| Positional argument | java -jar ...-cli.jar "SELECT 1" |
-e / --exec <SQL> | java -jar ...-cli.jar -e "SELECT 1" |
-f / --file <PATH> | java -jar ...-cli.jar -f query.sql |
| stdin (when none of the above) | echo "SELECT 1" | java -jar ...-cli.jar |
| Flag | Grammar rule | Use case |
|---|---|---|
| (default) | singleStatement | One SQL statement |
--multi | multiStatements | Multiple statements separated by ; |
--expression | expressionWithEof | A single SQL expression |
| Flag | Output |
|---|---|
| (default) | ANTLR LISP-style tree on one line |
--pretty | Indented multi-line tree, two-space indent per level |
| Flag | Effect |
|---|---|
--no-backslash-escapes | Maps to MySQL's NO_BACKSLASH_ESCAPES sql_mode — backslash is not a string-literal escape character |
--ansi | Enables ANSI SQL syntax variants in the few grammar rules that branch on it |
| Code | Meaning |
|---|---|
| 0 | Parse succeeded |
| 1 | Parse failed — ParseException thrown; the error message is printed to stderr with the offending line/column and a ^^^ pointer |
| 2 | Usage error or I/O error (bad flag, unreadable file, empty input) |
Single statement, default LISP format:
$ java -jar ...-cli.jar "SELECT 1" (singleStatement (statement (statementBase (query (queryTerm (queryPrimary (querySpecification (selectClause SELECT (selectColumnClause (namedExpressionSeq (namedExpression (expression (booleanExpression (valueExpression (primaryExpression (constant (number 1)))))))))) queryOrganization))) queryOrganization))) <EOF>)
Single statement, pretty format:
$ java -jar ...-cli.jar --pretty "SELECT a FROM t WHERE a > 1" singleStatement statement statementBase query queryTerm queryPrimary querySpecification selectClause 'SELECT' ... fromClause 'FROM' ... whereClause 'WHERE' ... '<EOF>'
Multiple statements:
$ java -jar ...-cli.jar --multi "USE db1; SELECT 1; SELECT 2"
Single expression:
$ java -jar ...-cli.jar --expression "a + 1 * COALESCE(b, 0)"
From file:
$ java -jar ...-cli.jar -f path/to/my-query.sql
From stdin (pipe a heredoc or another command's output):
$ cat my-query.sql | java -jar ...-cli.jar
Parse error — note the non-zero exit code:
$ java -jar ...-cli.jar "SELEKT 1" mismatched input 'SELEKT' expecting {...}(line 1, pos 0) $ echo $? 1
For frequent use, drop a wrapper on your PATH:
# ~/bin/doris-sql-parse #!/usr/bin/env bash exec java -jar /path/to/fe-sql-parser-1.2-SNAPSHOT-cli.jar "$@"
chmod +x ~/bin/doris-sql-parse doris-sql-parse --pretty "SELECT 1"
If you want to embed the parser in another JVM application rather than shelling out to the CLI.
<dependency> <groupId>org.apache.doris</groupId> <artifactId>fe-sql-parser</artifactId> <version>1.2-SNAPSHOT</version> </dependency> <!-- antlr4-runtime is pulled in transitively; declare it explicitly if you want to pin a specific version --> <dependency> <groupId>org.antlr</groupId> <artifactId>antlr4-runtime</artifactId> <version>4.13.1</version> </dependency>
Until the artifact is published to a public repository you need to mvn install it locally (see Library jar above).
import org.apache.doris.sqlparser.DorisSqlParser; import org.apache.doris.nereids.DorisParser.SingleStatementContext; DorisSqlParser parser = new DorisSqlParser(); SingleStatementContext tree = parser.parseStatement("SELECT a, b FROM t WHERE a > 1"); // `tree` is a standard ANTLR ParseTree; walk it with a Visitor or Listener.
import org.apache.doris.nereids.DorisParser; import org.apache.doris.nereids.DorisParserBaseVisitor; import java.util.ArrayList; import java.util.List; DorisSqlParser parser = new DorisSqlParser(); SingleStatementContext tree = parser.parseStatement( "SELECT u.id FROM users u JOIN orders o ON u.id = o.uid"); List<String> tables = new ArrayList<>(); new DorisParserBaseVisitor<Void>() { @Override public Void visitTableName(DorisParser.TableNameContext ctx) { tables.add(ctx.multipartIdentifier().getText()); return super.visitTableName(ctx); } }.visit(tree); System.out.println(tables); // [users, orders]
DorisParserBaseVisitor<T> and DorisParserBaseListener are generated by ANTLR — every grammar rule has a corresponding visitXxx / enterXxx / exitXxx method you can override.
ParseException is a RuntimeException. You do not have to declare or catch it, but you usually want to:
import org.apache.doris.nereids.exceptions.ParseException; try { parser.parseStatement("SELEKT 1"); } catch (ParseException e) { // e.getMessage() includes "line N, pos M" and a `^^^` pointer into the SQL. System.err.println(e.getMessage()); }
If you only need tokens (SQL formatter, comment extractor, keyword finder, hint inspector), skip the parser:
import org.apache.doris.nereids.DorisLexer; import org.antlr.v4.runtime.Token; DorisSqlParser parser = new DorisSqlParser(); DorisLexer lexer = parser.newLexer("SELECT /*+ HINT */ a FROM t"); Token token; while ((token = lexer.nextToken()).getType() != Token.EOF) { System.out.printf("%-20s %s%n", DorisLexer.VOCABULARY.getSymbolicName(token.getType()), token.getText()); }
Downstream projects can plug in custom logic (lineage tracking, policy enforcement, audit, SQL rewriting, metrics) without modifying fe-sql-parser itself. There are four extension points:
| Mechanism | When it fires | Typical use |
|---|---|---|
Subclass DorisParserBaseVisitor<T> | After parsing, when you call visitor.visit(tree) | Extract information, rewrite, lineage |
Subclass DorisParserBaseListener | After parsing, when you call ParseTreeWalker.walk(...) | Simple enter/exit interception |
parser.addParseListener(...) | Live, while the parser is building the tree | Token-level processing, on-the-fly mutation |
Wrap DorisSqlParser | Around the parseStatement call | Metrics, caching, request-level policy |
All ANTLR-generated classes (DorisParser, DorisParserBaseVisitor, DorisParserBaseListener) and the DorisSqlParser facade are public, so downstream code uses them directly.
The most common pattern. Extract “which tables were read” and “which table was written” from a single statement.
import org.apache.doris.nereids.DorisParser; import org.apache.doris.nereids.DorisParserBaseVisitor; import org.apache.doris.sqlparser.DorisSqlParser; import java.util.LinkedHashSet; import java.util.Set; public class LineageExtractor extends DorisParserBaseVisitor<Void> { public final Set<String> sources = new LinkedHashSet<>(); public String target; // INSERT INTO target_db.target_tbl SELECT ... FROM source ... @Override public Void visitInsertTable(DorisParser.InsertTableContext ctx) { target = ctx.tableName.getText(); return super.visitInsertTable(ctx); // keep descending to collect sources } // Any FROM <table> / JOIN <table> hits this @Override public Void visitTableName(DorisParser.TableNameContext ctx) { sources.add(ctx.multipartIdentifier().getText()); return null; } } // Usage DorisSqlParser parser = new DorisSqlParser(); LineageExtractor lineage = new LineageExtractor(); lineage.visit(parser.parseStatement( "INSERT INTO sink SELECT a.x, b.y FROM src1 a JOIN src2 b ON a.id = b.id")); System.out.println(lineage.target); // sink System.out.println(lineage.sources); // [src1, src2]
For column-level lineage, also override visitColumnReference / visitNamedExpression and maintain a stack of “current SELECT scope” so each column reference can be attributed to the right output column.
Use the listener pattern when you only care whether the parser entered a certain rule, not its return value.
import org.apache.doris.nereids.DorisParser; import org.apache.doris.nereids.DorisParserBaseListener; import org.antlr.v4.runtime.tree.ParseTreeWalker; public class DropGuardListener extends DorisParserBaseListener { @Override public void enterSupportedDropStatement(DorisParser.SupportedDropStatementContext ctx) { throw new SecurityException("DROP statements are not allowed: " + ctx.getText()); } } // Usage ParseTreeWalker.DEFAULT.walk( new DropGuardListener(), parser.parseStatement(userSql));
Audit-style collection:
public class AuditListener extends DorisParserBaseListener { public final List<String> writes = new ArrayList<>(); @Override public void enterInsertTable(DorisParser.InsertTableContext ctx) { writes.add("INSERT " + ctx.tableName.getText()); } @Override public void enterUpdate(DorisParser.UpdateContext ctx) { writes.add("UPDATE " + ctx.tableName.getText()); } @Override public void enterDelete(DorisParser.DeleteContext ctx) { writes.add("DELETE " + ctx.tableName.getText()); } @Override public void enterSupportedDropStatement(DorisParser.SupportedDropStatementContext ctx) { writes.add("DROP " + ctx.getText()); } }
ParseTreeListener — fire during parsingMost cases are covered by Examples 1 and 2. If you need to intervene while the parser is building each node (mutating tokens, injecting metadata, streaming work), attach a listener with parser.addParseListener(...). This is exactly how fe-sql-parser's internal PostProcessor rewrites identifier case at parse time.
DorisSqlParser.parseStatement does not expose the parser instance; use newLexer + newParser to take ownership:
import org.apache.doris.nereids.DorisLexer; import org.apache.doris.nereids.DorisParser; import org.apache.doris.nereids.DorisParserBaseListener; import org.apache.doris.sqlparser.DorisSqlParser; public class HintCollectorListener extends DorisParserBaseListener { public final List<String> hints = new ArrayList<>(); @Override public void exitOptimizeHint(DorisParser.OptimizeHintContext ctx) { hints.add(ctx.getText()); } } DorisSqlParser facade = new DorisSqlParser(); DorisLexer lexer = facade.newLexer(sql); DorisParser parser = facade.newParser(lexer); HintCollectorListener hintListener = new HintCollectorListener(); parser.addParseListener(hintListener); DorisParser.SingleStatementContext tree = parser.singleStatement(); System.out.println(hintListener.hints);
newParser already attaches PostProcessor and ParseErrorListener; your listener is added on top.
For “do something before and after every parse” (instrumentation, PII redaction, request-level routing), composition is the cleanest pattern:
import com.github.benmanes.caffeine.cache.Cache; import com.github.benmanes.caffeine.cache.Caffeine; import io.micrometer.core.instrument.MeterRegistry; import org.apache.doris.nereids.DorisParser; import org.apache.doris.sqlparser.DorisSqlParser; import static java.util.concurrent.TimeUnit.NANOSECONDS; public class InstrumentedDorisSqlParser { private final DorisSqlParser delegate; private final Cache<String, DorisParser.SingleStatementContext> cache; private final MeterRegistry metrics; public InstrumentedDorisSqlParser(MeterRegistry metrics) { this.delegate = new DorisSqlParser(); this.cache = Caffeine.newBuilder().maximumSize(10_000).build(); this.metrics = metrics; } public DorisParser.SingleStatementContext parse(String sql) { // pre-hook: redact literals so semantically equivalent queries share a cache entry String normalized = redactLiterals(sql); return cache.get(normalized, key -> { long start = System.nanoTime(); try { return delegate.parseStatement(key); } finally { metrics.timer("sql.parse").record(System.nanoTime() - start, NANOSECONDS); } }); } }
Different teams can maintain their own hook classes; you do not need to merge them into one giant visitor. ParseTreeWalker can walk the same tree multiple times:
ParseTree tree = parser.parseStatement(sql); LineageExtractor lineage = new LineageExtractor(); AuditListener audit = new AuditListener(); HintCollectorListener hints = new HintCollectorListener(); lineage.visit(tree); ParseTreeWalker.DEFAULT.walk(audit, tree); ParseTreeWalker.DEFAULT.walk(hints, tree);
visitXxx / enterXxx / exitXxx corresponds 1:1 to a xxx: rule in DorisParser.g4. Open DorisParserBaseVisitor in your IDE to see the full list, or run the CLI with --pretty to see the actual rule names that appear in the tree for your SQL, then target them in your visitor.super.visitXxx(ctx): a visitor's default behavior is to recurse into children. If you forget super, nothing below the current node will be visited. Either return super.visitXxx(ctx) to keep recursing, or return null to explicitly prune.fe-sql-parser's own error-location plumbing. If you need to fail inside a visitor, throw an exception that carries Origin-style line/column info (see ParserUtils.position(Token)).--pretty output tells you exactly what rule names show up for any SQL — much faster than guessing.DorisSqlParser is configured via constructor flags. Both default to false, which matches the most common Doris query behavior.
DorisSqlParser parser = new DorisSqlParser( /* noBackslashEscapes = */ false, /* ansiSqlSyntax = */ false );
| Flag | Effect |
|---|---|
noBackslashEscapes | When true, \ inside string literals is a literal backslash rather than an escape character. Matches MySQL's NO_BACKSLASH_ESCAPES sql_mode. |
ansiSqlSyntax | When true, enables ANSI SQL behavior in a small number of grammar rules (mainly around GROUP BY / ORDER BY resolution). Matches the enable_ansi_query_organization_behavior Doris session variable. |
ParserUtils.withOrigin pushes the current ANTLR rule's line/column onto a per-thread stack so that ParseException can report the exact source location of any error raised during tree construction. By default this uses a ThreadLocal; threads that run the parser on a hot path can opt into a faster field-based storage by implementing org.apache.doris.nereids.parser.OriginAware:
public class MyParserThread extends Thread implements OriginAware { private Origin origin; @Override public Origin getOrigin() { return origin; } @Override public void setOrigin(Origin o) { this.origin = o; } }
Any thread that does not implement OriginAware falls back to the ThreadLocal path. Correctness is identical either way; the fast path saves one ThreadLocal hash lookup per withOrigin call.
DorisSqlParser is stateless aside from its constructor flags and can be reused as a shared singleton across threads. Each parse call constructs a fresh Lexer, TokenStream, and Parser internally.
SELECT, you still parse with the full parser and just visit the relevant subtree.t, a, u.id come back as syntactic tokens. Resolving them against a catalog requires additional logic in your application.antlr4-runtime:4.13.1 is a transitive dependency of the thin jar. Align with this version in your project or you will hit NoSuchMethodError at runtime.antlr4-runtime so it has no classpath conflicts when run with java -jar.mvn install -Pflatten or pull it from an internal repository.