| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| # Apache Paimon Security Threat Model |
| |
| This document describes Apache Paimon's detailed security threat model for |
| maintainers and automated security triage. |
| |
| It complements the shorter public-facing security model in |
| [`docs/docs/project/security.md`](docs/docs/project/security.md) (published at the project website) by making |
| Paimon's trust assumptions, security boundaries, and recurring non-security |
| bug classes more explicit. |
| |
| ## Purpose |
| |
| Apache Paimon is a streaming data lake platform that is often deployed as a |
| library and integration layer inside larger systems (Flink, Spark, Hive, and |
| other query engines) that provide their own authentication, authorization, and |
| credential management. Because of that deployment model, many bug classes that |
| look security-relevant in the abstract are not actually security |
| vulnerabilities in Paimon itself. |
| |
| This model is intended to answer: |
| |
| - what Paimon generally treats as a security vulnerability |
| - what Paimon generally treats as correctness, hardening, or deployment work |
| - which boundaries are primarily owned by Paimon versus the surrounding |
| catalog, engine, or service |
| - which issue classes should be downgraded by default by scanners |
| |
| ## Scope |
| |
| This model is scoped to the Apache Paimon project itself: |
| |
| - the table format implementation (paimon-core) |
| - client libraries (paimon-api, paimon-common) |
| - the REST Catalog client and protocol (paimon-api, paimon-core) |
| - engine integrations (Flink, Spark, Hive connectors) |
| - the Python client (pypaimon) |
| |
| It is not a general threat model for every deployment that embeds Paimon. |
| |
| In particular, it does not attempt to define the complete security model for: |
| |
| - query engines or applications that embed Paimon |
| - storage-level authorization enforced outside Paimon |
| - REST Catalog server implementations (Paimon defines the client and protocol, |
| not the server) |
| |
| ## Security Goals |
| |
| Paimon should: |
| |
| - avoid exposing secrets or delegated credentials to principals that were not |
| already trusted with them |
| - avoid creating new unauthorized capabilities in Paimon-owned components or |
| integrations |
| - avoid violating trust boundaries that Paimon itself owns, such as leaking |
| auth, signer, or credential-bearing state across catalog or session |
| boundaries in the same process |
| - avoid leaking delegated storage tokens (data tokens) across table or |
| principal boundaries |
| |
| Paimon does not aim to be the primary enforcement point for: |
| |
| - user-to-user authorization inside a query engine |
| - storage-level authorization (e.g., object store IAM policies) |
| - service-side authorization performed by a REST Catalog server |
| - row-level or column-level access control (Paimon relays server-provided |
| filters and column masking rules, but enforcement is in the server) |
| |
| ## Roles |
| |
| ### Operator |
| |
| The operator deploys and configures the catalog, REST Catalog server, engine, |
| and storage integration around Paimon. This role is trusted to choose |
| endpoints, warehouses, and storage integrations, configure credentials, and |
| decide which users may create, read, or modify tables. |
| |
| ### Catalog Control Plane |
| |
| The catalog control plane is responsible for resolving tables and supplying |
| metadata, locations, configuration, and delegated credentials to Paimon. |
| This role may be implemented by: |
| |
| - a REST Catalog server |
| - a Hive Metastore |
| - a JDBC-backed catalog |
| - a filesystem-based catalog |
| |
| Regardless of implementation, it should not expose secrets to unintended |
| principals or leak credential-bearing state across unintended boundaries. |
| |
| Paimon assumes a trusted catalog or metastore, which is outside its primary |
| security boundary. |
| |
| ### REST Catalog Server |
| |
| In REST deployments, part of the catalog control plane is implemented by a |
| server that returns metadata, configuration, delegated storage credentials |
| (data tokens), and query-level authorization (row filters and column masking) |
| to the client. This server is generally treated as a trusted control-plane |
| component. |
| |
| The REST Catalog server is responsible for: |
| |
| - authenticating clients |
| - authorizing catalog operations (create/drop/alter databases, tables, views, |
| functions) |
| - issuing scoped, time-limited data tokens for storage access |
| - providing row-level filters and column masking rules via the auth table |
| query API |
| - returning server-side configuration to merge with client configuration |
| |
| ### REST Catalog Client |
| |
| In REST deployments, the client-side catalog (`RESTCatalog`, `RESTApi`) |
| consumes server-provided metadata, configuration, and credentials. Where the |
| client and server are meaningfully distinct, client-side bugs in token |
| handling, caching, or reuse may still be security-relevant. This is especially |
| true when the Paimon-owned client implementation leaks credential-bearing |
| state across catalog, session, or principal boundaries it is expected to |
| preserve. |
| |
| The REST Catalog client is responsible for: |
| |
| - sending authenticated requests using a configured `AuthProvider` |
| - refreshing tokens before expiration (with a configurable safe time margin) |
| - caching `FileIO` instances keyed by data token (via `RESTTokenFileIO`) |
| and evicting them when tokens expire |
| - not mixing data tokens or auth state across different catalog instances or |
| tables in the same process |
| |
| ### Engine or Embedding Application |
| |
| Query engines (Flink, Spark, Hive, Trino, StarRocks, etc.) and applications |
| may expose only a subset of Paimon capabilities to users. They are responsible |
| for their own user-facing authorization boundaries unless Paimon explicitly |
| documents otherwise. |
| |
| ### Table Writer or Maintainer |
| |
| This role may already have legitimate power to write or replace table |
| metadata, write or delete data files, manage snapshots, create or delete |
| branches and tags, and invoke destructive maintenance operations (compaction, |
| expiration, rollback). If a report only shows a new way to achieve the same |
| effect this role can already cause legitimately, it is usually not a security |
| issue in Paimon. |
| |
| ## Trust Boundaries |
| |
| ### Boundary 1: Operator-Trusted Configuration |
| |
| The following are generally treated as trusted operator or deployment inputs: |
| |
| - catalog properties (including `uri`, `warehouse`, `token.provider`) |
| - REST Catalog server endpoint configuration |
| - warehouse and storage roots |
| - authentication credentials |
| - Kerberos keytab paths and principal names |
| (`security.kerberos.login.keytab`, `security.kerberos.login.principal`) |
| - metastore wiring (Hive Metastore URI, JDBC connection strings) |
| - custom HTTP headers (`header.*`) |
| |
| If a report depends on the attacker controlling those values directly, it is |
| usually not a vulnerability in Paimon itself. |
| |
| ### Boundary 2: Catalog-Supplied Metadata |
| |
| Paimon often accepts metadata locations, table properties, database |
| properties, schema definitions, and related control-plane information from a |
| catalog or metastore. By default, Paimon treats those sources as trusted. |
| |
| This means a malicious catalog supplying incorrect or malicious metadata is |
| usually not a Paimon vulnerability by itself. |
| |
| ### Boundary 3: REST Catalog Server-Supplied Configuration and Delegated Storage Access |
| |
| In REST deployments, Paimon accepts the following from the REST Catalog server: |
| |
| - **Server configuration**: merged into client options via the `/v1/config` |
| endpoint, including catalog prefix and additional headers |
| - **Data tokens**: time-limited storage credentials returned by the |
| `/v1/{prefix}/databases/{database}/tables/{table}/token` endpoint, used by |
| `RESTTokenFileIO` to access the underlying object store |
| - **Auth table query responses**: row-level filters and column masking rules |
| returned by the `/v1/{prefix}/databases/{database}/tables/{table}/auth` |
| endpoint |
| |
| By default, these are treated as trusted control-plane inputs unless Paimon |
| explicitly documents a stronger guarantee. |
| |
| This means a malicious REST Catalog server sending dangerous configuration or |
| overly broad data tokens is usually not a Paimon vulnerability by itself. It |
| also means many client-side token-selection bugs are often correctness or |
| specification issues rather than security boundary failures. |
| |
| The major exception is **secret exposure**. If Paimon surfaces credentials or |
| secrets to a new audience that was not already trusted with them, that is |
| security-relevant. In particular: |
| |
| - Data tokens for one table leaking to operations on a different table |
| - Auth state from one catalog instance leaking into another |
| - Credentials appearing in logs, error messages, or serialized state |
| |
| ### Boundary 4: Storage-Level Authorization |
| |
| Object store permissions (e.g., OSS, S3, HDFS ACLs) are enforced by the |
| storage provider and the credentials the surrounding deployment chooses to |
| hand to Paimon. Paimon is not the root authority for bucket- or object-level |
| authorization. |
| |
| Reports that depend primarily on over-broad IAM policies or permissive |
| storage ACLs are usually deployment-sensitive rather than product-security |
| issues in Paimon. |
| |
| ### Boundary 5: Engine-Level User Authorization |
| |
| Paimon integrations may surface data and operations through a query engine or |
| application, but Paimon is not a complete user-authorization framework for |
| those systems. |
| |
| Paimon does provide a mechanism for the REST Catalog server to supply |
| row-level filters and column masking rules via `authTableQuery`, but |
| enforcement of those rules is a shared responsibility between the engine |
| integration and the catalog server. Paimon relays the rules; the engine |
| must apply them. |
| |
| ## In-Scope Security Vulnerabilities |
| |
| The following categories are generally security-relevant in Paimon when the |
| report is credible and reproducible. |
| |
| ### 1. Secret or Credential Disclosure to a New Audience |
| |
| Examples include: |
| |
| - catalog credentials exposed through a user-visible engine surface |
| (e.g., query results, EXPLAIN output, table properties) |
| - one catalog's credentials or auth state leaking into another catalog or |
| session within the same process |
| - data tokens for table A being used for (or exposed to) table B |
| - credentials or tokens logged at INFO or lower levels without redaction |
| - credentials surviving in serialized `RESTTokenFileIO` or `RESTApi` state |
| beyond their intended scope |
| |
| ### 2. Paimon-Owned Trust-Boundary Violations |
| |
| Security issues exist when Paimon itself is expected to separate catalogs, |
| principals, or sessions and fails to do so. |
| |
| Examples include: |
| |
| - process-global auth provider or signer state crossing catalog instances |
| (e.g., the `FILE_IO_CACHE` in `RESTTokenFileIO` returning a `FileIO` |
| belonging to a different principal) |
| - a data token obtained for one table being reused for a different table's |
| data access |
| - auth header state from one `RESTApi` instance leaking into another |
| |
| ### 3. Row-Level and Column-Level Access Control Bypass |
| |
| If Paimon's client-side handling of `authTableQuery` responses (row filters |
| or column masking rules) allows a caller to bypass filters that the server |
| intended to enforce, that is security-relevant when the bypass occurs within |
| Paimon-owned code rather than in the engine integration. |
| |
| ## Usually Out of Scope or Non-Security by Default |
| |
| These categories may still be real bugs worth fixing, but they are not usually |
| security vulnerabilities in Paimon itself. |
| |
| ### 1. Correctness Bugs |
| |
| Examples: |
| |
| - wrong byte offsets or stale decoded values in file formats |
| - incorrect merge-tree compaction producing wrong query results |
| - race conditions or logic bugs that do not create a new trust-boundary |
| violation |
| - snapshot or schema version conflicts that produce incorrect metadata |
| |
| ### 2. Parser Hardening and Malformed-Input Robustness |
| |
| Malformed-input crashes, raw runtime exceptions from invalid JSON or Avro |
| data, and memory amplification from oversized manifests or schemas are usually |
| treated as robustness or hardening work rather than security issues in Paimon |
| itself. |
| |
| ### 3. Malicious Catalog, Metastore, or External Service Scenarios |
| |
| Reports that require a malicious catalog, metastore, REST Catalog server, or |
| other external service are usually outside Paimon's primary security boundary. |
| |
| Examples: |
| |
| - a REST Catalog server returning a data token with overly broad storage |
| permissions |
| - a Hive Metastore returning a table location pointing to a sensitive path |
| - a REST Catalog server returning malicious row filters designed to extract |
| data through side channels |
| |
| ### 4. Equivalent-Harm Reports |
| |
| If the actor already has a legitimate capability that can cause the same harm, |
| the new path is usually not a security issue. This often applies to writers or |
| maintainers who already control metadata layout, file layout, or destructive |
| maintenance operations (snapshot expiration, orphan file cleanup, branch |
| deletion). |
| |
| ### 5. Denial of Service Through Normal Operations |
| |
| Resource exhaustion caused by legitimate but expensive operations (e.g., large |
| compaction, scanning many partitions, listing all snapshots) is usually |
| treated as an operational concern rather than a security vulnerability. |
| |
| ## REST Catalog Specific Security Considerations |
| |
| ### Authentication |
| |
| Paimon's REST Catalog client supports pluggable authentication through the |
| `AuthProvider` interface. |
| |
| Authentication providers are created via the `AuthProviderFactory` SPI, loaded |
| using Java's `ServiceLoader` mechanism based on the `token.provider` |
| configuration. The authentication provider is process-level per catalog |
| instance and must not share mutable state across instances. |
| |
| ### Data Token Lifecycle |
| |
| When `data-token.enabled` is `true`, `RESTTokenFileIO` manages delegated |
| storage credentials: |
| |
| 1. The client calls the table token endpoint to obtain a time-limited data |
| token |
| 2. The token is cached and used to construct a `FileIO` instance for storage |
| access |
| 3. Tokens are refreshed before expiration (1 hour safe time margin by default) |
| 4. `FileIO` instances are cached in a process-global cache |
| (`FILE_IO_CACHE`) keyed by `RESTToken`, with a maximum size of 1000 |
| entries and 10-hour expiry |
| |
| Security-relevant invariants: |
| |
| - Data tokens must be scoped to specific tables by the server |
| - The `FILE_IO_CACHE` keys on the full `RESTToken` (token content + |
| expiration), so different tokens produce different `FileIO` instances |
| - Token refresh creates a new `RESTApi` instance from the catalog context if |
| the original instance is unavailable (e.g., after deserialization) |
| |
| ### Kerberos |
| |
| Paimon supports Kerberos authentication for Hadoop-based deployments through |
| `SecurityContext` and `SecurityConfiguration`. Keytab paths and principals |
| are treated as trusted operator configuration. |
| |
| ## Scanner Calibration Rules |
| |
| A scanner targeting Paimon should treat a finding as higher-confidence only if |
| it plausibly shows one of the following: |
| |
| - exposure of a secret or delegated credential to a new audience |
| - creation of a new unauthorized capability in a Paimon-owned component |
| - violation of a Paimon-owned trust boundary (e.g., cross-catalog credential |
| leak, cross-table data token reuse) |
| |
| A finding should be downgraded or rejected by default if it instead depends |
| primarily on: |
| |
| - malformed-input robustness or denial-of-service behavior |
| - a malicious catalog, metastore, REST Catalog server, or external service |
| - a principal that already has equivalent power through legitimate write or |
| maintenance capabilities |
| - operator misconfiguration (overly broad credentials, missing TLS, etc.) |