title: “Python API” url: python-api-intro aliases: - “python/api-intro” menu: main: parent: “API” weight: 500

Iceberg Python API

Much of the python api conforms to the java api. You can get more info about the java api here.

Catalog

The Catalog interface, like java provides search and management operations for tables.

To create a catalog:

from iceberg.hive import HiveTables

# instantiate Hive Tables
conf = {"hive.metastore.uris": 'thrift://{hms_host}:{hms_port}',
        "hive.metastore.warehouse.dir": {tmpdir} }
tables = HiveTables(conf)

and to create a table from a catalog:

from iceberg.api.schema import Schema\
from iceberg.api.types import TimestampType, DoubleType, StringType, NestedField
from iceberg.api.partition_spec import PartitionSpecBuilder

schema = Schema(NestedField.optional(1, "DateTime", TimestampType.with_timezone()),
                NestedField.optional(2, "Bid", DoubleType.get()),
                NestedField.optional(3, "Ask", DoubleType.get()),
                NestedField.optional(4, "symbol", StringType.get()))
partition_spec = PartitionSpecBuilder(schema).add(1, 1000, "DateTime_day", "day").build()

tables.create(schema, "test.test_123", partition_spec)

Tables

The Table interface provides access to table metadata

  • schema returns the current table Schema
  • spec returns the current table PartitonSpec
  • properties returns a map of key-value TableProperties
  • currentSnapshot returns the current table Snapshot
  • snapshots returns all valid snapshots for the table
  • snapshot(id) returns a specific snapshot by ID
  • location returns the table’s base location

Tables also provide refresh to update the table to the latest version.

Scanning

Iceberg table scans start by creating a TableScan object with newScan.

scan = table.new_scan();

To configure a scan, call filter and select on the TableScan to get a new TableScan with those changes.

filtered_scan = scan.filter(Expressions.equal("id", 5))

String expressions can also be passed to the filter method.

filtered_scan = scan.filter("id=5")

Schema projections can be applied against a TableScan by passing a list of column names.

filtered_scan = scan.select(["col_1", "col_2", "col_3"])

Because some data types cannot be read using the python library, a convenience method for excluding columns from projection is provided.

filtered_scan = scan.select_except(["unsupported_col_1", "unsupported_col_2"])

Calls to configuration methods create a new TableScan so that each TableScan is immutable.

When a scan is configured, planFiles, planTasks, and Schema are used to return files, tasks, and the read projection.

scan = table.new_scan() \
    .filter("id=5") \
    .select(["id", "data"])

projection = scan.schema
for task in scan.plan_tasks():
    print(task)

Types

Iceberg data types are located in iceberg.api.types.types

Primitives

Primitive type instances are available from static methods in each type class. Types without parameters use get, and types like DecimalType use factory methods:

IntegerType.get()    # int
DoubleType.get()     # double
DecimalType.of(9, 2) # decimal(9, 2)

Nested types

Structs, maps, and lists are created using factory methods in type classes.

Like struct fields, map keys or values and list elements are tracked as nested fields. Nested fields track field IDs and nullability.

Struct fields are created using NestedField.optional or NestedField.required. Map value and list element nullability is set in the map and list factory methods.

# struct<1 id: int, 2 data: optional string>
struct = StructType.of([NestedField.required(1, "id", IntegerType.get()),
                        NestedField.optional(2, "data", StringType.get()])
  )
# map<1 key: int, 2 value: optional string>
map_var = MapType.of_optional(1, IntegerType.get(),
                          2, StringType.get())
# array<1 element: int>
list_var = ListType.of_required(1, IntegerType.get());

Expressions

Iceberg’s Expressions are used to configure table scans. To create Expressions, use the factory methods in Expressions.

Supported Predicate expressions are:

  • is_null
  • not_null
  • equal
  • not_equal
  • less_than
  • less_than_or_equal
  • greater_than
  • greater_than_or_equal

Supported expression Operationsare:

  • and
  • or
  • not

Constant expressions are:

  • always_true
  • always_false