blob: 44d0280ed6f4b8e6985e00c91eec87a8e590003c [file] [log] [blame]
:mod:`airflow.providers.apache.druid.transfers.hive_to_druid`
=============================================================
.. py:module:: airflow.providers.apache.druid.transfers.hive_to_druid
.. autoapi-nested-parse::
This module contains operator to move data from Hive to Druid.
Module Contents
---------------
.. data:: LOAD_CHECK_INTERVAL
:annotation: = 5
.. data:: DEFAULT_TARGET_PARTITION_SIZE
:annotation: = 5000000
.. py:class:: HiveToDruidOperator(*, sql: str, druid_datasource: str, ts_dim: str, metric_spec: Optional[List[Any]] = None, hive_cli_conn_id: str = 'hive_cli_default', druid_ingest_conn_id: str = 'druid_ingest_default', metastore_conn_id: str = 'metastore_default', hadoop_dependency_coordinates: Optional[List[str]] = None, intervals: Optional[List[Any]] = None, num_shards: float = -1, target_partition_size: int = -1, query_granularity: str = 'NONE', segment_granularity: str = 'DAY', hive_tblproperties: Optional[Dict[Any, Any]] = None, job_properties: Optional[Dict[Any, Any]] = None, **kwargs)
Bases: :class:`airflow.models.BaseOperator`
Moves data from Hive to Druid, [del]note that for now the data is loaded
into memory before being pushed to Druid, so this operator should
be used for smallish amount of data.[/del]
:param sql: SQL query to execute against the Druid database. (templated)
:type sql: str
:param druid_datasource: the datasource you want to ingest into in druid
:type druid_datasource: str
:param ts_dim: the timestamp dimension
:type ts_dim: str
:param metric_spec: the metrics you want to define for your data
:type metric_spec: list
:param hive_cli_conn_id: the hive connection id
:type hive_cli_conn_id: str
:param druid_ingest_conn_id: the druid ingest connection id
:type druid_ingest_conn_id: str
:param metastore_conn_id: the metastore connection id
:type metastore_conn_id: str
:param hadoop_dependency_coordinates: list of coordinates to squeeze
int the ingest json
:type hadoop_dependency_coordinates: list[str]
:param intervals: list of time intervals that defines segments,
this is passed as is to the json object. (templated)
:type intervals: list
:param num_shards: Directly specify the number of shards to create.
:type num_shards: float
:param target_partition_size: Target number of rows to include in a partition,
:type target_partition_size: int
:param query_granularity: The minimum granularity to be able to query results at and the granularity of
the data inside the segment. E.g. a value of "minute" will mean that data is aggregated at minutely
granularity. That is, if there are collisions in the tuple (minute(timestamp), dimensions), then it
will aggregate values together using the aggregators instead of storing individual rows.
A granularity of 'NONE' means millisecond granularity.
:type query_granularity: str
:param segment_granularity: The granularity to create time chunks at. Multiple segments can be created per
time chunk. For example, with 'DAY' segmentGranularity, the events of the same day fall into the
same time chunk which can be optionally further partitioned into multiple segments based on other
configurations and input size.
:type segment_granularity: str
:param hive_tblproperties: additional properties for tblproperties in
hive for the staging table
:type hive_tblproperties: dict
:param job_properties: additional properties for job
:type job_properties: dict
.. attribute:: template_fields
:annotation: = ['sql', 'intervals']
.. attribute:: template_ext
:annotation: = ['.sql']
.. method:: execute(self, context: Dict[str, Any])
.. method:: construct_ingest_query(self, static_path: str, columns: List[str])
Builds an ingest query for an HDFS TSV load.
:param static_path: The path on hdfs where the data is
:type static_path: str
:param columns: List of all the columns that are available
:type columns: list