docs-archive/apache-airflow-providers-apache-druid/3.0.0/_sources/_api/airflow/providers/apache/druid/transfers/hive_to_druid/index.rst.txt - airflow-site - Git at Google

 :py:mod:`airflow.providers.apache.druid.transfers.hive_to_druid`
 ================================================================

 .. py:module:: airflow.providers.apache.druid.transfers.hive_to_druid

 .. autoapi-nested-parse::

    This module contains operator to move data from Hive to Druid.


 Module Contents
 ---------------

 Classes
 ~~~~~~~

 .. autoapisummary::

    airflow.providers.apache.druid.transfers.hive_to_druid.HiveToDruidOperator


 Attributes
 ~~~~~~~~~~

 .. autoapisummary::

    airflow.providers.apache.druid.transfers.hive_to_druid.LOAD_CHECK_INTERVAL
    airflow.providers.apache.druid.transfers.hive_to_druid.DEFAULT_TARGET_PARTITION_SIZE


 .. py:data:: LOAD_CHECK_INTERVAL
    :annotation: = 5


 .. py:data:: DEFAULT_TARGET_PARTITION_SIZE
    :annotation: = 5000000


 .. py:class:: HiveToDruidOperator(*, sql, druid_datasource, ts_dim, metric_spec = None, hive_cli_conn_id = 'hive_cli_default', druid_ingest_conn_id = 'druid_ingest_default', metastore_conn_id = 'metastore_default', hadoop_dependency_coordinates = None, intervals = None, num_shards = -1, target_partition_size = -1, query_granularity = 'NONE', segment_granularity = 'DAY', hive_tblproperties = None, job_properties = None, **kwargs)

    Bases: :py:obj:`airflow.models.BaseOperator`

    Moves data from Hive to Druid, [del]note that for now the data is loaded
    into memory before being pushed to Druid, so this operator should
    be used for smallish amount of data.[/del]

    :param sql: SQL query to execute against the Druid database. (templated)
    :param druid_datasource: the datasource you want to ingest into in druid
    :param ts_dim: the timestamp dimension
    :param metric_spec: the metrics you want to define for your data
    :param hive_cli_conn_id: the hive connection id
    :param druid_ingest_conn_id: the druid ingest connection id
    :param metastore_conn_id: the metastore connection id
    :param hadoop_dependency_coordinates: list of coordinates to squeeze
        int the ingest json
    :param intervals: list of time intervals that defines segments,
        this is passed as is to the json object. (templated)
    :param num_shards: Directly specify the number of shards to create.
    :param target_partition_size: Target number of rows to include in a partition,
    :param query_granularity: The minimum granularity to be able to query results at and the granularity of
        the data inside the segment. E.g. a value of "minute" will mean that data is aggregated at minutely
        granularity. That is, if there are collisions in the tuple (minute(timestamp), dimensions), then it
        will aggregate values together using the aggregators instead of storing individual rows.
        A granularity of 'NONE' means millisecond granularity.
    :param segment_granularity: The granularity to create time chunks at. Multiple segments can be created per
        time chunk. For example, with 'DAY' segmentGranularity, the events of the same day fall into the
        same time chunk which can be optionally further partitioned into multiple segments based on other
        configurations and input size.
    :param hive_tblproperties: additional properties for tblproperties in
        hive for the staging table
    :param job_properties: additional properties for job

    .. py:attribute:: template_fields
       :annotation: :Sequence[str] = ['sql', 'intervals']


    .. py:attribute:: template_ext
       :annotation: :Sequence[str] = ['.sql']


    .. py:attribute:: template_fields_renderers


    .. py:method:: execute(self, context)

       This is the main method to derive when creating an operator.
       Context is the same dictionary used as when rendering jinja templates.

       Refer to get_template_context for more context.


    .. py:method:: construct_ingest_query(self, static_path, columns)

       Builds an ingest query for an HDFS TSV load.

       :param static_path: The path on hdfs where the data is
       :param columns: List of all the columns that are available
	:py:mod:`airflow.providers.apache.druid.transfers.hive_to_druid`
	================================================================

	.. py:module:: airflow.providers.apache.druid.transfers.hive_to_druid

	.. autoapi-nested-parse::

	This module contains operator to move data from Hive to Druid.



	Module Contents
	---------------

	Classes
	~~~~~~~

	.. autoapisummary::

	airflow.providers.apache.druid.transfers.hive_to_druid.HiveToDruidOperator




	Attributes
	~~~~~~~~~~

	.. autoapisummary::

	airflow.providers.apache.druid.transfers.hive_to_druid.LOAD_CHECK_INTERVAL
	airflow.providers.apache.druid.transfers.hive_to_druid.DEFAULT_TARGET_PARTITION_SIZE


	.. py:data:: LOAD_CHECK_INTERVAL
	:annotation: = 5



	.. py:data:: DEFAULT_TARGET_PARTITION_SIZE
	:annotation: = 5000000



	.. py:class:: HiveToDruidOperator(, sql, druid_datasource, ts_dim, metric_spec = None, hive_cli_conn_id = 'hive_cli_default', druid_ingest_conn_id = 'druid_ingest_default', metastore_conn_id = 'metastore_default', hadoop_dependency_coordinates = None, intervals = None, num_shards = -1, target_partition_size = -1, query_granularity = 'NONE', segment_granularity = 'DAY', hive_tblproperties = None, job_properties = None, *kwargs)

	Bases: :py:obj:`airflow.models.BaseOperator`

	Moves data from Hive to Druid, [del]note that for now the data is loaded
	into memory before being pushed to Druid, so this operator should
	be used for smallish amount of data.[/del]

	:param sql: SQL query to execute against the Druid database. (templated)
	:param druid_datasource: the datasource you want to ingest into in druid
	:param ts_dim: the timestamp dimension
	:param metric_spec: the metrics you want to define for your data
	:param hive_cli_conn_id: the hive connection id
	:param druid_ingest_conn_id: the druid ingest connection id
	:param metastore_conn_id: the metastore connection id
	:param hadoop_dependency_coordinates: list of coordinates to squeeze
	int the ingest json
	:param intervals: list of time intervals that defines segments,
	this is passed as is to the json object. (templated)
	:param num_shards: Directly specify the number of shards to create.
	:param target_partition_size: Target number of rows to include in a partition,
	:param query_granularity: The minimum granularity to be able to query results at and the granularity of
	the data inside the segment. E.g. a value of "minute" will mean that data is aggregated at minutely
	granularity. That is, if there are collisions in the tuple (minute(timestamp), dimensions), then it
	will aggregate values together using the aggregators instead of storing individual rows.
	A granularity of 'NONE' means millisecond granularity.
	:param segment_granularity: The granularity to create time chunks at. Multiple segments can be created per
	time chunk. For example, with 'DAY' segmentGranularity, the events of the same day fall into the
	same time chunk which can be optionally further partitioned into multiple segments based on other
	configurations and input size.
	:param hive_tblproperties: additional properties for tblproperties in
	hive for the staging table
	:param job_properties: additional properties for job

	.. py:attribute:: template_fields
	:annotation: :Sequence[str] = ['sql', 'intervals']



	.. py:attribute:: template_ext
	:annotation: :Sequence[str] = ['.sql']



	.. py:attribute:: template_fields_renderers




	.. py:method:: execute(self, context)

	This is the main method to derive when creating an operator.
	Context is the same dictionary used as when rendering jinja templates.

	Refer to get_template_context for more context.


	.. py:method:: construct_ingest_query(self, static_path, columns)

	Builds an ingest query for an HDFS TSV load.

	:param static_path: The path on hdfs where the data is
	:param columns: List of all the columns that are available