| :mod:`airflow.contrib.hooks.bigquery_hook` |
| ========================================== |
| |
| .. py:module:: airflow.contrib.hooks.bigquery_hook |
| |
| .. autoapi-nested-parse:: |
| |
| This module contains a BigQuery Hook, as well as a very basic PEP 249 |
| implementation for BigQuery. |
| |
| |
| |
| Module Contents |
| --------------- |
| |
| .. py:class:: BigQueryHook(bigquery_conn_id='bigquery_default', delegate_to=None, use_legacy_sql=True, location=None) |
| |
| Bases: :class:`airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook`, :class:`airflow.hooks.dbapi_hook.DbApiHook` |
| |
| Interact with BigQuery. This hook uses the Google Cloud Platform |
| connection. |
| |
| .. attribute:: conn_name_attr |
| :annotation: = bigquery_conn_id |
| |
| |
| |
| |
| .. method:: get_conn(self) |
| |
| Returns a BigQuery PEP 249 connection object. |
| |
| |
| |
| |
| .. method:: get_service(self) |
| |
| Returns a BigQuery service object. |
| |
| |
| |
| |
| .. method:: insert_rows(self, table, rows, target_fields=None, commit_every=1000) |
| |
| Insertion is currently unsupported. Theoretically, you could use |
| BigQuery's streaming API to insert rows into a table, but this hasn't |
| been implemented. |
| |
| |
| |
| |
| .. method:: get_pandas_df(self, sql, parameters=None, dialect=None) |
| |
| Returns a Pandas DataFrame for the results produced by a BigQuery |
| query. The DbApiHook method must be overridden because Pandas |
| doesn't support PEP 249 connections, except for SQLite. See: |
| |
| https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L447 |
| https://github.com/pydata/pandas/issues/6900 |
| |
| :param sql: The BigQuery SQL to execute. |
| :type sql: str |
| :param parameters: The parameters to render the SQL query with (not |
| used, leave to override superclass method) |
| :type parameters: mapping or iterable |
| :param dialect: Dialect of BigQuery SQL – legacy SQL or standard SQL |
| defaults to use `self.use_legacy_sql` if not specified |
| :type dialect: str in {'legacy', 'standard'} |
| |
| |
| |
| |
| .. method:: table_exists(self, project_id, dataset_id, table_id) |
| |
| Checks for the existence of a table in Google BigQuery. |
| |
| :param project_id: The Google cloud project in which to look for the |
| table. The connection supplied to the hook must provide access to |
| the specified project. |
| :type project_id: str |
| :param dataset_id: The name of the dataset in which to look for the |
| table. |
| :type dataset_id: str |
| :param table_id: The name of the table to check the existence of. |
| :type table_id: str |
| |
| |
| |
| |
| .. py:class:: BigQueryPandasConnector(project_id, service, reauth=False, verbose=False, dialect='legacy') |
| |
| Bases: :class:`pandas_gbq.gbq.GbqConnector` |
| |
| This connector behaves identically to GbqConnector (from Pandas), except |
| that it allows the service to be injected, and disables a call to |
| self.get_credentials(). This allows Airflow to use BigQuery with Pandas |
| without forcing a three legged OAuth connection. Instead, we can inject |
| service account credentials into the binding. |
| |
| |
| .. py:class:: BigQueryConnection(*args, **kwargs) |
| |
| Bases: :class:`object` |
| |
| BigQuery does not have a notion of a persistent connection. Thus, these |
| objects are small stateless factories for cursors, which do all the real |
| work. |
| |
| |
| .. method:: close(self) |
| |
| BigQueryConnection does not have anything to close. |
| |
| |
| |
| |
| .. method:: commit(self) |
| |
| BigQueryConnection does not support transactions. |
| |
| |
| |
| |
| .. method:: cursor(self) |
| |
| Return a new :py:class:`Cursor` object using the connection. |
| |
| |
| |
| |
| .. method:: rollback(self) |
| |
| |
| |
| |
| .. py:class:: BigQueryBaseCursor(service, project_id, use_legacy_sql=True, api_resource_configs=None, location=None, num_retries=5) |
| |
| Bases: :class:`airflow.utils.log.logging_mixin.LoggingMixin` |
| |
| The BigQuery base cursor contains helper methods to execute queries against |
| BigQuery. The methods can be used directly by operators, in cases where a |
| PEP 249 cursor isn't needed. |
| |
| |
| .. method:: create_empty_table(self, project_id, dataset_id, table_id, schema_fields=None, time_partitioning=None, cluster_fields=None, labels=None, view=None, encryption_configuration=None, num_retries=5) |
| |
| Creates a new, empty table in the dataset. |
| To create a view, which is defined by a SQL query, parse a dictionary to 'view' kwarg |
| |
| :param project_id: The project to create the table into. |
| :type project_id: str |
| :param dataset_id: The dataset to create the table into. |
| :type dataset_id: str |
| :param table_id: The Name of the table to be created. |
| :type table_id: str |
| :param schema_fields: If set, the schema field list as defined here: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema |
| :type schema_fields: list |
| :param labels: a dictionary containing labels for the table, passed to BigQuery |
| :type labels: dict |
| |
| **Example**: :: |
| |
| schema_fields=[{"name": "emp_name", "type": "STRING", "mode": "REQUIRED"}, |
| {"name": "salary", "type": "INTEGER", "mode": "NULLABLE"}] |
| |
| :param time_partitioning: configure optional time partitioning fields i.e. |
| partition by field, type and expiration as per API specifications. |
| |
| .. seealso:: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#timePartitioning |
| :type time_partitioning: dict |
| :param cluster_fields: [Optional] The fields used for clustering. |
| Must be specified with time_partitioning, data in the table will be first |
| partitioned and subsequently clustered. |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#clustering.fields |
| :type cluster_fields: list |
| :param view: [Optional] A dictionary containing definition for the view. |
| If set, it will create a view instead of a table: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#view |
| :type view: dict |
| |
| **Example**: :: |
| |
| view = { |
| "query": "SELECT * FROM `test-project-id.test_dataset_id.test_table_prefix*` LIMIT 1000", |
| "useLegacySql": False |
| } |
| |
| :param encryption_configuration: [Optional] Custom encryption configuration (e.g., Cloud KMS keys). |
| **Example**: :: |
| |
| encryption_configuration = { |
| "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key" |
| } |
| :type encryption_configuration: dict |
| :return: None |
| |
| |
| |
| |
| .. method:: create_external_table(self, external_project_dataset_table, schema_fields, source_uris, source_format='CSV', autodetect=False, compression='NONE', ignore_unknown_values=False, max_bad_records=0, skip_leading_rows=0, field_delimiter=',', quote_character=None, allow_quoted_newlines=False, allow_jagged_rows=False, encoding='UTF-8', src_fmt_configs=None, labels=None, encryption_configuration=None) |
| |
| Creates a new external table in the dataset with the data in Google |
| Cloud Storage. See here: |
| |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource |
| |
| for more details about these parameters. |
| |
| :param external_project_dataset_table: |
| The dotted ``(<project>.|<project>:)<dataset>.<table>($<partition>)`` BigQuery |
| table name to create external table. |
| If ``<project>`` is not included, project will be the |
| project defined in the connection json. |
| :type external_project_dataset_table: str |
| :param schema_fields: The schema field list as defined here: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource |
| :type schema_fields: list |
| :param source_uris: The source Google Cloud |
| Storage URI (e.g. gs://some-bucket/some-file.txt). A single wild |
| per-object name can be used. |
| :type source_uris: list |
| :param source_format: File format to export. |
| :type source_format: str |
| :param autodetect: Try to detect schema and format options automatically. |
| Any option specified explicitly will be honored. |
| :type autodetect: bool |
| :param compression: [Optional] The compression type of the data source. |
| Possible values include GZIP and NONE. |
| The default value is NONE. |
| This setting is ignored for Google Cloud Bigtable, |
| Google Cloud Datastore backups and Avro formats. |
| :type compression: str |
| :param ignore_unknown_values: [Optional] Indicates if BigQuery should allow |
| extra values that are not represented in the table schema. |
| If true, the extra values are ignored. If false, records with extra columns |
| are treated as bad records, and if there are too many bad records, an |
| invalid error is returned in the job result. |
| :type ignore_unknown_values: bool |
| :param max_bad_records: The maximum number of bad records that BigQuery can |
| ignore when running the job. |
| :type max_bad_records: int |
| :param skip_leading_rows: Number of rows to skip when loading from a CSV. |
| :type skip_leading_rows: int |
| :param field_delimiter: The delimiter to use when loading from a CSV. |
| :type field_delimiter: str |
| :param quote_character: The value that is used to quote data sections in a CSV |
| file. |
| :type quote_character: str |
| :param allow_quoted_newlines: Whether to allow quoted newlines (true) or not |
| (false). |
| :type allow_quoted_newlines: bool |
| :param allow_jagged_rows: Accept rows that are missing trailing optional columns. |
| The missing values are treated as nulls. If false, records with missing |
| trailing columns are treated as bad records, and if there are too many bad |
| records, an invalid error is returned in the job result. Only applicable when |
| soure_format is CSV. |
| :type allow_jagged_rows: bool |
| :param encoding: The character encoding of the data. See: |
| |
| .. seealso:: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#externalDataConfiguration.csvOptions.encoding |
| :type encoding: str |
| :param src_fmt_configs: configure optional fields specific to the source format |
| :type src_fmt_configs: dict |
| :param labels: a dictionary containing labels for the table, passed to BigQuery |
| :type labels: dict |
| :param encryption_configuration: [Optional] Custom encryption configuration (e.g., Cloud KMS keys). |
| **Example**: :: |
| |
| encryption_configuration = { |
| "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key" |
| } |
| :type encryption_configuration: dict |
| |
| |
| |
| |
| .. method:: patch_table(self, dataset_id, table_id, project_id=None, description=None, expiration_time=None, external_data_configuration=None, friendly_name=None, labels=None, schema=None, time_partitioning=None, view=None, require_partition_filter=None, encryption_configuration=None) |
| |
| Patch information in an existing table. |
| It only updates fileds that are provided in the request object. |
| |
| Reference: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/patch |
| |
| :param dataset_id: The dataset containing the table to be patched. |
| :type dataset_id: str |
| :param table_id: The Name of the table to be patched. |
| :type table_id: str |
| :param project_id: The project containing the table to be patched. |
| :type project_id: str |
| :param description: [Optional] A user-friendly description of this table. |
| :type description: str |
| :param expiration_time: [Optional] The time when this table expires, |
| in milliseconds since the epoch. |
| :type expiration_time: int |
| :param external_data_configuration: [Optional] A dictionary containing |
| properties of a table stored outside of BigQuery. |
| :type external_data_configuration: dict |
| :param friendly_name: [Optional] A descriptive name for this table. |
| :type friendly_name: str |
| :param labels: [Optional] A dictionary containing labels associated with this table. |
| :type labels: dict |
| :param schema: [Optional] If set, the schema field list as defined here: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.schema |
| The supported schema modifications and unsupported schema modification are listed here: |
| https://cloud.google.com/bigquery/docs/managing-table-schemas |
| **Example**: :: |
| |
| schema=[{"name": "emp_name", "type": "STRING", "mode": "REQUIRED"}, |
| {"name": "salary", "type": "INTEGER", "mode": "NULLABLE"}] |
| |
| :type schema: list |
| :param time_partitioning: [Optional] A dictionary containing time-based partitioning |
| definition for the table. |
| :type time_partitioning: dict |
| :param view: [Optional] A dictionary containing definition for the view. |
| If set, it will patch a view instead of a table: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#view |
| **Example**: :: |
| |
| view = { |
| "query": "SELECT * FROM `test-project-id.test_dataset_id.test_table_prefix*` LIMIT 500", |
| "useLegacySql": False |
| } |
| |
| :type view: dict |
| :param require_partition_filter: [Optional] If true, queries over the this table require a |
| partition filter. If false, queries over the table |
| :type require_partition_filter: bool |
| :param encryption_configuration: [Optional] Custom encryption configuration (e.g., Cloud KMS keys). |
| **Example**: :: |
| |
| encryption_configuration = { |
| "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key" |
| } |
| :type encryption_configuration: dict |
| |
| |
| |
| |
| .. method:: run_query(self, bql=None, sql=None, destination_dataset_table=None, write_disposition='WRITE_EMPTY', allow_large_results=False, flatten_results=None, udf_config=None, use_legacy_sql=None, maximum_billing_tier=None, maximum_bytes_billed=None, create_disposition='CREATE_IF_NEEDED', query_params=None, labels=None, schema_update_options=None, priority='INTERACTIVE', time_partitioning=None, api_resource_configs=None, cluster_fields=None, location=None, encryption_configuration=None) |
| |
| Executes a BigQuery SQL query. Optionally persists results in a BigQuery |
| table. See here: |
| |
| https://cloud.google.com/bigquery/docs/reference/v2/jobs |
| |
| For more details about these parameters. |
| |
| :param bql: (Deprecated. Use `sql` parameter instead) The BigQuery SQL |
| to execute. |
| :type bql: str |
| :param sql: The BigQuery SQL to execute. |
| :type sql: str |
| :param destination_dataset_table: The dotted ``<dataset>.<table>`` |
| BigQuery table to save the query results. |
| :type destination_dataset_table: str |
| :param write_disposition: What to do if the table already exists in |
| BigQuery. |
| :type write_disposition: str |
| :param allow_large_results: Whether to allow large results. |
| :type allow_large_results: bool |
| :param flatten_results: If true and query uses legacy SQL dialect, flattens |
| all nested and repeated fields in the query results. ``allowLargeResults`` |
| must be true if this is set to false. For standard SQL queries, this |
| flag is ignored and results are never flattened. |
| :type flatten_results: bool |
| :param udf_config: The User Defined Function configuration for the query. |
| See https://cloud.google.com/bigquery/user-defined-functions for details. |
| :type udf_config: list |
| :param use_legacy_sql: Whether to use legacy SQL (true) or standard SQL (false). |
| If `None`, defaults to `self.use_legacy_sql`. |
| :type use_legacy_sql: bool |
| :param api_resource_configs: a dictionary that contain params |
| 'configuration' applied for Google BigQuery Jobs API: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs |
| for example, {'query': {'useQueryCache': False}}. You could use it |
| if you need to provide some params that are not supported by the |
| BigQueryHook like args. |
| :type api_resource_configs: dict |
| :param maximum_billing_tier: Positive integer that serves as a |
| multiplier of the basic price. |
| :type maximum_billing_tier: int |
| :param maximum_bytes_billed: Limits the bytes billed for this job. |
| Queries that will have bytes billed beyond this limit will fail |
| (without incurring a charge). If unspecified, this will be |
| set to your project default. |
| :type maximum_bytes_billed: float |
| :param create_disposition: Specifies whether the job is allowed to |
| create new tables. |
| :type create_disposition: str |
| :param query_params: a list of dictionary containing query parameter types and |
| values, passed to BigQuery |
| :type query_params: list |
| :param labels: a dictionary containing labels for the job/query, |
| passed to BigQuery |
| :type labels: dict |
| :param schema_update_options: Allows the schema of the destination |
| table to be updated as a side effect of the query job. |
| :type schema_update_options: Union[list, tuple, set] |
| :param priority: Specifies a priority for the query. |
| Possible values include INTERACTIVE and BATCH. |
| The default value is INTERACTIVE. |
| :type priority: str |
| :param time_partitioning: configure optional time partitioning fields i.e. |
| partition by field, type and expiration as per API specifications. |
| :type time_partitioning: dict |
| :param cluster_fields: Request that the result of this query be stored sorted |
| by one or more columns. This is only available in combination with |
| time_partitioning. The order of columns given determines the sort order. |
| :type cluster_fields: list[str] |
| :param location: The geographic location of the job. Required except for |
| US and EU. See details at |
| https://cloud.google.com/bigquery/docs/locations#specifying_your_location |
| :type location: str |
| :param encryption_configuration: [Optional] Custom encryption configuration (e.g., Cloud KMS keys). |
| **Example**: :: |
| |
| encryption_configuration = { |
| "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key" |
| } |
| :type encryption_configuration: dict |
| |
| |
| |
| |
| .. method:: run_extract(self, source_project_dataset_table, destination_cloud_storage_uris, compression='NONE', export_format='CSV', field_delimiter=',', print_header=True, labels=None) |
| |
| Executes a BigQuery extract command to copy data from BigQuery to |
| Google Cloud Storage. See here: |
| |
| https://cloud.google.com/bigquery/docs/reference/v2/jobs |
| |
| For more details about these parameters. |
| |
| :param source_project_dataset_table: The dotted ``<dataset>.<table>`` |
| BigQuery table to use as the source data. |
| :type source_project_dataset_table: str |
| :param destination_cloud_storage_uris: The destination Google Cloud |
| Storage URI (e.g. gs://some-bucket/some-file.txt). Follows |
| convention defined here: |
| https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple |
| :type destination_cloud_storage_uris: list |
| :param compression: Type of compression to use. |
| :type compression: str |
| :param export_format: File format to export. |
| :type export_format: str |
| :param field_delimiter: The delimiter to use when extracting to a CSV. |
| :type field_delimiter: str |
| :param print_header: Whether to print a header for a CSV file extract. |
| :type print_header: bool |
| :param labels: a dictionary containing labels for the job/query, |
| passed to BigQuery |
| :type labels: dict |
| |
| |
| |
| |
| .. method:: run_copy(self, source_project_dataset_tables, destination_project_dataset_table, write_disposition='WRITE_EMPTY', create_disposition='CREATE_IF_NEEDED', labels=None, encryption_configuration=None) |
| |
| Executes a BigQuery copy command to copy data from one BigQuery table |
| to another. See here: |
| |
| https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.copy |
| |
| For more details about these parameters. |
| |
| :param source_project_dataset_tables: One or more dotted |
| ``(project:|project.)<dataset>.<table>`` |
| BigQuery tables to use as the source data. Use a list if there are |
| multiple source tables. |
| If ``<project>`` is not included, project will be the project defined |
| in the connection json. |
| :type source_project_dataset_tables: list|string |
| :param destination_project_dataset_table: The destination BigQuery |
| table. Format is: ``(project:|project.)<dataset>.<table>`` |
| :type destination_project_dataset_table: str |
| :param write_disposition: The write disposition if the table already exists. |
| :type write_disposition: str |
| :param create_disposition: The create disposition if the table doesn't exist. |
| :type create_disposition: str |
| :param labels: a dictionary containing labels for the job/query, |
| passed to BigQuery |
| :type labels: dict |
| :param encryption_configuration: [Optional] Custom encryption configuration (e.g., Cloud KMS keys). |
| **Example**: :: |
| |
| encryption_configuration = { |
| "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key" |
| } |
| :type encryption_configuration: dict |
| |
| |
| |
| |
| .. method:: run_load(self, destination_project_dataset_table, source_uris, schema_fields=None, source_format='CSV', create_disposition='CREATE_IF_NEEDED', skip_leading_rows=0, write_disposition='WRITE_EMPTY', field_delimiter=',', max_bad_records=0, quote_character=None, ignore_unknown_values=False, allow_quoted_newlines=False, allow_jagged_rows=False, encoding='UTF-8', schema_update_options=None, src_fmt_configs=None, time_partitioning=None, cluster_fields=None, autodetect=False, encryption_configuration=None) |
| |
| Executes a BigQuery load command to load data from Google Cloud Storage |
| to BigQuery. See here: |
| |
| https://cloud.google.com/bigquery/docs/reference/v2/jobs |
| |
| For more details about these parameters. |
| |
| :param destination_project_dataset_table: |
| The dotted ``(<project>.|<project>:)<dataset>.<table>($<partition>)`` BigQuery |
| table to load data into. If ``<project>`` is not included, project will be the |
| project defined in the connection json. If a partition is specified the |
| operator will automatically append the data, create a new partition or create |
| a new DAY partitioned table. |
| :type destination_project_dataset_table: str |
| :param schema_fields: The schema field list as defined here: |
| https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load |
| Required if autodetect=False; optional if autodetect=True. |
| :type schema_fields: list |
| :param autodetect: Attempt to autodetect the schema for CSV and JSON |
| source files. |
| :type autodetect: bool |
| :param source_uris: The source Google Cloud |
| Storage URI (e.g. gs://some-bucket/some-file.txt). A single wild |
| per-object name can be used. |
| :type source_uris: list |
| :param source_format: File format to export. |
| :type source_format: str |
| :param create_disposition: The create disposition if the table doesn't exist. |
| :type create_disposition: str |
| :param skip_leading_rows: Number of rows to skip when loading from a CSV. |
| :type skip_leading_rows: int |
| :param write_disposition: The write disposition if the table already exists. |
| :type write_disposition: str |
| :param field_delimiter: The delimiter to use when loading from a CSV. |
| :type field_delimiter: str |
| :param max_bad_records: The maximum number of bad records that BigQuery can |
| ignore when running the job. |
| :type max_bad_records: int |
| :param quote_character: The value that is used to quote data sections in a CSV |
| file. |
| :type quote_character: str |
| :param ignore_unknown_values: [Optional] Indicates if BigQuery should allow |
| extra values that are not represented in the table schema. |
| If true, the extra values are ignored. If false, records with extra columns |
| are treated as bad records, and if there are too many bad records, an |
| invalid error is returned in the job result. |
| :type ignore_unknown_values: bool |
| :param allow_quoted_newlines: Whether to allow quoted newlines (true) or not |
| (false). |
| :type allow_quoted_newlines: bool |
| :param allow_jagged_rows: Accept rows that are missing trailing optional columns. |
| The missing values are treated as nulls. If false, records with missing |
| trailing columns are treated as bad records, and if there are too many bad |
| records, an invalid error is returned in the job result. Only applicable when |
| soure_format is CSV. |
| :type allow_jagged_rows: bool |
| :param encoding: The character encoding of the data. |
| |
| .. seealso:: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#externalDataConfiguration.csvOptions.encoding |
| :type encoding: str |
| :param schema_update_options: Allows the schema of the destination |
| table to be updated as a side effect of the load job. |
| :type schema_update_options: Union[list, tuple, set] |
| :param src_fmt_configs: configure optional fields specific to the source format |
| :type src_fmt_configs: dict |
| :param time_partitioning: configure optional time partitioning fields i.e. |
| partition by field, type and expiration as per API specifications. |
| :type time_partitioning: dict |
| :param cluster_fields: Request that the result of this load be stored sorted |
| by one or more columns. This is only available in combination with |
| time_partitioning. The order of columns given determines the sort order. |
| :type cluster_fields: list[str] |
| :param encryption_configuration: [Optional] Custom encryption configuration (e.g., Cloud KMS keys). |
| **Example**: :: |
| |
| encryption_configuration = { |
| "kmsKeyName": "projects/testp/locations/us/keyRings/test-kr/cryptoKeys/test-key" |
| } |
| :type encryption_configuration: dict |
| |
| |
| |
| |
| .. method:: run_with_configuration(self, configuration) |
| |
| Executes a BigQuery SQL query. See here: |
| |
| https://cloud.google.com/bigquery/docs/reference/v2/jobs |
| |
| For more details about the configuration parameter. |
| |
| :param configuration: The configuration parameter maps directly to |
| BigQuery's configuration field in the job object. See |
| https://cloud.google.com/bigquery/docs/reference/v2/jobs for |
| details. |
| |
| |
| |
| |
| .. method:: poll_job_complete(self, job_id) |
| |
| |
| |
| |
| .. method:: cancel_query(self) |
| |
| Cancel all started queries that have not yet completed |
| |
| |
| |
| |
| .. method:: get_schema(self, dataset_id, table_id) |
| |
| Get the schema for a given datset.table. |
| see https://cloud.google.com/bigquery/docs/reference/v2/tables#resource |
| |
| :param dataset_id: the dataset ID of the requested table |
| :param table_id: the table ID of the requested table |
| :return: a table schema |
| |
| |
| |
| |
| .. method:: get_tabledata(self, dataset_id, table_id, max_results=None, selected_fields=None, page_token=None, start_index=None) |
| |
| Get the data of a given dataset.table and optionally with selected columns. |
| see https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list |
| |
| :param dataset_id: the dataset ID of the requested table. |
| :param table_id: the table ID of the requested table. |
| :param max_results: the maximum results to return. |
| :param selected_fields: List of fields to return (comma-separated). If |
| unspecified, all fields are returned. |
| :param page_token: page token, returned from a previous call, |
| identifying the result set. |
| :param start_index: zero based index of the starting row to read. |
| :return: map containing the requested rows. |
| |
| |
| |
| |
| .. method:: run_table_delete(self, deletion_dataset_table, ignore_if_missing=False) |
| |
| Delete an existing table from the dataset; |
| If the table does not exist, return an error unless ignore_if_missing |
| is set to True. |
| |
| :param deletion_dataset_table: A dotted |
| ``(<project>.|<project>:)<dataset>.<table>`` that indicates which table |
| will be deleted. |
| :type deletion_dataset_table: str |
| :param ignore_if_missing: if True, then return success even if the |
| requested table does not exist. |
| :type ignore_if_missing: bool |
| :return: |
| |
| |
| |
| |
| .. method:: run_table_upsert(self, dataset_id, table_resource, project_id=None) |
| |
| creates a new, empty table in the dataset; |
| If the table already exists, update the existing table. |
| Since BigQuery does not natively allow table upserts, this is not an |
| atomic operation. |
| |
| :param dataset_id: the dataset to upsert the table into. |
| :type dataset_id: str |
| :param table_resource: a table resource. see |
| https://cloud.google.com/bigquery/docs/reference/v2/tables#resource |
| :type table_resource: dict |
| :param project_id: the project to upsert the table into. If None, |
| project will be self.project_id. |
| :return: |
| |
| |
| |
| |
| .. method:: run_grant_dataset_view_access(self, source_dataset, view_dataset, view_table, source_project=None, view_project=None) |
| |
| Grant authorized view access of a dataset to a view table. |
| If this view has already been granted access to the dataset, do nothing. |
| This method is not atomic. Running it may clobber a simultaneous update. |
| |
| :param source_dataset: the source dataset |
| :type source_dataset: str |
| :param view_dataset: the dataset that the view is in |
| :type view_dataset: str |
| :param view_table: the table of the view |
| :type view_table: str |
| :param source_project: the project of the source dataset. If None, |
| self.project_id will be used. |
| :type source_project: str |
| :param view_project: the project that the view is in. If None, |
| self.project_id will be used. |
| :type view_project: str |
| :return: the datasets resource of the source dataset. |
| |
| |
| |
| |
| .. method:: create_empty_dataset(self, dataset_id='', project_id='', dataset_reference=None) |
| |
| Create a new empty dataset: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/insert |
| |
| :param project_id: The name of the project where we want to create |
| an empty a dataset. Don't need to provide, if projectId in dataset_reference. |
| :type project_id: str |
| :param dataset_id: The id of dataset. Don't need to provide, |
| if datasetId in dataset_reference. |
| :type dataset_id: str |
| :param dataset_reference: Dataset reference that could be provided |
| with request body. More info: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource |
| :type dataset_reference: dict |
| |
| |
| |
| |
| .. method:: delete_dataset(self, project_id, dataset_id, delete_contents=False) |
| |
| Delete a dataset of Big query in your project. |
| |
| :param project_id: The name of the project where we have the dataset . |
| :type project_id: str |
| :param dataset_id: The dataset to be delete. |
| :type dataset_id: str |
| :param delete_contents: [Optional] Whether to force the deletion even if the dataset is not empty. |
| Will delete all tables (if any) in the dataset if set to True. |
| Will raise HttpError 400: "{dataset_id} is still in use" if set to False and dataset is not empty. |
| The default value is False. |
| :type delete_contents: bool |
| :return: |
| |
| |
| |
| |
| .. method:: get_dataset(self, dataset_id, project_id=None) |
| |
| Method returns dataset_resource if dataset exist |
| and raised 404 error if dataset does not exist |
| |
| :param dataset_id: The BigQuery Dataset ID |
| :type dataset_id: str |
| :param project_id: The GCP Project ID |
| :type project_id: str |
| :return: dataset_resource |
| |
| .. seealso:: |
| For more information, see Dataset Resource content: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource |
| |
| |
| |
| |
| .. method:: get_datasets_list(self, project_id=None) |
| |
| Method returns full list of BigQuery datasets in the current project |
| |
| .. seealso:: |
| For more information, see: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/list |
| |
| :param project_id: Google Cloud Project for which you |
| try to get all datasets |
| :type project_id: str |
| :return: datasets_list |
| |
| Example of returned datasets_list: :: |
| |
| { |
| "kind":"bigquery#dataset", |
| "location":"US", |
| "id":"your-project:dataset_2_test", |
| "datasetReference":{ |
| "projectId":"your-project", |
| "datasetId":"dataset_2_test" |
| } |
| }, |
| { |
| "kind":"bigquery#dataset", |
| "location":"US", |
| "id":"your-project:dataset_1_test", |
| "datasetReference":{ |
| "projectId":"your-project", |
| "datasetId":"dataset_1_test" |
| } |
| } |
| ] |
| |
| |
| |
| |
| .. method:: patch_dataset(self, dataset_id, dataset_resource, project_id=None) |
| |
| Patches information in an existing dataset. |
| It only replaces fields that are provided in the submitted dataset resource. |
| More info: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/patch |
| |
| :param dataset_id: The BigQuery Dataset ID |
| :type dataset_id: str |
| :param dataset_resource: Dataset resource that will be provided |
| in request body. |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource |
| :type dataset_resource: dict |
| :param project_id: The GCP Project ID |
| :type project_id: str |
| :rtype: dataset |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource |
| |
| |
| |
| |
| .. method:: update_dataset(self, dataset_id, dataset_resource, project_id=None) |
| |
| Updates information in an existing dataset. The update method replaces the entire |
| dataset resource, whereas the patch method only replaces fields that are provided |
| in the submitted dataset resource. |
| More info: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/update |
| |
| :param dataset_id: The BigQuery Dataset ID |
| :type dataset_id: str |
| :param dataset_resource: Dataset resource that will be provided |
| in request body. |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource |
| :type dataset_resource: dict |
| :param project_id: The GCP Project ID |
| :type project_id: str |
| :rtype: dataset |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#resource |
| |
| |
| |
| |
| .. method:: insert_all(self, project_id, dataset_id, table_id, rows, ignore_unknown_values=False, skip_invalid_rows=False, fail_on_error=False) |
| |
| Method to stream data into BigQuery one record at a time without needing |
| to run a load job |
| |
| .. seealso:: |
| For more information, see: |
| https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll |
| |
| :param project_id: The name of the project where we have the table |
| :type project_id: str |
| :param dataset_id: The name of the dataset where we have the table |
| :type dataset_id: str |
| :param table_id: The name of the table |
| :type table_id: str |
| :param rows: the rows to insert |
| :type rows: list |
| |
| **Example or rows**: |
| rows=[{"json": {"a_key": "a_value_0"}}, {"json": {"a_key": "a_value_1"}}] |
| |
| :param ignore_unknown_values: [Optional] Accept rows that contain values |
| that do not match the schema. The unknown values are ignored. |
| The default value is false, which treats unknown values as errors. |
| :type ignore_unknown_values: bool |
| :param skip_invalid_rows: [Optional] Insert all valid rows of a request, |
| even if invalid rows exist. The default value is false, which causes |
| the entire request to fail if any invalid rows exist. |
| :type skip_invalid_rows: bool |
| :param fail_on_error: [Optional] Force the task to fail if any errors occur. |
| The default value is false, which indicates the task should not fail |
| even if any insertion errors occur. |
| :type fail_on_error: bool |
| |
| |
| |
| |
| .. py:class:: BigQueryCursor(service, project_id, use_legacy_sql=True, location=None, num_retries=5) |
| |
| Bases: :class:`airflow.contrib.hooks.bigquery_hook.BigQueryBaseCursor` |
| |
| A very basic BigQuery PEP 249 cursor implementation. The PyHive PEP 249 |
| implementation was used as a reference: |
| |
| https://github.com/dropbox/PyHive/blob/master/pyhive/presto.py |
| https://github.com/dropbox/PyHive/blob/master/pyhive/common.py |
| |
| .. attribute:: description |
| |
| |
| The schema description method is not currently implemented. |
| |
| |
| .. attribute:: rowcount |
| |
| |
| By default, return -1 to indicate that this is not supported. |
| |
| |
| .. attribute:: arraysize |
| |
| |
| |
| |
| |
| .. method:: close(self) |
| |
| By default, do nothing |
| |
| |
| |
| |
| .. method:: execute(self, operation, parameters=None) |
| |
| Executes a BigQuery query, and returns the job ID. |
| |
| :param operation: The query to execute. |
| :type operation: str |
| :param parameters: Parameters to substitute into the query. |
| :type parameters: dict |
| |
| |
| |
| |
| .. method:: executemany(self, operation, seq_of_parameters) |
| |
| Execute a BigQuery query multiple times with different parameters. |
| |
| :param operation: The query to execute. |
| :type operation: str |
| :param seq_of_parameters: List of dictionary parameters to substitute into the |
| query. |
| :type seq_of_parameters: list |
| |
| |
| |
| |
| .. method:: flush_results(self) |
| |
| Flush results related cursor attributes. |
| |
| |
| |
| |
| .. method:: fetchone(self) |
| |
| Fetch the next row of a query result set. |
| |
| |
| |
| |
| .. method:: next(self) |
| |
| Helper method for fetchone, which returns the next row from a buffer. |
| If the buffer is empty, attempts to paginate through the result set for |
| the next page, and load it into the buffer. |
| |
| |
| |
| |
| .. method:: fetchmany(self, size=None) |
| |
| Fetch the next set of rows of a query result, returning a sequence of sequences |
| (e.g. a list of tuples). An empty sequence is returned when no more rows are |
| available. The number of rows to fetch per call is specified by the parameter. |
| If it is not given, the cursor's arraysize determines the number of rows to be |
| fetched. The method should try to fetch as many rows as indicated by the size |
| parameter. If this is not possible due to the specified number of rows not being |
| available, fewer rows may be returned. An :py:class:`~pyhive.exc.Error` |
| (or subclass) exception is raised if the previous call to |
| :py:meth:`execute` did not produce any result set or no call was issued yet. |
| |
| |
| |
| |
| .. method:: fetchall(self) |
| |
| Fetch all (remaining) rows of a query result, returning them as a sequence of |
| sequences (e.g. a list of tuples). |
| |
| |
| |
| |
| .. method:: get_arraysize(self) |
| |
| Specifies the number of rows to fetch at a time with .fetchmany() |
| |
| |
| |
| |
| .. method:: set_arraysize(self, arraysize) |
| |
| Specifies the number of rows to fetch at a time with .fetchmany() |
| |
| |
| |
| |
| .. method:: setinputsizes(self, sizes) |
| |
| Does nothing by default |
| |
| |
| |
| |
| .. method:: setoutputsize(self, size, column=None) |
| |
| Does nothing by default |
| |
| |
| |
| |
| .. function:: _bind_parameters(operation, parameters) |
| Helper method that binds parameters to a SQL query. |
| |
| |
| .. function:: _escape(s) |
| Helper method that escapes parameters to a SQL query. |
| |
| |
| .. function:: _bq_cast(string_field, bq_type) |
| Helper method that casts a BigQuery row to the appropriate data types. |
| This is useful because BigQuery returns all fields as strings. |
| |
| |
| .. function:: _split_tablename(table_input, default_project_id, var_name=None) |
| |
| .. function:: _cleanse_time_partitioning(destination_dataset_table, time_partitioning_in) |
| |
| .. function:: _validate_value(key, value, expected_type) |
| function to check expected type and raise |
| error if type is not correct |
| |
| |
| .. function:: _api_resource_configs_duplication_check(key, value, config_dict, config_dict_name='api_resource_configs') |
| |
| .. function:: _validate_src_fmt_configs(source_format, src_fmt_configs, valid_configs, backward_compatibility_configs=None) |
| Validates the given src_fmt_configs against a valid configuration for the source format. |
| Adds the backward compatiblity config to the src_fmt_configs. |
| |
| :param source_format: File format to export. |
| :type source_format: str |
| :param src_fmt_configs: Configure optional fields specific to the source format. |
| :type src_fmt_configs: dict |
| :param valid_configs: Valid configuration specific to the source format |
| :type valid_configs: List[str] |
| :param backward_compatibility_configs: The top-level params for backward-compatibility |
| :type backward_compatibility_configs: dict |
| |
| |