{% include JB/setup %}
BigQuery is a highly scalable no-ops data warehouse in the Google Cloud Platform. Querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. Google BigQuery solves this problem by enabling super-fast SQL queries against append-only tables using the processing power of Google's infrastructure. Simply move your data into BigQuery and let us handle the hard work. You can control access to both the project and your data based on your business needs, such as giving others the ability to view or query your data.
Zeppelin is built against BigQuery API version v2-rev265-1.21.0 - API Javadocs
In a notebook, to enable the BigQuery interpreter, click the Gear icon and select bigquery.
In order to run BigQuery interpreter outside of Google Cloud Engine you need to provide authentication credentials, by following this instructions:
Credentials
page, select the Create credentials
drop-down, then select Service account key
.Key type
, select the JSON
key option, then select Create
. The file automatically downloads to your computer.*.json
file you just downloaded in a directory of your choosing. This directory must be private (you can't let anyone get access to this), but accessible to your Zeppelin instance.GOOGLE_APPLICATION_CREDENTIALS
to the path of the JSON file downloaded.zeppelin-env.sh
: just add it to the end of the file.In a paragraph, use %bigquery.sql
to select the BigQuery interpreter and then input SQL statements against your datasets stored in BigQuery. You can use BigQuery SQL Reference to build your own SQL.
For Example, SQL to query for top 10 departure delays across airports using the flights public dataset
%bigquery.sql SELECT departure_airport,count(case when departure_delay>0 then 1 else 0 end) as no_of_delays FROM [bigquery-samples:airline_ontime_data.flights] group by departure_airport order by 2 desc limit 10
Another Example, SQL to query for most commonly used java packages from the github data hosted in BigQuery
%bigquery.sql SELECT package, COUNT(*) count FROM ( SELECT REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') package, id FROM ( SELECT SPLIT(content, '\n') line, id FROM [bigquery-public-data:github_repos.sample_contents] WHERE content CONTAINS 'import' AND sample_path LIKE '%.java' HAVING LEFT(line, 6)='import' ) GROUP BY package, id ) GROUP BY 1 ORDER BY count DESC LIMIT 40
For in-depth technical details on current implementation please refer to bigquery/README.md.