blob: 909d02af76aadc8c991ecce507f2d63febcb838a [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Apache Drill - Schema-free SQL for Hadoop, NoSQL and Cloud Storage</title>
<description>The official user documentation for Apache Drill.
</description>
<link>/</link>
<atom:link href="/zh/feed.xml" rel="self" type="application/rss+xml"/>
<pubDate>Wed, 01 May 2024 16:28:41 +0000</pubDate>
<lastBuildDate>Wed, 01 May 2024 16:28:41 +0000</lastBuildDate>
<generator>Jekyll v3.9.1</generator>
<item>
<title>Drill 1.21.1 Released</title>
<description>&lt;p&gt;Today, we’re happy to announce the availability of Drill 1.21.1. You can download it &lt;a href=&quot;https://drill.apache.org/download/&quot;&gt;here&lt;/a&gt;. You can find a complete list of improvements and JIRAs resolved in the 1.21.1 release &lt;a href=&quot;/docs/apache-drill-1-21-1-release-notes/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
</description>
<pubDate>Sat, 29 Apr 2023 00:00:00 +0000</pubDate>
<link>/blog/2023/04/29/drill-1.21.1-released/</link>
<guid isPermaLink="true">/blog/2023/04/29/drill-1.21.1-released/</guid>
<category>blog</category>
</item>
<item>
<title>Announcing Drill 1.21!</title>
<description>&lt;h1 id=&quot;announcing-drill-121-new-connectors-functions-and-much-better-stability&quot;&gt;Announcing Drill 1.21: New Connectors, Functions and Much Better Stability&lt;/h1&gt;
&lt;p&gt;The Apache Drill PMC is pleased to announce a milestone release of Apache Drill. Since the last release of Drill the team has been hard at work quashing bugs and making overall functionality improvements. The TL;DR includes the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;New connectors including Apache Iceberg, Delta Lake, Microsoft Access, GoogleSheets, and Box&lt;/li&gt;
&lt;li&gt;Efficient cross-cloud query capability&lt;/li&gt;
&lt;li&gt;Greatly improved access controls to include user translation support for all storage plugins&lt;/li&gt;
&lt;li&gt;Greatly improved query planning and implicit casting.&lt;/li&gt;
&lt;li&gt;New BI-focused SQL operators including &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PIVOT&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UNPIVOT&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXCEPT&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INTERSECT&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;New functions for computing regression lines and trends.&lt;/li&gt;
&lt;li&gt;New and updated date manipulation functions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Overall, Drill 1.21 is much more capable and stable than previous versions.&lt;/p&gt;
&lt;h2 id=&quot;calcite-were-back&quot;&gt;Calcite, We’re Back!&lt;/h2&gt;
&lt;p&gt;Drill relies on another open source project, Apache Calcite for its query planning. The query planning process is a huge part of the overall functionality of Drill. Unfortunately, about three years ago, there were some issues in Calcite which forced Drill to fork it and rely on that fork. As a result, Drill was essentially stuck with a three year old query planner, but more importantly, bugs that were fixed in Calcite, as well as new capabilities were not finding their way into Drill.&lt;/p&gt;
&lt;p&gt;That is no longer the case. Drill 1.21 is now running on the latest stable version of Calcite, version 1.33. As a result, we’ve been able to close countless JIRA tickets of various queries failing and other random bugs that were the result of query planning bugs.&lt;/p&gt;
&lt;p&gt;What this means for you as a user is that you’ll see much fewer queries failing and better overall performance in terms of speed and stability. You’ll see better optimizations being pushed down to JDBC data sources as well as support for BigQuery, Athena and other JDBC data sources. We hope to keep Drill away from Calcite forks so I hope that we will work with the Calcite community to keep our tools in sync.&lt;/p&gt;
&lt;h2 id=&quot;improved-implicit-casting-rules-reduce-schema-change-failures&quot;&gt;Improved Implicit Casting Rules Reduce Schema Change Failures&lt;/h2&gt;
&lt;p&gt;From this author’s perspective, one of the biggest improvements in Drill is one of the least noticeable and that is the result of improved implicit casting. One of Drill’s unique features is its ability to infer the structure, or schema of your data. However, this can be problematic when the schema changes. When I used to teach Drill, I used to have spend a considerable amount of time teaching students how to cast data from one data type to another to ensure that the queries would succeed.&lt;/p&gt;
&lt;p&gt;When using latest version of Drill, you’ll find that queries will work without the need for much if any casting. In short, they’ll do what you expect them to do. It’s really a high on magic functionality.&lt;/p&gt;
&lt;h2 id=&quot;integrations-with-the-modern-and-not-so-modern-data-stack&quot;&gt;Integrations with the Modern and Not-so-Modern Data Stack&lt;/h2&gt;
&lt;p&gt;The new version of Drill features several new connectors and readers that will enable users to connect to the “modern data stack”, specifically support for Apache Iceberg and Delta Lake.&lt;/p&gt;
&lt;h3 id=&quot;breaking-the-iceberg&quot;&gt;Breaking the Iceberg&lt;/h3&gt;
&lt;p&gt;Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Drill to safely work with the same tables, at the same time. In addition to being able to query data directly from Iceberg tables, Drill also allows users to query the Iceberg table metadata as well as snapshots. &lt;a href=&quot;https://drill.apache.org/docs/iceberg-format-plugin/&quot;&gt;Complete documentation is available here&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;querying-delta-lake&quot;&gt;Querying Delta Lake&lt;/h3&gt;
&lt;p&gt;Lest we offend someone, we’re not going to get into the debate between Iceberg and Delta lake (after all, let’s not argue about who killed whom), but Delta Lake, if you aren’t familiar with it, is another modern table format which allows ACID transactions, versioning etc. In version 1.21, Drill adds support for Delta Lake tables, so users can query Delta Lake tables as well as associated metadata. You can also query specific versions of files in delta lake. &lt;a href=&quot;https://drill.apache.org/docs/delta-lake-format-plugin/&quot;&gt;Complete documentation is available here&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;accessing-access&quot;&gt;Accessing Access&lt;/h3&gt;
&lt;p&gt;A surprising number of people use Microsoft Access as a database for their business data. With version 1.21, Apache Drill can now natively query Microsoft Access database files using Drill. This can be a major benefit for those looking to migrate data from Access into more modern formats such as parquet or even other relational databases. Drill will support Access files from version 1997 and up.&lt;/p&gt;
&lt;h3 id=&quot;oh-sheets&quot;&gt;Oh Sheets!&lt;/h3&gt;
&lt;p&gt;In addition to all of the above, Drill can now query data directly from GoogleSheets. In addition to being able to query this data source, Drill can read, write, delete and append to GoogleSheets. Google doesn’t make it easy, so if this is a feature you are interested in, you’ll definitely want to &lt;a href=&quot;https://drill.apache.org/docs/google-sheets-storage-plugin/&quot;&gt;read the documentation here&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;remote-data&quot;&gt;Remote Data&lt;/h3&gt;
&lt;p&gt;As you can see, Drill has significantly expanded the number of data sources and types that it can query. A part of this work has also been to improve the implementation behind filesystems. As a result, Drill can now query data stored on Dropbox, and Box. We added support for filesystems which use OAuth 2.0 for authorization so this means that more extended file systems are likely coming your way for the next release.&lt;/p&gt;
&lt;h2 id=&quot;greatly-improved-access-controls&quot;&gt;Greatly Improved Access Controls&lt;/h2&gt;
&lt;p&gt;Managing access controls and credentials on a federated query engine is a complicated task. Drill has supported a concept called user impersonation which basically means that Drill can execute queries using the credentials of the logged in user. This concept works well for querying file systems such as Hadoop, and other data sources that have the same concepts, however it does not work at all with data sources that have different concepts of users, or in the case of OAuth enabled plugins, a completely different set of credentials.&lt;/p&gt;
&lt;p&gt;To answer this challenge, Drill 1.21 introduces the concept of user translation. The idea of user translation is that, when enabled, every user will have their own unique credentials for specific data sources. Thus, when that user queries a specific data source, that user’s credentials are used to execute the query. This is configurable on an individual data source basis. Ultimately, what this means is that you no longer have to create service accounts to access data via Drill.&lt;/p&gt;
&lt;h2 id=&quot;drilling-across-the-clouds&quot;&gt;Drilling Across the Clouds&lt;/h2&gt;
&lt;p&gt;While we’re on the subject of clouds, as you may be aware, Drill can query data stored in cloud-based file systems such as S3, Azure, GCP etc. One of the challenges however, is that if you have data stored in multiple clouds, it can become very inefficient to query this data, especially from the perspective of network IO. As of Drill 1.21, Drill adds a storage plugin which we are calling Drill on Drill.&lt;/p&gt;
&lt;p&gt;Let’s say that you had a Drill cluster in S3, but you had data in both S3 and Azure. With the new Drill on Drill capability, you could install an additional Drill cluster in Azure, then query both from either Drill cluster. The advantage is that the queries would be pushed down to the Drill cluster where the data resides. So if you query Azure from S3, you aren’t sending tons of data back and forth.&lt;/p&gt;
&lt;h2 id=&quot;drill-now-supports-more-bi-operators&quot;&gt;Drill Now Supports More BI Operators&lt;/h2&gt;
&lt;p&gt;While Drill held more or less to the SQL standard, it was missing some BI operators that had become commonplace among SQL platforms. Drill 1.21 introduces the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PIVOT&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UNPIVOT&lt;/code&gt; operators which covert rows to columns or vice versa, much in the same way a pivot table works in Excel. Additionally, we added set operators &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INTERSECT&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;EXCEPT&lt;/code&gt; which have become part of the SQL standard.&lt;/p&gt;
&lt;h2 id=&quot;new-statistical-functions&quot;&gt;New Statistical Functions&lt;/h2&gt;
&lt;p&gt;Drill 1.21 adds new SQL functions for statistical summaries including &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kendall_correlation&lt;/code&gt; for calculating correlation coefficients, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;width_bucket&lt;/code&gt; which is a SQL function for computing histograms and distributions, and two other functions for computing regression lines.&lt;/p&gt;
&lt;p&gt;Lastly, we’ve also added additional date/time manipulation functions which will make working with dates significantly easier.&lt;/p&gt;
&lt;h2 id=&quot;whats-next&quot;&gt;What’s Next?&lt;/h2&gt;
&lt;p&gt;The big question is where do we go from here? We’ve already started working on adding support for additional BI operators such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CUBE&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GROUPING SETS&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ROLLUP&lt;/code&gt;, as well as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;REGEXP_EXTRACT&lt;/code&gt;. Since the new version of Calcite has support for numerous optimizations around materialized views this is also something which is being discussed. If you like what you are seeing, please download Drill and try it out. Feedback is always welcome on the Drill slack channel or on our mailing lists. Happy Drilling!&lt;/p&gt;
</description>
<pubDate>Thu, 02 Mar 2023 00:00:00 +0000</pubDate>
<link>/blog/2023/03/02/drill-1.21.0-released/</link>
<guid isPermaLink="true">/blog/2023/03/02/drill-1.21.0-released/</guid>
<category>blog</category>
</item>
<item>
<title>Drill 1.20.3 Released</title>
<description>&lt;p&gt;Today, we’re happy to announce the availability of Drill 1.20.3. You can download it &lt;a href=&quot;https://drill.apache.org/download/&quot;&gt;here&lt;/a&gt;. You can find a complete list of improvements and JIRAs resolved in the 1.20.3 release &lt;a href=&quot;/docs/apache-drill-1-20-3-release-notes/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
</description>
<pubDate>Sat, 07 Jan 2023 00:00:00 +0000</pubDate>
<link>/blog/2023/01/07/drill-1.20.3-released/</link>
<guid isPermaLink="true">/blog/2023/01/07/drill-1.20.3-released/</guid>
<category>blog</category>
</item>
<item>
<title>Drill 1.20.2 Released</title>
<description>&lt;p&gt;Today, we’re happy to announce the availability of Drill 1.20.2. You can download it &lt;a href=&quot;https://drill.apache.org/download/&quot;&gt;here&lt;/a&gt;. You can find a complete list of improvements and JIRAs resolved in the 1.20.2 release &lt;a href=&quot;/docs/apache-drill-1-20-2-release-notes/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
</description>
<pubDate>Wed, 03 Aug 2022 00:00:00 +0000</pubDate>
<link>/blog/2022/08/03/drill-1.20.2-released/</link>
<guid isPermaLink="true">/blog/2022/08/03/drill-1.20.2-released/</guid>
<category>blog</category>
</item>
<item>
<title>Drill 1.20.1 Released</title>
<description>&lt;p&gt;Today, we’re happy to announce the availability of Drill 1.20.1. You can download it &lt;a href=&quot;https://drill.apache.org/download/&quot;&gt;here&lt;/a&gt;. You can find a complete list of improvements and JIRAs resolved in the 1.20.1 release &lt;a href=&quot;/docs/apache-drill-1-20-1-release-notes/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
</description>
<pubDate>Mon, 16 May 2022 00:00:00 +0000</pubDate>
<link>/blog/2022/05/16/drill-1.20.1-released/</link>
<guid isPermaLink="true">/blog/2022/05/16/drill-1.20.1-released/</guid>
<category>blog</category>
</item>
<item>
<title>Drill 1.20 Released</title>
<description>&lt;p&gt;Today, we’re happy to announce the availability of Drill 1.20.0. You can download it &lt;a href=&quot;https://drill.apache.org/download/&quot;&gt;here&lt;/a&gt;. Like much of Drill, this release is full of valuable contributions from talented Ukrainian contributors. As a crisis unfolds in their homeland, the Drill community would like to record here its sympathy and concern for the well-being of our Ukrainian colleagues and friends.&lt;/p&gt;
&lt;h2 id=&quot;this-release-provides-the-following-new-features&quot;&gt;This release provides the following new Features:&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-1282&quot;&gt;DRILL-1282&lt;/a&gt;] - Add read and write support for Parquet v2&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-7985&quot;&gt;DRILL-7985&lt;/a&gt;] - Support Mongo aggregate, union, project, limit, sort pushdowns&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-8027&quot;&gt;DRILL-8027&lt;/a&gt;] - Format plugin for Apache Iceberg&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-8073&quot;&gt;DRILL-8073&lt;/a&gt;] - Add support for persistent table and storage aliases&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-8107&quot;&gt;DRILL-8107&lt;/a&gt;] - Hadoop2 backport Maven profile&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-7969&quot;&gt;DRILL-7969&lt;/a&gt;] - Add support for reading and writing Parquet files using Brotli, LZ4 and Zstandard codecs&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-7971&quot;&gt;DRILL-7971&lt;/a&gt;] - Support Elasticsearch authentication&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-7988&quot;&gt;DRILL-7988&lt;/a&gt;] - Add credentials provider support for API connections in HTTP plugin&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-7995&quot;&gt;DRILL-7995&lt;/a&gt;] - Add ability to query OCI OS&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-8005&quot;&gt;DRILL-8005&lt;/a&gt;] - Add Writer to JDBC Storage Plugin&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-8011&quot;&gt;DRILL-8011&lt;/a&gt;] - Add Dropbox File System to Drill&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-8022&quot;&gt;DRILL-8022&lt;/a&gt;] - Add Provided Schema Support for Excel Reader&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-8028&quot;&gt;DRILL-8028&lt;/a&gt;] - Add PDF Format Plugin&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-8047&quot;&gt;DRILL-8047&lt;/a&gt;] - Add a custom authn provider for HashiCorp Vault&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-8054&quot;&gt;DRILL-8054&lt;/a&gt;] - Add SAS Format Plugin&lt;/li&gt;
&lt;li&gt;[&lt;a href=&quot;https://issues.apache.org/jira/browse/DRILL-8056&quot;&gt;DRILL-8056&lt;/a&gt;] - Add OAuth2 Support for HTTP Rest Plugin&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can find a complete list of improvements and JIRAs resolved in the 1.20.0 release &lt;a href=&quot;/docs/apache-drill-1-20-0-release-notes/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
</description>
<pubDate>Fri, 25 Feb 2022 00:00:00 +0000</pubDate>
<link>/blog/2022/02/25/drill-1.20-released/</link>
<guid isPermaLink="true">/blog/2022/02/25/drill-1.20-released/</guid>
<category>blog</category>
</item>
<item>
<title>2021 in Review</title>
<description>&lt;p&gt;What’s this, a short post written only to tantalise us with news of the looming Drill 1.20? That announcement is indeed soon to follow, but this post is no mere teaser. Here we share with the community a copy of Cong Luo’s excellent review of the Drill project in 2021. If you missed out on this presentation in the January 2022 meetup then &lt;a href=&quot;/static/cong-luo-2021-review.pdf&quot;&gt;here, at least, is a digital copy of the content&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The next community meetup is later today, and the details for joining are &lt;a href=&quot;/community-resources/&quot;&gt;on the Community Resources page&lt;/a&gt;. Come and join us!&lt;/p&gt;
</description>
<pubDate>Fri, 04 Feb 2022 00:00:00 +0000</pubDate>
<link>/blog/2022/02/04/2021-in-review/</link>
<guid isPermaLink="true">/blog/2022/02/04/2021-in-review/</guid>
<category>blog</category>
</item>
<item>
<title>The reports of my death have been greatly exaggerated</title>
<description>&lt;p&gt;There’s a somewhat breathless post entitled “The Death of Apache Drill” in a blog that has as a theme the imminent demise of technologies previously or currently associated with Hadoop, with the exception of Trino (formerly known as PrestoSQL). It’s ultimately a promotional piece for the website’s owner, which is entirely normal and usually it wouldn’t warrant further mention. But it’s done whatever it is that it takes to climb up to the first page of the search results for “Apache Drill” so we’d like to clear up some of the misconceptions that it has come to proliferate. I’ll forgo placing a link here but you will not find it difficult to locate the original.&lt;/p&gt;
&lt;p&gt;Firstly, the title proclaims a little too much. Drill did suffer the loss of its primary corporate backer, and of course its pulse has been faint as a result, but we invite the author to visit the project and reconsider his declaration of death. We don’t have hundreds of active contributors making thousands of commits a year but there are enough of us to get bugs fixed, new data sources supported, and performance and reliability improved. In the near future I’ll blog about our recent work on Elasticsearch, Iceberg, JDBC writing, Parquet formats and Phoenix.&lt;/p&gt;
&lt;p&gt;We’ve started talking about speeding up our release cadence to better reflect our recent activity. We’re rekindling the project’s communication channels, and improving and translating our documentation. Metrics like &lt;a href=&quot;https://pepy.tech/project/sqlalchemy-drill&quot;&gt;downloads of Drill-related software&lt;/a&gt; suggest to us that interest has stopped trending down and started trending up. If this is death, in short, then that phenomenon is a lot less about resting in peace than we’ve allowed ourselves to suppose.&lt;/p&gt;
&lt;p&gt;Let’s turn next to the notion that Drill is “tied”, locked in, to MapR and Hadoop. As far as &lt;em&gt;Apache&lt;/em&gt; Drill is concerned, this has never been true in the time I’ve worked with it. You require nothing from MapR, nor do you need to run a single Hadoop service, in order to start querying using the Drill binaries we distribute with default settings. This is not to say that you &lt;em&gt;cannot&lt;/em&gt; integrate Drill with MapR products and Hadoop. It supports these things well and its history is certainly intertwined with theirs, but you need never configure them e.g. if you’re using cheaper-but-slower cloud object storage instead. We do reuse some library code from Hadoop in our codebase, but so does Trino. The subsequent dichotomy implied by the post’s “Proprietary Solutions vs. Open Source” section heading is a false one, Apache Drill is entirely open source.&lt;/p&gt;
&lt;p&gt;We move on, to the sentiment that users of Hadoop should be fearful. Hadoop probably was overdeployed as many of us scrambled for another scrap of open source to drop from Big Tech’s banquet table where it was developed for a context that only some of us actually share. Some of those deployments will likely revert to something simpler or better matched to the problem at hand. Nevertheless Hadoop is mature and capable software that solves a certain set of problems very well, lives at Apache, and is not about to vanish in a puff of smoke. I see no need at all for its users to feel afraid, regardless of how their big data stacks might evolve in the future.&lt;/p&gt;
&lt;p&gt;On performance and concurrency issues, I don’t have enough information to add anything useful to this. If they’re code problems, rather than misconfiguration, then we’d certainly make them a priority. It’s worth noting that, while there are projects that focus on speed above all else, contemporary Drill places as much weight on flexibility as it does on speed. And what about all the praise heaped on Trino? Well, we agree: this impressive project has accomplished a tremendous amount and we wish them the best for the future.&lt;/p&gt;
&lt;p&gt;Drill is it a very interesting point in its history. It presents a unique opportunity to developers who would like to challenge themselves in that individual contributions are not diluted in a sea of commits from others, and even newcomers can have a major impact. If you’d like to come and pick an interesting problem in Drill to solve please feel welcomed, you’ll find us a friendly bunch. If you’d like a job working full time on Drill then send an email to me at dzamo at apache.org.&lt;/p&gt;
</description>
<pubDate>Sat, 30 Oct 2021 00:00:00 +0000</pubDate>
<link>/blog/2021/10/30/reports-of-my-death/</link>
<guid isPermaLink="true">/blog/2021/10/30/reports-of-my-death/</guid>
<category>blog</category>
</item>
<item>
<title>Drill provider for Airflow</title>
<description>&lt;p&gt;You’re building a new report, visualisation or ML model. Most of the data involved comes from sources well known to you but a new source has become available, allowing your team to measure and model new variables. Eager to get to a prototype and an early sense of what the new analytics look like, you head straight for the first order of business and start to construct a first version of the dataset upon which your final output will be based.&lt;/p&gt;
&lt;p&gt;The data sources you need to combine are immediately accessible but heteregenous: transactional data in PostgreSQL must be combined with data from another team that uses Splunk, lookup data maintained by operational team in an Excel spreadsheet, thousands of XML exports received from a partner and some Parquet files already in your big data environment just for good measure.&lt;/p&gt;
&lt;p&gt;Using Drill iteratively you query and join in each data source one at a time, applying grouping, filtering and other intensive transformations as you go, finally producing a dataset with the fields and grain you need. You store it by adding CREATE TABLE AS in front of your final SELECT then write a few counting and summing queries against the original data sources and your transformed dataset to check that your code produces the expected outputs.&lt;/p&gt;
&lt;p&gt;Apart from possibly configuring some new storage plugins in the Drill web UI, you have so far not left your SQL editor. The onerous data exploration and plumbing parts of your project have flashed by in a blaze of SQL, and you move your dataset into the next tool for visualisation or modelling. The results are good and you know that your users will immediately ask for the outputs to incorporate new data on a regular schedule.&lt;/p&gt;
&lt;p&gt;While Drill can assemble your dataset on the fly, as it did while you prototyped, doing that for the full set takes over 40 minutes, places more load than you’d like in office hours on to your data sources and limits you to the history that the sources keep, in some cases only a few weeks.&lt;/p&gt;
&lt;p&gt;It’s time for ETL, you concede. In the past that meant you had to choose between keeping your working Drill SQL and scheduling it using 70s Unix tools like Cron and Bash, seeing what you could jury-rig using some ETL software and interfaces to Drill like like ODBC and JDBC, or recreating your Drill SQL entirely in (perhaps multiple) other tools and languages, perhaps Spark and Python. But this time you can do things differently…&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://airflow.apache.org&quot;&gt;Apache Airflow&lt;/a&gt; is a workflow engine built in the Python programming ecosystem that has grown into a leading choice for orchestrating big data pipelines, amongst its other applications. Perhaps the first point to understand about Airflow in the context of ETL is that it is designed only for workflow &lt;em&gt;control&lt;/em&gt;, and not for data flow. This makes it different from some of the ETL tools you might have encountered like Microsoft’s SSIS or Pentaho’s PDI which handle both control and data flow: these ETL tools will &lt;em&gt;themselves&lt;/em&gt; connect to a data source and stream data records from it.&lt;/p&gt;
&lt;p&gt;In contrast Airflow is, unless you’re doing it wrong, used only to instruct other software like Spark, Beam, PostgreSQL, Bash, Celery, Scikit-learn scripts, Slack, (… the list of providers is long and varied) to kick off actions at scheduled times. While Airflow does load its schedules from the crontab format, a comparison to cron stops there. Airflow can resolve and execute complex job DAGs with options for clustering, parallelism, retries, backfilling and performance monitoring.&lt;/p&gt;
&lt;p&gt;The exciting news for Drill users is that &lt;a href=&quot;https://pypi.org/project/apache-airflow-providers-apache-drill/&quot;&gt;a new provider package adding support for Drill&lt;/a&gt; was added to Airflow this month. This provider is based on the &lt;a href=&quot;https://pypi.org/project/sqlalchemy-drill/&quot;&gt;sqlalchemy-drill package&lt;/a&gt; which provides Drill connectivity for Python programs. This means that you can add tasks which execute queries on Drill to your Airflow DAGs without any hacky intermediate shell scripts, or build new Airflow operators that use the Drill hook.&lt;/p&gt;
&lt;p&gt;A new tutorial that walks through the development of a simple Airflow DAG that uses the Drill provider &lt;a href=&quot;/docs/orchestrating-queries-with-airflow/&quot;&gt;is available here&lt;/a&gt;.&lt;/p&gt;
</description>
<pubDate>Thu, 05 Aug 2021 00:00:00 +0000</pubDate>
<link>/blog/2021/08/05/drill-provider-for-airflow/</link>
<guid isPermaLink="true">/blog/2021/08/05/drill-provider-for-airflow/</guid>
<category>blog</category>
</item>
<item>
<title>Streaming data from the Drill REST API</title>
<description>&lt;p&gt;Anyone who’s used a UNIX pipe, or even just watched something on Netflix, is at least a little familiar with the idea of processing data in a streaming fashion. While your data are small in size compared to available memory and I/O speeds, streaming is something you can afford to dispense with. But when you cannot fit an entire dataset in RAM, or when you have to download an entire 4K movie before you can start playing it, then streaming data processing can make a game changing difference.&lt;/p&gt;
&lt;p&gt;With the release of version 1.19, Drill will stream JSON query result data over an HTTP response to the client that initiated the query using the REST API. And if anything can easily get big compared to your available RAM or network speed, it’s query results coming back from Drill. It’s important to note here that JSON over HTTP is never going to be the most &lt;em&gt;efficient&lt;/em&gt; way to move big data around&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. JDBC, ODBC and &lt;a href=&quot;https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html&quot;&gt;innovations around them&lt;/a&gt; will always win a speed race over JSON/HTTP, and simply network mounting or copying Parquet files out of your big data storage pool (be that HDFS, S3, or something else) is probably going to beat everything else you try once you’ve got really big data volumes.&lt;/p&gt;
&lt;p&gt;Where JSON and HTTP &lt;em&gt;do&lt;/em&gt; win is universality: today it’s hard to imagine a client hardware and software stack that doesn’t provide JSON and HTTP out of the box with minimal effort. So it’s important that they work as well as is possible, in spite of the alternatives that exist. The new streaming query results delivery on the server side means that Drill’s heap memory isn’t pressurised by having to buffer entire result sets before it starts to transmit them over the network. Even existing REST API clients that do no stream processing of their own will benefit from better reliability because of this.&lt;/p&gt;
&lt;p&gt;To fully realise the benefits of streaming query result data, clients can &lt;em&gt;themselves&lt;/em&gt; operate on the HTTP response they receive in a streaming fashion, thereby potentially starting to process records before Drill has even finished materialising them and avoiding holding the full set in memory if they choose. At the transport level (when there are enough results to warrant it), the HTTP response headers from the Drill REST API will include a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Transfer-Encoding: chunked&lt;/code&gt; header indicating that data transmission in chunks will ensue. Most HTTP client libraries will abstract this implementation detail, instead presenting you with a stream which you can provide as the input to a streaming JSON parser.&lt;/p&gt;
&lt;p&gt;If you set out to develop a streaming HTTP client for the Drill REST API, do take note that the schema of the JSON query result is &lt;em&gt;not&lt;/em&gt; a &lt;a href=&quot;https://en.wikipedia.org/wiki/JSON_streaming&quot;&gt;streaming JSON format&lt;/a&gt; like “JSON lines”. This means that you must be careful about which JSON objects you parse entirely in a single call, particularly avoiding any parent of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rows&lt;/code&gt; property in the query result which would see you parse essentially the entire payload in a single step. The newly released version 1.1.0 of the Python &lt;a href=&quot;https://pypi.org/project/sqlalchemy-drill/&quot;&gt;sqlalchemy-drill&lt;/a&gt; library includes &lt;a href=&quot;https://github.com/JohnOmernik/sqlalchemy-drill/blob/master/sqlalchemy_drill/drilldbapi/_drilldbapi.py&quot;&gt;an implementation of a streaming HTTP client&lt;/a&gt; based on the &lt;a href=&quot;https://pypi.org/project/ijson/&quot;&gt;ijson&lt;/a&gt; streaming JSON parser which might make for a useful reference.&lt;/p&gt;
&lt;p&gt;In closing, and for a bit of fun, here’s the log from a short IPython session where I use sqlalchemy-drill to run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT *&lt;/code&gt; on a remote 17 billion record table over a 0.5 Mbit/s link and start scanning through (a steady trickle of) rows in seconds.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-ipython&quot;&gt;In [4]: r = engine.execute('select count(*) from dfs.ws.big_table')
INFO:drilldbapi:received Drill query ID 1f211888-cc20-fe6f-69d1-6584c5caa2df.
INFO:drilldbapi:opened a row data stream of 1 columns.
In [5]: next(r)
Out[5]: (17437571247,)
In [6]: r = engine.execute('select * from dfs.ws.big_table')
INFO:drilldbapi:received Drill query ID 1f211838-73df-1506-a74e-f5695f6b0ff5.
INFO:drilldbapi:opened a row data stream of 21 columns.
In [7]: while True:
...: _ = next(r)
...:
INFO:drilldbapi:streamed 10000 rows.
INFO:drilldbapi:streamed 20000 rows.
INFO:drilldbapi:streamed 30000 rows.
INFO:drilldbapi:streamed 40000 rows.
&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
&lt;ol&gt;
&lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
&lt;p&gt;Some mitigation is possible. JSON representations of tabular big data are typically extremely compressible and you can reduce the bytes sent over the network by 10-20x by enabling HTTP response compression on a web server like Apache placed in front of the Drill REST API. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
<pubDate>Fri, 09 Jul 2021 00:00:00 +0000</pubDate>
<link>/blog/2021/07/09/streaming-data-from-the-rest-api/</link>
<guid isPermaLink="true">/blog/2021/07/09/streaming-data-from-the-rest-api/</guid>
<category>blog</category>
</item>
</channel>
</rss>