commit	472fa2b74679a5d2b6fb08dd107b292a95e577b7	[log] [tgz]
author	Ivan Leskin <leskin.in@phystech.edu>	Mon Mar 05 17:19:26 2018 +0300
committer	Alexander Denissov <adenissov@pivotal.io>	Fri Sep 07 12:23:24 2018 -0700
tree	e4478a4aab2844aa2d65e0b78e7b464b7ec11c31
parent	a741655bacd47ead0874b4684c05292d04cf7ee1 [diff]

HAWQ-1605. Support INSERT in PXF JDBC plugin

(closes #1353)

Fix incorrect TIMESTAMP handling

PXF JDBC plugin update

* Add support for INSERT queries:
	* The INSERT queries are processed by the same classes as the SELECT queries;
	* INSERTs are processed by the JDBC PreparedStatement;
	* INSERTs support batching (by means of JDBC);

* Minor changes in WhereSQLBuilder and JdbcPartitionFragmenter:
	* Removed 'WHERE 1=1';
	* The same pattern of spaces around operators everywhere ('a = b', not 'a=b');
	* JdbcPartitionFragmenter.buildFragmenterSql() made static to avoid extra checks of InputData (proposed by @sansanichfb);

* Refactoring and some microoptimizations;

PXF JDBC refactoring

* The README.md is completely rewritten;

* Lots of changes in comments and javadoc comments;

* Code refactoring and minor changes in codestyle

Fixes proposed by @sansanichfb

Add DbProduct for Microsoft SQL Server

Notes on consistency in README and errors

* Add an explicit note on consistency of INSERT queries (it is not guaranteed).

* Change error message on INSERT failure

* Minor corrections of README

The fixes were proposed by @sansanichfb

Improve WhereSQLBuilder

* Add support of TIMESTAMP values;

* Add support of operations <>, LIKE, IS NULL, IS NOT NULL.

Fix proposed by @sansanichfb

Throw an exception when trying to open an already open connection when writing to an external database using `openForWrite()`.

Although the behaviour is different in case of `openForRead()`, it does not apply here. The second call to `openForWrite()` could be made from another thread, and that would result in a race: the `PreparedStatement` we use to write to an external database is the same object for all threads, and the procedure `writeNextObject()` is not `synchronized` (or "protected" some other way).

Simplify logging; BatchUpdateException

Simplify logging so that the logs produced by pxf-jdbc do not grow too big in case DEBUG is enabled (the removed logging calls provide the field types and names, and in most cases they are the same as in the data provided. The exceptions are still being logged).

Add processing of BatchUpdateException, so that the real cause of an exception is returned to the user.

PXF JDBC thread pool support

Implement support of multi-threaded processing of INSERT queries, using a thread pool. To use the feature, set the parameter POOL_SIZE in the LOCATION clause of an external table (<1: Pool size is equal to a number of CPUs available to JVM; =1: Disable thread pool; >1: Use the given size of a pool.

Not all operations are processed by pool threads: pool threads only execute() the queries, but they do not fill the PreparedStatement from OneRow.

Redesign connection pooling

* Redesign connection pooling: move OneRow objects processing to threads from the pool. This decreases the load of a single-thread part of PXF;

* Introduce WriterCallable & related. This significantly simplifies the code of JdbcAccessor and allows to introduce new methods of processing INSERT queries with ease and enables fast hardcode tweaks for the same purpose.

* Add docs on thread pool feature

Support long values in PARTITION clause

Support values of Java primitive type 'long' in PARTITION clause (both for RANGE and INTERVAL variables).

* Modify JdbcPartitionFragmenter (convert all int variables to long)
* Move parsing of INTERVAL values for PARTITION_TYPE "INT" to class constructor (and add a parse exception handler)
* Simplify ByteUtil (remove methods to deal with values of type 'int')
* Update JdbcPartitionFragmenterTest
* Minor changes in comments

Fix pxf-profiles-default.xml

Remove ampersand from a description of JDBC profile from pxf-profiles-default.xml

Remove explicit throws of IllegalArgumentException

Remove explicit references to 'IllegalArgumentException', as the caller is probably unable to recover from them.
'IllegalStateException' is left unchanged, as it is thrown when the caller must perform an action that will resolve the problem ('WriterCallable' is full).

Other runtime exceptions are explicitly listed in function definitions as before; their causes are usually known to the caller, so it could do something about them or at least send a more meaningful message about the error cause to the user.

Proposed by Alex Denissov <adenissov@pivotal.io>

Simplify isCallRequired()

Make function 'isCallRequired()' body a one-line expression in all implementations of 'WriterCallable'.

Proposed by Alex Denissov <adenissov@pivotal.io>

Remove rollback and change BATCH_SIZE logic

Remove calls to 'tryRollback()' and all processing of rollbacks in INSERT queries.
The reason for the change is that rollback is effective for only one case: INSERT is performed from one PXF segment that uses one thread to perform that INSERT, and the external database supports transactions. In most cases, there are more than one PXF segment that performs INSERT, and rollback is of no use then.
On the other hand, rollback logic is cumbersome and notably increases code complexity.

Due to the removal of rollback, there is no longer a need to keep BATCH_SIZE infinite as often as possible (when BATCH_SIZE is infinite, the number of scenarious of rollback() failing is lower (but this number is not zero)).
Thus, setting a recommended (https://docs.oracle.com/cd/E11882_01/java.112/e16548/oraperf.htm#JJDBC28754) value makes sense.
The old logic of infinite batch size also remains active.

Modify README.md: minor corrections, new BATCH_SIZE logic

Proposed by Alex Denissov <adenissov@pivotal.io>

Change BATCH_SIZE logic

* Modify BATCH_SIZE parameter processing according to new proposals https://github.com/apache/hawq/pull/1353#discussion_r214413534
* Update README.md
* Restore fallback to non-batched INSERTs in case the external database (or JDBC connector) does not support batch updates

Proposed by Alex Denissov <adenissov@pivotal.io>
Proposed by Dmitriy Pavlov <pavlovdmst@gmail.com>

Modify processing of BATCH_SIZE parameter

Modify BATCH_SIZE parameter processing according to the proposal https://github.com/apache/hawq/pull/1353#discussion_r215023775:
* Update allowed values of BATCH_SIZE and their meanings
* Introduce explicit flag of presentness of a BATCH_SIZE parameter
* Introduce DEFAULT_BATCH_SIZE constant in JdbcPlugin
* Move processing of BATCH_SIZE values to JdbcAccessor
* Update README.md

Proposed by @divyabhargov, @denalex

Fix column type for columns converted to TEXT

Modify column type processing so that the column type is set correctly for fields that:
* Are represented as columns of type TEXT by GPDBWritable, but whose actual type is different
* Contain NULL value

Before, the column type code was not set correctly for such columns due to a check of NULL field value.

Proposed and authored by @divyabhargov

removed parseUnsignedInt

22 files changed

tree: e4478a4aab2844aa2d65e0b78e7b464b7ec11c31

README.md

CI Process	Status
Travis CI Build
Apache Release Audit Tool (RAT)
Coverity Static Analysis

Apache HAWQ

Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively. HAWQ delivers industry-leading performance and linear scalability. It provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. More specifically, HAWQ has the following features:

On-premise or cloud deployment
Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension
Extremely high performance. many times faster than other Hadoop SQL engine
World-class parallel optimizer
Full transaction capability and consistency guarantee: ACID
Dynamic data flow engine through high speed UDP based interconnect
Elastic execution engine based on virtual segment & data locality
Support multiple level partitioning and List/Range based partitioned tables
Multiple compression method support: snappy, gzip, zlib
Multi-language user defined function support: Python, Perl, Java, C/C++, R
Advanced machine learning and data mining functionalities through MADLib
Dynamic node expansion: in seconds
Most advanced three level resource management: Integrate with YARN and hierarchical resource queues.
Easy access of all HDFS data and external system data (for example, HBase)
Hadoop Native: from storage (HDFS), resource management (YARN) to deployment (Ambari).
Authentication & Granular authorization: Kerberos, SSL and role based access
Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN
Support most third party tools: Tableau, SAS et al.
Standard connectivity: JDBC/ODBC

Build & Install & Test

Please see HAWQ wiki page: https://cwiki.apache.org/confluence/display/HAWQ/Build+and+Install

Export Control

This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See http://www.wassenaar.org/ for more information.

The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code.