HAWQ-1605. Support INSERT in PXF JDBC plugin

(closes #1353)

Fix incorrect TIMESTAMP handling

PXF JDBC plugin update

* Add support for INSERT queries:
	* The INSERT queries are processed by the same classes as the SELECT queries;
	* INSERTs are processed by the JDBC PreparedStatement;
	* INSERTs support batching (by means of JDBC);

* Minor changes in WhereSQLBuilder and JdbcPartitionFragmenter:
	* Removed 'WHERE 1=1';
	* The same pattern of spaces around operators everywhere ('a = b', not 'a=b');
	* JdbcPartitionFragmenter.buildFragmenterSql() made static to avoid extra checks of InputData (proposed by @sansanichfb);

* Refactoring and some microoptimizations;

PXF JDBC refactoring

* The README.md is completely rewritten;

* Lots of changes in comments and javadoc comments;

* Code refactoring and minor changes in codestyle

Fixes proposed by @sansanichfb

Add DbProduct for Microsoft SQL Server

Notes on consistency in README and errors

* Add an explicit note on consistency of INSERT queries (it is not guaranteed).

* Change error message on INSERT failure

* Minor corrections of README

The fixes were proposed by @sansanichfb

Improve WhereSQLBuilder

* Add support of TIMESTAMP values;

* Add support of operations <>, LIKE, IS NULL, IS NOT NULL.

Fix proposed by @sansanichfb

Throw an exception when trying to open an already open connection when writing to an external database using `openForWrite()`.

Although the behaviour is different in case of `openForRead()`, it does not apply here. The second call to `openForWrite()` could be made from another thread, and that would result in a race: the `PreparedStatement` we use to write to an external database is the same object for all threads, and the procedure `writeNextObject()` is not `synchronized` (or "protected" some other way).

Simplify logging; BatchUpdateException

Simplify logging so that the logs produced by pxf-jdbc do not grow too big in case DEBUG is enabled (the removed logging calls provide the field types and names, and in most cases they are the same as in the data provided. The exceptions are still being logged).

Add processing of BatchUpdateException, so that the real cause of an exception is returned to the user.

PXF JDBC thread pool support

Implement support of multi-threaded processing of INSERT queries, using a thread pool. To use the feature, set the parameter POOL_SIZE in the LOCATION clause of an external table (<1: Pool size is equal to a number of CPUs available to JVM; =1: Disable thread pool; >1: Use the given size of a pool.

Not all operations are processed by pool threads: pool threads only execute() the queries, but they do not fill the PreparedStatement from OneRow.

Redesign connection pooling

* Redesign connection pooling: move OneRow objects processing to threads from the pool. This decreases the load of a single-thread part of PXF;

* Introduce WriterCallable & related. This significantly simplifies the code of JdbcAccessor and allows to introduce new methods of processing INSERT queries with ease and enables fast hardcode tweaks for the same purpose.

* Add docs on thread pool feature

Support long values in PARTITION clause

Support values of Java primitive type 'long' in PARTITION clause (both for RANGE and INTERVAL variables).

* Modify JdbcPartitionFragmenter (convert all int variables to long)
* Move parsing of INTERVAL values for PARTITION_TYPE "INT" to class constructor (and add a parse exception handler)
* Simplify ByteUtil (remove methods to deal with values of type 'int')
* Update JdbcPartitionFragmenterTest
* Minor changes in comments

Fix pxf-profiles-default.xml

Remove ampersand from a description of JDBC profile from pxf-profiles-default.xml

Remove explicit throws of IllegalArgumentException

Remove explicit references to 'IllegalArgumentException', as the caller is probably unable to recover from them.
'IllegalStateException' is left unchanged, as it is thrown when the caller must perform an action that will resolve the problem ('WriterCallable' is full).

Other runtime exceptions are explicitly listed in function definitions as before; their causes are usually known to the caller, so it could do something about them or at least send a more meaningful message about the error cause to the user.

Proposed by Alex Denissov <adenissov@pivotal.io>

Simplify isCallRequired()

Make function 'isCallRequired()' body a one-line expression in all implementations of 'WriterCallable'.

Proposed by Alex Denissov <adenissov@pivotal.io>

Remove rollback and change BATCH_SIZE logic

Remove calls to 'tryRollback()' and all processing of rollbacks in INSERT queries.
The reason for the change is that rollback is effective for only one case: INSERT is performed from one PXF segment that uses one thread to perform that INSERT, and the external database supports transactions. In most cases, there are more than one PXF segment that performs INSERT, and rollback is of no use then.
On the other hand, rollback logic is cumbersome and notably increases code complexity.

Due to the removal of rollback, there is no longer a need to keep BATCH_SIZE infinite as often as possible (when BATCH_SIZE is infinite, the number of scenarious of rollback() failing is lower (but this number is not zero)).
Thus, setting a recommended (https://docs.oracle.com/cd/E11882_01/java.112/e16548/oraperf.htm#JJDBC28754) value makes sense.
The old logic of infinite batch size also remains active.

Modify README.md: minor corrections, new BATCH_SIZE logic

Proposed by Alex Denissov <adenissov@pivotal.io>

Change BATCH_SIZE logic

* Modify BATCH_SIZE parameter processing according to new proposals https://github.com/apache/hawq/pull/1353#discussion_r214413534
* Update README.md
* Restore fallback to non-batched INSERTs in case the external database (or JDBC connector) does not support batch updates

Proposed by Alex Denissov <adenissov@pivotal.io>
Proposed by Dmitriy Pavlov <pavlovdmst@gmail.com>

Modify processing of BATCH_SIZE parameter

Modify BATCH_SIZE parameter processing according to the proposal https://github.com/apache/hawq/pull/1353#discussion_r215023775:
* Update allowed values of BATCH_SIZE and their meanings
* Introduce explicit flag of presentness of a BATCH_SIZE parameter
* Introduce DEFAULT_BATCH_SIZE constant in JdbcPlugin
* Move processing of BATCH_SIZE values to JdbcAccessor
* Update README.md

Proposed by @divyabhargov, @denalex

Fix column type for columns converted to TEXT

Modify column type processing so that the column type is set correctly for fields that:
* Are represented as columns of type TEXT by GPDBWritable, but whose actual type is different
* Contain NULL value

Before, the column type code was not set correctly for such columns due to a check of NULL field value.

Proposed and authored by @divyabhargov

removed parseUnsignedInt
22 files changed
tree: e4478a4aab2844aa2d65e0b78e7b464b7ec11c31
  1. config/
  2. contrib/
  3. depends/
  4. dist/
  5. doc/
  6. licenses/
  7. pxf/
  8. ranger-plugin/
  9. src/
  10. tools/
  11. .gitignore
  12. .travis.yml
  13. aclocal.m4
  14. configure
  15. configure.in
  16. DISCLAIMER
  17. getversion
  18. GNUmakefile.in
  19. LICENSE
  20. Makefile
  21. NOTICE
  22. pom.xml
  23. putversion
  24. README-PostgreSQL
  25. README.md
README.md

HAWQ


CI ProcessStatus
Travis CI Buildhttps://travis-ci.org/apache/hawq.svg?branch=master
Apache Release Audit Tool (RAT)Rat Status
Coverity Static AnalysisCoverity Scan Build

Website | Wiki | Documentation | Developer Mailing List | User Mailing List | Q&A Collections | Open Defect

Apache HAWQ


Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively. HAWQ delivers industry-leading performance and linear scalability. It provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. More specifically, HAWQ has the following features:

  • On-premise or cloud deployment
  • Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension
  • Extremely high performance. many times faster than other Hadoop SQL engine
  • World-class parallel optimizer
  • Full transaction capability and consistency guarantee: ACID
  • Dynamic data flow engine through high speed UDP based interconnect
  • Elastic execution engine based on virtual segment & data locality
  • Support multiple level partitioning and List/Range based partitioned tables
  • Multiple compression method support: snappy, gzip, zlib
  • Multi-language user defined function support: Python, Perl, Java, C/C++, R
  • Advanced machine learning and data mining functionalities through MADLib
  • Dynamic node expansion: in seconds
  • Most advanced three level resource management: Integrate with YARN and hierarchical resource queues.
  • Easy access of all HDFS data and external system data (for example, HBase)
  • Hadoop Native: from storage (HDFS), resource management (YARN) to deployment (Ambari).
  • Authentication & Granular authorization: Kerberos, SSL and role based access
  • Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN
  • Support most third party tools: Tableau, SAS et al.
  • Standard connectivity: JDBC/ODBC

Build & Install & Test


Please see HAWQ wiki page: https://cwiki.apache.org/confluence/display/HAWQ/Build+and+Install

Export Control


This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See http://www.wassenaar.org/ for more information.

The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code.