| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept rev="1.2" id="create_function"> |
| |
| <title>CREATE FUNCTION Statement</title> |
| <titlealts audience="PDF"><navtitle>CREATE FUNCTION</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="SQL"/> |
| <data name="Category" value="DDL"/> |
| <data name="Category" value="Schemas"/> |
| <data name="Category" value="Impala Functions"/> |
| <data name="Category" value="UDFs"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> Creates a user-defined function (UDF), which you can use to implement |
| custom logic during <codeph>SELECT</codeph> or <codeph>INSERT</codeph> |
| operations. </p> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <p> |
| The syntax is different depending on whether you create a scalar UDF, which is called once for each row and |
| implemented by a single function, or a user-defined aggregate function (UDA), which is implemented by |
| multiple functions that compute intermediate results across sets of rows. |
| </p> |
| |
| <p rev="2.5.0 IMPALA-2843"> |
| In <keyword keyref="impala25_full"/> and higher, the syntax is also different for creating or dropping scalar Java-based UDFs. |
| The statements for Java UDFs use a new syntax, without any argument types or return type specified. Java-based UDFs |
| created using the new syntax persist across restarts of the Impala catalog server, and can be shared transparently |
| between Impala and Hive. |
| </p> |
| |
| <p> |
| To create a persistent scalar C++ UDF with <codeph>CREATE FUNCTION</codeph>: |
| </p> |
| |
| <codeblock>CREATE FUNCTION [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname>([<varname>arg_type</varname>[, <varname>arg_type</varname>...]) |
| RETURNS <varname>return_type</varname> |
| LOCATION '<varname>hdfs_path_to_dot_so</varname>' |
| SYMBOL='<varname>symbol_name</varname>'</codeblock> |
| |
| <p rev="2.5.0 IMPALA-2843"> |
| To create a persistent Java UDF with <codeph>CREATE FUNCTION</codeph>: |
| <codeblock>CREATE FUNCTION [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname> |
| LOCATION '<varname>hdfs_path_to_jar</varname>' |
| SYMBOL='<varname>class_name</varname>'</codeblock> |
| </p> |
| |
| <p> |
| To create a persistent UDA, which must be written in C++, issue a <codeph>CREATE AGGREGATE FUNCTION</codeph> statement: |
| </p> |
| |
| <codeblock>CREATE [AGGREGATE] FUNCTION [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname>([<varname>arg_type</varname>[, <varname>arg_type</varname>...]) |
| RETURNS <varname>return_type</varname> |
| <ph rev="2.3.0 IMPALA-1829">[INTERMEDIATE <varname>type_spec</varname>]</ph> |
| LOCATION '<varname>hdfs_path</varname>' |
| [INIT_FN='<varname>function</varname>] |
| UPDATE_FN='<varname>function</varname> |
| MERGE_FN='<varname>function</varname> |
| [PREPARE_FN='<varname>function</varname>] |
| [CLOSEFN='<varname>function</varname>] |
| <ph rev="2.0.0">[SERIALIZE_FN='<varname>function</varname>]</ph> |
| [FINALIZE_FN='<varname>function</varname>]</codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/ddl_blurb"/> |
| |
| <p> |
| <b>Varargs notation:</b> |
| </p> |
| |
| <note rev=""> |
| <p rev=""> |
| Variable-length argument lists are supported for C++ UDFs, but currently not for Java UDFs. |
| </p> |
| </note> |
| |
| <p> |
| If the underlying implementation of your function accepts a variable number of arguments: |
| </p> |
| |
| <ul> |
| <li> |
| The variable arguments must go last in the argument list. |
| </li> |
| |
| <li> |
| The variable arguments must all be of the same type. |
| </li> |
| |
| <li> |
| You must include at least one instance of the variable arguments in every function call invoked from SQL. |
| </li> |
| |
| <li> |
| You designate the variable portion of the argument list in the <codeph>CREATE FUNCTION</codeph> statement |
| by including <codeph>...</codeph> immediately after the type name of the first variable argument. For |
| example, to create a function that accepts an <codeph>INT</codeph> argument, followed by a |
| <codeph>BOOLEAN</codeph>, followed by one or more <codeph>STRING</codeph> arguments, your <codeph>CREATE |
| FUNCTION</codeph> statement would look like: |
| <codeblock>CREATE FUNCTION <varname>func_name</varname> (INT, BOOLEAN, STRING ...) |
| RETURNS <varname>type</varname> LOCATION '<varname>path</varname>' SYMBOL='<varname>entry_point</varname>'; |
| </codeblock> |
| </li> |
| </ul> |
| |
| <p rev=""> |
| See <xref href="impala_udf.xml#udf_varargs"/> for how to code a C++ UDF to accept |
| variable-length argument lists. |
| </p> |
| |
| <p> |
| <b>Scalar and aggregate functions:</b> |
| </p> |
| |
| <p> |
| The simplest kind of user-defined function returns a single scalar value each time it is called, typically |
| once for each row in the result set. This general kind of function is what is usually meant by UDF. |
| User-defined aggregate functions (UDAs) are a specialized kind of UDF that produce a single value based on |
| the contents of multiple rows. You usually use UDAs in combination with a <codeph>GROUP BY</codeph> clause to |
| condense a large result set into a smaller one, or even a single row summarizing column values across an |
| entire table. |
| </p> |
| |
| <p> |
| You create UDAs by using the <codeph>CREATE AGGREGATE FUNCTION</codeph> syntax. The clauses |
| <codeph>INIT_FN</codeph>, <codeph>UPDATE_FN</codeph>, <codeph>MERGE_FN</codeph>, |
| <ph rev="2.0.0"><codeph>SERIALIZE_FN</codeph>,</ph> <codeph>FINALIZE_FN</codeph>, and |
| <codeph>INTERMEDIATE</codeph> only apply when you create a UDA rather than a scalar UDF. |
| </p> |
| |
| <p> |
| The <codeph>*_FN</codeph> clauses specify functions to call at different phases of function processing. |
| </p> |
| |
| <ul> |
| <li> |
| <b>Initialize:</b> The function you specify with the <codeph>INIT_FN</codeph> clause does any initial |
| setup, such as initializing member variables in internal data structures. This function is often a stub for |
| simple UDAs. You can omit this clause and a default (no-op) function will be used. |
| </li> |
| |
| <li> |
| <b>Update:</b> The function you specify with the <codeph>UPDATE_FN</codeph> clause is called once for each |
| row in the original result set, that is, before any <codeph>GROUP BY</codeph> clause is applied. A separate |
| instance of the function is called for each different value returned by the <codeph>GROUP BY</codeph> |
| clause. The final argument passed to this function is a pointer, to which you write an updated value based |
| on its original value and the value of the first argument. |
| </li> |
| |
| <li> |
| <b>Merge:</b> The function you specify with the <codeph>MERGE_FN</codeph> clause is called an arbitrary |
| number of times, to combine intermediate values produced by different nodes or different threads as Impala |
| reads and processes data files in parallel. The final argument passed to this function is a pointer, to |
| which you write an updated value based on its original value and the value of the first argument. |
| </li> |
| |
| <li rev="2.0.0"> |
| <b>Serialize:</b> The function you specify with the <codeph>SERIALIZE_FN</codeph> clause frees memory |
| allocated to intermediate results. It is required if any memory was allocated by the Allocate function in |
| the Init, Update, or Merge functions, or if the intermediate type contains any pointers. See |
| <xref keyref="uda-sample.cc">the UDA code samples</xref> for details. |
| </li> |
| |
| <li> |
| <b>Finalize:</b> The function you specify with the <codeph>FINALIZE_FN</codeph> clause does any required |
| teardown for resources acquired by your UDF, such as freeing memory, closing file handles if you explicitly |
| opened any files, and so on. This function is often a stub for simple UDAs. You can omit this clause and a |
| default (no-op) function will be used. It is required in UDAs where the final return type is different than |
| the intermediate type. or if any memory was allocated by the Allocate function in the Init, Update, or |
| Merge functions. See <xref keyref="uda-sample.cc">the UDA code samples</xref> for details. |
| </li> |
| </ul> |
| |
| <p> |
| If you use a consistent naming convention for each of the underlying functions, Impala can automatically |
| determine the names based on the first such clause, so the others are optional. |
| </p> |
| |
| <p> |
| For end-to-end examples of UDAs, see <xref href="impala_udf.xml#udfs"/>. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> |
| |
| <p conref="../shared/impala_common.xml#common/udfs_no_complex_types"/> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <ul> |
| <li> When authorization is enabled, the <codeph>CREATE FUNCTION</codeph> |
| statement requires:<ul> |
| <li>The <codeph>CREATE</codeph> privilege on the database.</li> |
| <li>The <codeph>ALL</codeph> privilege on URI where URI is the value |
| you specified for the <codeph>LOCATION</codeph> in the |
| <codeph>CREATE FUNCTION</codeph> statement. </li> |
| </ul></li> |
| <li> |
| You can write Impala UDFs in either C++ or Java. C++ UDFs are new to Impala, and are the recommended format |
| for high performance utilizing native code. Java-based UDFs are compatible between Impala and Hive, and are |
| most suited to reusing existing Hive UDFs. (Impala can run Java-based Hive UDFs but not Hive UDAs.) |
| </li> |
| |
| <li rev="2.5.0 IMPALA-1748 IMPALA-2843"> |
| <keyword keyref="impala25_full"/> introduces UDF improvements to persistence for both C++ and Java UDFs, |
| and better compatibility between Impala and Hive for Java UDFs. |
| See <xref href="impala_udf.xml#udfs"/> for details. |
| </li> |
| |
| <li> |
| The body of the UDF is represented by a <codeph>.so</codeph> or <codeph>.jar</codeph> file, which you store |
| in HDFS and the <codeph>CREATE FUNCTION</codeph> statement distributes to each Impala node. |
| </li> |
| |
| <li> |
| Impala calls the underlying code during SQL statement evaluation, as many times as needed to process all |
| the rows from the result set. All UDFs are assumed to be deterministic, that is, to always return the same |
| result when passed the same argument values. Impala might or might not skip some invocations of a UDF if |
| the result value is already known from a previous call. Therefore, do not rely on the UDF being called a |
| specific number of times, and do not return different result values based on some external factor such as |
| the current time, a random number function, or an external data source that could be updated while an |
| Impala query is in progress. |
| </li> |
| |
| <li> |
| The names of the function arguments in the UDF are not significant, only their number, positions, and data |
| types. |
| </li> |
| |
| <li> |
| You can overload the same function name by creating multiple versions of the function, each with a |
| different argument signature. For security reasons, you cannot make a UDF with the same name as any |
| built-in function. |
| </li> |
| |
| <li> |
| In the UDF code, you represent the function return result as a <codeph>struct</codeph>. This |
| <codeph>struct</codeph> contains 2 fields. The first field is a <codeph>boolean</codeph> representing |
| whether the value is <codeph>NULL</codeph> or not. (When this field is <codeph>true</codeph>, the return |
| value is interpreted as <codeph>NULL</codeph>.) The second field is the same type as the specified function |
| return type, and holds the return value when the function returns something other than |
| <codeph>NULL</codeph>. |
| </li> |
| |
| <li> |
| In the UDF code, you represent the function arguments as an initial pointer to a UDF context structure, |
| followed by references to zero or more <codeph>struct</codeph>s, corresponding to each of the arguments. |
| Each <codeph>struct</codeph> has the same 2 fields as with the return value, a <codeph>boolean</codeph> |
| field representing whether the argument is <codeph>NULL</codeph>, and a field of the appropriate type |
| holding any non-<codeph>NULL</codeph> argument value. |
| </li> |
| |
| <li> |
| For sample code and build instructions for UDFs, |
| see <xref keyref="udf_samples">the sample UDFs in the Impala github repo</xref>. |
| </li> |
| |
| <li> |
| Because the file representing the body of the UDF is stored in HDFS, it is automatically available to all |
| the Impala nodes. You do not need to manually copy any UDF-related files between servers. |
| </li> |
| |
| <li> |
| Because Impala currently does not have any <codeph>ALTER FUNCTION</codeph> statement, if you need to rename |
| a function, move it to a different database, or change its signature or other properties, issue a |
| <codeph>DROP FUNCTION</codeph> statement for the original function followed by a <codeph>CREATE |
| FUNCTION</codeph> with the desired properties. |
| </li> |
| |
| <li> |
| Because each UDF is associated with a particular database, either issue a <codeph>USE</codeph> statement |
| before doing any <codeph>CREATE FUNCTION</codeph> statements, or specify the name of the function as |
| <codeph><varname>db_name</varname>.<varname>function_name</varname></codeph>. |
| </li> |
| </ul> |
| |
| <p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> |
| |
| <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> |
| |
| <p> |
| Impala can run UDFs that were created through Hive, as long as they refer to Impala-compatible data types |
| (not composite or nested column types). Hive can run Java-based UDFs that were created through Impala, but |
| not Impala UDFs written in C++. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/current_user_caveat"/> |
| |
| <p><b>Persistence:</b></p> |
| |
| <p conref="../shared/impala_common.xml#common/udf_persistence_restriction"/> |
| |
| <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> |
| |
| <p conref="../shared/impala_common.xml#common/permissions_blurb_no"/> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| For additional examples of all kinds of user-defined functions, see <xref href="impala_udf.xml#udfs"/>. |
| </p> |
| |
| <p rev="2.5.0 IMPALA-2843"> |
| The following example shows how to take a Java jar file and make all the functions inside one of its classes |
| into UDFs under a single (overloaded) function name in Impala. Each <codeph>CREATE FUNCTION</codeph> or |
| <codeph>DROP FUNCTION</codeph> statement applies to all the overloaded Java functions with the same name. |
| This example uses the signatureless syntax for <codeph>CREATE FUNCTION</codeph> and <codeph>DROP FUNCTION</codeph>, |
| which is available in <keyword keyref="impala25_full"/> and higher. |
| </p> |
| <p rev="2.5.0 IMPALA-2843"> |
| At the start, the jar file is in the local filesystem. Then it is copied into HDFS, so that it is |
| available for Impala to reference through the <codeph>CREATE FUNCTION</codeph> statement and |
| queries that refer to the Impala function name. |
| </p> |
| <codeblock rev="2.5.0 IMPALA-2843"> |
| $ jar -tvf udf-examples.jar |
| 0 Mon Feb 22 04:06:50 PST 2016 META-INF/ |
| 122 Mon Feb 22 04:06:48 PST 2016 META-INF/MANIFEST.MF |
| 0 Mon Feb 22 04:06:46 PST 2016 org/ |
| 0 Mon Feb 22 04:06:46 PST 2016 org/apache/ |
| 0 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/ |
| 2460 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/IncompatibleUdfTest.class |
| 541 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/TestUdfException.class |
| 3438 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/JavaUdfTest.class |
| 5872 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/TestUdf.class |
| ... |
| $ hdfs dfs -put udf-examples.jar /user/impala/udfs |
| $ hdfs dfs -ls /user/impala/udfs |
| Found 2 items |
| -rw-r--r-- 3 jrussell supergroup 853 2015-10-09 14:05 /user/impala/udfs/hello_world.jar |
| -rw-r--r-- 3 jrussell supergroup 7366 2016-06-08 14:25 /user/impala/udfs/udf-examples.jar |
| </codeblock> |
| <p rev="2.5.0 IMPALA-2843"> |
| In <cmdname>impala-shell</cmdname>, the <codeph>CREATE FUNCTION</codeph> refers to the HDFS path of the jar file |
| and the fully qualified class name inside the jar. Each of the functions inside the class becomes an |
| Impala function, each one overloaded under the specified Impala function name. |
| </p> |
| <codeblock rev="2.5.0 IMPALA-2843"> |
| [localhost:21000] > create function testudf location '/user/impala/udfs/udf-examples.jar' symbol='org.apache.impala.TestUdf'; |
| [localhost:21000] > show functions; |
| +-------------+---------------------------------------+-------------+---------------+ |
| | return type | signature | binary type | is persistent | |
| +-------------+---------------------------------------+-------------+---------------+ |
| | BIGINT | testudf(BIGINT) | JAVA | true | |
| | BOOLEAN | testudf(BOOLEAN) | JAVA | true | |
| | BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true | |
| | BOOLEAN | testudf(BOOLEAN, BOOLEAN, BOOLEAN) | JAVA | true | |
| | DOUBLE | testudf(DOUBLE) | JAVA | true | |
| | DOUBLE | testudf(DOUBLE, DOUBLE) | JAVA | true | |
| | DOUBLE | testudf(DOUBLE, DOUBLE, DOUBLE) | JAVA | true | |
| | FLOAT | testudf(FLOAT) | JAVA | true | |
| | FLOAT | testudf(FLOAT, FLOAT) | JAVA | true | |
| | FLOAT | testudf(FLOAT, FLOAT, FLOAT) | JAVA | true | |
| | INT | testudf(INT) | JAVA | true | |
| | DOUBLE | testudf(INT, DOUBLE) | JAVA | true | |
| | INT | testudf(INT, INT) | JAVA | true | |
| | INT | testudf(INT, INT, INT) | JAVA | true | |
| | SMALLINT | testudf(SMALLINT) | JAVA | true | |
| | SMALLINT | testudf(SMALLINT, SMALLINT) | JAVA | true | |
| | SMALLINT | testudf(SMALLINT, SMALLINT, SMALLINT) | JAVA | true | |
| | STRING | testudf(STRING) | JAVA | true | |
| | STRING | testudf(STRING, STRING) | JAVA | true | |
| | STRING | testudf(STRING, STRING, STRING) | JAVA | true | |
| | TINYINT | testudf(TINYINT) | JAVA | true | |
| +-------------+---------------------------------------+-------------+---------------+ |
| </codeblock> |
| <p rev="2.5.0 IMPALA-2843"> |
| These are all simple functions that return their single arguments, or |
| sum, concatenate, and so on their multiple arguments. Impala determines which |
| overloaded function to use based on the number and types of the arguments. |
| </p> |
| <codeblock rev="2.5.0 IMPALA-2843"> |
| insert into bigint_x values (1), (2), (4), (3); |
| select testudf(x) from bigint_x; |
| +-----------------+ |
| | udfs.testudf(x) | |
| +-----------------+ |
| | 1 | |
| | 2 | |
| | 4 | |
| | 3 | |
| +-----------------+ |
| |
| insert into int_x values (1), (2), (4), (3); |
| select testudf(x, x+1, x*x) from int_x; |
| +-------------------------------+ |
| | udfs.testudf(x, x + 1, x * x) | |
| +-------------------------------+ |
| | 4 | |
| | 9 | |
| | 25 | |
| | 16 | |
| +-------------------------------+ |
| |
| select testudf(x) from string_x; |
| +-----------------+ |
| | udfs.testudf(x) | |
| +-----------------+ |
| | one | |
| | two | |
| | four | |
| | three | |
| +-----------------+ |
| select testudf(x,x) from string_x; |
| +--------------------+ |
| | udfs.testudf(x, x) | |
| +--------------------+ |
| | oneone | |
| | twotwo | |
| | fourfour | |
| | threethree | |
| +--------------------+ |
| </codeblock> |
| |
| <p rev="2.5.0 IMPALA-2843"> |
| The previous example used the same Impala function name as the name of the class. |
| This example shows how the Impala function name is independent of the underlying |
| Java class or function names. A second <codeph>CREATE FUNCTION</codeph> statement |
| results in a set of overloaded functions all named <codeph>my_func</codeph>, |
| to go along with the overloaded functions all named <codeph>testudf</codeph>. |
| </p> |
| <codeblock rev="2.5.0 IMPALA-2843"> |
| create function my_func location '/user/impala/udfs/udf-examples.jar' |
| symbol='org.apache.impala.TestUdf'; |
| |
| show functions; |
| +-------------+---------------------------------------+-------------+---------------+ |
| | return type | signature | binary type | is persistent | |
| +-------------+---------------------------------------+-------------+---------------+ |
| | BIGINT | my_func(BIGINT) | JAVA | true | |
| | BOOLEAN | my_func(BOOLEAN) | JAVA | true | |
| | BOOLEAN | my_func(BOOLEAN, BOOLEAN) | JAVA | true | |
| ... |
| | BIGINT | testudf(BIGINT) | JAVA | true | |
| | BOOLEAN | testudf(BOOLEAN) | JAVA | true | |
| | BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true | |
| ... |
| </codeblock> |
| <p rev="2.5.0 IMPALA-2843"> |
| The corresponding <codeph>DROP FUNCTION</codeph> statement with no signature |
| drops all the overloaded functions with that name. |
| </p> |
| <codeblock rev="2.5.0 IMPALA-2843"> |
| drop function my_func; |
| show functions; |
| +-------------+---------------------------------------+-------------+---------------+ |
| | return type | signature | binary type | is persistent | |
| +-------------+---------------------------------------+-------------+---------------+ |
| | BIGINT | testudf(BIGINT) | JAVA | true | |
| | BOOLEAN | testudf(BOOLEAN) | JAVA | true | |
| | BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true | |
| ... |
| </codeblock> |
| <p rev="2.5.0 IMPALA-2843"> |
| The signatureless <codeph>CREATE FUNCTION</codeph> syntax for Java UDFs ensures that |
| the functions shown in this example remain available after the Impala service |
| (specifically, the Catalog Server) are restarted. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| |
| <p> |
| <xref href="impala_udf.xml#udfs"/> for more background information, usage instructions, and examples for |
| Impala UDFs; <xref href="impala_drop_function.xml#drop_function"/> |
| </p> |
| </conbody> |
| </concept> |