blob: 83eb4e3f91021c1c69100151444d39192864a35e [file] [log] [blame]
******************************
Python Functions
******************************
=======================
User-defined Functions
=======================
-----------------------
Function registration
-----------------------
To register Python UDFs, you must install script files in all cluster nodes.
After that, you can register your functions by specifying the paths to those script files in ``tajo-site.xml``. Here is an example of the configuration.
.. code-block:: xml
<property>
<name>tajo.function.python.code-dir</name>
<value>/path/to/script1.py,/path/to/script2.py</value>
</property>
Please note that you can specify multiple paths with ``','`` as a delimiter. Each file can contain multiple functions. Here is a typical example of a script file.
.. code-block:: python
# /path/to/udf1.py
@output_type('int4')
def return_one():
return 1
@output_type("text")
def helloworld():
return 'Hello, World'
# No decorator - blob
def concat_py(str):
return str+str
@output_type('int4')
def sum_py(a,b):
return a+b
If the configuration is set properly, every function in the script files are registered when the Tajo cluster starts up.
-----------------------
Decorators and types
-----------------------
By default, every function has a return type of ``BLOB``.
You can use Python decorators to define output types for the script functions. Tajo can figure out return types from the annotations of the Python script.
* ``output_type``: Defines the return data type for a script UDF in a format that Tajo can understand. The defined type must be one of the types supported by Tajo. For supported types, please refer to :doc:`/sql_language/data_model`.
-----------------------
Query example
-----------------------
Once the Python UDFs are successfully registered, you can use them as other built-in functions.
.. code-block:: sql
default> select concat_py(n_name)::text from nation where sum_py(n_regionkey,1) > 2;
==============================================
User-defined Aggregation Functions
==============================================
-----------------------
Function registration
-----------------------
To define your Python aggregation functions, you should write Python classes for each function.
Followings are typical examples of Python UDAFs.
.. code-block:: python
# /path/to/udaf1.py
class AvgPy:
sum = 0
cnt = 0
def __init__(self):
self.reset()
def reset(self):
self.sum = 0
self.cnt = 0
# eval at the first stage
def eval(self, item):
self.sum += item
self.cnt += 1
# get intermediate result
def get_partial_result(self):
return [self.sum, self.cnt]
# merge intermediate results
def merge(self, list):
self.sum += list[0]
self.cnt += list[1]
# get final result
@output_type('float8')
def get_final_result(self):
return self.sum / float(self.cnt)
class CountPy:
cnt = 0
def __init__(self):
self.reset()
def reset(self):
self.cnt = 0
# eval at the first stage
def eval(self):
self.cnt += 1
# get intermediate result
def get_partial_result(self):
return self.cnt
# merge intermediate results
def merge(self, cnt):
self.cnt += cnt
# get final result
@output_type('int4')
def get_final_result(self):
return self.cnt
These classes must provide ``reset()``, ``eval()``, ``merge()``, ``get_partial_result()``, and ``get_final_result()`` functions.
* ``reset()`` resets the aggregation state.
* ``eval()`` evaluates input tuples in the first stage.
* ``merge()`` merges intermediate results of the first stage.
* ``get_partial_result()`` returns intermediate results of the first stage. Output type must be same with the input type of ``merge()``.
* ``get_final_result()`` returns the final aggregation result.
-----------------------
Query example
-----------------------
Once the Python UDAFs are successfully registered, you can use them as other built-in aggregation functions.
.. code-block:: sql
default> select avgpy(n_nationkey), countpy() from nation;
.. warning::
Currently, Python UDAFs cannot be used as window functions.