tajo-docs/src/main/sphinx/functions/python.rst - tajo - Git at Google

 ******************************
 Python Functions
 ******************************

 =======================
 User-defined Functions
 =======================

 -----------------------
 Function registration
 -----------------------

 To register Python UDFs, you must install script files in all cluster nodes.
 After that, you can register your functions by specifying the paths to those script files in ``tajo-site.xml``. Here is an example of the configuration.

 .. code-block:: xml

   <property>
     <name>tajo.function.python.code-dir</name>
     <value>/path/to/script1.py,/path/to/script2.py</value>
   </property>

 Please note that you can specify multiple paths with ``','`` as a delimiter. Each file can contain multiple functions. Here is a typical example of a script file.

 .. code-block:: python

   # /path/to/udf1.py

   @output_type('int4')
   def return_one():
     return 1

   @output_type("text")
   def helloworld():
     return 'Hello, World'

   # No decorator - blob
   def concat_py(str):
     return str+str

   @output_type('int4')
   def sum_py(a,b):
     return a+b

 If the configuration is set properly, every function in the script files are registered when the Tajo cluster starts up.

 -----------------------
 Decorators and types
 -----------------------

 By default, every function has a return type of ``BLOB``.
 You can use Python decorators to define output types for the script functions. Tajo can figure out return types from the annotations of the Python script.

 * ``output_type``: Defines the return data type for a script UDF in a format that Tajo can understand. The defined type must be one of the types supported by Tajo. For supported types, please refer to :doc:`/sql_language/data_model`.

 -----------------------
 Query example
 -----------------------

 Once the Python UDFs are successfully registered, you can use them as other built-in functions.

 .. code-block:: sql

   default> select concat_py(n_name)::text from nation where sum_py(n_regionkey,1) > 2;

 ==============================================
 User-defined Aggregation Functions
 ==============================================

 -----------------------
 Function registration
 -----------------------

 To define your Python aggregation functions, you should write Python classes for each function.
 Followings are typical examples of Python UDAFs.

 .. code-block:: python

   # /path/to/udaf1.py

   class AvgPy:
     sum = 0
     cnt = 0

     def __init__(self):
         self.reset()

     def reset(self):
         self.sum = 0
         self.cnt = 0

     # eval at the first stage
     def eval(self, item):
         self.sum += item
         self.cnt += 1

     # get intermediate result
     def get_partial_result(self):
         return [self.sum, self.cnt]

     # merge intermediate results
     def merge(self, list):
         self.sum += list[0]
         self.cnt += list[1]

     # get final result
     @output_type('float8')
     def get_final_result(self):
         return self.sum / float(self.cnt)


   class CountPy:
     cnt = 0

     def __init__(self):
         self.reset()

     def reset(self):
         self.cnt = 0

     # eval at the first stage
     def eval(self):
         self.cnt += 1

     # get intermediate result
     def get_partial_result(self):
         return self.cnt

     # merge intermediate results
     def merge(self, cnt):
         self.cnt += cnt

     # get final result
     @output_type('int4')
     def get_final_result(self):
         return self.cnt


 These classes must provide ``reset()``, ``eval()``, ``merge()``, ``get_partial_result()``, and ``get_final_result()`` functions.

 * ``reset()`` resets the aggregation state.
 * ``eval()`` evaluates input tuples in the first stage.
 * ``merge()`` merges intermediate results of the first stage.
 * ``get_partial_result()`` returns intermediate results of the first stage. Output type must be same with the input type of ``merge()``.
 * ``get_final_result()`` returns the final aggregation result.

 -----------------------
 Query example
 -----------------------

 Once the Python UDAFs are successfully registered, you can use them as other built-in aggregation functions.

 .. code-block:: sql

   default> select avgpy(n_nationkey), countpy() from nation;

 .. warning::

   Currently, Python UDAFs cannot be used as window functions.
	******************************
	Python Functions
	******************************

	=======================
	User-defined Functions
	=======================

	-----------------------
	Function registration
	-----------------------

	To register Python UDFs, you must install script files in all cluster nodes.
	After that, you can register your functions by specifying the paths to those script files in ``tajo-site.xml``. Here is an example of the configuration.

	.. code-block:: xml

	<property>
	<name>tajo.function.python.code-dir</name>
	<value>/path/to/script1.py,/path/to/script2.py</value>
	</property>

	Please note that you can specify multiple paths with ``','`` as a delimiter. Each file can contain multiple functions. Here is a typical example of a script file.

	.. code-block:: python

	# /path/to/udf1.py

	@output_type('int4')
	def return_one():
	return 1

	@output_type("text")
	def helloworld():
	return 'Hello, World'

	# No decorator - blob
	def concat_py(str):
	return str+str

	@output_type('int4')
	def sum_py(a,b):
	return a+b

	If the configuration is set properly, every function in the script files are registered when the Tajo cluster starts up.

	-----------------------
	Decorators and types
	-----------------------

	By default, every function has a return type of ``BLOB``.
	You can use Python decorators to define output types for the script functions. Tajo can figure out return types from the annotations of the Python script.

	* ``output_type``: Defines the return data type for a script UDF in a format that Tajo can understand. The defined type must be one of the types supported by Tajo. For supported types, please refer to :doc:`/sql_language/data_model`.

	-----------------------
	Query example
	-----------------------

	Once the Python UDFs are successfully registered, you can use them as other built-in functions.

	.. code-block:: sql

	default> select concat_py(n_name)::text from nation where sum_py(n_regionkey,1) > 2;

	==============================================
	User-defined Aggregation Functions
	==============================================

	-----------------------
	Function registration
	-----------------------

	To define your Python aggregation functions, you should write Python classes for each function.
	Followings are typical examples of Python UDAFs.

	.. code-block:: python

	# /path/to/udaf1.py

	class AvgPy:
	sum = 0
	cnt = 0

	def __init__(self):
	self.reset()

	def reset(self):
	self.sum = 0
	self.cnt = 0

	# eval at the first stage
	def eval(self, item):
	self.sum += item
	self.cnt += 1

	# get intermediate result
	def get_partial_result(self):
	return [self.sum, self.cnt]

	# merge intermediate results
	def merge(self, list):
	self.sum += list[0]
	self.cnt += list[1]

	# get final result
	@output_type('float8')
	def get_final_result(self):
	return self.sum / float(self.cnt)


	class CountPy:
	cnt = 0

	def __init__(self):
	self.reset()

	def reset(self):
	self.cnt = 0

	# eval at the first stage
	def eval(self):
	self.cnt += 1

	# get intermediate result
	def get_partial_result(self):
	return self.cnt

	# merge intermediate results
	def merge(self, cnt):
	self.cnt += cnt

	# get final result
	@output_type('int4')
	def get_final_result(self):
	return self.cnt


	These classes must provide ``reset()``, ``eval()``, ``merge()``, ``get_partial_result()``, and ``get_final_result()`` functions.

	* ``reset()`` resets the aggregation state.
	* ``eval()`` evaluates input tuples in the first stage.
	* ``merge()`` merges intermediate results of the first stage.
	* ``get_partial_result()`` returns intermediate results of the first stage. Output type must be same with the input type of ``merge()``.
	* ``get_final_result()`` returns the final aggregation result.

	-----------------------
	Query example
	-----------------------

	Once the Python UDAFs are successfully registered, you can use them as other built-in aggregation functions.

	.. code-block:: sql

	default> select avgpy(n_nationkey), countpy() from nation;

	.. warning::

	Currently, Python UDAFs cannot be used as window functions.