docs/source/concept.rst - dolphinscheduler-sdk-python - Git at Google

 .. Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

 ..   http://www.apache.org/licenses/LICENSE-2.0

 .. Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.

 Concepts
 ========

 In this section, you would know the core concepts of *PyDolphinScheduler*.

 Process Definition
 ------------------

 Process definition describe the whole things except `tasks`_ and `tasks dependence`_, which including
 name, schedule interval, schedule start time and end time. You would know scheduler

 Process definition could be initialized in normal assign statement or in context manger.

 .. code-block:: python

    # Initialization with assign statement
    pd = ProcessDefinition(name="my first process definition")

    # Or context manger
    with ProcessDefinition(name="my first process definition") as pd:
        pd.submit()

 Process definition is the main object communicate between *PyDolphinScheduler* and DolphinScheduler daemon.
 After process definition and task is be declared, you could use `submit` and `run` notify server your definition.

 If you just want to submit your definition and create workflow, without run it, you should use attribute `submit`.
 But if you want to run the workflow after you submit it, you could use attribute `run`.

 .. code-block:: python

    # Just submit definition, without run it
    pd.submit()

    # Both submit and run definition
    pd.run()

 Schedule
 ~~~~~~~~

 We use parameter `schedule` determine the schedule interval of workflow, *PyDolphinScheduler* support seven
 asterisks expression, and each of the meaning of position as below

 .. code-block:: text

     * * * * * * *
     ┬ ┬ ┬ ┬ ┬ ┬ ┬
     │ │ │ │ │ │ │
     │ │ │ │ │ │ └─── year
     │ │ │ │ │ └───── day of week (0 - 7) (0 to 6 are Sunday to Saturday, or use names; 7 is Sunday, the same as 0)
     │ │ │ │ └─────── month (1 - 12)
     │ │ │ └───────── day of month (1 - 31)
     │ │ └─────────── hour (0 - 23)
     │ └───────────── min (0 - 59)
     └─────────────── second (0 - 59)

 Here we add some example crontab:

 - `0 0 0 * * ? *`: Workflow execute every day at 00:00:00.
 - `10 2 * * * ? *`: Workflow execute hourly day at ten pass two.
 - `10,11 20 0 1,2 * ? *`: Workflow execute first and second day of month at 00:20:10 and 00:20:11.

 Tenant
 ~~~~~~

 Tenant is the user who run task command in machine or in virtual machine. it could be assign by simple string.

 .. code-block:: python

    #
    pd = ProcessDefinition(name="process definition tenant", tenant="tenant_exists")

 .. note::

    Make should tenant exists in target machine, otherwise it will raise an error when you try to run command

 Tasks
 -----

 Task is the minimum unit running actual job, and it is nodes of DAG, aka directed acyclic graph. You could define
 what you want to in the task. It have some required parameter to make uniqueness and definition.

 Here we use :py:meth:`pydolphinscheduler.tasks.Shell` as example, parameter `name` and `command` is required and must be provider. Parameter
 `name` set name to the task, and parameter `command` declare the command you wish to run in this task.

 .. code-block:: python

    # We named this task as "shell", and just run command `echo shell task`
    shell_task = Shell(name="shell", command="echo shell task")

 If you want to see all type of tasks, you could see :doc:`tasks/index`.

 Tasks Dependence
 ~~~~~~~~~~~~~~~~

 You could define many tasks in on single `Process Definition`_. If all those task is in parallel processing,
 then you could leave them alone without adding any additional information. But if there have some tasks should
 not be run unless pre task in workflow have be done, we should set task dependence to them. Set tasks dependence
 have two mainly way and both of them is easy. You could use bitwise operator `>>` and `<<`, or task attribute
 `set_downstream` and `set_upstream` to do it.

 .. code-block:: python

    # Set task1 as task2 upstream
    task1 >> task2
    # You could use attribute `set_downstream` too, is same as `task1 >> task2`
    task1.set_downstream(task2)

    # Set task1 as task2 downstream
    task1 << task2
    # It is same as attribute `set_upstream`
    task1.set_upstream(task2)

    # Beside, we could set dependence between task and sequence of tasks,
    # we set `task1` is upstream to both `task2` and `task3`. It is useful
    # for some tasks have same dependence.
    task1 >> [task2, task3]

 Task With Process Definition
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 In most of data orchestration cases, you should assigned attribute `process_definition` to task instance to
 decide workflow of task. You could set `process_definition` in both normal assign or in context manger mode

 .. code-block:: python

    # Normal assign, have to explicit declaration and pass `ProcessDefinition` instance to task
    pd = ProcessDefinition(name="my first process definition")
    shell_task = Shell(name="shell", command="echo shell task", process_definition=pd)

    # Context manger, `ProcessDefinition` instance pd would implicit declaration to task
    with ProcessDefinition(name="my first process definition") as pd:
        shell_task = Shell(name="shell", command="echo shell task",

 With both `Process Definition`_, `Tasks`_  and `Tasks Dependence`_, we could build a workflow with multiple tasks.
	.. Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	Concepts
	========

	In this section, you would know the core concepts of PyDolphinScheduler.

	Process Definition
	------------------

	Process definition describe the whole things except `tasks`_ and `tasks dependence`_, which including
	name, schedule interval, schedule start time and end time. You would know scheduler

	Process definition could be initialized in normal assign statement or in context manger.

	.. code-block:: python

	# Initialization with assign statement
	pd = ProcessDefinition(name="my first process definition")

	# Or context manger
	with ProcessDefinition(name="my first process definition") as pd:
	pd.submit()

	Process definition is the main object communicate between PyDolphinScheduler and DolphinScheduler daemon.
	After process definition and task is be declared, you could use `submit` and `run` notify server your definition.

	If you just want to submit your definition and create workflow, without run it, you should use attribute `submit`.
	But if you want to run the workflow after you submit it, you could use attribute `run`.

	.. code-block:: python

	# Just submit definition, without run it
	pd.submit()

	# Both submit and run definition
	pd.run()

	Schedule
	~~~~~~~~

	We use parameter `schedule` determine the schedule interval of workflow, PyDolphinScheduler support seven
	asterisks expression, and each of the meaning of position as below

	.. code-block:: text

	* * * * * * *
	┬ ┬ ┬ ┬ ┬ ┬ ┬
	│ │ │ │ │ │ │
	│ │ │ │ │ │ └─── year
	│ │ │ │ │ └───── day of week (0 - 7) (0 to 6 are Sunday to Saturday, or use names; 7 is Sunday, the same as 0)
	│ │ │ │ └─────── month (1 - 12)
	│ │ │ └───────── day of month (1 - 31)
	│ │ └─────────── hour (0 - 23)
	│ └───────────── min (0 - 59)
	└─────────────── second (0 - 59)

	Here we add some example crontab:

	- `0 0 0 * * ? *`: Workflow execute every day at 00:00:00.
	- `10 2 * * * ? *`: Workflow execute hourly day at ten pass two.
	- `10,11 20 0 1,2 * ? *`: Workflow execute first and second day of month at 00:20:10 and 00:20:11.

	Tenant
	~~~~~~

	Tenant is the user who run task command in machine or in virtual machine. it could be assign by simple string.

	.. code-block:: python

	#
	pd = ProcessDefinition(name="process definition tenant", tenant="tenant_exists")

	.. note::

	Make should tenant exists in target machine, otherwise it will raise an error when you try to run command

	Tasks
	-----

	Task is the minimum unit running actual job, and it is nodes of DAG, aka directed acyclic graph. You could define
	what you want to in the task. It have some required parameter to make uniqueness and definition.

	Here we use :py:meth:`pydolphinscheduler.tasks.Shell` as example, parameter `name` and `command` is required and must be provider. Parameter
	`name` set name to the task, and parameter `command` declare the command you wish to run in this task.

	.. code-block:: python

	# We named this task as "shell", and just run command `echo shell task`
	shell_task = Shell(name="shell", command="echo shell task")

	If you want to see all type of tasks, you could see :doc:`tasks/index`.

	Tasks Dependence
	~~~~~~~~~~~~~~~~

	You could define many tasks in on single `Process Definition`_. If all those task is in parallel processing,
	then you could leave them alone without adding any additional information. But if there have some tasks should
	not be run unless pre task in workflow have be done, we should set task dependence to them. Set tasks dependence
	have two mainly way and both of them is easy. You could use bitwise operator `>>` and `<<`, or task attribute
	`set_downstream` and `set_upstream` to do it.

	.. code-block:: python

	# Set task1 as task2 upstream
	task1 >> task2
	# You could use attribute `set_downstream` too, is same as `task1 >> task2`
	task1.set_downstream(task2)

	# Set task1 as task2 downstream
	task1 << task2
	# It is same as attribute `set_upstream`
	task1.set_upstream(task2)

	# Beside, we could set dependence between task and sequence of tasks,
	# we set `task1` is upstream to both `task2` and `task3`. It is useful
	# for some tasks have same dependence.
	task1 >> [task2, task3]

	Task With Process Definition
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	In most of data orchestration cases, you should assigned attribute `process_definition` to task instance to
	decide workflow of task. You could set `process_definition` in both normal assign or in context manger mode

	.. code-block:: python

	# Normal assign, have to explicit declaration and pass `ProcessDefinition` instance to task
	pd = ProcessDefinition(name="my first process definition")
	shell_task = Shell(name="shell", command="echo shell task", process_definition=pd)

	# Context manger, `ProcessDefinition` instance pd would implicit declaration to task
	with ProcessDefinition(name="my first process definition") as pd:
	shell_task = Shell(name="shell", command="echo shell task",

	With both `Process Definition`_, `Tasks`_ and `Tasks Dependence`_, we could build a workflow with multiple tasks.