docs/arch/device_target_interactions.rst - tvm - Git at Google

 ..  Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at

 ..    http://www.apache.org/licenses/LICENSE-2.0

 ..  Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.


 .. _tvm-target-specific-overview:

 Device/Target Interactions
 ==========================

 This documented is intended for developers interested in understanding
 how the TVM framework interacts with specific device APIs, or who
 may want to implement support for a new API or new hardware.

 There are three main aspects that must be implemented for any new
 runtime environment.

 * The :ref:`DeviceAPI <tvm-target-specific-device-api>` class gives a
   handle to a specific device, and the API used to interact with it.
   It defines a common interface for querying device parameters
   (e.g. memory available, number of threads, etc.) and for performing
   simple actions (e.g. copying memory from the host, or between
   buffers on the device).

 * The :ref:`Target <tvm-target-specific-target>` class contains a
   description of the device on which a function will run.  It is
   exposed both to the target code generators and to the optimization
   passes.

 * The :ref:`target code generators <tvm-target-specific-codegen>`
   construct a :ref:`Module <tvm-runtime-system-module>` consisting of
   one or more :ref:`PackedFunc <tvm-runtime-system-packed-func>`, from
   an IRModule.

 .. _tvm-target-specific-device-api:

 DeviceAPI
 ---------

 The ``DeviceAPI`` represents a handle to a specific hardware device
 API.  (e.g. ``CUDADeviceAPI`` handles all interactions through the
 CUDA framework.)  Most ``DeviceAPI`` methods accept a ``device_id``
 parameter to specify which device should be accessed.  In Python,
 these are typically accessed using the :py:func:`tvm.runtime.device`
 function, which returns a handle to a specific device, accessed
 through a specific API.  (e.g. ``tvm.runtime.device('cuda',0)`` gives
 access to physical device ``0``, accessed through the CUDA API.)

 .. _device_api.h: https://github.com/apache/tvm/blob/main/include/tvm/runtime/device_api.h

 * Attribute queries - ``GetAttr`` allows different
   device-specific parameters to be queried, such as the device name,
   number of threads, etc.  The parameters that can be queried are
   defined in ``enum DeviceAttrKind`` in `device_api.h`_.  Not all
   query-able parameters are supported by all devices.  If a parameter
   cannot be queried (e.g. ``kMaxClockRate`` on Vulkan), or if a
   parameter isn't applicable (e.g. ``kWarpSize`` on CPU), then those
   queries should return ``nullptr``.

 * Setting active device - ``SetDevice`` should set a
   particular device as being active.  If a ``PackedFunc`` generated by
   the target-specific code gen requires execution on a device, it
   should run on the active device.

 * Memory management - Utilities for allocating and deallocating memory
   on the device.

   * Allocate data space - ``AllocDataSpace`` and ``FreeDataSpace``
     allocate and free space on the device.  These allocations can be
     provided as inputs and outputs to an operator and make up the
     primary data flow of the operator graph.  It must be possible to
     transfer data from the host to/from a data space.  The return
     value is an opaque ``void*``.  While some implementations return a
     memory address, this is not required, and the ``void*`` may be an
     opaque handle that is interpretable only by the device backend
     that generated it.  The ``void*`` is used as an argument to other
     backend-specific functions, such as ``CopyDataFromTo``.

   * Allocate work space - ``AllocWorkspace`` and ``FreeWorkspace``
     allocate and free space on the device.  Unlike data space, these
     are used for storage of intermediate values within an operator
     definition, and are not required to be transferable to/from the
     host device.  If a ``DeviceAPI`` subclass does not implement these
     methods, they will default to calling the corresponding
     ``DataSpace`` functions.

   * Copy data - ``CopyDataFromTo`` should copy data from one location
     to another.  The type of copy is determined by the ``dev_from``
     and ``dev_to`` parameters.  Implementations should support copying
     memory from CPU to device, from device to CPU, and from one buffer
     to another on a single device.  If the source or destination
     locations are on the CPU, the corresponding ``void*`` points to a
     CPU address that can be passed into ``memcpy``.  If the source or
     destinations locations are on the device, the corresponding
     ``void*`` was previously generated by either ``AllocDataSpace`` or
     ``AllocWorkspace``.

     These copies are queued to execute on a specific
     ``TVMStreamHandle``.  However, implementations should not assume
     that CPU buffers remains valid or accessible after the call to
     ``CopyDataFromTo`` completes.


 * Execution stream management - Utilities for handling
   ``TVMStreamHandle``, which represents parallel streams of execution
   used to execute commands.

   * Create stream - ``CreateStream`` and ``FreeStream`` should
     allocate/free a handle to a stream of execution. If a device
     implements only a single queue of commands, then ``CreateStream``
     should return ``nullptr``.

   * Set active stream - ``SetStream`` should set a stream as being
     active.  While active, if a ``PackedFunc`` generated by the
     target-specific code gen requires execution on a device, the work
     should be submitted to the active stream.

   * Synchronize to CPU - ``StreamSync`` should synchronize a stream of
     execution to the CPU.  The call to ``StreamSync`` should return
     once all memory transfers and computations submitted prior to the
     ``StreamSync`` call have completed.

   * Synchronize between streams - ``SyncStreamFromTo`` should
     introduce a synchronization barrier between the source and
     destination stream.  That is, the destination stream may not
     proceed beyond commands currently queued until the source stream
     has completed all commands that are currently queued.


 In order to be usable by the TVM framework, the new DeviceAPI should
 then be registered with the following steps.

 #. Create a function that instantiates the new DeviceAPI, and returns
    a pointer to it::

      FooDeviceAPI* FooDeviceAPI::Global() {
        static FooDeviceAPI inst;
        return &inst;
      }

 #. Register the function to the tvm registry::

      TVM_FFI_STATIC_INIT_BLOCK() {
        namespace refl = tvm::ffi::reflection;
        refl::GlobalDef().def("device_api.foo", FooDeviceAPI::Global);
      }

 .. _base.h: https://github.com/apache/tvm/blob/main/include/tvm/runtime/base.h

 #. Add an entry for the new DeviceAPI to the ``TVMDeviceExtType`` enum
    in `base.h`_.  The value should be an unused value greater
    than ``DLDeviceType::kDLExtDev``, but less than
    ``DeviceAPIManager::kMaxDeviceAPI``.

 #. Add a case in ``DeviceName`` in `device_api.h`_ to convert from the
    enum value to a string representation.  This string representation
    should match the name given to ``GlobalDef().def``.

 #. Add entries to the ``_DEVICE_TYPE_TO_NAME`` and ``_DEVICE_NAME_TO_TYPE`` dictionaries of
    :py:class:`tvm.runtime.Device` for the new enum value.


 .. _tvm-target-specific-target:

 Target Definition
 -----------------

 The ``Target`` object is a lookup table of properties about a physical
 device, its hardware/driver limits, and its capabilities.  The
 ``Target`` is accessible both during optimization and code generation
 stages.  While the same ``Target`` class is used for all runtime
 targets, each runtime target may need to add target-specific options.

 .. _target_kind.cc: https://github.com/apache/tvm/blob/main/src/target/target_kind.cc

 In `target_kind.cc`_, add a new declaration of
 ``TVM_REGISTER_TARGET_KIND``, passing a string name of the new target,
 and the ``TVMDeviceExtType`` or ``DLDeviceType`` enum value for the
 device on which that target should run.  Typically, the target name
 and the device name will match.  (e.g. The ``"cuda"`` target runs on
 the ``kDLCUDA`` device.)  There are exceptions, such as when multiple
 different code generation targets can run on the same physical device.
 (e.g. The ``"llvm"`` and ``"c"`` targets both run on the ``kDLCPU``
 device type.)

 All options for a specific target kind are added with the
 ``add_attr_option`` function, with optional default values.  A `Target`
 parser can be added with ``set_target_parser`` to process
 any parameters that are dynamically based on other parameters or
 queried from device properties.

 This argument definition defines a parser that can unpack a string
 description of a target.  This is done in the ``Target::Target(const
 String&)`` constructor in C++, which accepts a JSON-formatted string
 and is typically called using the :py:class:`tvm.target.Target` python
 object.  For example, ``tvm.target.Target('{"kind": "cuda",
 "max_num_threads": 1024}')`` will create a ``cuda`` target, while
 overriding the default maximum number of threads.

 In a code generator, the target properties can be accessed using
 ``target->GetAttr<T>(param_name)`` in C++, or with the
 ``target.attrs`` dictionary in Python.


 .. _tvm-target-specific-codegen:

 Target Code Generators
 ----------------------

 The code generators take an optimized ``IRModule`` and converts it
 into an executable representation.  Each code generator must be
 registered in order to be used by the TVM framework.  This is done by
 registering a function named ``"target.build.foo"``, where ``foo`` is
 the same name as was used in the ``TVM_REGISTER_TARGET_KIND``
 definition above. ::

   tvm::runtime::Module GeneratorFooCode(IRModule mod, Target target);
   TVM_FFI_STATIC_INIT_BLOCK() {
     namespace refl = tvm::ffi::reflection;
     refl::GlobalDef().def("target.build.foo", GeneratorFooCode);
   }

 The code generator takes two arguments.  The first is the ``IRModule``
 to compile, and the second is the ``Target`` that describes the device
 on which the code should run.  Because the environment performing the
 compilation is not necessarily the same as the environment that will
 be executing the code, code generators should not perform any
 attribute lookups on the device itself, and should instead access
 parameters stored in the ``Target``.

 Each function in the input ``IRModule`` should be accessible by name
 in the output ``runtime::Module``.
	.. Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.


	.. _tvm-target-specific-overview:

	Device/Target Interactions
	==========================

	This documented is intended for developers interested in understanding
	how the TVM framework interacts with specific device APIs, or who
	may want to implement support for a new API or new hardware.

	There are three main aspects that must be implemented for any new
	runtime environment.

	* The :ref:`DeviceAPI <tvm-target-specific-device-api>` class gives a
	handle to a specific device, and the API used to interact with it.
	It defines a common interface for querying device parameters
	(e.g. memory available, number of threads, etc.) and for performing
	simple actions (e.g. copying memory from the host, or between
	buffers on the device).

	* The :ref:`Target <tvm-target-specific-target>` class contains a
	description of the device on which a function will run. It is
	exposed both to the target code generators and to the optimization
	passes.

	* The :ref:`target code generators <tvm-target-specific-codegen>`
	construct a :ref:`Module <tvm-runtime-system-module>` consisting of
	one or more :ref:`PackedFunc <tvm-runtime-system-packed-func>`, from
	an IRModule.

	.. _tvm-target-specific-device-api:

	DeviceAPI
	---------

	The ``DeviceAPI`` represents a handle to a specific hardware device
	API. (e.g. ``CUDADeviceAPI`` handles all interactions through the
	CUDA framework.) Most ``DeviceAPI`` methods accept a ``device_id``
	parameter to specify which device should be accessed. In Python,
	these are typically accessed using the :py:func:`tvm.runtime.device`
	function, which returns a handle to a specific device, accessed
	through a specific API. (e.g. ``tvm.runtime.device('cuda',0)`` gives
	access to physical device ``0``, accessed through the CUDA API.)

	.. _device_api.h: https://github.com/apache/tvm/blob/main/include/tvm/runtime/device_api.h

	* Attribute queries - ``GetAttr`` allows different
	device-specific parameters to be queried, such as the device name,
	number of threads, etc. The parameters that can be queried are
	defined in ``enum DeviceAttrKind`` in `device_api.h`_. Not all
	query-able parameters are supported by all devices. If a parameter
	cannot be queried (e.g. ``kMaxClockRate`` on Vulkan), or if a
	parameter isn't applicable (e.g. ``kWarpSize`` on CPU), then those
	queries should return ``nullptr``.

	* Setting active device - ``SetDevice`` should set a
	particular device as being active. If a ``PackedFunc`` generated by
	the target-specific code gen requires execution on a device, it
	should run on the active device.

	* Memory management - Utilities for allocating and deallocating memory
	on the device.

	* Allocate data space - ``AllocDataSpace`` and ``FreeDataSpace``
	allocate and free space on the device. These allocations can be
	provided as inputs and outputs to an operator and make up the
	primary data flow of the operator graph. It must be possible to
	transfer data from the host to/from a data space. The return
	value is an opaque ``void*``. While some implementations return a
	memory address, this is not required, and the ``void*`` may be an
	opaque handle that is interpretable only by the device backend
	that generated it. The ``void*`` is used as an argument to other
	backend-specific functions, such as ``CopyDataFromTo``.

	* Allocate work space - ``AllocWorkspace`` and ``FreeWorkspace``
	allocate and free space on the device. Unlike data space, these
	are used for storage of intermediate values within an operator
	definition, and are not required to be transferable to/from the
	host device. If a ``DeviceAPI`` subclass does not implement these
	methods, they will default to calling the corresponding
	``DataSpace`` functions.

	* Copy data - ``CopyDataFromTo`` should copy data from one location
	to another. The type of copy is determined by the ``dev_from``
	and ``dev_to`` parameters. Implementations should support copying
	memory from CPU to device, from device to CPU, and from one buffer
	to another on a single device. If the source or destination
	locations are on the CPU, the corresponding ``void*`` points to a
	CPU address that can be passed into ``memcpy``. If the source or
	destinations locations are on the device, the corresponding
	``void*`` was previously generated by either ``AllocDataSpace`` or
	``AllocWorkspace``.

	These copies are queued to execute on a specific
	``TVMStreamHandle``. However, implementations should not assume
	that CPU buffers remains valid or accessible after the call to
	``CopyDataFromTo`` completes.


	* Execution stream management - Utilities for handling
	``TVMStreamHandle``, which represents parallel streams of execution
	used to execute commands.

	* Create stream - ``CreateStream`` and ``FreeStream`` should
	allocate/free a handle to a stream of execution. If a device
	implements only a single queue of commands, then ``CreateStream``
	should return ``nullptr``.

	* Set active stream - ``SetStream`` should set a stream as being
	active. While active, if a ``PackedFunc`` generated by the
	target-specific code gen requires execution on a device, the work
	should be submitted to the active stream.

	* Synchronize to CPU - ``StreamSync`` should synchronize a stream of
	execution to the CPU. The call to ``StreamSync`` should return
	once all memory transfers and computations submitted prior to the
	``StreamSync`` call have completed.

	* Synchronize between streams - ``SyncStreamFromTo`` should
	introduce a synchronization barrier between the source and
	destination stream. That is, the destination stream may not
	proceed beyond commands currently queued until the source stream
	has completed all commands that are currently queued.


	In order to be usable by the TVM framework, the new DeviceAPI should
	then be registered with the following steps.

	#. Create a function that instantiates the new DeviceAPI, and returns
	a pointer to it::

	FooDeviceAPI* FooDeviceAPI::Global() {
	static FooDeviceAPI inst;
	return &inst;
	}

	#. Register the function to the tvm registry::

	TVM_FFI_STATIC_INIT_BLOCK() {
	namespace refl = tvm::ffi::reflection;
	refl::GlobalDef().def("device_api.foo", FooDeviceAPI::Global);
	}

	.. _base.h: https://github.com/apache/tvm/blob/main/include/tvm/runtime/base.h

	#. Add an entry for the new DeviceAPI to the ``TVMDeviceExtType`` enum
	in `base.h`_. The value should be an unused value greater
	than ``DLDeviceType::kDLExtDev``, but less than
	``DeviceAPIManager::kMaxDeviceAPI``.

	#. Add a case in ``DeviceName`` in `device_api.h`_ to convert from the
	enum value to a string representation. This string representation
	should match the name given to ``GlobalDef().def``.

	#. Add entries to the ``_DEVICE_TYPE_TO_NAME`` and ``_DEVICE_NAME_TO_TYPE`` dictionaries of
	:py:class:`tvm.runtime.Device` for the new enum value.


	.. _tvm-target-specific-target:

	Target Definition
	-----------------

	The ``Target`` object is a lookup table of properties about a physical
	device, its hardware/driver limits, and its capabilities. The
	``Target`` is accessible both during optimization and code generation
	stages. While the same ``Target`` class is used for all runtime
	targets, each runtime target may need to add target-specific options.

	.. _target_kind.cc: https://github.com/apache/tvm/blob/main/src/target/target_kind.cc

	In `target_kind.cc`_, add a new declaration of
	``TVM_REGISTER_TARGET_KIND``, passing a string name of the new target,
	and the ``TVMDeviceExtType`` or ``DLDeviceType`` enum value for the
	device on which that target should run. Typically, the target name
	and the device name will match. (e.g. The ``"cuda"`` target runs on
	the ``kDLCUDA`` device.) There are exceptions, such as when multiple
	different code generation targets can run on the same physical device.
	(e.g. The ``"llvm"`` and ``"c"`` targets both run on the ``kDLCPU``
	device type.)

	All options for a specific target kind are added with the
	``add_attr_option`` function, with optional default values. A `Target`
	parser can be added with ``set_target_parser`` to process
	any parameters that are dynamically based on other parameters or
	queried from device properties.

	This argument definition defines a parser that can unpack a string
	description of a target. This is done in the ``Target::Target(const
	String&)`` constructor in C++, which accepts a JSON-formatted string
	and is typically called using the :py:class:`tvm.target.Target` python
	object. For example, ``tvm.target.Target('{"kind": "cuda",
	"max_num_threads": 1024}')`` will create a ``cuda`` target, while
	overriding the default maximum number of threads.

	In a code generator, the target properties can be accessed using
	``target->GetAttr<T>(param_name)`` in C++, or with the
	``target.attrs`` dictionary in Python.


	.. _tvm-target-specific-codegen:

	Target Code Generators
	----------------------

	The code generators take an optimized ``IRModule`` and converts it
	into an executable representation. Each code generator must be
	registered in order to be used by the TVM framework. This is done by
	registering a function named ``"target.build.foo"``, where ``foo`` is
	the same name as was used in the ``TVM_REGISTER_TARGET_KIND``
	definition above. ::

	tvm::runtime::Module GeneratorFooCode(IRModule mod, Target target);
	TVM_FFI_STATIC_INIT_BLOCK() {
	namespace refl = tvm::ffi::reflection;
	refl::GlobalDef().def("target.build.foo", GeneratorFooCode);
	}

	The code generator takes two arguments. The first is the ``IRModule``
	to compile, and the second is the ``Target`` that describes the device
	on which the code should run. Because the environment performing the
	compilation is not necessarily the same as the environment that will
	be executing the code, code generators should not perform any
	attribute lookups on the device itself, and should instead access
	parameters stored in the ``Target``.

	Each function in the input ``IRModule`` should be accessible by name
	in the output ``runtime::Module``.