docs/tutorials/ray.rst - hamilton - Git at Google

 ======
 ⚡ Ray
 ======

 This describes how one can get started with Ray in 5 minutes.

 For those eager to just see code, please jump right ahead to a full hello world with Ray
 `here <https://github.com/stitchfix/hamilton/tree/main/examples/ray>`_.

 .. image:: ../_static/Hamilton_ray_post_image.png

 For those unfamiliar with `Ray <https://ray.io/>`_, it is an open source framework that can scale python applications
 that came out of `UC Berkeley <https://rise.cs.berkeley.edu/projects/ray/>`_. It has a growing ecosystem of tooling
 that helps with lots of machine learning related workflows. For example, it sells itself as enabling you to scale from
 your laptop to a cluster very easily, without having to change much code. In terms of real world use, we like to use
 Ray as a very quick way to implement `multiprocessing in python <https://machinelearningmastery.com/multiprocessing-in-python/>`_
 without worry about the details!

 Ray Primer
 ----------

 Here is a Ray primer. The good news is, is that you don’t have to know much about Ray to use it with Hamilton. You just
 need to know that it will parallelize your workflow over multiple CPU cores easily, and allow you to scale beyond your
 laptop if you have a Ray Cluster set up. But just so you understand how Hamilton connects with it, let’s quickly go over
 how one would use Ray.

 Ray Usage Premise
 #################

 The basic premise to use Ray, is that you have to annotate your functions that you want to be scheduled for execution
 via Ray. For example (from `their documentation <https://docs.ray.io/en/latest/ray-core/tasks.html#ray-remote-functions>`_):

 .. code-block:: python

     # This is a regular Python function.
     def normal_function():
         return 1

     # By adding the `@ray.remote` decorator, a regular Python function
     # becomes a Ray remote function.
     @ray.remote
     def my_ray_function():
         return 1

 Then to execute the ``my_ray_function`` function, you would do:

 .. code-block:: python

     my_ray_function.remote()

 Which would then tell Ray to schedule the function for execution. To run locally versus on a cluster, all you have to
 do is instantiate “Ray” differently before calling the above code.

 .. code-block:: python

     import ray
     ray.init() # local execution over multiple cores
     ray.init({... cluster details ...}) # connect to cluster.

 Now, 🤔, you might be wondering this seems like a lot of work to get my existing code to run, e.g. how do I pass in
 parameters to my functions? How should I change my application to make better use of Ray? etc. Good news! You don’t
 have to think about that with Hamilton at all! You just write your standard Hamilton functions, and only change some
 “_driver_” code to make it run on Ray. More on that in the next section.

 Hamilton + Ray
 --------------

 To use Ray with Hamilton, you first need to install it.

 .. code-block:: bash

     pip install "sf-hamilton[ray]"

 Next, with Hamilton by default, all your logic is written as python functions. So you write your Hamilton functions as
 you normally would.

 At the Hamilton framework level, Hamilton can easily inject ``@ray.remote`` for every single function in the directed
 acyclic graph (DAG) your functions define. That is, `you don’t have to change any of your Hamilton code to make use of
 Ray!` All you need to do, to make Hamilton run on Ray, is provide a `“GraphAdapter”` object to the Hamilton `“Driver”`
 class you instantiate.

 A GraphAdapter, is just a simple class that has a few functions defined that enable you to augment how your DAG is
 walked and executed. See :doc:`../reference/api-reference/available-graph-adapters` for more information.

 In terms of code to add/change, here’s what’s required to augment standard Hamilton driver code — see **numbered
 comments**:

 .. code-block:: python

     import ray
     from hamilton import base, driver
     from hamilton.experimental import h_ray
     ...
     ray.init() # (1) instantiate Ray
     config = {...} # instantiate your config
     modules = [...] # provide modules where your Hamilton functions live
     rga = h_ray.RayGraphAdapter( # (2) object to tell Hamilton to run on Ray
           result_builder=base.PandasDataFrameResult())  # (3) says we want a DF as a result
     dr = driver.Driver(config, *modules, adapter=rga) # (4) tell Hamilton
     df = dr.execute([...])
     ray.shutdown() #(5) shut down ray/our connection to it.

 Note: no change to Hamilton functions needs to take place.

 Let's walk through the numbered code comments:

 #. instantiates Ray -- this is where we would provide cluster information, otherwise this just spins up Ray locally.
 #. we instantiate a RayGraphAdapter. This object will tell Hamilton to do a few special things to execute the DAG.
 #. We have to specify what object we want to return from execution. We want a pandas data frame here, though it could be any type of python object. That is, the other common return type is probably ``base.DictResult()``.
 #. We pass the graph adapter as a keyword argument to the Driver constructor.
 #. We shut down Ray when finished.

 Ray Workflows
 #############

 The Ray Hamilton integration also supports `Ray Workflows <https://docs.ray.io/en/latest/workflows/concepts.html>`_.
 To use that, you just need to replace the graph adapter instantiation with this line:

 .. code-block:: python

     rga = h_ray.RayWorkflowGraphAdapter(
             result_builder=base.PandasDataFrameResult(),
             # Ray will resume a run if possible based on workflow id
             workflow_id="hello-world-123",  # so change this to suit your needs
         )

 Ray workflows require a ``workflow_id`` argument, so be sure to look into the
 `Ray Workflow documentation <https://docs.ray.io/en/latest/workflows/concepts.html>`_ for best practices there.

 It’s that simple!
 #################

 To summarize, the recipe for using Ray with Hamilton doesn’t change much from using Hamilton:

 #. Install Hamilton + Ray. ``pip install "sf-hamilton[ray]"``.
 #. Write Hamilton functions.
 #. Write your driver code — adjust this part if you want it to run on Ray.

 Since it’s so easy to switch to using Ray or not, we’d love some benchmarks/anecdotes to see how much switching to Ray
 improves the speed or scale at which you can operate your dataflow!

 For a full “Ray Hello World” code sample, we direct you to the `examples directory here <https://github.com/stitchfix/hamilton/tree/main/examples/ray/hello\_world>`_.

 Caveats
 -------

 A brief note on caveats with using Hamilton + Ray.

 #. We are looking to graduate Ray support from `"experimental"`, but to do that we need your feedback! That API has
 been very stable (hasn’t changed since launch), but to feel good about making it permanent, we’d love to know what you
 think.
 #. We don’t expose all the functionality of Ray, but we could. E.g. memory aware scheduling, or specifying resources for
 specific functions. Let us know if you want something exposed — create an issue on github please —
 `https://github.com/stitchfix/hamilton <https://github.com/stitchfix/hamilton>`_.

 To conclude
 -----------

 By using Hamilton, you can organize and scale the human side of writing data transforms (no, I didn’t talk about this
 in this post, but see :doc:`../talks-or-podcasts-or-blogs-or-papers` to convince yourself there 😉). With Ray, you can
 scale your data workflows to work beyond the limits of your laptop. Together, the skies the limit!
	======
	⚡ Ray
	======

	This describes how one can get started with Ray in 5 minutes.

	For those eager to just see code, please jump right ahead to a full hello world with Ray
	`here <https://github.com/stitchfix/hamilton/tree/main/examples/ray>`_.

	.. image:: ../_static/Hamilton_ray_post_image.png

	For those unfamiliar with `Ray <https://ray.io/>`_, it is an open source framework that can scale python applications
	that came out of `UC Berkeley <https://rise.cs.berkeley.edu/projects/ray/>`_. It has a growing ecosystem of tooling
	that helps with lots of machine learning related workflows. For example, it sells itself as enabling you to scale from
	your laptop to a cluster very easily, without having to change much code. In terms of real world use, we like to use
	Ray as a very quick way to implement `multiprocessing in python <https://machinelearningmastery.com/multiprocessing-in-python/>`_
	without worry about the details!

	Ray Primer
	----------

	Here is a Ray primer. The good news is, is that you don’t have to know much about Ray to use it with Hamilton. You just
	need to know that it will parallelize your workflow over multiple CPU cores easily, and allow you to scale beyond your
	laptop if you have a Ray Cluster set up. But just so you understand how Hamilton connects with it, let’s quickly go over
	how one would use Ray.

	Ray Usage Premise
	#################

	The basic premise to use Ray, is that you have to annotate your functions that you want to be scheduled for execution
	via Ray. For example (from `their documentation <https://docs.ray.io/en/latest/ray-core/tasks.html#ray-remote-functions>`_):

	.. code-block:: python

	# This is a regular Python function.
	def normal_function():
	return 1

	# By adding the `@ray.remote` decorator, a regular Python function
	# becomes a Ray remote function.
	@ray.remote
	def my_ray_function():
	return 1

	Then to execute the ``my_ray_function`` function, you would do:

	.. code-block:: python

	my_ray_function.remote()

	Which would then tell Ray to schedule the function for execution. To run locally versus on a cluster, all you have to
	do is instantiate “Ray” differently before calling the above code.

	.. code-block:: python

	import ray
	ray.init() # local execution over multiple cores
	ray.init({... cluster details ...}) # connect to cluster.

	Now, 🤔, you might be wondering this seems like a lot of work to get my existing code to run, e.g. how do I pass in
	parameters to my functions? How should I change my application to make better use of Ray? etc. Good news! You don’t
	have to think about that with Hamilton at all! You just write your standard Hamilton functions, and only change some
	“_driver_” code to make it run on Ray. More on that in the next section.

	Hamilton + Ray
	--------------

	To use Ray with Hamilton, you first need to install it.

	.. code-block:: bash

	pip install "sf-hamilton[ray]"

	Next, with Hamilton by default, all your logic is written as python functions. So you write your Hamilton functions as
	you normally would.

	At the Hamilton framework level, Hamilton can easily inject ``@ray.remote`` for every single function in the directed
	acyclic graph (DAG) your functions define. That is, `you don’t have to change any of your Hamilton code to make use of
	Ray!` All you need to do, to make Hamilton run on Ray, is provide a `“GraphAdapter”` object to the Hamilton `“Driver”`
	class you instantiate.

	A GraphAdapter, is just a simple class that has a few functions defined that enable you to augment how your DAG is
	walked and executed. See :doc:`../reference/api-reference/available-graph-adapters` for more information.

	In terms of code to add/change, here’s what’s required to augment standard Hamilton driver code — see **numbered
	comments**:

	.. code-block:: python

	import ray
	from hamilton import base, driver
	from hamilton.experimental import h_ray
	...
	ray.init() # (1) instantiate Ray
	config = {...} # instantiate your config
	modules = [...] # provide modules where your Hamilton functions live
	rga = h_ray.RayGraphAdapter( # (2) object to tell Hamilton to run on Ray
	result_builder=base.PandasDataFrameResult()) # (3) says we want a DF as a result
	dr = driver.Driver(config, *modules, adapter=rga) # (4) tell Hamilton
	df = dr.execute([...])
	ray.shutdown() #(5) shut down ray/our connection to it.

	Note: no change to Hamilton functions needs to take place.

	Let's walk through the numbered code comments:

	#. instantiates Ray -- this is where we would provide cluster information, otherwise this just spins up Ray locally.
	#. we instantiate a RayGraphAdapter. This object will tell Hamilton to do a few special things to execute the DAG.
	#. We have to specify what object we want to return from execution. We want a pandas data frame here, though it could be any type of python object. That is, the other common return type is probably ``base.DictResult()``.
	#. We pass the graph adapter as a keyword argument to the Driver constructor.
	#. We shut down Ray when finished.

	Ray Workflows
	#############

	The Ray Hamilton integration also supports `Ray Workflows <https://docs.ray.io/en/latest/workflows/concepts.html>`_.
	To use that, you just need to replace the graph adapter instantiation with this line:

	.. code-block:: python

	rga = h_ray.RayWorkflowGraphAdapter(
	result_builder=base.PandasDataFrameResult(),
	# Ray will resume a run if possible based on workflow id
	workflow_id="hello-world-123", # so change this to suit your needs
	)

	Ray workflows require a ``workflow_id`` argument, so be sure to look into the
	`Ray Workflow documentation <https://docs.ray.io/en/latest/workflows/concepts.html>`_ for best practices there.

	It’s that simple!
	#################

	To summarize, the recipe for using Ray with Hamilton doesn’t change much from using Hamilton:

	#. Install Hamilton + Ray. ``pip install "sf-hamilton[ray]"``.
	#. Write Hamilton functions.
	#. Write your driver code — adjust this part if you want it to run on Ray.

	Since it’s so easy to switch to using Ray or not, we’d love some benchmarks/anecdotes to see how much switching to Ray
	improves the speed or scale at which you can operate your dataflow!

	For a full “Ray Hello World” code sample, we direct you to the `examples directory here <https://github.com/stitchfix/hamilton/tree/main/examples/ray/hello\_world>`_.

	Caveats
	-------

	A brief note on caveats with using Hamilton + Ray.

	#. We are looking to graduate Ray support from `"experimental"`, but to do that we need your feedback! That API has
	been very stable (hasn’t changed since launch), but to feel good about making it permanent, we’d love to know what you
	think.
	#. We don’t expose all the functionality of Ray, but we could. E.g. memory aware scheduling, or specifying resources for
	specific functions. Let us know if you want something exposed — create an issue on github please —
	`https://github.com/stitchfix/hamilton <https://github.com/stitchfix/hamilton>`_.

	To conclude
	-----------

	By using Hamilton, you can organize and scale the human side of writing data transforms (no, I didn’t talk about this
	in this post, but see :doc:`../talks-or-podcasts-or-blogs-or-papers` to convince yourself there 😉). With Ray, you can
	scale your data workflows to work beyond the limits of your laptop. Together, the skies the limit!