Prompt: What is MLTransform
in Apache Beam and how can I use it to process my data?
Response: MLTransform
is a class in Apache Beam that consolidates various transforms for preprocessing and transforming data in machine learning pipelines. It offers a range of capabilities:
The feature that sets MLTransform
apart is its capability to perform a full pass on the dataset, enabling the execution of common preprocessing steps, such as:
MLTransform facilitates the generation of text embeddings and implementation of specialized processing modules for diverse machine learning tasks. It includes:
SentenceTransformerEmbeddings
(leveraging models from the Hugging Face Sentence Transformers) and VertexAITextEmbeddings
(utilizing models from the Vertex AI text-embeddings API).ApplyBuckets
, Bucketize
, ComputeAndApplyVocabulary
, and more, sourced from the TensorFlow Transforms (TFT) library.Apache Beam ensures consistency in preprocessing steps for both training and test data by implementing a workflow for writing and reading artifacts within the MLTransform
class. These artifacts are additional data elements generated by data transformations, such as minimum and maximum values from a ScaleTo01
transformation. The MLTransform
class enables users to specify whether it creates or retrieves artifacts using the write_artifact_location
and read_artifact_location
parameters. When using the write_artifact_location
parameter, MLTransform
runs specified transformations on the dataset and creates artifacts from these transformations, storing them for future use in the specified location. When employing the read_artifact_location
parameter, MLTransform
retrieves artifacts from the specified location and incorporates them into the transform. This artifact workflow ensures that test data undergoes the same preprocessing steps as training data. For instance, after training a machine learning model on data preprocessed by MLTransform
with the write_artifact_location
parameter, you can run inference on new test data using the read_artifact_location
parameter to fetch transformation artifacts created during preprocessing from a specified location.
To preprocess data using MLTransform
, follow these steps:
MLTransform
and any specific transforms you intend to use.MLTransform
with specified transformations, and any additional processing or output steps.Here is the code snippet utilizing MLTransform
that you can modify and integrate into your Apache Beam pipeline:
import apache_beam as beam from apache_beam.ml.transforms.base import MLTransform from apache_beam.ml.transforms.tft import <TRANSFORM_NAME> import tempfile data = [ { <DATA> }, ] artifact_location = tempfile.mkdtemp() <TRANSFORM_FUNCTION_NAME> = <TRANSFORM_NAME>(columns=['x']) with beam.Pipeline() as p: transformed_data = ( p | beam.Create(data) | MLTransform(write_artifact_location=artifact_location).with_transform( <TRANSFORM_FUNCTION_NAME>) | beam.Map(print))
In this example, MLTransform
writes artifacts to the specified location using write_artifact_location
. In the provided code snippet, replace the following values:
The following example illustrates using MLTransform
to normalize data between 0 and 1 using the ScaleToZScore
transformation:
scale_to_z_score_transform = ScaleToZScore(columns=['x', 'y']) with beam.Pipeline() as p: (data | MLTransform(write_artifact_location=artifact_location).with_transform(scale_to_z_score_transform))
You can pass data processing transforms to MLTransform
using either the with_transform
method or a list. The with_transform
method is useful when you want to apply a single transform to your dataset. On the other hand, providing a list of transforms is beneficial when you need to apply multiple transforms sequentially:
MLTransform(transforms=transforms, write_artifact_location=write_artifact_location)