learning/prompts/documentation-lookup-nolinks/39_ai_runinference

Prompt: What metrics are available for monitoring the performance of ML model inference in Apache Beam?

Response: Apache Beam enables efficient inference on both local and remote ML models within your pipelines through the RunInference API. This functionality is available in the Python SDK versions 2.40.0 and later. The Java SDK versions 2.41.0 and later also support the API through Apache Beam’s Multi-language Pipelines framework.

The RunInference API provides built-in support for monitoring the performance of ML model inference. Here is a list of commonly used metrics for inference benchmarking:

model_byte_size: size of the memory footprint of the model load and initialization.
load_model_latency_milli_secs: time taken to load and initialize the model.
num_inferences: total number of elements passed to run_inference().
inference_batch_latency_micro_secs: time taken to perform inference across all batches of examples.
inference_request_batch_byte_size: size of the batch in bytes.
inference_request_batch_size: number of examples in the batch.
failed_batches_counter: number of failed batches.

You can derive additional metrics from the ones listed above. Example: Total time taken for inference = num_inferences x inference_batch_latency_micro_secs_MEAN

Here is a simplified example of how to use the RunInference API to perform inference on a language model (LM):

  with beam.Pipeline(options=pipeline_options) as pipeline:
    _ = (
        pipeline
        | "Create inputs" >> beam.Create(<INPUTS>>)
        | "Tokenize" >> beam.ParDo(Tokenize(<TOKENIZER_NAME>))
        | "Inference" >> RunInference(<MODEL_HANDLER>))

After running the pipeline, you can query the metrics using the following code:

metrics = pipeline.result.metrics().query(beam.metrics.MetricsFilter())

Metrics are also available in the Google Cloud Dataflow UI. For the complete source code of the example and instructions to run it in Dataflow, refer to the Apache Beam GitHub repository.

For a sample implementation of a metrics collector that tracks ML-related performance and memory usage, see the base Python inference example in the Apache Beam GitHub repository.