This directory contains Apache Beam examples for MLTransform pipelines.
mltransform_generate_vocab.py builds a vocabulary artifact from batch input rows using MLTransform + ComputeAndApplyVocabulary.
--input_file) or BigQuery (--input_table).--columns).trim, optional lowercasing).ComputeAndApplyVocabulary with top-k and min-frequency constraints using space-delimited token splitting.--output_vocab--columns--input_file--input_table--vocab_size (default: 50000)--min_frequency (default: 1)--lowercase (default: true)--input_expand_factor (default: 1, useful for perf/load testing)python -m apache_beam.examples.ml_transform.mltransform_generate_vocab \ --input_file=/tmp/input.jsonl \ --output_vocab=/tmp/vocab.txt \ --columns=text,category \ --vocab_size=5 \ --min_frequency=1 \ --lowercase=true \ --input_expand_factor=1 \ --runner=DirectRunner
JSONL input with object rows, for example:
{"id":"1","text":"Beam beam ML pipeline"} {"id":"2","text":"Beam pipeline dataflow"} {"id":"3","text":"ML transform beam"} {"id":"4","text":"vocab vocab vocab test"} {"id":"5","text":"rare_token_once"} {"id":"6","text":""} {"id":"7","text":null}
The integration tests in mltransform_generate_vocab_test.py generate this sample data programmatically.
One token per line:
ComputeAndApplyVocabulary.Example output:
beam
ml
For this sample and config:
--columns=text --min_frequency=2 --vocab_size=3
the expected output is:
beam
vocab
ml
Test data for happy path and null/empty/missing columns is generated inline in mltransform_generate_vocab_test.py.
--input_expand_factor): throughput/cost benchmarking on Dataflow.