docs/hop-user-manual/modules/ROOT/pages/pipeline/metadata-injection.adoc - hop - Git at Google

 ////
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at
   http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 ////
 [[MetadataInjection]]
 :imagesdir: ../assets/images
 :description: Metadata injection inserts data from various sources into a template pipeline at runtime to reduce repetitive tasks.

 = Metadata Injection

 Metadata injection inserts data from various sources into a template pipeline at runtime to reduce repetitive tasks.

 For example, you might have a simple pipeline to load transaction data values from a supplier, filter specific values, and output them to a file.
 If you have more than one supplier, you would need to run this simple pipeline for each supplier.
 Yet, with metadata injection, you can expand this simple repetitive pipeline by inserting metadata from another pipeline that contains the ETL Metadata Injection transform.
 This transform coordinates the data values from the various inputs through the metadata you define.
 This process reduces the need for you to adjust and run the repetitive pipeline for each specific input.

 The repetitive pipeline is known as the template pipeline.
 The template pipeline is called by the ETL Metadata Injection transform.
 You will create a pipeline to prepare what common values you want to use as metadata and inject these specific values through the ETL Metadata Injection transform.

 We recommend the following basic procedure for using this transform to inject metadata:

 1. Optimize your data for injection, such as preparing folder structures and inputs.

 2. Develop pipelines for the repetitive process (the template pipeline), for metadata injection through the ETL Metadata Injection transform, and for handling multiple inputs.

 The metadata is injected into the template pipeline through any transform that supports metadata injection.

 == Supported Transforms

 The goal is to add Metadata Injection support to all transforms, The current (29-july 2022) status is:

 |===
 |Transform|Supports MDI
 |Abort|Y
 |Add a checksum|Y
 |Add constants|Y
 |Add sequence|Y
 |Add value fields changing sequence|Y
 |Add XML|Y
 |Analytic query|Y
 |Apache Tika|Y
 |Append streams|Y
 |Avro Decode|Y
 |Avro Encode|Y
 |Avro File Input|Y
 |Avro File Output|Y
 |Azure Event Hubs Listener|Y
 |Azure Event Hubs Writer|Y
 |Beam BigQuery Input|Y
 |Beam BigQuery Output|Y
 |Beam Bigtable Input|Y
 |Beam Bigtable Output|Y
 |Beam GCP Pub/Sub : Publish|Y
 |Beam GCP Pub/Sub : Subscribe|Y
 |Beam Input|Y
 |Beam Kafka Consume|Y
 |Beam Kafka Produce|Y
 |Beam Kinesis Consume|Y
 |Beam Kinesis Produce|Y
 |Beam Output|Y
 |Beam Timestamp|Y
 |Beam Window|Y
 |Block until transforms finish|Y
 |Blocking transform|Y
 |Calculator|Y
 |Call DB procedure|Y
 |Cassandra input|Y
 |Cassandra output|Y
 |Change file encoding|Y
 |Check if file is locked|Y
 |Check if webservice is available|Y
 |Clone row|Y
 |Closure generator|Y
 |Coalesce Fields|Y
 |Column exists|Y
 |Combination lookup/update|Y
 |Concat Fields|Y
 |Copy rows to result|Y
 |Credit card validator|Y
 |CSV file input|Y
 |Data grid|Y
 |Database join|Y
 |Database lookup|Y
 |De-serialize from file|Y
 |Delay row|Y
 |Delete|Y
 |Detect empty stream|Y
 |Dimension lookup/update|Y
 |Doris bulk loader|Y
 |Dummy (do nothing)|Y
 |Dynamic SQL row|Y
 |EDI to XML|Y
 |Email messages input|Y
 |Enhanced JSON Output|Y
 |ETL metadata injection|Y
 |Execute a process|Y
 |Execute row SQL script|Y
 |Execute SQL script|Y
 |Execute Unit Tests|Y
 |Fake data|Y
 |File exists|Y
 |File Metadata|Y
 |Filter rows|Y
 |Formula|Y
 |Fuzzy match|Y
 |Generate random value|Y
 |Generate rows|Y
 |Get data from XML|Y
 |Get file names|Y
 |Get files from result|Y
 |Get files rows count|Y
 |Get ID from hop server|Y
 |Get Neo4j Logging Info|Y
 |Get records from stream|Y
 |Get rows from result|Y
 |Get Server Status|Y
 |Get subfolder names|Y
 |Get system info|Y
 |Get table names|Y
 |Get variables|Y
 |Group by|Y
 |HTTP client|Y
 |HTTP post|Y
 |Identify last row in a stream|Y
 |If Null|Y
 |Injector|Y
 |Insert / update|Y
 |Java filter|Y
 |JavaScript|Y
 |Join rows (cartesian product)|Y
 |JSON input|Y
 |JSON output|Y
 |Kafka Consumer|Y
 |Kafka Producer|Y
 |LDAP input|Y
 |LDAP output|Y
 |Load file content in memory|Y
 |Mail|Y
 |Mapping Input|Y
 |Mapping Output|Y
 |Memory group by|Y
 |Merge join|Y
 |Merge rows (diff)|Y
 |Metadata Input|Y
 |Metadata structure of stream|Y
 |Microsoft Excel input|Y
 |Microsoft Excel writer|Y
 |MonetDB bulk loader|Y
 |MongoDB Delete|Y
 |MongoDB input|Y
 |MongoDB output|Y
 |Multiway merge join|Y
 |Neo4j Cypher|Y
 |Neo4j Generate CSVs|Y
 |Neo4j Graph Output|Y
 |Neo4j Import|Y
 |Neo4J Output|Y
 |Neo4j Split Graph|Y
 |Null if|Y
 |Number range|Y
 |Parquet File Input|Y
 |Parquet File Output |Y
 |PGP decrypt stream|Y
 |PGP encrypt stream|Y
 |Pipeline executor|Y
 |Pipeline Logging|Y
 |Pipeline Probe|Y
 |PostgreSQL Bulk Loader|Y
 |Process files|Y
 |Properties input|Y
 |Properties output|Y
 |Regex evaluation|Y
 |Replace in string|Y
 |Reservoir sampling|Y
 |REST client|Y
 |Row denormaliser|Y
 |Row flattener|Y
 |Row normaliser|Y
 |Rules accumulator|Y
 |Rules executor|Y
 |Run SSH commands|Y
 |Salesforce delete|Y
 |Salesforce input|Y
 |Salesforce insert|Y
 |Salesforce update|Y
 |Salesforce upsert|Y
 |Sample rows|Y
 |SAS Input|Y
 |Select values|Y
 |Serialize to file|Y
 |Set field value|Y
 |Set field value to a constant|Y
 |Set files in result|Y
 |Set variables|Y
 |Simple Mapping|Y
 |Snowflake Bulk Loader|Y
 |Sort rows|Y
 |Sorted merge|Y
 |Split field to rows|Y
 |Split fields|Y
 |Splunk Input|Y
 |SQL file output|Y
 |SSTable output|Y
 |Standardize phone number|Y
 |Stream lookup|Y
 |Stream Schema Merge|Y
 |String operations|Y
 |Strings cut|Y
 |Switch / case|Y
 |Synchronize after merge|Y
 |Table compare|Y
 |Table exists|Y
 |Table input|Y
 |Table output|Y
 |Teradata Fastload bulk loader|Y
 |Text file input|Y
 |Text file input (deprecated)|Y
 |Text file output|Y
 |Token Replacement|Y
 |Unique rows|Y
 |Unique rows (HashSet)|Y
 |Update|Y
 |User defined Java class|Y
 |User defined Java expression|Y
 |Value mapper|Y
 |Web services lookup|Y
 |Workflow executor|Y
 |Workflow Logging|Y
 |Write to log|Y
 |XML input stream (StAX)|Y
 |XML join|Y
 |XML output|Y
 |XSD validator|Y
 |XSL Transformation|Y
 |YAML input |Y
 |Zip file|Y
 |===
	////
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at
	http://www.apache.org/licenses/LICENSE-2.0
	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	////
	[[MetadataInjection]]
	:imagesdir: ../assets/images
	:description: Metadata injection inserts data from various sources into a template pipeline at runtime to reduce repetitive tasks.

	= Metadata Injection

	Metadata injection inserts data from various sources into a template pipeline at runtime to reduce repetitive tasks.

	For example, you might have a simple pipeline to load transaction data values from a supplier, filter specific values, and output them to a file.
	If you have more than one supplier, you would need to run this simple pipeline for each supplier.
	Yet, with metadata injection, you can expand this simple repetitive pipeline by inserting metadata from another pipeline that contains the ETL Metadata Injection transform.
	This transform coordinates the data values from the various inputs through the metadata you define.
	This process reduces the need for you to adjust and run the repetitive pipeline for each specific input.

	The repetitive pipeline is known as the template pipeline.
	The template pipeline is called by the ETL Metadata Injection transform.
	You will create a pipeline to prepare what common values you want to use as metadata and inject these specific values through the ETL Metadata Injection transform.

	We recommend the following basic procedure for using this transform to inject metadata:

	1. Optimize your data for injection, such as preparing folder structures and inputs.

	2. Develop pipelines for the repetitive process (the template pipeline), for metadata injection through the ETL Metadata Injection transform, and for handling multiple inputs.

	The metadata is injected into the template pipeline through any transform that supports metadata injection.

	== Supported Transforms

	The goal is to add Metadata Injection support to all transforms, The current (29-july 2022) status is:

	\|===
	\|Transform\|Supports MDI
	\|Abort\|Y
	\|Add a checksum\|Y
	\|Add constants\|Y
	\|Add sequence\|Y
	\|Add value fields changing sequence\|Y
	\|Add XML\|Y
	\|Analytic query\|Y
	\|Apache Tika\|Y
	\|Append streams\|Y
	\|Avro Decode\|Y
	\|Avro Encode\|Y
	\|Avro File Input\|Y
	\|Avro File Output\|Y
	\|Azure Event Hubs Listener\|Y
	\|Azure Event Hubs Writer\|Y
	\|Beam BigQuery Input\|Y
	\|Beam BigQuery Output\|Y
	\|Beam Bigtable Input\|Y
	\|Beam Bigtable Output\|Y
	\|Beam GCP Pub/Sub : Publish\|Y
	\|Beam GCP Pub/Sub : Subscribe\|Y
	\|Beam Input\|Y
	\|Beam Kafka Consume\|Y
	\|Beam Kafka Produce\|Y
	\|Beam Kinesis Consume\|Y
	\|Beam Kinesis Produce\|Y
	\|Beam Output\|Y
	\|Beam Timestamp\|Y
	\|Beam Window\|Y
	\|Block until transforms finish\|Y
	\|Blocking transform\|Y
	\|Calculator\|Y
	\|Call DB procedure\|Y
	\|Cassandra input\|Y
	\|Cassandra output\|Y
	\|Change file encoding\|Y
	\|Check if file is locked\|Y
	\|Check if webservice is available\|Y
	\|Clone row\|Y
	\|Closure generator\|Y
	\|Coalesce Fields\|Y
	\|Column exists\|Y
	\|Combination lookup/update\|Y
	\|Concat Fields\|Y
	\|Copy rows to result\|Y
	\|Credit card validator\|Y
	\|CSV file input\|Y
	\|Data grid\|Y
	\|Database join\|Y
	\|Database lookup\|Y
	\|De-serialize from file\|Y
	\|Delay row\|Y
	\|Delete\|Y
	\|Detect empty stream\|Y
	\|Dimension lookup/update\|Y
	\|Doris bulk loader\|Y
	\|Dummy (do nothing)\|Y
	\|Dynamic SQL row\|Y
	\|EDI to XML\|Y
	\|Email messages input\|Y
	\|Enhanced JSON Output\|Y
	\|ETL metadata injection\|Y
	\|Execute a process\|Y
	\|Execute row SQL script\|Y
	\|Execute SQL script\|Y
	\|Execute Unit Tests\|Y
	\|Fake data\|Y
	\|File exists\|Y
	\|File Metadata\|Y
	\|Filter rows\|Y
	\|Formula\|Y
	\|Fuzzy match\|Y
	\|Generate random value\|Y
	\|Generate rows\|Y
	\|Get data from XML\|Y
	\|Get file names\|Y
	\|Get files from result\|Y
	\|Get files rows count\|Y
	\|Get ID from hop server\|Y
	\|Get Neo4j Logging Info\|Y
	\|Get records from stream\|Y
	\|Get rows from result\|Y
	\|Get Server Status\|Y
	\|Get subfolder names\|Y
	\|Get system info\|Y
	\|Get table names\|Y
	\|Get variables\|Y
	\|Group by\|Y
	\|HTTP client\|Y
	\|HTTP post\|Y
	\|Identify last row in a stream\|Y
	\|If Null\|Y
	\|Injector\|Y
	\|Insert / update\|Y
	\|Java filter\|Y
	\|JavaScript\|Y
	\|Join rows (cartesian product)\|Y
	\|JSON input\|Y
	\|JSON output\|Y
	\|Kafka Consumer\|Y
	\|Kafka Producer\|Y
	\|LDAP input\|Y
	\|LDAP output\|Y
	\|Load file content in memory\|Y
	\|Mail\|Y
	\|Mapping Input\|Y
	\|Mapping Output\|Y
	\|Memory group by\|Y
	\|Merge join\|Y
	\|Merge rows (diff)\|Y
	\|Metadata Input\|Y
	\|Metadata structure of stream\|Y
	\|Microsoft Excel input\|Y
	\|Microsoft Excel writer\|Y
	\|MonetDB bulk loader\|Y
	\|MongoDB Delete\|Y
	\|MongoDB input\|Y
	\|MongoDB output\|Y
	\|Multiway merge join\|Y
	\|Neo4j Cypher\|Y
	\|Neo4j Generate CSVs\|Y
	\|Neo4j Graph Output\|Y
	\|Neo4j Import\|Y
	\|Neo4J Output\|Y
	\|Neo4j Split Graph\|Y
	\|Null if\|Y
	\|Number range\|Y
	\|Parquet File Input\|Y
	\|Parquet File Output \|Y
	\|PGP decrypt stream\|Y
	\|PGP encrypt stream\|Y
	\|Pipeline executor\|Y
	\|Pipeline Logging\|Y
	\|Pipeline Probe\|Y
	\|PostgreSQL Bulk Loader\|Y
	\|Process files\|Y
	\|Properties input\|Y
	\|Properties output\|Y
	\|Regex evaluation\|Y
	\|Replace in string\|Y
	\|Reservoir sampling\|Y
	\|REST client\|Y
	\|Row denormaliser\|Y
	\|Row flattener\|Y
	\|Row normaliser\|Y
	\|Rules accumulator\|Y
	\|Rules executor\|Y
	\|Run SSH commands\|Y
	\|Salesforce delete\|Y
	\|Salesforce input\|Y
	\|Salesforce insert\|Y
	\|Salesforce update\|Y
	\|Salesforce upsert\|Y
	\|Sample rows\|Y
	\|SAS Input\|Y
	\|Select values\|Y
	\|Serialize to file\|Y
	\|Set field value\|Y
	\|Set field value to a constant\|Y
	\|Set files in result\|Y
	\|Set variables\|Y
	\|Simple Mapping\|Y
	\|Snowflake Bulk Loader\|Y
	\|Sort rows\|Y
	\|Sorted merge\|Y
	\|Split field to rows\|Y
	\|Split fields\|Y
	\|Splunk Input\|Y
	\|SQL file output\|Y
	\|SSTable output\|Y
	\|Standardize phone number\|Y
	\|Stream lookup\|Y
	\|Stream Schema Merge\|Y
	\|String operations\|Y
	\|Strings cut\|Y
	\|Switch / case\|Y
	\|Synchronize after merge\|Y
	\|Table compare\|Y
	\|Table exists\|Y
	\|Table input\|Y
	\|Table output\|Y
	\|Teradata Fastload bulk loader\|Y
	\|Text file input\|Y
	\|Text file input (deprecated)\|Y
	\|Text file output\|Y
	\|Token Replacement\|Y
	\|Unique rows\|Y
	\|Unique rows (HashSet)\|Y
	\|Update\|Y
	\|User defined Java class\|Y
	\|User defined Java expression\|Y
	\|Value mapper\|Y
	\|Web services lookup\|Y
	\|Workflow executor\|Y
	\|Workflow Logging\|Y
	\|Write to log\|Y
	\|XML input stream (StAX)\|Y
	\|XML join\|Y
	\|XML output\|Y
	\|XSD validator\|Y
	\|XSL Transformation\|Y
	\|YAML input \|Y
	\|Zip file\|Y
	\|===