blob: 0ce83d4c85ec2ee26c741395962ab3f416b2ec95 [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
////
[[Pipelines]]
:imagesdir: ../assets/images
:description: Pipelines, together with workflows, are the main building blocks in Hop. Pipelines perform the heavy data lifting: in a pipeline, you read data from one or more sources, perform a number of operations (joins, lookups, filters and lots more) and finally write the processed data to one or more target platforms.
= Pipelines
== Pipelines overview
Pipelines, together with workflows, are the main building blocks in Hop. Pipelines perform the heavy data lifting: in a pipeline, you read data from one or more sources, perform a number of operations (joins, lookups, filters and lots more) and finally write the processed data to one or more target platforms.
Pipelines are a network of xref:pipeline/transforms.adoc[transforms], connected by hops. Just like the xref:workflow/actions.adoc[actions] in a workflow, each transform is a small piece of functionality. The combination of a number of transforms allow Hop developers to build powerful data processing and, in combination with workflows, orchestration solutions.
Even though there is some visual resemblance, workflows and pipelines operate very differently.
The core principles of pipelines are:
* pipelines are networks. Each transform in a pipeline is part of the network.
* a pipeline runs all of its transforms in parallel. All transforms are started and process data simultaneously. In a simple pipeline where you read data from a large file, do some processing and finally write to a database table, you're typically still reading from the file while you're already loading data to the database.
* data flows through the various transforms in a pipeline over hops. In contrast to workflow hops, pipeline hops typically don't have an exit status. Pipelines do have some routing capabilities through e.g. xref:pipeline/transforms/filterrows.adoc[Filter Rows] transform and xref:pipeline/errorhandling.adoc[error handling], but the core pipeline principle still applies: the pipeline is a network, and data flow through the network in parallel.
== Example pipeline walk-through
The example below shows a very basic pipeline. This is what happens when we run this pipeline:
* the pipeline has 7 transforms. All 7 of these transforms become active when we start the pipeline.
* the "read-25M-records" transform starts reading data from a file, and pushes that data down the stream to "perform-calculations" and the following transforms. Since reading 25 million records takes a while, some data may already have finished processing while we're still reading records from the file.
* the "lookup-sql-data" matches data we read from the file with data we retrieved from the "read-sql-data" transform. The xref:pipeline/transforms/streamlookup.adoc[Stream Lookup] accepts input from the "read-sql-data", which is shown with the information icon image:icons/info.svg[] on the hop.
* once the data from the file and sql query are matched, we check a condition with the xref:pipeline/transforms/filterrows.adoc[Filter Rows] transform in "condition?". The output of this data is passed to "write-to-table" or "write-to-file", depending on whether the condition outcome was true or false.
image:hop-gui/pipeline/basic-pipeline.png[Pipelines - basic pipeline, width="65%"]
== Next steps
Pipelines are an extensive topic. Check the pages below to learn more about working with pipelines:
* xref:pipeline/hop-pipeline-editor.adoc[Pipeline Editor]
* xref:pipeline/create-pipeline.adoc[Create a Pipeline]
* xref:pipeline/run-preview-debug-pipeline.adoc[Run, Preview and Debug a Pipeline]
* xref:pipeline/pipeline-run-configurations/pipeline-run-configurations.adoc[Pipeline Run Configurations]
* xref:pipeline/metadata-injection.adoc[Metadata Injection]
* xref:pipeline/partitioning.adoc[Partitioning]
* xref:pipeline/beam/getting-started-with-beam.adoc[Getting started with Apache Beam]
* xref:pipeline/transforms.adoc[Transforms]