| # Licensed under the Apache License, Version 2.0 (the "License"); |
| # you may not use this file except in compliance with the License. |
| # You may obtain a copy of the License at |
| # |
| # http://www.apache.org/licenses/LICENSE-2.0 |
| # |
| # Unless required by applicable law or agreed to in writing, software |
| # distributed under the License is distributed on an "AS IS" BASIS, |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| # See the License for the specific language governing permissions and |
| # limitations under the License. |
| |
| capability-matrix: |
| columns: |
| - class: dataflow |
| name: Google Cloud Dataflow |
| - class: flink |
| name: Apache Flink |
| - class: spark-rdd |
| name: Apache Spark (RDD/DStream based) |
| - class: spark-dataset |
| name: Apache Spark Structured Streaming (Dataset based) |
| - class: samza |
| name: Apache Samza |
| - class: nemo |
| name: Apache Nemo |
| - class: jet |
| name: Hazelcast Jet |
| - class: twister2 |
| name: Twister2 |
| - class: python direct |
| name: Python Direct FnRunner |
| - class: go direct |
| name: Go Direct Runner |
| |
| categories: |
| - description: What is being computed? |
| anchor: what |
| color-y: "fff" |
| color-yb: "f6f6f6" |
| color-p: "f9f9f9" |
| color-pb: "d8d8d8" |
| color-n: "e1e0e0" |
| color-nb: "bcbcbc" |
| rows: |
| - name: ParDo |
| description: Element-wise transformation parameterized by a chunk of user code. Elements are processed in bundles, with initialization and termination hooks. Bundle size is chosen by the runner and cannot be controlled by user code. ParDo processes a main input PCollection one element at a time, but provides side input access to additional PCollections. |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: fully supported |
| l3: Batch mode uses large bundle sizes. Streaming uses smaller bundle sizes. |
| - class: flink |
| l1: "Yes" |
| l2: fully supported |
| l3: ParDo itself, as per-element transformation with UDFs, is fully supported by Flink for both batch and streaming. |
| - class: spark-rdd |
| l1: "Yes" |
| l2: fully supported |
| l3: ParDo applies per-element transformations as Spark FlatMapFunction. |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: ParDo applies per-element transformations as Spark FlatMapFunction. |
| - class: samza |
| l1: "Yes" |
| l2: fully supported |
| l3: Supported with per-element transformation. |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: python direct |
| l1: "" |
| l2: |
| l3: "" |
| - class: go direct |
| l1: "" |
| l2: |
| l3: "" |
| - name: GroupByKey |
| description: Grouping of key-value pairs per key, window, and pane. (See also other tabs.) |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: flink |
| l1: "Yes" |
| l2: fully supported |
| l3: "Uses Flink's keyBy for key grouping. When grouping by window in streaming (creating the panes) the Flink runner uses the Beam code. This guarantees support for all windowing and triggering mechanisms." |
| - class: spark-rdd |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "Using Spark's <tt>groupByKey</tt>. GroupByKey with multiple trigger firings in streaming mode is a work in progress." |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "Using Spark's <tt>groupByKey</tt>." |
| - class: samza |
| l1: "Yes" |
| l2: fully supported |
| l3: "Uses Samza's partitionBy for key grouping and Beam's logic for window aggregation and triggering." |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: python direct |
| l1: "" |
| l2: |
| l3: "" |
| - class: go direct |
| l1: "" |
| l2: |
| l3: "" |
| - name: Flatten |
| description: Concatenates multiple homogenously typed collections together. |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: flink |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: spark-rdd |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: Some corner cases like flatten on empty collections are not yet supported. |
| - class: samza |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: python direct |
| l1: "" |
| l2: |
| l3: "" |
| - class: go direct |
| l1: "" |
| l2: |
| l3: "" |
| - name: Combine |
| description: 'Application of an associative, commutative operation over all values ("globally") or over all values associated with each key ("per key"). Can be implemented using ParDo, but often more efficient implementations exist.' |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: "efficient execution" |
| l3: "" |
| - class: flink |
| l1: "Yes" |
| l2: "fully supported" |
| l3: Uses a combiner for pre-aggregation for batch and streaming. |
| - class: spark-rdd |
| l1: "Yes" |
| l2: fully supported |
| l3: "Using Spark's <tt>combineByKey</tt> and <tt>aggregate</tt> functions." |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "Using Spark's <tt>Aggregator</tt> and agg function" |
| - class: samza |
| l1: "Yes" |
| l2: fully supported |
| l3: Use combiner for efficient pre-aggregation. |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "Batch mode uses pre-aggregation" |
| - class: jet |
| l1: "Yes" |
| l2: fully supported |
| l3: "Batch mode uses pre-aggregation" |
| - class: twister2 |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: python direct |
| l1: "" |
| l2: |
| l3: "" |
| - class: go direct |
| l1: "" |
| l2: |
| l3: "" |
| - name: Composite Transforms |
| description: Allows easy extensibility for library writers. In the near future, we expect there to be more information provided at this level -- customized metadata hooks for monitoring, additional runtime/environment hooks, etc. |
| values: |
| - class: dataflow |
| l1: "Partially" |
| l2: supported via inlining |
| l3: Currently composite transformations are inlined during execution. The structure is later recreated from the names, but other transform level information (if added to the model) will be lost. |
| - class: flink |
| l1: "Partially" |
| l2: supported via inlining |
| l3: "" |
| - class: spark-rdd |
| l1: "Partially" |
| l2: supported via inlining |
| l3: "" |
| - class: spark-dataset |
| l1: "Partially" |
| l2: supported via inlining only in batch mode |
| l3: "" |
| - class: samza |
| l1: "Partially" |
| l2: supported via inlining |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: jet |
| l1: "Partially" |
| l2: supported via inlining |
| l3: "" |
| - class: twister2 |
| l1: "Partially" |
| l2: supported via inlining |
| l3: "" |
| - class: python direct |
| l1: "" |
| l2: |
| l3: "" |
| - class: go direct |
| l1: "" |
| l2: |
| l3: "" |
| - name: Side Inputs |
| description: Side inputs are additional <tt>PCollections</tt> whose contents are computed during pipeline execution and then made accessible to DoFn code. The exact shape of the side input depends both on the <tt>PCollectionView</tt> used to describe the access pattern (interable, map, singleton) and the window of the element from the main input that is currently being processed. |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: some size restrictions in streaming |
| l3: Batch mode supports a distributed implementation, but streaming mode may force some size restrictions. Neither mode is able to push lookups directly up into key-based sources. |
| - class: flink |
| l1: "Yes" |
| l2: some size restrictions in streaming |
| l3: Batch mode supports a distributed implementation, but streaming mode may force some size restrictions. Neither mode is able to push lookups directly up into key-based sources. |
| - class: spark-rdd |
| l1: "Yes" |
| l2: fully supported |
| l3: "Using Spark's broadcast variables. In streaming mode, side inputs may update but only between micro-batches." |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "Using Spark's broadcast variables." |
| - class: samza |
| l1: "Yes" |
| l2: fully supported |
| l3: Uses Samza's broadcast operator to distribute the side inputs. |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: jet |
| l1: "Partially" |
| l2: with restrictions |
| l3: Supported only when the side input source is bounded and windowing uses global window |
| - class: twister2 |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: python direct |
| l1: "" |
| l2: |
| l3: "" |
| - class: go direct |
| l1: "" |
| l2: |
| l3: "" |
| - name: Source API |
| description: Allows users to provide additional input sources. Supports both bounded and unbounded data. Includes hooks necessary to provide efficient parallelization (size estimation, progress information, dynamic splitting, etc). |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: fully supported |
| l3: Support includes autotuning features (https://cloud.google.com/dataflow/service/dataflow-service-desc#autotuning-features). |
| - class: flink |
| l1: "Yes" |
| l2: fully supported |
| l3: |
| - class: spark-rdd |
| l1: "Yes" |
| l2: fully supported |
| l3: |
| - class: spark-dataset |
| l1: "Partially" |
| l2: bounded source only |
| l3: "Using Spark's DatasourceV2 API in microbatch mode (Continuous streaming mode is tagged experimental in spark and does not support aggregation)." |
| - class: samza |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: python direct |
| l1: "" |
| l2: |
| l3: "" |
| - class: go direct |
| l1: "" |
| l2: |
| l3: "" |
| - name: Metrics |
| description: Allow transforms to gather simple metrics across bundles in a <tt>PTransform</tt>. Provide a mechanism to obtain both committed and attempted metrics. Semantically similar to using an additional output, but support partial results as the transform executes, and support both committed and attempted values. Will likely want to augment <tt>Metrics</tt> to be more useful for processing unbounded data by making them windowed. |
| values: |
| - class: dataflow |
| l1: "Partially" |
| l2: "" |
| l3: Gauge metrics are not supported. All other metric types are supported. |
| - class: flink |
| l1: "Partially" |
| l2: All metrics types are supported. |
| l3: Only attempted values are supported. No committed values for metrics. |
| - class: spark-rdd |
| l1: "Partially" |
| l2: All metric types are supported. |
| l3: Only attempted values are supported. No committed values for metrics. |
| - class: spark-dataset |
| l1: "Partially" |
| l2: All metric types are supported in batch mode. |
| l3: Only attempted values are supported. No committed values for metrics. |
| - class: samza |
| l1: "Partially" |
| l2: Counter and Gauge are supported. |
| l3: Only attempted values are supported. No committed values for metrics. |
| - class: nemo |
| l1: "No" |
| l2: not implemented |
| l3: "" |
| - class: jet |
| l1: "Partially" |
| l2: All metrics types supported, both in batching and streaming mode. |
| l3: Doesn't differentiate between committed and attempted values. |
| - class: twister2 |
| l1: "No" |
| l2: not implemented |
| l3: "" |
| - class: python direct |
| l1: "" |
| l2: |
| l3: "" |
| - class: go direct |
| l1: "" |
| l2: |
| l3: "" |
| - name: Stateful Processing |
| description: Allows fine-grained access to per-key, per-window persistent state. Necessary for certain use cases (e.g. high-volume windows which store large amounts of data, but typically only access small portions of it; complex state machines; etc.) that are not easily or efficiently addressed via <tt>Combine</tt> or <tt>GroupByKey</tt>+<tt>ParDo</tt>. |
| values: |
| - class: dataflow |
| l1: "Partially" |
| l2: non-merging windows |
| l3: State is supported for non-merging windows. SetState and MapState are not yet supported. |
| - class: flink |
| l1: "Partially" |
| l2: non-merging windows |
| l3: State is supported for non-merging windows. SetState and MapState are not yet supported. |
| - class: spark-rdd |
| l1: "Partially" |
| l2: full support in batch mode |
| l3: |
| - class: spark-dataset |
| l1: "No" |
| l2: not implemented |
| l3: |
| - class: samza |
| l1: "Partially" |
| l2: non-merging windows |
| l3: "States are backed up by either rocksDb KV store or in-memory hash map, and persist using changelog." |
| - class: nemo |
| l1: "No" |
| l2: not implemented |
| l3: "" |
| - class: jet |
| l1: "Partially" |
| l2: non-merging windows |
| l3: "" |
| - class: twister2 |
| l1: "No" |
| l2: not implemented |
| l3: "" |
| - class: python direct |
| l1: "" |
| l2: |
| l3: "" |
| - class: go direct |
| l1: "" |
| l2: |
| l3: "" |
| - description: Bounded Splittable DoFn Support Status |
| anchor: what |
| color-y: "fff" |
| color-yb: "f6f6f6" |
| color-p: "f9f9f9" |
| color-pb: "d8d8d8" |
| color-n: "e1e0e0" |
| color-nb: "bcbcbc" |
| rows: |
| - name: Base |
| description: "" |
| values: |
| - class: dataflow |
| l1: "Partially" |
| l2: Only Dataflow Runner V2 supports this. |
| l3: "" |
| - class: flink |
| l1: "Partially" |
| l2: Only portable Flink Runner supports this. |
| l3: "" |
| - class: spark-rdd |
| l1: |
| l2: |
| l3: "" |
| - class: spark-dataset |
| l1: |
| l2: |
| l3: "" |
| - class: samza |
| l1: |
| l2: |
| l3: "" |
| - class: nemo |
| l1: |
| l2: |
| l3: "" |
| - class: jet |
| l1: |
| l2: |
| l3: "" |
| - class: twister2 |
| l1: |
| l2: |
| l3: "" |
| - class: python direct |
| l1: "Yes" |
| l2: |
| l3: |
| - class: go direct |
| l1: "Yes" |
| l2: |
| l3: |
| - name: Side Inputs |
| description: "" |
| values: |
| - class: dataflow |
| l1: "Partially" |
| l2: Only Dataflow Runner V2 supports this. |
| l3: "" |
| - class: flink |
| l1: "Partially" |
| l2: Only portable Flink Runner supports this. |
| l3: "" |
| - class: spark-rdd |
| l1: |
| l2: |
| l3: "" |
| - class: spark-dataset |
| l1: |
| l2: |
| l3: "" |
| - class: samza |
| l1: |
| l2: |
| l3: "" |
| - class: nemo |
| l1: |
| l2: |
| l3: "" |
| - class: jet |
| l1: |
| l2: |
| l3: "" |
| - class: twister2 |
| l1: |
| l2: |
| l3: "" |
| - class: python direct |
| l1: |
| l2: |
| l3: |
| - class: go direct |
| l1: "Yes" |
| l2: |
| l3: |
| - name: Splittable DoFn Initiated Checkpointing |
| description: "" |
| values: |
| - class: dataflow |
| l1: "Partially" |
| l2: Only Dataflow Runner v2 supports this. |
| l3: "" |
| - class: flink |
| l1: "Partially" |
| l2: Only portable Flink Runner supports this. |
| l3: "" |
| - class: spark-rdd |
| l1: |
| l2: |
| l3: "" |
| - class: spark-dataset |
| l1: |
| l2: |
| l3: "" |
| - class: samza |
| l1: |
| l2: |
| l3: "" |
| - class: nemo |
| l1: |
| l2: |
| l3: "" |
| - class: jet |
| l1: |
| l2: |
| l3: "" |
| - class: twister2 |
| l1: |
| l2: |
| l3: "" |
| - class: python direct |
| l1: "Yes" |
| l2: |
| l3: |
| - class: go direct |
| l1: "No" |
| l2: |
| l3: |
| - name: Dynamic Splitting |
| description: "" |
| values: |
| - class: dataflow |
| l1: "Partially" |
| l2: Only Dataflow Runner V2 supports this. |
| l3: "" |
| - class: flink |
| l1: "No" |
| l2: |
| l3: "" |
| - class: spark-rdd |
| l1: |
| l2: |
| l3: "" |
| - class: spark-dataset |
| l1: |
| l2: |
| l3: "" |
| - class: samza |
| l1: |
| l2: |
| l3: "" |
| - class: nemo |
| l1: |
| l2: |
| l3: "" |
| - class: jet |
| l1: |
| l2: |
| l3: "" |
| - class: twister2 |
| l1: |
| l2: |
| l3: "" |
| - class: python direct |
| l1: "Yes" |
| l2: Only with Python SDK |
| l3: |
| - class: go direct |
| l1: "No" |
| l2: |
| l3: |
| - name: Bundle Finalization |
| description: "" |
| values: |
| - class: dataflow |
| l1: "Partially" |
| l2: Only Dataflow Runner V2 supports this. |
| l3: "" |
| - class: flink |
| l1: "No" |
| l2: |
| l3: "" |
| - class: spark-rdd |
| l1: |
| l2: |
| l3: "" |
| - class: spark-dataset |
| l1: |
| l2: |
| l3: "" |
| - class: samza |
| l1: |
| l2: |
| l3: "" |
| - class: nemo |
| l1: |
| l2: |
| l3: "" |
| - class: jet |
| l1: |
| l2: |
| l3: "" |
| - class: twister2 |
| l1: |
| l2: |
| l3: "" |
| - class: python direct |
| l1: "Yes" |
| l2: |
| l3: |
| - class: go direct |
| l1: "No" |
| l2: |
| l3: |
| - description: Unbounded Splittable DoFn Support Status |
| anchor: what |
| color-y: "fff" |
| color-yb: "f6f6f6" |
| color-p: "f9f9f9" |
| color-pb: "d8d8d8" |
| color-n: "e1e0e0" |
| color-nb: "bcbcbc" |
| rows: |
| - name: Base |
| description: "" |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: |
| l3: "" |
| - class: flink |
| l1: "Yes" |
| l2: |
| l3: "" |
| - class: spark-rdd |
| l1: |
| l2: |
| l3: "" |
| - class: spark-dataset |
| l1: |
| l2: |
| l3: "" |
| - class: samza |
| l1: |
| l2: |
| l3: "" |
| - class: nemo |
| l1: |
| l2: |
| l3: "" |
| - class: jet |
| l1: |
| l2: |
| l3: "" |
| - class: twister2 |
| l1: |
| l2: |
| l3: "" |
| - class: python direct |
| l1: "Yes" |
| l2: |
| l3: |
| - class: go direct |
| l1: "No" |
| l2: |
| l3: |
| - name: Side Inputs |
| description: "" |
| values: |
| - class: dataflow |
| l1: "Partially" |
| l2: Only Dataflow Runner V2 supports this. |
| l3: "" |
| - class: flink |
| l1: "Partially" |
| l2: Only portable Flink Runner supports this. |
| l3: "" |
| - class: spark-rdd |
| l1: |
| l2: |
| l3: "" |
| - class: spark-dataset |
| l1: |
| l2: |
| l3: "" |
| - class: samza |
| l1: |
| l2: |
| l3: "" |
| - class: nemo |
| l1: |
| l2: |
| l3: "" |
| - class: jet |
| l1: |
| l2: |
| l3: "" |
| - class: twister2 |
| l1: |
| l2: |
| l3: "" |
| - class: python direct |
| l1: |
| l2: |
| l3: |
| - class: go direct |
| l1: "Yes" |
| l2: |
| l3: |
| - name: Splittable DoFn Initiated Checkpointing |
| description: "" |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: |
| l3: "" |
| - class: flink |
| l1: "Yes" |
| l2: |
| l3: "" |
| - class: spark-rdd |
| l1: |
| l2: |
| l3: "" |
| - class: spark-dataset |
| l1: |
| l2: |
| l3: "" |
| - class: samza |
| l1: |
| l2: |
| l3: "" |
| - class: nemo |
| l1: |
| l2: |
| l3: "" |
| - class: jet |
| l1: |
| l2: |
| l3: "" |
| - class: twister2 |
| l1: |
| l2: |
| l3: "" |
| - class: python direct |
| l1: "Yes" |
| l2: |
| l3: |
| - class: go direct |
| l1: "No" |
| l2: |
| l3: |
| - name: Dynamic Splitting |
| description: "" |
| values: |
| - class: dataflow |
| l1: "No" |
| l2: |
| l3: "" |
| - class: flink |
| l1: "No" |
| l2: |
| l3: "" |
| - class: spark-rdd |
| l1: |
| l2: |
| l3: "" |
| - class: spark-dataset |
| l1: |
| l2: |
| l3: "" |
| - class: samza |
| l1: |
| l2: |
| l3: "" |
| - class: nemo |
| l1: |
| l2: |
| l3: "" |
| - class: jet |
| l1: |
| l2: |
| l3: "" |
| - class: twister2 |
| l1: |
| l2: |
| l3: "" |
| - class: python direct |
| l1: "No" |
| l2: |
| l3: |
| - class: go direct |
| l1: "No" |
| l2: |
| l3: |
| - name: Bundle Finalization |
| description: "" |
| values: |
| - class: dataflow |
| l1: "Partially" |
| l2: Only Dataflow Runner V2 supports this. |
| l3: "" |
| - class: flink |
| l1: "Partially" |
| l2: Only portable Flink Runner supports this with checkpointing enabled. |
| l3: "" |
| - class: spark-rdd |
| l1: |
| l2: |
| l3: "" |
| - class: spark-dataset |
| l1: |
| l2: |
| l3: "" |
| - class: samza |
| l1: |
| l2: |
| l3: "" |
| - class: nemo |
| l1: |
| l2: |
| l3: "" |
| - class: jet |
| l1: |
| l2: |
| l3: "" |
| - class: twister2 |
| l1: |
| l2: |
| l3: "" |
| - class: python direct |
| l1: "Yes" |
| l2: |
| l3: |
| - class: go direct |
| l1: "No" |
| l2: |
| l3: |
| - description: Where in event time? |
| anchor: where |
| color-y: "fff" |
| color-yb: "f6f6f6" |
| color-p: "f9f9f9" |
| color-pb: "d8d8d8" |
| color-n: "e1e0e0" |
| color-nb: "bcbcbc" |
| rows: |
| - name: Global windows |
| description: The default window which covers all of time. (Basically how traditional batch cases fit in the model.) |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: default |
| l3: "" |
| - class: flink |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: spark-rdd |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "" |
| - class: samza |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - name: Fixed windows |
| description: Fixed-size, timestamp-based windows. (Hourly, Daily, etc) |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: built-in |
| l3: "" |
| - class: flink |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: spark-rdd |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "" |
| - class: samza |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - name: Sliding windows |
| description: Possibly overlapping fixed-size timestamp-based windows (Every minute, use the last ten minutes of data.) |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: built-in |
| l3: "" |
| - class: flink |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: spark-rdd |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "" |
| - class: samza |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - name: Session windows |
| description: Based on bursts of activity separated by a gap size. Different per key. |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: built-in |
| l3: "" |
| - class: flink |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: spark-rdd |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "" |
| - class: samza |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - name: Custom windows |
| description: All windows must implement <tt>BoundedWindow</tt>, which specifies a max timestamp. Each <tt>WindowFn</tt> assigns elements to an associated window. |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: flink |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: spark-rdd |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "" |
| - class: samza |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - name: Custom merging windows |
| description: A custom <tt>WindowFn</tt> additionally specifies whether and how to merge windows. |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: flink |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: spark-rdd |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "" |
| - class: samza |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - name: Timestamp control |
| description: For a grouping transform, such as GBK or Combine, an OutputTimeFn specifies (1) how to combine input timestamps within a window and (2) how to merge aggregated timestamps when windows merge. |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: flink |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: spark-rdd |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "" |
| - class: samza |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: supported |
| l3: "" |
| |
| - description: When in processing time? |
| anchor: when |
| color-y: "fff" |
| color-yb: "f6f6f6" |
| color-p: "f9f9f9" |
| color-pb: "d8d8d8" |
| color-n: "e1e0e0" |
| color-nb: "bcbcbc" |
| rows: |
| - name: Configurable triggering |
| description: Triggering may be specified by the user (instead of simply driven by hardcoded defaults). |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: fully supported |
| l3: Fully supported in streaming mode. In batch mode, intermediate trigger firings are effectively meaningless. |
| - class: flink |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: spark-rdd |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "" |
| - class: samza |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| |
| - name: Event-time triggers |
| description: Triggers that fire in response to event-time completeness signals, such as watermarks progressing. |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: yes in streaming, fixed granularity in batch |
| l3: Fully supported in streaming mode. In batch mode, currently watermark progress jumps from the beginning of time to the end of time once the input has been fully consumed, thus no additional triggering granularity is available. |
| - class: flink |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: spark-rdd |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "" |
| - class: samza |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| |
| - name: Processing-time triggers |
| description: Triggers that fire in response to processing-time advancing. |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: yes in streaming, fixed granularity in batch |
| l3: Fully supported in streaming mode. In batch mode, from the perspective of triggers, processing time currently jumps from the beginning of time to the end of time once the input has been fully consumed, thus no additional triggering granularity is available. |
| - class: flink |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: spark-rdd |
| l1: "Yes" |
| l2: "This is Spark streaming's native model" |
| l3: "Spark processes streams in micro-batches. The micro-batch size is actually a pre-set, fixed, time interval. Currently, the runner takes the first window size in the pipeline and sets it's size as the batch interval. Any following window operations will be considered processing time windows and will affect triggering." |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: |
| - class: samza |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| |
| - name: Count triggers |
| description: Triggers that fire after seeing at least N elements. |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: fully supported |
| l3: Fully supported in streaming mode. In batch mode, elements are processed in the largest bundles possible, so count-based triggers are effectively meaningless. |
| - class: flink |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: spark-rdd |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "" |
| - class: samza |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| |
| - name: Composite triggers |
| description: Triggers which compose other triggers in more complex structures, such as logical AND, logical OR, early/on-time/late, etc. |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: flink |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: spark-rdd |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "" |
| - class: samza |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: twister2 |
| l1: "Partially" |
| l2: |
| l3: "" |
| |
| - name: Allowed lateness |
| description: A way to bound the useful lifetime of a window (in event time), after which any unemitted results may be materialized, the window contents may be garbage collected, and any addtional late data that arrive for the window may be discarded. |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: fully supported |
| l3: Fully supported in streaming mode. In batch mode no data is ever late. |
| - class: flink |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: spark-rdd |
| l1: "No" |
| l2: "" |
| l3: "" |
| - class: spark-dataset |
| l1: "No" |
| l2: no streaming support in the runner |
| l3: "" |
| - class: samza |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: twister2 |
| l1: "Partially" |
| l2: |
| l3: "" |
| |
| - name: Timers |
| description: A fine-grained mechanism for performing work at some point in the future, in either the event-time or processing-time domain. Useful for orchestrating delayed events, timeouts, etc in complex state per-key, per-window state machines. |
| values: |
| - class: dataflow |
| l1: "Partially" |
| l2: non-merging windows |
| l3: Dataflow supports timers in non-merging windows. |
| - class: flink |
| l1: "Partially" |
| l2: non-merging windows |
| l3: The Flink Runner supports timers in non-merging windows. |
| - class: spark-rdd |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "" |
| - class: spark-dataset |
| l1: "No" |
| l2: not implemented |
| l3: "" |
| - class: samza |
| l1: "Partially" |
| l2: non-merging windows |
| l3: The Samza Runner supports timers in non-merging windows. |
| - class: nemo |
| l1: "No" |
| l2: not implemented |
| l3: "" |
| - class: jet |
| l1: "Partially" |
| l2: non-merging windows |
| l3: "" |
| - class: twister2 |
| l1: "Partially" |
| l2: |
| l3: "" |
| |
| - description: How do refinements relate? |
| anchor: how |
| color-y: "fff" |
| color-yb: "f6f6f6" |
| color-p: "f9f9f9" |
| color-pb: "d8d8d8" |
| color-n: "e1e0e0" |
| color-nb: "bcbcbc" |
| rows: |
| - name: Discarding |
| description: Elements are discarded from accumulated state as their pane is fired. |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: flink |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: spark-rdd |
| l1: "Yes" |
| l2: fully supported |
| l3: "Spark streaming natively discards elements after firing." |
| - class: spark-dataset |
| l1: "Partially" |
| l2: fully supported in batch mode |
| l3: "" |
| - class: samza |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| |
| - name: Accumulating |
| description: Elements are accumulated in state across multiple pane firings for the same window. |
| values: |
| - class: dataflow |
| l1: "Yes" |
| l2: fully supported |
| l3: Requires that the accumulated pane fits in memory, after being passed through the combiner (if relevant) |
| - class: flink |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: spark-rdd |
| l1: "No" |
| l2: "" |
| l3: "" |
| - class: spark-dataset |
| l1: "No" |
| l2: "" |
| l3: "" |
| - class: samza |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: nemo |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: jet |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| - class: twister2 |
| l1: "Yes" |
| l2: fully supported |
| l3: "" |
| |
| - description: Additional common features not yet part of the Beam model |
| anchor: misc |
| color-y: "fff" |
| color-yb: "f6f6f6" |
| color-p: "f9f9f9" |
| color-pb: "d8d8d8" |
| color-n: "e1e0e0" |
| color-nb: "bcbcbc" |
| rows: |
| - name: Drain |
| description: APIs and semantics for draining a pipeline are under discussion. This would cause incomplete aggregations to be emitted regardless of trigger and tagged with metadata indicating it is incompleted. |
| values: |
| - class: dataflow |
| l1: "Partially" |
| l2: |
| l3: Dataflow has a native drain operation, but it does not work in the presence of event time timer loops. Final implemention pending model support. |
| - class: flink |
| l1: "Partially" |
| l2: |
| l3: Flink supports taking a "savepoint" of the pipeline and shutting the pipeline down after its completion. |
| - class: spark-rdd |
| l1: |
| l2: |
| l3: |
| - class: spark-dataset |
| l1: |
| l2: |
| l3: |
| - class: samza |
| l1: |
| l2: |
| l3: |
| - class: nemo |
| l1: |
| l2: |
| l3: |
| - class: jet |
| l1: |
| l2: |
| l3: |
| - class: twister2 |
| l1: |
| l2: |
| l3: |
| - name: Checkpoint |
| description: APIs and semantics for saving a pipeline checkpoint are under discussion. This would be a runner-specific materialization of the pipeline state required to resume or duplicate the pipeline. |
| values: |
| - class: dataflow |
| l1: "No" |
| l2: |
| l3: |
| - class: flink |
| l1: "Partially" |
| l2: |
| l3: Flink has a native savepoint capability. |
| - class: spark-rdd |
| l1: "Partially" |
| l2: |
| l3: Spark has a native savepoint capability. |
| - class: spark-dataset |
| l1: "No" |
| l2: |
| l3: not implemented |
| - class: samza |
| l1: "Partially" |
| l2: |
| l3: Samza has a native checkpoint capability. |
| - class: nemo |
| l1: |
| l2: |
| l3: |
| - class: jet |
| l1: |
| l2: |
| l3: |
| - class: twister2 |
| l1: |
| l2: |
| l3: |