blob: 57038189e6730a854a30bab3a39b9ee9f2377a4a [file] [log] [blame]
.. ################################################################################
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
################################################################################
==========
DataStream
==========
DataStream
----------
A DataStream represents a stream of elements of the same type.
.. currentmodule:: pyflink.datastream.data_stream
.. autosummary::
:toctree: api/
DataStream.get_name
DataStream.name
DataStream.uid
DataStream.set_uid_hash
DataStream.set_parallelism
DataStream.set_max_parallelism
DataStream.get_type
DataStream.get_execution_environment
DataStream.force_non_parallel
DataStream.set_buffer_timeout
DataStream.start_new_chain
DataStream.disable_chaining
DataStream.slot_sharing_group
DataStream.set_description
DataStream.map
DataStream.flat_map
DataStream.key_by
DataStream.filter
DataStream.window_all
DataStream.union
DataStream.connect
DataStream.shuffle
DataStream.project
DataStream.rescale
DataStream.rebalance
DataStream.forward
DataStream.broadcast
DataStream.process
DataStream.assign_timestamps_and_watermarks
DataStream.partition_custom
DataStream.add_sink
DataStream.sink_to
DataStream.execute_and_collect
DataStream.print
DataStream.get_side_output
DataStream.cache
DataStreamSink
--------------
A Stream Sink. This is used for emitting elements from a streaming topology.
.. currentmodule:: pyflink.datastream.data_stream
.. autosummary::
:toctree: api/
DataStreamSink.name
DataStreamSink.uid
DataStreamSink.set_uid_hash
DataStreamSink.set_parallelism
DataStreamSink.set_description
DataStreamSink.disable_chaining
DataStreamSink.slot_sharing_group
KeyedStream
-----------
A Stream Sink. This is used for emitting elements from a streaming topology.
.. currentmodule:: pyflink.datastream.data_stream
.. autosummary::
:toctree: api/
KeyedStream.map
KeyedStream.flat_map
KeyedStream.reduce
KeyedStream.filter
KeyedStream.sum
KeyedStream.min
KeyedStream.max
KeyedStream.min_by
KeyedStream.max_by
KeyedStream.add_sink
KeyedStream.key_by
KeyedStream.process
KeyedStream.window
KeyedStream.count_window
KeyedStream.union
KeyedStream.connect
KeyedStream.partition_custom
KeyedStream.print
CachedDataStream
----------------
CachedDataStream represents a DataStream whose intermediate result will be cached at the first
time when it is computed. And the cached intermediate result can be used in later job that using
the same CachedDataStream to avoid re-computing the intermediate result.
.. currentmodule:: pyflink.datastream.data_stream
.. autosummary::
:toctree: api/
CachedDataStream.get_type
CachedDataStream.get_execution_environment
CachedDataStream.set_description
CachedDataStream.map
CachedDataStream.flat_map
CachedDataStream.key_by
CachedDataStream.filter
CachedDataStream.window_all
CachedDataStream.union
CachedDataStream.connect
CachedDataStream.shuffle
CachedDataStream.project
CachedDataStream.rescale
CachedDataStream.rebalance
CachedDataStream.forward
CachedDataStream.broadcast
CachedDataStream.process
CachedDataStream.assign_timestamps_and_watermarks
CachedDataStream.partition_custom
CachedDataStream.add_sink
CachedDataStream.sink_to
CachedDataStream.execute_and_collect
CachedDataStream.print
CachedDataStream.get_side_output
CachedDataStream.cache
CachedDataStream.invalidate
WindowedStream
--------------
A WindowedStream represents a data stream where elements are grouped by key, and for each
key, the stream of elements is split into windows based on a WindowAssigner. Window emission
is triggered based on a Trigger.
The windows are conceptually evaluated for each key individually, meaning windows can trigger
at different points for each key.
Note that the WindowedStream is purely an API construct, during runtime the WindowedStream will
be collapsed together with the KeyedStream and the operation over the window into one single
operation.
.. currentmodule:: pyflink.datastream.data_stream
.. autosummary::
:toctree: api/
WindowedStream.get_execution_environment
WindowedStream.get_input_type
WindowedStream.trigger
WindowedStream.allowed_lateness
WindowedStream.side_output_late_data
WindowedStream.reduce
WindowedStream.aggregate
WindowedStream.apply
WindowedStream.process
AllWindowedStream
-----------------
A AllWindowedStream represents a data stream where the stream of elements is split into windows
based on a WindowAssigner. Window emission is triggered based on a Trigger.
If an Evictor is specified it will be used to evict elements from the window after evaluation
was triggered by the Trigger but before the actual evaluation of the window.
When using an evictor, window performance will degrade significantly, since pre-aggregation of
window results cannot be used.
Note that the AllWindowedStream is purely an API construct, during runtime the AllWindowedStream
will be collapsed together with the operation over the window into one single operation.
.. currentmodule:: pyflink.datastream.data_stream
.. autosummary::
:toctree: api/
AllWindowedStream.get_execution_environment
AllWindowedStream.get_input_type
AllWindowedStream.trigger
AllWindowedStream.allowed_lateness
AllWindowedStream.side_output_late_data
AllWindowedStream.reduce
AllWindowedStream.aggregate
AllWindowedStream.apply
AllWindowedStream.process
ConnectedStreams
----------------
ConnectedStreams represent two connected streams of (possibly) different data types.
Connected streams are useful for cases where operations on one stream directly
affect the operations on the other stream, usually via shared state between the streams.
An example for the use of connected streams would be to apply rules that change over time
onto another stream. One of the connected streams has the rules, the other stream the
elements to apply the rules to. The operation on the connected stream maintains the
current set of rules in the state. It may receive either a rule update and update the state
or a data element and apply the rules in the state to the element.
The connected stream can be conceptually viewed as a union stream of an Either type, that
holds either the first stream's type or the second stream's type.
.. currentmodule:: pyflink.datastream.data_stream
.. autosummary::
:toctree: api/
ConnectedStreams.key_by
ConnectedStreams.map
ConnectedStreams.flat_map
ConnectedStreams.process
BroadcastStream
---------------
.. currentmodule:: pyflink.datastream.data_stream
.. autosummary::
:toctree: api/
BroadcastStream
BroadcastConnectedStream
------------------------
A BroadcastConnectedStream represents the result of connecting a keyed or non-keyed stream, with
a :class:`BroadcastStream` with :class:`~state.BroadcastState` (s). As in the case of
:class:`ConnectedStreams` these streams are useful for cases where operations on one stream
directly affect the operations on the other stream, usually via shared state between the
streams.
An example for the use of such connected streams would be to apply rules that change over time
onto another, possibly keyed stream. The stream with the broadcast state has the rules, and will
store them in the broadcast state, while the other stream will contain the elements to apply the
rules to. By broadcasting the rules, these will be available in all parallel instances, and can
be applied to all partitions of the other stream.
.. currentmodule:: pyflink.datastream.data_stream
.. autosummary::
:toctree: api/
BroadcastConnectedStream.process