Each function has a Fully Qualified Function Name (FQFN) with a specified tenant, namespace, and function name. With FQFN, you can create multiple functions in different namespaces with the same function name.
An FQFN looks like this:
tenant/namespace/name
Function instance is the core element of the function execution framework, consisting of the following elements:
The following figure illustrates the internal workflow of a function instance.
A function can have multiple instances, and each instance executes one copy of a function. You can specify the number of instances in the configuration file.
The consumers inside a function instance use FQFN as subscriber names to enable load balancing between multiple instances based on subscription types. The subscription type can be specified at the function level.
Each function has a separate state store with FQFN. You can specify a state interface to persist intermediate results in the BookKeeper. Other users can query the state of the function and extract these results.
Function worker is a logic component to monitor, orchestrate, and execute individual function in cluster-mode deployment of Pulsar Functions.
Within function workers, each function instance can be executed as a thread or process, depending on the selected configurations. Alternatively, if a Kubernetes cluster is available, functions can be spawned as StatefulSets within Kubernetes. See Set up function workers for more details.
The following figure illustrates the internal architecture and workflow of function workers.
Function workers form a cluster of worker nodes and the workflow is described as follows.
A function instance is invoked inside a runtime, and a number of instances can run in parallel. Pulsar supports three types of function runtime with different costs and isolation guarantees to maximize deployment flexibility. You can use one of them to run functions based on your needs. See Configure function runtime for more details.
The following table outlines the three types of function runtime.
Type | Description |
---|---|
Thread runtime | Each instance runs as a thread. Since the code for thread mode is written in Java, it is only applicable to Java instances. When a function runs in thread mode, it runs on the same Java virtual machine (JVM) with a function worker. |
Process runtime | Each instance runs as a process. When a function runs in process mode, it runs on the same machine that the function worker runs. |
Kubernetes runtime | Function is submitted as Kubernetes StatefulSet by workers and each function instance runs as a pod. Pulsar supports adding labels to the Kubernetes StatefulSets and services while launching functions, which facilitates selecting the target Kubernetes objects. |
Pulsar provides three different messaging delivery semantics that you can apply to a function. Different delivery semantic implementations are determined according to the ack time node.
Delivery semantics | Description | Adopted subscription type |
---|---|---|
At-most-once delivery | Each message sent to a function is processed at its best effort. There’s no guarantee that the message will be processed or not. When setting At-most-once, the autoAck configuration must be equal to true, otherwise the startup will fail(autoAck configuration will be deprecated in future releases). Ack time node: Before function processing. | Shared |
At-least-once delivery (default) | Each message sent to the function can be processed more than once (in case of a processing failure or redelivery). If you create a function without specifying the --processing-guarantees flag, the function provides at-least-once delivery guarantee. Ack time node: After sending a message to output. | Shared |
Effectively-once delivery | Each message sent to the function can be processed more than once but it has only one output. Duplicated messages are ignored.Effectively once is achieved on top of at-least-once processing and guaranteed server-side deduplication. This means a state update can happen twice, but the same state update is only applied once, the other duplicated state update is discarded on the server-side. Ack time node: After sending a message to output. | Failover |
Manual delivery | Under this semantics, the user needs to call the method context.getCurrentRecord().ack() inside the function to manually perform the ack operation, and the framework will not help users to do any ack operations. Ack time node: User decides, in function method. | Shared |
:::tip
at-least-once
delivery guarantees. If you create a function without supplying a value for the --processingGuarantees
flag, the function provides at-least-once
guarantees.Exclusive
subscription type is not available in Pulsar Functions because:exclusive
equals failover
.exclusive
may crash and restart when functions restart. In this case, exclusive
does not equal failover
. Because when the master consumer disconnects, all non-acknowledged and subsequent messages are delivered to the next consumer in line.shared
to key_shared
, you can use the —retain-key-ordering
option in pulsar-admin
.:::
You can set the processing guarantees for a function when you create the function. The following command creates a function with effectively-once guarantees applied.
bin/pulsar-admin functions create \ --name my-effectively-once-function \ --processing-guarantees EFFECTIVELY_ONCE \ # Other function configs
You can change the processing guarantees applied to a function using the update
command.
bin/pulsar-admin functions update \ --processing-guarantees ATMOST_ONCE \ # Other function configs
Java, Python, and Go SDKs provide access to a context object that can be used by a function. This context object provides a wide variety of information and functionality to the function including:
:::tip
For more information about code examples, refer to Java, Python and Go.
:::
Pulsar Functions take byte arrays as inputs and spit out byte arrays as output. You can write typed functions and bind messages to types by using either of the following ways:
:::note
Currently, window function is only available in Java.
:::
Window function is a function that performs computation across a data window, that is, a finite subset of the event stream. As illustrated below, the stream is split into “buckets” where functions can be applied.
The definition of a data window for a function involves two policies:
Both trigger policy and eviction policy are driven by either time or count.
:::tip
Both processing time and event time are supported.
Delivery Semantic Guarantees.
MANUAL
and Effectively-once
delivery semantics.:::
Based on whether two adjacent windows can share common events or not, windows can be divided into the following two types:
Tumbling window assigns elements to a window of a specified time length or count. The eviction policy for tumbling windows is always based on the window being full. So you only need to specify the trigger policy, either count-based or time-based.
In a tumbling window with a count-based trigger policy, as illustrated in the following example, the trigger policy is set to 2. Each function is triggered and executed when two items are in the window, regardless of the time.
In contrast, as illustrated in the following example, the window length of the tumbling window is 10 seconds, which means the function is triggered when the 10-second time interval has elapsed, regardless of how many events are in the window.
The sliding window method defines a fixed window length by setting the eviction policy to limit the amount of data retained for processing and setting the trigger policy with a sliding interval. If the sliding interval is smaller than the window length, there is data overlapping, which means the data simultaneously falling into adjacent windows is used for computation more than once.
As illustrated in the following example, the window length is 2 seconds, which means that any data older than 2 seconds will be evicted and not used in the computation. The sliding interval is configured to be 1 second, which means that function is executed every second to process the data within the entire window length.