IR is an abstraction that we use to express the logical notion of data processing applications and the underlying execution runtime behaviors on separate layers. It basically takes a form of directed acyclic graphs (DAGs), with which we can logically express dataflow programs. To express various different execution properties to fully exploit different deployment characteristics, we enable flexible annotations to the IR on a separate layer. On that layer, we can annotate specific execution properties related to the IR component.
Nemo IR is composed of vertices, which each represent a data-parallel operator that transforms data, and edges between them, which each represents the dependency of data flow between the vertices. Nemo IR supports four different types of IR vertices:
Each IR vertex and edge can be annotated to be able to express the different execution properties. For example, edges that the user wants to store intermediate data as local files can be annotated to use the ‘local file’ module for the ‘Data Store’ execution property. Execution properties that can be configured for IR vertices include parallelism, executor placement, stage number, and schedule group, as they are related to the computation itself. For IR edges, it includes data store, data flow model, data communication pattern, and partitioning, as they are used for expressing the behaviors regarding data transfer. By having an IR for expressing workloads and the related execution properties, it enables the optimization phase to be decoupled, making it easier to implement and plug in different optimizations for different deployment characteristics.