Last updated: May 2 2022
The following terms are arranged in the order of their appearance in the actual user workflow.
A blueprint is the plan that covers all the work to get your raw data ready for query and metric computaion in the dashboards. Creating a blueprint consists of three steps:
The relationship among Blueprint, Data Connections, Data Scope and Transformation Rules is explained as follows:
A data source is a specific DevOps tool from which you wish to sync your data, such as GitHub, GitLab, Jira and Jenkins.
DevLake normally uses one data plugin to pull data for a single data source. However, in some cases, DevLake uses multiple data plugins for one data source for the purpose of improved sync speed, among many other advantages. For instance, when you pull data from GitHub or GitLab, aside from the GitHub or GitLab plugin, Git Extractor is also used to pull data from the repositories. In this case, DevLake still refers GitHub or GitLab as a single data source.
A data connection is a specific instance of a data source that stores information such as endpoint
and auth
. A single data source can have one or more data connections (e.g. two Jira instances). Currently, DevLake supports one data connection for GitHub, GitLab and Jenkins, and multiple connections for Jira.
You can set up a new data connection either during the first step of creating a blueprint, or in the Connections page that can be accessed from the navigation bar. Because one single data connection can be resued in multiple blueprints, you can update the information of a particular data connection in Connections, to ensure all its associated blueprints will run properly. For example, you may want to update your GitHub token in a data connection if it goes expired.
In a blueprint, each data connection has one or more sets of data scope configurations. The data scope selection includes Github or Gitlab repositories, Jira boards and data entities (e.g. Issue Tracking and Source Code Management). The fields for data scope configuration vary according to different data sources.
As mentioned in the Blueprint diagram, the reason for choosing one vs. multiple sets of data scope is based on if you want to configure a unified set of transformation rules for the entire data collection, or you need distinct sets of transformation rules for different parts of the data collection(e.g. if you use different sets of labels across multiple GitHub repositories).
To learn more about the default data scope of all data sources and data plugins, please refer to Data Support.
Data entities refer to the domain entities of the five domains: Issue Tracking, Source Code Management, Code Review, CI/CD and Cross-Domain.
For instance, if you wish to pull Source Code Management data from GitHub and Issue Tracking data from Jira, you can check the corresponding data entities during setting the data scope of these two data connections.
Although data entities and domain entities technically can be used interchangeably, data entities are typically used in the DevLake Configuration UI to reduce the learning curves of the DevLake data models. For detailed definition of all data entities/domain entities, please refer to Domain Layer Schema.
Transformation rules are a collection of methods that allow you to customize how DevLake normalizes raw data for query and metric computation. Each set of data scope is strictly acompanied with one set of transformation rules.
DevLake uses these normalized values to design advanced dashboards, such as the Weekly Bug Retro dashboard. Although configuring transformation rules is not mandatory, if you leave the rules blank or have not configured correctly, only the basic dashboards (e.g. GitHub Basic Metrics) will be displayed as expected, while the advanced dashboards will not.
A historical run of a blueprint is an actual excecution of the data collection and transformation tasks defined in the blueprint at its creation. A list of historical runs of a blueprint is the entire running history of that blueprint, whether excecuted automatically or manually. Historical runs can be triggered in three ways:
/pipelines
endpoint manuallyHowever, the name Historical Runs is only used in the Configuration UI. In DevLake API, they are called pipelines.
The following terms have not appeared in the Regular Mode of Configuration UI for simplification, but can be very useful if you want to learn about the underlying framework of Devalke or use Advanced Mode and the DevLake API.
A data plugin is a specific module that syncs or transforms data. There are two types of data plugins: Data Collection Plugins and Data Transformation Plugins.
Data Collection Plugins pull data from one or more data sources. DevLake supports 8 data plugins in this category: ae
, feishu
, gitextractor
, github
, gitlab
, jenkins
and jira
.
Data Transformation Plugins transform the data pulled by other Data Collection Plugins. refdiff
is currently the only plugin in this category.
Although the names of the data plugins are not displayed in the regular mode of DevLake Configuration UI, they can be used directly in JSON in the Advanced Mode.
For detailed information about the relationship between data sources and data plugins, please refer to Data Support.
A pipeline is an orchestration of tasks of data collection
, extraction
, conversion
and enrichment
, defined in the DevLake API. A pipeline is composed of one or multiple stages that are executed in a sequential order. Any error occured during the execution of any stage, task or substask will cause the immediate fail of the pipeline.
The composition of a pipeline is exaplined as follows: Notice: You can manually orchestrate the pipeline in Configuration UI Advanced Mode and the DevLake API; whereas in Configuration UI regular mode, an optimized pipeline orchestration will be automatically generated for you.
A stages is a collection of tasks performed by data plugins. Stages are executed in a sequential order in a pipeline.
A task is a collection of subtasks that perform any of the collection
, extraction
, conversion
and enrichment
jobs of a particular data plugin. Tasks are executed in a parralel order in any stages.
A subtask is the minimal work unit in a pipeline that performs in any of the four roles: Collectors
, Extractors
, Converters
and Enrichers
. Subtasks are executed in sequential orders.
Collectors
: Collect raw data from data sources, normally via DevLake API and stored into raw data table
Extractors
: Extract data from raw data table
to domain layer tables
Converters
: Convert data from tool layer tables
into domain layer tables
Enrichers
: Enrich data from one domain to other domains. For instance, the Fourier Transformation can examine issue_changelog
to show time distribution of an issue on every assignee.