docs/github-user-guide-v0.10.0.md

Summary

GitHub has a rate limit of 2,000 API calls per hour for their REST API. As a result, it may take hours to collect commits data from GitHub API for a repo that has 10,000+ commits. To accelerate the process, DevLake introduces GitExtractor, a new plugin that collects git data by cloning the git repo instead of by calling GitHub APIs.

Starting from v0.10.0, DevLake will collect GitHub data in 2 separate plugins:

GitHub plugin (via GitHub API): collect repos, issues, pull requests
GitExtractor (via cloning repos): collect commits, refs

Note that GitLab plugin still collects commits via API by default since GitLab has a much higher API rate limit.

This doc details the process of collecting GitHub data in v0.10.0. We're working on simplifying this process in the next releases.

Before start, please make sure all services are started.

GitHub Data Collection Procedure

There're 3 steps.

Configure GitHub connection
Create a pipeline to run GitHub plugin
Create a pipeline to run GitExtractor plugin
[Optional] Set up a recurring pipeline to keep data fresh

Step 1 - Configure GitHub connection

Visit config-ui at http://localhost:4000, click the GitHub icon
Click the default connection ‘Github’ in the list
Configure connection by providing your GitHub API endpoint URL and your personal access token(s).
Endpoint URL: Leave this unchanged if you‘re using github.com. Otherwise replace it with your own GitHub instance’s REST API endpoint URL. This URL should end with ‘/’.
Auth Token(s): Fill in your personal access tokens(s). For how to generate personal access tokens, please see GitHub's official documentation. You can provide multiple tokens to speed up the data collection process, simply concatenating tokens with commas.
GitHub Proxy URL: This is optional. Enter a valid proxy server address on your Network, e.g. http://your-proxy-server.com:1080
Click ‘Test Connection’ and see it's working, then click ‘Save Connection’.
[Optional] Help DevLake understand your GitHub data by customizing data enrichment rules shown below.
1. Pull Request Enrichment Options
  1. Type: PRs with label that matches given Regular Expression, their properties type will be set to the value of first sub match. For example, with Type being set to type/(.*)$, a PR with label type/bug, its type would be set to bug, with label type/doc, it would be doc.
  2. Component: Same as above, but for component property.
2. Issue Enrichment Options
  1. Severity: Same as above, but for issue.severity of course.
  2. Component: Same as above.
  3. Priority: Same as above.
  4. Requirement : Issues with label that matches given Regular Expression, their properties type will be set to REQUIREMENT. Unlike PR.type, submatch does nothing, because for Issue Management Analysis, people tend to focus on 3 kinds of type (Requiremnt/Bug/Incident), however, the concrete naming varies from repo to repo, time to time, so we decided to standardize them to help analyst making general purpose metric.
  5. Bug: Same as above, with type setting to BUG
  6. Incident: Same as above, with type setting to INCIDENT
Click ‘Save Settings’