Merico Analysis Engine (AE)

THIS PLUGIN IS ONLY FOR MERICO EMPLOYEES AT THIS TIME. SOON IT WILL BE MADE PUBLIC.

Important notes

Some data looks like it is missing...

The commit data stored in Trino. The files can be deleted by Mino expiration strategy over time if they are too old.

How do I trigger analysis on my project?

Just add DevLake to the Merico Enterprise Edition and triggered an analysis. You can find this item by searching “ae staging”? You can log in AE staging server and restart an analysis of DevLake. (Login credentials for Merico employees are stored in one password)

Who controls the api for merico analysis engine?

Jingyang Liang and the Merico AE team

How do I authenticate and why?

AE api use a scheme shares idea of http mac authorization

This scheme is to prevent Replay Attack while avoiding API Server cache blow up, it is required because AE will most likely be deployed without HTTPS, and Replay Attacks are expected. Keep in mind, this wouldn't prevent Ear Dropping Attack, which can only be solved by HTTPS (in RESTful api context).

nonce: a random string to identify a unique request, any api request with nonce already exists in api server cache will be rejected. use only nonce can prevent Replay Attack, but will blow up api server cache eventually.
timestamp: to avoid api server cache being filled up with nonce strings, a timestamp is required, any api request with a timestamp unmatched to server current time (with small period of tolerance, like 3 minutes) will be rejected immediately, so api server can safely remove those expired nonce (normally double the period of tolerance). why not use timestamp only? well, the resolution of unix timestamp is 1sec, api server qps would be throttled to 1/s without nonce.
secret_key: in order to authenticate api requests, a shared symmetric secret is required between server and client.api client should sign its request with this key and server will verify signature with same key. this key should be generated by api server and transferred to client via some secure channel, and never send this key on any request.
app_id: when api server has multiple clients and we want to identify different apps, this is for server to know which secret_key to use, and it should not have any mathematical link to its secret_key of course.
sign: this one is, only those client with a correct secret pair can generate correct signature.

So, nonce is for request identity, timestamp is for server to eject expired key, sign and app_id is for authentication, hope that I convey the idea of the scheme right.

Data Gathered

Projects

[
  {
    "id": 0,
    "git_url": "string",
    "priority": 0,
    "create_time": "2021-11-23T17:28:10.286Z",
    "update_time": "2021-11-23T17:28:10.286Z"
  }
]

Commits

[
  {
    "hexsha": "string",
    "analysis_id": "string",
    "author_email": "string",
    "dev_eq": 0
  }
]

The most valuable data here is the dev_eq. This is a Merico owned measurement of code value

Configuration

You will need to set following settings in order to run this plugin.

These can be set in your .env file as

AE_APP_ID=xxx
AE_SECRET_KEY=xxx
AE_ENDPOINT=xxx

TBD: How do non merico users get these keys?

Gathering Data with AE

To collect data on a single project, you can make a POST request to /pipelines

curl --location --request POST 'localhost:8080/pipelines' \
--header 'Content-Type: application/json' \
--data-raw '
{
    "name": "ae 20211201",
    "tasks": [[{
        "plugin": "ae",
        "options": {
            "projectId": <Your project id>
        }
    }]]
}
'
    ```