merging https://github.com/daumkakao/s2graph/pull/14 into master
tree: 8d786fc8797353f91c19fba20c070792798404ef
  1. .gitignore
  2. CONTRIBUTING.md
  3. LICENSE
  4. README.md
  5. app/
  6. asynchbase/
  7. build.sbt
  8. conf/
  9. loader/
  10. migrate/
  11. project/
  12. s2core/
  13. script/
  14. spark/
  15. test/
README.md

s2graph

s2graph is a GraphDB that stores big data using edges and vertices, and also serves REST APIs for querying information on its edges and vertices. It provide fully asynchronous, non-blocking API. This document defines terms and concepts used in s2graph and describes its REST API.

Table of content

Table of Contents generated with DocToc

Getting Started

S2Graph consists of multiple projects.

  1. S2Core: core library for common classes to store and retrieve data as edge/vertex.
  2. root project: Play rest server that provide rest APIs.
  3. spark: spark related common classes.
  4. loader: spark jobs that consume events from Kafka to HBase using S2Core library. also contains migration kit from hdfs to s2graph.
  5. asynchbase: This is fork from https://github.com/OpenTSDB/asynchbase. we add few functionalities on GetRequest. all theses are heavily relies on pull requests which not have been merged on original project yet. 6. rpcTimeout 7. setFilter 8. column pagination 9. retryAttempCount 10. timestamp filtering

to getup and running following is required.

  1. Apache HBase setup. 2. brew install hadoop and brew install hbase if you are on mac. 3. otherwise checkout reference for how to setup hbase. 4. note that currently we support latest stable version of apache hbase 1.0.1 with apache hadoop version 2.7.0. if you are using cdh, then you can checkout our feature/cdh5.3.0. we are working on providing profile on hbase/hadoop version soon.
  2. mysql setup. 3. first create new user for s2graph on your mysql. 4. create database and grant all privileges to this user on created database. 5. run s2core/migrate/mysql/schema.sql on created database. 6. set mysql connection info in conf/reference.conf - db.defaut.[url, user, password] 7. because of license issue, we are working on change this to Derby
  3. install protobuf. 4. asynchbase require protoc, so you should install protobuf. 5. brew install protobuf if you are on mac. 6. otherwise install 2.6.1(feature/cdh5.3.0 branch, expect protobuf 2.5.0)

once all requirements are setup correctly, you have to install asynchbase on your local first.

cd asynchbase; make pom.xml; mvn install 

then compile rest project

sbt compile

now you are run s2graph.

sbt run

we provide simple script under script/test.sh only for checking if everything setup property.

sh script/test.sh

finally join the mailing list

The Data Model

There are four important abstractions that define the data model used throughout s2graph: services, columns, labels and properties.

Services, the top level abstraction, are like databases in traditional RDBMS in which all data are contained. A service usually represents one of the company's real services and is named accordingly, e.g. "KakaoTalk", "KakaoStory".

Columns define the type of vertices and a service can have multiple columns. For example, service "KakaoMusic" can have columns "user_id" and "track_id". While columns can be compared to tables in traditional RDBMS in a sense, labels are the primary abstraction for representing schemas and columns are usually referenced only using their names.

Labels, represent relations between two columns, therefore representing the type of edges. The two columns can be the same, e.g. for a label representing friendships in an SNS, the two column will both be "user_id" of the service. There can be labels connecting two columns from two different services; for example, one can create a label that stores all events where KakaoStory posts are shared to KakaoTalk.

Properties, are metadata linked to vertices or edges that can be queried upon later. For vertices representing KakaoTalk users, estimated_birth_year is a possible property, and for edges representing similar KakaoMusic songs their cosine_similarity can be a property.

Using these abstractions, a unique vertex can be identified with its (service, column, vertex id), and a unique edge can be identified with its (service, label, source vertex id, target vertex id). Additional information on edges and vertices are stored within their own properties.

REST API Glossary

The following is a non-exhaustive list of commonly used s2graph APIs and their examples. The full list of the latest REST API can be found in the routes file.

0. Create a Service - POST /graphs/createService

see following to see what you can set with this API.

0.1 service definition

To create a Service, the following fields needs to be specified in the request.

field namedefinitiondata typeexamplenote
serviceNamename of user defined namespace.string“talk_friendship”required.
clusterzookeeper quorum address for your cluster.string“abc.com:2181,abd.com:2181”optional.
default value is “hbase.zookeeper.quorum” on your application.conf. if there is no value for “hbase.zookeeper.quorum” is defined on application.conf, then default value is “localhost”
hTableNamephysical HBase table name.string“test”optional.
default is serviceName-#{phase}.
phase is either dev/real/alpha/sandbox
hTableTTLglobal time to keep the data alive.integer86000optional. default is infinite.
preSplitSizenumber of pre split for HBase table.integer20optional.
default is 0(no pre-split)

Service is the top level abstraction in s2graph which can be considered like a database in RDBMS. You can create a service using this API:

curl -XPOST localhost:9000/graphs/createService -H 'Content-Type: Application/json' -d '
{"serviceName": "s2graph", "cluster": "address for zookeeper", "hTableName": "hbase table name", "hTableTTL": 86000, "preSplitSize": # of pre split}
'

note that optional value for your service is only advanced users only. stick to default if you don`t know what you are doing.

You can also look up all labels corresponding to a service.

curl -XGET localhost:9000/graphs/getLabels/:serviceName

1. Create a Label - POST /graphs/createLabel


A label represents a relation between two columns, and plays a role like a table in RDBMS since labels contain the schema information, i.e. what type of data will be collected and what among them needs to be indexed for efficient retrieval. In most scenario, defining a schema on vertices is pretty straightforward but defining a schema on edges requires a little effort. Think about queries you will need first, and then model user's actions/relations as edges to design a label.

1.1 label definition

To create a Label, the following fields needs to be specified in the request.

field namedefinitiondata typeexamplenote
labelname of this relation; be specific.string“talk_friendship”required.
srcServiceNamesource column's servicestring“kakaotalk”required.
srcColumnNamesource column's namestring“user_id”required.
srcColumnTypesource column's data typelong/integer/string“string”required.
tgtServiceNametarget column's servicestring“kakaotalk”/“kakaoagit”same as srcServiceName when not specified
tgtColumnNametarget column's namestring“item_id”required.
tgtColumnTypetarget column's data typelong/integer/string“long”required.
indexPropsmapping from indexed properties' names to their default values.
indexed properties will be primary index for this label (like PRIMARY INDEX idx_xxx (p1, p2)`, in RDBMS.
note that _timestamp, _from, _to is reserved property
json dictionary{“timestamp”:0, “affinity_score”:10, “play_count”:0}A default value must be provided for each property. The default value is usually the minimum value permitted for the property. When this filed is empty, the default property named timestamp will be automatically added and indexed. The value's type can be one of long/int/bool/byte and cannot be float. If your property is float type, you need to convert them to long/int first.
propsmapping from non-indexed properties' names to their default values.
these properties are indexed and therefore cannot be used efficiently for querying, like non-indexed columns in RDBMS
json dictionary{“is_hidden”: false, “country_iso”: “kr”, “country_code”: 82}non-indexed properties can be added later, like alter table add column in RDBMS
isDirectedif this label is directed or undirectedtrue/falsetrue/falsedefault true
serviceNamewhich service this label is belongs to.either srcServiceName or tgtServiceNames2graphdefault tgtServiceName
hTableNameif this label need special usecase(such as batch upload), own hbase table name can be used.strings2graph-batchdefault use service`s hTableName.
note that this is optional.
hTableTTLtime to data keep alive.integer86000default use service`s hTableTTL.
note that this is optional.
consistencyLevelif this is strong, only one edge between same from/to can be made. otherwise(weak) multiple edges with same from/to can be exist.stringstrong/weakdefault weak

Note. following property names are reserved for system. user can not create property same with these reserved property names. user can use this properties for indexProps/props/where clause on query.

  1. _timestamp is reserved for system wise timestamp. this can be interpreted as last_modified_at
  2. _from is reserved for label`s start vertex.
  3. _to is reserved for

1.2 label example

The following is an example that creates a label named graph_test, which represents the relation between account_id in service named s2graph and account_id in the same service, with indexed properties timestamp and affinity_score which both have the zero default value.

curl -XPOST localhost:9000/graphs/createLabel -H 'Content-Type: Application/json' -d '
{
    "label": "graph_test",
    "srcServiceName": "s2graph",
    "srcColumnName": "account_id",
    "srcColumnType": "long",
    "tgtServiceName": "s2graph",
    "tgtColumnName": "item_id",
    "tgtColumnType": "long",
    "indexProps": {
        "time": 0,
        "weight": 0
    },
    "props": {
        "is_hidden": true,
        "is_blocked": true,
        "error_code": 500
    }, 
    "serviceName": "s2graph",
    "consistencyLevel": "strong"
}
'

Here is another example that creates a label named kakao_group_join label between column account_id of service kakao and column group_id of service kakaogroup. Note that the default indexed property timestamp will be created since the indexedProps field is empty.

curl -XPOST localhost:9000/graphs/createLabel -H 'Content-Type: Application/json' -d '
{
    "label": "kakao_group_join",
    "srcServiceName": "kakao",
    "srcColumnName": "account_id",
    "srcColumnType": "long",
    "tgtServiceName": "kakaogroup",
    "tgtColumnName": "group_id",
    "tgtColumnType": "string",
    "indexProps": {},
    "serviceName": "kakaogroup",
    "props": {}
}
'

The following query will return the information regarding a label, graph_test in this case.

curl -XGET localhost:9000/graphs/getLabel/graph_test

You can delete a label using the following API:

curl -XPUT localhost:9000/graphs/deleteLabel/graph_test

To add a new non-indexed property, use the following API:

curl -XPOST localhost:9000/graphs/addProp/graph_test -H 'Content-Type: Application/json' -d '
{"name": "is_blocked", "defaultValue": false, "dataType": "boolean", "usedInIndex": false}
'

1.3 Consistency level.

One last important constraint on label is consistency level.

This define how to store edges on storage level. note that query is completely independent with this.

To explain consistency, s2graph defined edge uniquely with their (from, label, to) triple. s2graph call this triple as unique edge key.

following example is used to explain differences between strong/weak consistency level.

1418950524721	insert	e	1 	101	graph_test	{"weight": 10} = (1, graph_test, 101)
1418950524723	insert	e	1	101	graph_test	{"weight": 20} = (1, graph_test, 101)

currently there are two consistency level

1. strong

make sure there is only one edge stored in storage between same edge key((1, graph_test, 101) above). with strong consistency level, last command overwrite previous command.

2. weak

no consistency check on unique edge key. above example yield two different edge stored in storage with different timestamp and weight value.

for example, with each configuration, following edges will be stored.

assumes that only timestamp is used as indexProps and user inserts following.

u1 -> (t1, v1)
u1 -> (t2, v2)
u1 -> (t3, v2)
u1 -> (t4, v1)

with strong consistencyLevel following is what to be stored.

u1 -> (t4, v1), (t3, v2)

note that u1 -> (t1, v1), (t2, v2) are not exist.

with weak consistencyLevel.

u1 -> (t4, v1), (t3, v2), (t2, v2), (t1, v1)

Reason weak consistency is default.

most case edges related to user`s activity should use weak consistencyLevel since there will be no concurrent update on same edges. strong consistencyLevel is only for edges expecting many concurrent updates.

Consistency level also determine how edges will be stored in storage when command is delivered reversely by their timestamp.

with strong consistencyLevel following is guaranteed.

natural event on (1, graph_test, 101) unique edge key is following.

1418950524721	insert	e	1	101	graph_test	{"is_blocked": false}
1418950524722	delete	e	1	101	graph_test
1418950524723	insert	e	1	101	graph_test	{"is_hidden": false, "weight": 10}
1418950524724	update	e	1	101	graph_test	{"time": 1, "weight": -10}
1418950524726	update	e	1	101	graph_test	{"is_blocked": true}

even if above commands arrive in not in order, strong consistency make sure same eventual state on (1, graph_test, 101).

1418950524726	update	e	1	101	graph_test	{"is_blocked": true}
1418950524723	insert	e	1	101	graph_test	{"is_hidden": false, "weight": 10}
1418950524722	delete	e	1	101	graph_test
1418950524721	insert	e	1	101	graph_test	{"is_blocked": false}
1418950524724	update	e	1	101	graph_test	{"time": 1, "weight": -10}

There are many cases that commands arrive in not in order.

  1. client servers are distributed and each client issue command asynchronously.
  2. client servers are distributed and grouped commands.
  3. by using kafka queue, global ordering or message is not guaranteed.

Following is what s2graph do to make strong consistency level.

complexity = O(one read) + O(one delete) + O(2 put)

fetchedEdge = fetch edge with (1, graph_test, 101) from lookup table.

if fetchedEdge is not exist:
	create new edge same as current insert operation
	update lookup table as current insert operation
else:
	valid = compare fetchedEdge vs current insert operation.
	if valid: 
		delete fetchedEdge
		create new edge after comparing fetchedEdge and current insert.
		update lookup table

Limitation Since we write our data to HBase asynchronously, there is no consistency guarantee on same edge within our flushInterval(1 seconds).

2. (Optionally) Add Extra Indexes - POST /graphs/addIndex


A label can have multiple indexed properties, or (for brevity) indexes. When queried, returned edges' order is determined according to indexes, indexes essentially defines what will be included in the topK query.

Edge retrieval queries in s2graph by default returns topK edges. Clients must issue another query to fetch the next K edges, i.e., topK ~ 2topK.

Internally, s2graph stores edges sorted according to the indexes in order to limit the number of edges to fetch in one query. If no ordering is given, s2graph will use the timestamp as an index, thus resulting in the most recent data.

It is impossible to fetch millions of edges and sort them on-line to get topK in less than a second. s2graph uses vertex-centric indexes to avoid this.

using vertex-centric index, having millions of edges is fine as long as the topK value is reasonable (~ 1K) Note that indexes must be created before putting any data on this label (just like RDBMS).

New indexes can be dynamically added, but it will not be applied to existing data(planned in future versions). the number of indexes on a label is currently limited to 8.

The following is an example of adding indexes play_count and pay_amount to a label named graph_test.

curl -XPOST localhost:9000/graphs/addIndex -H 'Content-Type: Application/json' -d '
{"label": "graph_test", "indexProps": {"play_count":0, "pay_amount":0}}
'

##3. Insert and Manipulate Edges ##


An edge represents a relation between two vertices, with properties according to the schema defined in its label. The following fields need to be specified when inserting an edge, and are returned when queried on edges.

field namedefinitiondata typenoteexample
timestampwhen this request is issued.longrequired. in millis since the epoch. It is important to use millis, since TTL support is in millis.1430116731156
operationinsert/delete/update/incrementstringrequired only for bulk operation; aliases are insert: i, delete:d, update: u, increment: in, default is insert.“i”, “insert”
fromId of start vertex.long/stringrequired. prefer long if possible. maximum string bytes length < 2491
toId of end vertex.long/stringrequired. prefer long if possible. maximum string bytes length < 249101
labelname the corresponding labelstringrequired.“graph_test”
directiondirection of this relation, one of out/in/undirectedstringrequired. alias are out: o, in: i, undirected: u“out”
propsextra properties of this edge.json dictionaryrequired. all indexed properties should be present, otherwise the default values will be added. Non-indexed properties can also be present{“timestamp”: 1417616431, “affinity_score”:10, “is_hidden”: false, “is_valid”: true}

Edge Operations

1. Insert - POST /graphs/edges/insert

insert have different behavior according to label`s consistency level.

  1. strong consistency level(default): 1 READ + (1 DELETE+ 1 PUT, optional) insert is equal to upsert. s2graph check if unique edge key exist, then if there is edge with same unique edge key, run validation then decide apply current request or drop it.

  2. weak consistency level: 2 PUT no consistency check on unique edge key, insert same edge key multiple times can possibly yield multiple edges.

For consistency reasons, graph databases typically go through the following three steps to insert an edge between a source vertex to a target vertex with some metadata:

  1. fetch the source vertex to make sure it exists
  2. fetch the target vertex to make sure it exists
  3. insert an edge with the metadata on from -> to

Unlike other graph databases like Titan where server-generated vertex ids must be used, any user-defined vertex ids can be used in s2graph. Therefore s2graph will not fetch vertex data during the insert operation, making it one simple write to the underlying database.

This means that you don‘t have to create source and target vertices prior to inserting edges, if you don’t need any properties on vertices(i.e., you only need vertex id). In this case, s2graph will not fetch vertex information from the underlying db, therefore no read operation is required.

The following is an example inserting edges:

curl -XPOST localhost:9000/graphs/edges/insert -H 'Content-Type: Application/json' -d '
[
  {"from":1,"to":101,"label":"graph_test","props":{"time":-1, "weight":10},"timestamp":1417616431},
  {"from":1,"to":102,"label":"graph_test","props":{"time":0, "weight":11},"timestamp":1417616431},
  {"from":1,"to":103,"label":"graph_test","props":{"time":1, "weight":12},"timestamp":1417616431},
  {"from":1,"to":104,"label":"graph_test","props":{"time":-2, "weight":1},"timestamp":1417616431}
]
'

2. delete - POST /graphs/edges/delete

You can also delete edges.

Note that if the timestamp in a delete request is larger (later) than the actual timestamp of the edge, the delete request will be ignored.

The following is an example deleting edges.

curl -XPOST localhost:9000/graphs/edges/delete -H 'Content-Type: Application/json' -d '
[
 {"from":1,"to":102,"label":"graph_test","timestamp":1417616432},
 {"from":1,"to":103,"label":"graph_test","timestamp":1417616432}
]
'

3. update - POST /graphs/edges/update

An update request on edges will overwrite properties of the corresponding edge.

This is not an upsert operation and a corresponding edge must exist for update operation. Update operations on nonexistent edges will be ignored.

Also remember that previous data stored in the edge is overwritten.

The following is an example updating properties of an edge, first setting is_hidden property to be true and then setting weight property to be 100.

curl -XPOST localhost:9000/graphs/edges/update -H 'Content-Type: Application/json' -d '
[
 {"from":1,"to":104,"label":"graph_test","timestamp":1417616433, "props": {"is_hidden":true}},
 {"from":1,"to":104,"label":"graph_test","timestamp":1417616434, "props": {"weight":100}}
]
'

4. increment - POST /graphs/edges/increment

You can add a certain value to edges' indexed properties. Negative numbers can be used to subtract some value from the properties. Increment operations are only supported for indexed properties.

you don't have to insert an edge prior to its increment operation. If the edge corresponding to an increment request is not found, a new edge filled with the default property values (provided when defining the label) will be automatically created.

The following is an example incrementing edges' properties.

curl -XPOST localhost:9000/graphs/edges/increment -H 'Content-Type: Application/json' -d '
[
  {"from":1,"to":101,"label":"graph_test","props":{"time":-1, "weight":10},"timestamp":1417616435},
  {"from":1,"to":102,"label":"graph_test","props":{"time":0, "weight":11},"timestamp":1417616435},
  {"from":1,"to":103,"label":"graph_test","props":{"time":1, "weight":12},"timestamp":1417616435},
  {"from":1,"to":104,"label":"graph_test","props":{"time":-2, "weight":1},"timestamp":1417616435}
]
'

5. insertBulk - POST /graphs/edges/insertBulk

insert edges without checking consistency.

The following is an example inserting edges:

curl -XPOST localhost:9000/graphs/edges/insertBulk -H 'Content-Type: Application/json' -d '
[
  {"from":1,"to":101,"label":"graph_test","props":{"time":-1, "weight":10},"timestamp":1417616431},
  {"from":1,"to":101,"label":"graph_test","props":{"time":0, "weight":11},"timestamp":1417616432}
]
'

4. (Optionally) Insert and Manipulate Vertices

Vertices are the two ends that an edge is connecting, and correspond to a column defined for a service. In case you need to store some metadata corresponding to vertices and make queries regarding them, you can insert and manipulate vertices rather than edges.

Unlike edges and their labels, properties on vertices are not indexed and do not require a predefined schema nor default values. The following fields are used when operating on vertices.

field namedefinitiondata typenoteexample
timestamplongrequired. in seconds since the epoch1417616431
operationthe operation to perform; one of insert, delete, update, incrementstringrequired only for bulk operations; alias are insert: i, delete:d, update: u, increment: in, default is insert.“i”, “insert”
serviceNamecorresponding service's name“string”required.“kakaotalk”/“kakaogroup”
columnNamecorresponding column's namestringrequired.“xxx_service_ user_id”
ida unique identifier of this vertexlong/stringrequired. prefer long if possible.101
propsextra properties of this vertex.json dictionaryrequired.{“is_active_user”: true, “age”:10, “gender”: “F”, “country_iso”: “kr”}

1. Insert - POST /graphs/vertices/insert/:serviceName/:columnName

curl -XPOST localhost:9000/graphs/vertices/insert/s2graph/account_id -H 'Content-Type: Application/json' -d '
[
  {"id":1,"props":{"is_active":true, "talk_user_id":10},"timestamp":1417616431},
  {"id":2,"props":{"is_active":true, "talk_user_id":12},"timestamp":1417616431},
  {"id":3,"props":{"is_active":false, "talk_user_id":13},"timestamp":1417616431},
  {"id":4,"props":{"is_active":true, "talk_user_id":14},"timestamp":1417616431},
  {"id":5,"props":{"is_active":true, "talk_user_id":15},"timestamp":1417616431}
]
'

2. delete - POST /graphs/vertices/delete/:serviceName/:columnName

This operation will delete only the vertex data of a specified column and will not delete all edges connected to those vertices.

Important notes

This means that edges returned by a query can contain deleted vertices. Clients need to check if those vertices are valid.

3. deleteAll - POST /graphs/vertices/delete/:serviceName/:columnName

This operation will delete all vertex data of a specified column and also delete all edges that are connected to those vertices. Example:

curl -XPOST localhost:9000/graphs/vertices/deleteAll/s2graph/account_id -H 'Content-Type: Application/json' -d '
[{"id": 1, "timestamp": 193829198}]
'

This is an extremely expensive operation; The following is a pseudocode showing how this operation works:

vertices = vertex list to delete
for vertex in vertices
	labals = fetch all labels that this vertex is included.
	for label in labels
		for index in label.indices
			edges = G.read with limit 50K
			for edge in edges
				edge.delete

The total complexity is O(L * L.I) reads + O(L * L.I * 50K) writes in the worst case. If a vertex to delete has more than 50K edges, the delete operation will not be consistent.

3. update - POST /graphs/vertices/update/:serviceName/:columnName

The update operation on vertices uses the same parameters as in the insert operation.

4. increment

Not yet implemented; stay tuned.

5. Query

1. Definition


Once you have your graph data uploaded to s2graph, you can traverse your graph using our REST APIs. Queries contain the vertex to start traversing, and list of labels paired with filters and scoring weights used during the traversal. Query requests are structures as follows:

field namedefinitiondata typenoteexample
srcVerticesvertices to start traversing.json array of json dictionary specifying each vertex, with “serviceName”, “columnName”, “id” fields.required.[{"serviceName": "kakao", "columnName": "account_id", "id":1}]
stepslist of steps for traversing.json array of stepsexplained below[[{"label": "graph_test", "direction": "out", "limit": 100, "scoring":{"time": 0, "weight": 1}}]]
removeCyclewhen traverse to next step, don`t traverse already visited verticestrue/false. default is truealready visited is defined by following(label, vertex).
so if steps are friend -> friend, then remove second depth friends if they exist in first depth friends

step: Each step define what to traverse in a single hop on the graph. The first step has to be a direct neighbor of the starting vertices, the second step is a direct neighbor of vertices from the first step and so on. A step is specified with a list of query params, hence the steps field of a query request becoming an array of arrays of dictionaries.

query param:

field namedefinitiondata typenoteexample
labelname of label to traverse.stringrequired. must be an existing label.“graph_test”
directionin/out direction to traversestringoptional, default out“out”
limithow many edges to fetchintoptional, default 1010
offsetstart position on this indexintoptional, default 050
intervalthe range to filter on indexed propertiesjson dictoptional{"from": {"time": 0, "weight": 1}, "to": {"time": 1, "weight": 15}}
durationtime rangejson dictoptional{"from": 1407616431, "to": 1417616431}
scoringa mapping from indexed properties' names to their weights
the weighted sum of property values will be the final score.
json dictoptional{"time": 1, "weight": 2}
wherefilter condition(like sql`s where clause).
logical operation(and/or) is supported and each condition can have exact equal(=), sets(in), and range(between x and y).
do not use any quotes for string type
stringoptionalex) “((_from = 123 and _to = abcd) or gender = M) and is_hidden = false and weight between 1 and 10 or time in (1, 2, 3)”.
note that it only support long/string/boolean type
outputFieldreplace edge`s to field with this field in propsstringoptional“outputField”: “service_user_id”. this change to field into props[‘service_user_id’]
excludedecide if vertices that appear on this label and different labels in this step should be filtered outbooleanoptional, default falsetrue, exclude vertices that appear on this label and other labels in this step will be filtered out.
includedecide if vertices that appear on this label and different labels in this step should be remain in result.booleanoptional, default false
duplicatepolicy on how to deal with duplicate edges.
duplicate edges means edges with same (from, to, label, direction).
string
one of “first”, “sum”, “countSum”, “raw”
optional, default “firstfirst” means only first occurrence of edge survive.
sum” means sums up all scores of same edges but only one edge survive.
countSum” means counts up occurrence of same edges but only one edge survive.
raw” means same edges will be survived as they are.
rpcTimeouttimeout for this requestintegeroptional, default 100msnote: maximum value should be less than 1000ms
maxAttempthow many times client will try to fetch result from HBaseintegeroptional, default 1note: maximum value should be less than 5

2. Query API

2.1. Edge Queries

edge query provide following 4 APIs. s2graph itself would not provide any business logic dependent query. it would rather provide necessary data to help implementing business logic.

1. POST /graphs/getEdges

get all edges. flat hierarchy.

{
    "size": 2,
    "results": [
        {
            "from": 1,
            "to": 88277115755635400,
            "label": "talk_friend_long_term_agg_by_account_id",
            "direction": "out",
            "_timestamp": 1425088498,
            "props": {
                "talk_user_id": 41780,
                "score": 8,
                "service_user_id": 88277115755635400,
                "profile_id": 424,
                "birth_date": 517,
                "birth_year": 1977,
                "gender": "F"
            },
            "score": 8
        },
        {
            "from": 1,
            "to": 88300639020224930,
            "label": "talk_friend_long_term_agg_by_account_id",
            "direction": "out",
            "_timestamp": 1425088493,
            "props": {
                "talk_user_id": 1545029,
                "score": 0,
                "service_user_id": 88300639020224930,
                "profile_id": 9571562,
                "birth_date": 605,
                "birth_year": 1979,
                "gender": "F"
            },
            "score": 0
        }
    ]
}
2. POST /graphs/getEdges/grouped

get all edges with group by edge`s target vertex.

{
    "size": 2,
    "results": [
        {
            "name": "account_id",
            "id": "88277115755635393",
            "scoreSum": 8,
            "aggr": {
                "name": "account_id",
                "ids": [
                    "1"
                ]
            }
        },
        {
            "name": "account_id",
            "id": "88300639020224928",
            "scoreSum": 0,
            "aggr": {
                "name": "account_id",
                "ids": [
                    "1"
                ]
            }
        }
    ]
}
3. POST /graphs/getEdgesExcluded

get all edges excluding all edges from srcVertices to last step.

4. POST /graphs/getEdgesExcluded/grouped

get all edges excluding all edges from srcVertices to last step with group by edge`s target vertex.

2.2. Vertex Queries

1. POST /graphs/getVertices

get all vertex data.

3. Query Examples

3.1. Edge Queries

Example 1. Selecting the first 100 edges of label graph_test, which start from the vertex with account_id=1, sorted using the default index of graph_test.

curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
    "srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}],
    "steps": [
      [{"label": "graph_test", "direction": "out", "offset": 0, "limit": 100
      }]
    ]
}
'

Example 2. Selecting the 50th ~ 100th edges from the same vertex.

curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
    "srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}],
    "steps": [
      [{"label": "graph_test", "direction": "in", "offset": 50, "limit": 50}]
    ]
}
'

Example 3. Selecting the 50th ~ 100th edges from the same vertex, now with a time range filter.

curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
    "srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}],
    "steps": [
      [{"label": "graph_test", "direction": "in", "offset": 50, "limit": 50, "duration": {"from": 1416214118, "to": 1416214218}]
    ]
}
'

Example 4. Selecting 50th ~ 100th edges from the same vertex, sorted using the indexed properties time and weight, with the same time range filter, and applying weighted sum using time: 1.5, weight: 10

curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
    "srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}],
    "steps": [
      [{"label": "graph_test", "direction": "in", "offset": 50, "limit": 50, "duration": {"from": 1416214118, "to": 1416214218}, "scoring": {"time": 1.5, "weight": 10}]
    ]
}
'

Example 5. Selecting 100 edges representing friends, from the vertex with account_id=1, and again selecting their 10 friends, therefore selecting at most 1,000 “friends of friends”.

curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
    "srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}],
    "steps": [
      [{"label": "friends", "direction": "out", "limit": 100}],
      [{"label": "friends", "direction": "out", "limit": 10}]
    ]
}
'

Example 6. Selecting 100 edges representing friends and their 10 listened_music edges, to get “music that my friends have listened to”.

curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '
{
    "srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}],
    "steps": [
      [{"label": "talk_friend", "direction": "out", "limit": 100}],
      [{"label": "play_music", "direction": "out", "limit": 10}]
    ]
}
'

3.2. Vertex Queries

Example 1. Selecting all vertices from column account_id of a service s2graph.

curl -XPOST localhost:9000/graphs/getVertices -H 'Content-Type: Application/json' -d '
[
	{"serviceName": "s2graph", "columnName": "account_id", "ids": [1, 2, 3]},
	{"serviceName": "agit", "columnName": "user_id", "ids": [1, 2, 3]}
]
'

6. Bulk Loading

In many cases, the first step to start using s2graph is to migrate a large dataset into s2graph. s2graph provides a bulk loading script for importing the initial dataset.

To use bulk load, you need running Spark cluster and TSV file with bulk load format.

Note that if you don't need extra properties on vertices(i.e., you only need vertex id), you only need to publish the edges and not the vertices. Publishing edges will effectively create vertices with empty properties.

Edge Format

timestampoperationlogTypefromtolabelprops
1416236400insertedge5649326071316talk_friend_long_term_agg_by_account_id{“timestamp”:1416236400,“score”:0}

Vertex Format

timestampoperationlogTypeidserviceNamecolumnNameprops
1416236400insertvertex56493kakaotalkaccount_id{"is_active":true, "country_iso": "kr"}

Build

to build bulk loader, you need to build loader project. just run following commend.

`sbt “project loader” “clean” “assembly”

you will see s2graph-loader-assembly-0.0.4-SNAPSHOT.jar under loader/target/scala-2.xx/

Source Data Storage Options

For bulk loading, source data can be either in HDFS or Kafka queue.

1. When the source data is in HDFS.

  • run subscriber.GraphSubscriber to bulk upload HDFS TSV file into s2graph.
  • make sure how many edges are parsed/stored by looking at Spark UI.

2. When the source data is in Kafka.

assumes that data is bulk loading format and constantly coming into Kafka MQ.

  • run subscriber.GraphSubscriberStreaming to extract and load into s2graph from kafka topic.
  • make sure how many edges are parsed/stored by looking at Spark UI.

3. online migration

following is the way we do online migration from RDBMS to s2graph. assumes that client send same events that goes to primary storage(RDBMS) and s2graph.

  • mark label as isAsync true. this will queue all events into kafka queue.
  • dump RDBMS and build bulk load file in TSV.
  • update TSV file with subscriber.GraphSubscriber.
  • mark label as isAsync false. this will stop queuing events into kafka queue and apply changes into s2graph directly.
  • since s2graph is Idempotent, it is safe to replay queued message while bulk load process. so just use subscriber.GraphSubscriberStreaming to queued events.

7. Benchmark

Test data

  1. kakao talk full graph(8.8 billion edges)
  2. sample 10 million user id that have more than 100 friends.
  3. number of region server for HBase = 20

1. friend of friend

find 50 talk friends then find 20 talk friends

 {
    "srcVertices": [{"serviceName": "kakaotalk", "columnName": "talk_user_id", "id":$id}],
    "steps": [
      [{"label": "talk_friend", "direction": "out", "limit": 50}],
      [{"label": "talk_friend", "direction": "out", "limit": 20}]
    ]
	}

total vuser = 980

number of rest servertpsmean test time
105,981.5151.36 ms
2010,58986.45 ms
3016,295.456.43 ms

2. friends

find 100 talk friends

 {
    "srcVertices": [{"serviceName": "kakaotalk", "columnName": "talk_user_id", "id":$id}],
    "steps": [
      [{"label": "talk_friend", "direction": "out", "limit": 100}]
    ]
	}

total vuser = 2,072

number of rest servertpsmean test time
2053,713.437.31 ms

new benchmark (asynchbase)

1. one step query

{
    "srcVertices": [
        {
            "serviceName": "kakaotalk",
            "columnName": "talk_user_id",
            "id": %s    
        }
    ],
    "steps": [
      [
        {
          "label": "talk_friend_long_term_agg", 
          "direction": "out", 
          "offset": 0, 
          "limit": %d
        }
      ]
    ]
}
number of rest servervuseroffsetfirst step limittpslatency
1300109790TPS3ms
13080109,958.2TPS2.91ms
1300207,418.1TPS3.92ms
1300405,118.5TPS5.72ms
1300603,966.9TPS7.38ms
1300803,408.4TPS8.58ms
13001003,048.1TPS9.76ms
26001005,869.4TPS10.04ms
4120010011,473.1TPS10.27ms

2. two step query

{
    "srcVertices": [
        {
            "serviceName": "kakaotalk",
            "columnName": "talk_user_id",
            "id": %s    
        }
    ],
    "steps": [
      [
        {
          "label": "talk_friend_long_term_agg", 
          "direction": "out", 
          "offset": 0, 
          "limit": %d
        }
      ], 
      [
        {
          "label": "talk_friend_long_term_agg", 
          "direction": "out", 
          "offset": 0, 
          "limit": %d
        }
      ]
    ]
}
number of rest servervuserfirst step limitsecond step limittpslatency
13010102,008.2TPS14.7ms
13010201,221.3TPS24.13ms
1301040678TPS43.92ms
1301060488.2TPS60.72ms
1301080360.2TPS82.55ms
13010100312.1TPS94.7ms
12010100297TPS66.73ms
11010100302TPS32.86ms
13020101163.3TPS25.5ms
1302020645.9TPS45.79ms
1304010618.4TPS47.96ms
1306010448.9TPS66.16ms
1308010339.3TPS87.82ms
13010010272.5TPS108.65ms
12010010288.5TPS68.34ms
11010010261.4TPS37.49ms
26010010412.9TPS143.83ms
412010010791.7TPS150.06ms

3. three step query

{
    "srcVertices": [
        {
            "serviceName": "kakaotalk",
            "columnName": "talk_user_id",
            "id": %s    
        }
    ],
    "steps": [
      [
        {
          "label": "talk_friend_long_term_agg", 
          "direction": "out", 
          "offset": 0, 
          "limit": %d
        }
      ], 
      [
        {
          "label": "talk_friend_long_term_agg", 
          "direction": "out", 
          "offset": 0, 
          "limit": %d
        }
      ],
      [
        {
          "label": "talk_friend_long_term_agg", 
          "direction": "out", 
          "offset": 0, 
          "limit": %d
        }
      ]
    ]
}
number of rest servervuserfirst step limitsecond step limitthird step limittpslatency
130101010250.2TPS118.86ms
13010102090.4TPS329.46ms
12010102083.2TPS238.42ms
11010102082.6TPS120.16ms

Analytics