commit | a38a180358e9277f489dc599524e135118b448ef | [log] [tgz] |
---|---|---|
author | SteamShon <shom83@gmail.com> | Wed May 20 23:12:06 2015 +0900 |
committer | SteamShon <shom83@gmail.com> | Wed May 20 23:12:06 2015 +0900 |
tree | 172b9e659d6b9c0cdfee3263378e8f862fa89790 | |
parent | 8767c362a7f27cbc8a32518860e6c83dcc8a0e39 [diff] |
fix test errors. https://github.com/daumkakao/s2graph/issues/10
s2graph is a GraphDB that stores big data using edges and vertices, and also serves REST APIs for querying information on its edges and vertices. It provide fully asynchronous, non-blocking API. This document defines terms and concepts used in s2graph and describes its REST API.
Table of Contents generated with DocToc
POST /graphs/createService
POST /graphs/createLabel
POST /graphs/addIndex
S2Graph consists of multiple projects.
to getup and running following is required.
brew install hadoop
and brew install hbase
if you are on mac. 3. otherwise checkout reference for how to setup hbase. 4. note that currently we support latest stable version of apache hbase 1.0.1 with apache hadoop version 2.7.0. if you are using cdh, then you can checkout our feature/cdh5.3.0. we are working on providing profile on hbase/hadoop version soon.conf/reference.conf
- db.defaut.[url, user, password] 7. because of license issue, we are working on change this to Derbybrew install protobuf
if you are on mac. 6. otherwise install 2.6.1(feature/cdh5.3.0 branch, expect protobuf 2.5.0)once all requirements are setup correctly, you have to install asynchbase on your local first.
cd asynchbase; make pom.xml; mvn install
then compile rest project
sbt compile
now you are run s2graph.
sbt run
we provide simple script under script/test.sh only for checking if everything setup property.
sh script/test.sh
finally join the mailing list
There are four important abstractions that define the data model used throughout s2graph: services, columns, labels and properties.
Services, the top level abstraction, are like databases in traditional RDBMS in which all data are contained. A service usually represents one of the company's real services and is named accordingly, e.g. "KakaoTalk"
, "KakaoStory"
.
Columns define the type of vertices and a service can have multiple columns. For example, service "KakaoMusic"
can have columns "user_id"
and "track_id"
. While columns can be compared to tables in traditional RDBMS in a sense, labels are the primary abstraction for representing schemas and columns are usually referenced only using their names.
Labels, represent relations between two columns, therefore representing the type of edges. The two columns can be the same, e.g. for a label representing friendships in an SNS, the two column will both be "user_id"
of the service. There can be labels connecting two columns from two different services; for example, one can create a label that stores all events where KakaoStory posts are shared to KakaoTalk.
Properties, are metadata linked to vertices or edges that can be queried upon later. For vertices representing KakaoTalk users, estimated_birth_year
is a possible property, and for edges representing similar KakaoMusic songs their cosine_similarity
can be a property.
Using these abstractions, a unique vertex can be identified with its (service, column, vertex id)
, and a unique edge can be identified with its (service, label, source vertex id, target vertex id)
. Additional information on edges and vertices are stored within their own properties.
The following is a non-exhaustive list of commonly used s2graph APIs and their examples. The full list of the latest REST API can be found in the routes file.
POST /graphs/createService
see following to see what you can set with this API.
To create a Service, the following fields needs to be specified in the request.
field name | definition | data type | example | note |
---|---|---|---|---|
serviceName | name of user defined namespace. | string | “talk_friendship” | required. |
cluster | zookeeper quorum address for your cluster. | string | “abc.com:2181,abd.com:2181” | optional. default value is “hbase.zookeeper.quorum” on your application.conf. if there is no value for “hbase.zookeeper.quorum” is defined on application.conf, then default value is “localhost” |
hTableName | physical HBase table name. | string | “test” | optional. default is serviceName-#{phase}. phase is either dev/real/alpha/sandbox |
hTableTTL | global time to keep the data alive. | integer | 86000 | optional. default is infinite. |
preSplitSize | number of pre split for HBase table. | integer | 20 | optional. default is 0(no pre-split) |
Service is the top level abstraction in s2graph which can be considered like a database in RDBMS. You can create a service using this API:
curl -XPOST localhost:9000/graphs/createService -H 'Content-Type: Application/json' -d ' {"serviceName": "s2graph", "cluster": "address for zookeeper", "hTableName": "hbase table name", "hTableTTL": 86000, "preSplitSize": # of pre split} '
note that optional value for your service is only advanced users only. stick to default if you don`t know what you are doing.
You can also look up all labels corresponding to a service.
curl -XGET localhost:9000/graphs/getLabels/:serviceName
POST /graphs/createLabel
A label represents a relation between two columns, and plays a role like a table in RDBMS since labels contain the schema information, i.e. what type of data will be collected and what among them needs to be indexed for efficient retrieval. In most scenario, defining a schema on vertices is pretty straightforward but defining a schema on edges requires a little effort. Think about queries you will need first, and then model user's actions/relations as edges to design a label.
To create a Label, the following fields needs to be specified in the request.
field name | definition | data type | example | note |
---|---|---|---|---|
label | name of this relation; be specific. | string | “talk_friendship” | required. |
srcServiceName | source column's service | string | “kakaotalk” | required. |
srcColumnName | source column's name | string | “user_id” | required. |
srcColumnType | source column's data type | long/integer/string | “string” | required. |
tgtServiceName | target column's service | string | “kakaotalk”/“kakaoagit” | same as srcServiceName when not specified |
tgtColumnName | target column's name | string | “item_id” | required. |
tgtColumnType | target column's data type | long/integer/string | “long” | required. |
indexProps | mapping from indexed properties' names to their default values. indexed properties will be primary index for this label (like PRIMARY INDEX idx_xxx (p1, p2 )`, in RDBMS.note that _timestamp, _from, _to is reserved property | json dictionary | {“timestamp”:0, “affinity_score”:10, “play_count”:0} | A default value must be provided for each property. The default value is usually the minimum value permitted for the property. When this filed is empty, the default property named timestamp will be automatically added and indexed. The value's type can be one of long/int/bool/byte and cannot be float. If your property is float type, you need to convert them to long/int first. |
props | mapping from non-indexed properties' names to their default values. these properties are indexed and therefore cannot be used efficiently for querying, like non-indexed columns in RDBMS | json dictionary | {“is_hidden”: false, “country_iso”: “kr”, “country_code”: 82} | non-indexed properties can be added later, like alter table add column in RDBMS |
isDirected | if this label is directed or undirected | true/false | true/false | default true |
serviceName | which service this label is belongs to. | either srcServiceName or tgtServiceName | s2graph | default tgtServiceName |
hTableName | if this label need special usecase(such as batch upload), own hbase table name can be used. | string | s2graph-batch | default use service`s hTableName. note that this is optional. |
hTableTTL | time to data keep alive. | integer | 86000 | default use service`s hTableTTL. note that this is optional. |
consistencyLevel | if this is strong, only one edge between same from/to can be made. otherwise(weak) multiple edges with same from/to can be exist. | string | strong/weak | default weak |
Note. following property names are reserved for system. user can not create property same with these reserved property names. user can use this properties for indexProps/props/where clause on query.
- _timestamp is reserved for system wise timestamp. this can be interpreted as last_modified_at
- _from is reserved for label`s start vertex.
- _to is reserved for
The following is an example that creates a label named graph_test
, which represents the relation between account_id
in service named s2graph
and account_id
in the same service, with indexed properties timestamp
and affinity_score
which both have the zero default value.
curl -XPOST localhost:9000/graphs/createLabel -H 'Content-Type: Application/json' -d ' { "label": "graph_test", "srcServiceName": "s2graph", "srcColumnName": "account_id", "srcColumnType": "long", "tgtServiceName": "s2graph", "tgtColumnName": "item_id", "tgtColumnType": "long", "indexProps": { "time": 0, "weight": 0 }, "props": { "is_hidden": true, "is_blocked": true, "error_code": 500 }, "serviceName": "s2graph", "consistencyLevel": "strong" } '
Here is another example that creates a label named kakao_group_join
label between column account_id
of service kakao
and column group_id
of service kakaogroup
. Note that the default indexed property timestamp
will be created since the indexedProps
field is empty.
curl -XPOST localhost:9000/graphs/createLabel -H 'Content-Type: Application/json' -d ' { "label": "kakao_group_join", "srcServiceName": "kakao", "srcColumnName": "account_id", "srcColumnType": "long", "tgtServiceName": "kakaogroup", "tgtColumnName": "group_id", "tgtColumnType": "string", "indexProps": {}, "serviceName": "kakaogroup", "props": {} } '
The following query will return the information regarding a label, graph_test
in this case.
curl -XGET localhost:9000/graphs/getLabel/graph_test
You can delete a label using the following API:
curl -XPUT localhost:9000/graphs/deleteLabel/graph_test
To add a new non-indexed property, use the following API:
curl -XPOST localhost:9000/graphs/addProp/graph_test -H 'Content-Type: Application/json' -d ' {"name": "is_blocked", "defaultValue": false, "dataType": "boolean", "usedInIndex": false} '
One last important constraint on label is consistency level.
This define how to store edges on storage level. note that query is completely independent with this.
To explain consistency, s2graph defined edge uniquely with their (from, label, to) triple. s2graph call this triple as unique edge key.
following example is used to explain differences between strong/weak consistency level.
1418950524721 insert e 1 101 graph_test {"weight": 10} = (1, graph_test, 101) 1418950524723 insert e 1 101 graph_test {"weight": 20} = (1, graph_test, 101)
currently there are two consistency level
1. strong
make sure there is only one edge stored in storage between same edge key((1, graph_test, 101) above). with strong consistency level, last command overwrite previous command.
2. weak
no consistency check on unique edge key. above example yield two different edge stored in storage with different timestamp and weight value.
for example, with each configuration, following edges will be stored.
assumes that only timestamp is used as indexProps and user inserts following.
u1 -> (t1, v1) u1 -> (t2, v2) u1 -> (t3, v2) u1 -> (t4, v1)
with strong consistencyLevel following is what to be stored.
u1 -> (t4, v1), (t3, v2)
note that u1 -> (t1, v1), (t2, v2) are not exist.
with weak consistencyLevel.
u1 -> (t4, v1), (t3, v2), (t2, v2), (t1, v1)
Reason weak consistency is default.
most case edges related to user`s activity should use weak consistencyLevel since there will be no concurrent update on same edges. strong consistencyLevel is only for edges expecting many concurrent updates.
Consistency level also determine how edges will be stored in storage when command is delivered reversely by their timestamp.
with strong consistencyLevel following is guaranteed.
natural event on (1, graph_test, 101) unique edge key is following.
1418950524721 insert e 1 101 graph_test {"is_blocked": false} 1418950524722 delete e 1 101 graph_test 1418950524723 insert e 1 101 graph_test {"is_hidden": false, "weight": 10} 1418950524724 update e 1 101 graph_test {"time": 1, "weight": -10} 1418950524726 update e 1 101 graph_test {"is_blocked": true}
even if above commands arrive in not in order, strong consistency make sure same eventual state on (1, graph_test, 101).
1418950524726 update e 1 101 graph_test {"is_blocked": true} 1418950524723 insert e 1 101 graph_test {"is_hidden": false, "weight": 10} 1418950524722 delete e 1 101 graph_test 1418950524721 insert e 1 101 graph_test {"is_blocked": false} 1418950524724 update e 1 101 graph_test {"time": 1, "weight": -10}
There are many cases that commands arrive in not in order.
- client servers are distributed and each client issue command asynchronously.
- client servers are distributed and grouped commands.
- by using kafka queue, global ordering or message is not guaranteed.
Following is what s2graph do to make strong consistency level.
complexity = O(one read) + O(one delete) + O(2 put) fetchedEdge = fetch edge with (1, graph_test, 101) from lookup table. if fetchedEdge is not exist: create new edge same as current insert operation update lookup table as current insert operation else: valid = compare fetchedEdge vs current insert operation. if valid: delete fetchedEdge create new edge after comparing fetchedEdge and current insert. update lookup table
Limitation Since we write our data to HBase asynchronously, there is no consistency guarantee on same edge within our flushInterval(1 seconds).
POST /graphs/addIndex
A label can have multiple indexed properties, or (for brevity) indexes. When queried, returned edges' order is determined according to indexes, indexes essentially defines what will be included in the topK query.
Edge retrieval queries in s2graph by default returns topK edges. Clients must issue another query to fetch the next K edges, i.e., topK ~ 2topK.
Internally, s2graph stores edges sorted according to the indexes in order to limit the number of edges to fetch in one query. If no ordering is given, s2graph will use the timestamp as an index, thus resulting in the most recent data.
It is impossible to fetch millions of edges and sort them on-line to get topK in less than a second. s2graph uses vertex-centric indexes to avoid this.
using vertex-centric index, having millions of edges is fine as long as the topK value is reasonable (~ 1K) Note that indexes must be created before putting any data on this label (just like RDBMS).
New indexes can be dynamically added, but it will not be applied to existing data(planned in future versions). the number of indexes on a label is currently limited to 8.
The following is an example of adding indexes play_count
and pay_amount
to a label named graph_test
.
curl -XPOST localhost:9000/graphs/addIndex -H 'Content-Type: Application/json' -d ' {"label": "graph_test", "indexProps": {"play_count":0, "pay_amount":0}} '
##3. Insert and Manipulate Edges ##
An edge represents a relation between two vertices, with properties according to the schema defined in its label. The following fields need to be specified when inserting an edge, and are returned when queried on edges.
field name | definition | data type | note | example |
---|---|---|---|---|
timestamp | when this request is issued. | long | required. in millis since the epoch. It is important to use millis, since TTL support is in millis. | 1430116731156 |
operation | insert/delete/update/increment | string | required only for bulk operation; aliases are insert: i, delete:d, update: u, increment: in, default is insert. | “i”, “insert” |
from | Id of start vertex. | long/string | required. prefer long if possible. maximum string bytes length < 249 | 1 |
to | Id of end vertex. | long/string | required. prefer long if possible. maximum string bytes length < 249 | 101 |
label | name the corresponding label | string | required. | “graph_test” |
direction | direction of this relation, one of out/in/undirected | string | required. alias are out: o, in: i, undirected: u | “out” |
props | extra properties of this edge. | json dictionary | required. all indexed properties should be present, otherwise the default values will be added. Non-indexed properties can also be present | {“timestamp”: 1417616431, “affinity_score”:10, “is_hidden”: false, “is_valid”: true} |
POST /graphs/edges/insert
insert have different behavior according to label`s consistency level.
strong consistency level(default): 1 READ + (1 DELETE+ 1 PUT, optional) insert is equal to upsert. s2graph check if unique edge key exist, then if there is edge with same unique edge key, run validation then decide apply current request or drop it.
weak consistency level: 2 PUT no consistency check on unique edge key, insert same edge key multiple times can possibly yield multiple edges.
For consistency reasons, graph databases typically go through the following three steps to insert an edge between a source vertex to a target vertex with some metadata:
- fetch the source vertex to make sure it exists
- fetch the target vertex to make sure it exists
- insert an edge with the metadata on from -> to
Unlike other graph databases like Titan where server-generated vertex ids must be used, any user-defined vertex ids can be used in s2graph. Therefore s2graph will not fetch vertex data during the insert operation, making it one simple write to the underlying database.
This means that you don‘t have to create source and target vertices prior to inserting edges, if you don’t need any properties on vertices(i.e., you only need vertex id). In this case, s2graph will not fetch vertex information from the underlying db, therefore no read operation is required.
The following is an example inserting edges:
curl -XPOST localhost:9000/graphs/edges/insert -H 'Content-Type: Application/json' -d ' [ {"from":1,"to":101,"label":"graph_test","props":{"time":-1, "weight":10},"timestamp":1417616431}, {"from":1,"to":102,"label":"graph_test","props":{"time":0, "weight":11},"timestamp":1417616431}, {"from":1,"to":103,"label":"graph_test","props":{"time":1, "weight":12},"timestamp":1417616431}, {"from":1,"to":104,"label":"graph_test","props":{"time":-2, "weight":1},"timestamp":1417616431} ] '
POST /graphs/edges/delete
You can also delete edges.
Note that if the timestamp in a delete request is larger (later) than the actual timestamp of the edge, the delete request will be ignored.
The following is an example deleting edges.
curl -XPOST localhost:9000/graphs/edges/delete -H 'Content-Type: Application/json' -d ' [ {"from":1,"to":102,"label":"graph_test","timestamp":1417616432}, {"from":1,"to":103,"label":"graph_test","timestamp":1417616432} ] '
POST /graphs/edges/update
An update request on edges will overwrite properties of the corresponding edge.
This is not an upsert operation and a corresponding edge must exist for update operation. Update operations on nonexistent edges will be ignored.
Also remember that previous data stored in the edge is overwritten.
The following is an example updating properties of an edge, first setting is_hidden
property to be true and then setting weight
property to be 100.
curl -XPOST localhost:9000/graphs/edges/update -H 'Content-Type: Application/json' -d ' [ {"from":1,"to":104,"label":"graph_test","timestamp":1417616433, "props": {"is_hidden":true}}, {"from":1,"to":104,"label":"graph_test","timestamp":1417616434, "props": {"weight":100}} ] '
POST /graphs/edges/increment
You can add a certain value to edges' indexed properties. Negative numbers can be used to subtract some value from the properties. Increment operations are only supported for indexed properties.
you don't have to insert an edge prior to its increment operation. If the edge corresponding to an increment request is not found, a new edge filled with the default property values (provided when defining the label) will be automatically created.
The following is an example incrementing edges' properties.
curl -XPOST localhost:9000/graphs/edges/increment -H 'Content-Type: Application/json' -d ' [ {"from":1,"to":101,"label":"graph_test","props":{"time":-1, "weight":10},"timestamp":1417616435}, {"from":1,"to":102,"label":"graph_test","props":{"time":0, "weight":11},"timestamp":1417616435}, {"from":1,"to":103,"label":"graph_test","props":{"time":1, "weight":12},"timestamp":1417616435}, {"from":1,"to":104,"label":"graph_test","props":{"time":-2, "weight":1},"timestamp":1417616435} ] '
POST /graphs/edges/insertBulk
insert edges without checking consistency
.
The following is an example inserting edges:
curl -XPOST localhost:9000/graphs/edges/insertBulk -H 'Content-Type: Application/json' -d ' [ {"from":1,"to":101,"label":"graph_test","props":{"time":-1, "weight":10},"timestamp":1417616431}, {"from":1,"to":101,"label":"graph_test","props":{"time":0, "weight":11},"timestamp":1417616432} ] '
Vertices are the two ends that an edge is connecting, and correspond to a column defined for a service. In case you need to store some metadata corresponding to vertices and make queries regarding them, you can insert and manipulate vertices rather than edges.
Unlike edges and their labels, properties on vertices are not indexed and do not require a predefined schema nor default values. The following fields are used when operating on vertices.
field name | definition | data type | note | example |
---|---|---|---|---|
timestamp | long | required. in seconds since the epoch | 1417616431 | |
operation | the operation to perform; one of insert, delete, update, increment | string | required only for bulk operations; alias are insert: i, delete:d, update: u, increment: in, default is insert. | “i”, “insert” |
serviceName | corresponding service's name | “string” | required. | “kakaotalk”/“kakaogroup” |
columnName | corresponding column's name | string | required. | “xxx_service_ user_id” |
id | a unique identifier of this vertex | long/string | required. prefer long if possible. | 101 |
props | extra properties of this vertex. | json dictionary | required. | {“is_active_user”: true, “age”:10, “gender”: “F”, “country_iso”: “kr”} |
POST /graphs/vertices/insert/:serviceName/:columnName
curl -XPOST localhost:9000/graphs/vertices/insert/s2graph/account_id -H 'Content-Type: Application/json' -d ' [ {"id":1,"props":{"is_active":true, "talk_user_id":10},"timestamp":1417616431}, {"id":2,"props":{"is_active":true, "talk_user_id":12},"timestamp":1417616431}, {"id":3,"props":{"is_active":false, "talk_user_id":13},"timestamp":1417616431}, {"id":4,"props":{"is_active":true, "talk_user_id":14},"timestamp":1417616431}, {"id":5,"props":{"is_active":true, "talk_user_id":15},"timestamp":1417616431} ] '
POST /graphs/vertices/delete/:serviceName/:columnName
This operation will delete only the vertex data of a specified column and will not delete all edges connected to those vertices.
Important notes
This means that edges returned by a query can contain deleted vertices. Clients need to check if those vertices are valid.
POST /graphs/vertices/delete/:serviceName/:columnName
This operation will delete all vertex data of a specified column and also delete all edges that are connected to those vertices. Example:
curl -XPOST localhost:9000/graphs/vertices/deleteAll/s2graph/account_id -H 'Content-Type: Application/json' -d ' [{"id": 1, "timestamp": 193829198}] '
This is an extremely expensive operation; The following is a pseudocode showing how this operation works:
vertices = vertex list to delete for vertex in vertices labals = fetch all labels that this vertex is included. for label in labels for index in label.indices edges = G.read with limit 50K for edge in edges edge.delete
The total complexity is O(L * L.I) reads + O(L * L.I * 50K) writes in the worst case. If a vertex to delete has more than 50K edges, the delete operation will not be consistent.
POST /graphs/vertices/update/:serviceName/:columnName
The update operation on vertices uses the same parameters as in the insert operation.
Not yet implemented; stay tuned.
Once you have your graph data uploaded to s2graph, you can traverse your graph using our REST APIs. Queries contain the vertex to start traversing, and list of labels paired with filters and scoring weights used during the traversal. Query requests are structures as follows:
field name | definition | data type | note | example |
---|---|---|---|---|
srcVertices | vertices to start traversing. | json array of json dictionary specifying each vertex, with “serviceName”, “columnName”, “id” fields. | required. | [{"serviceName": "kakao", "columnName": "account_id", "id":1}] |
steps | list of steps for traversing. | json array of steps | explained below | [[{"label": "graph_test", "direction": "out", "limit": 100, "scoring":{"time": 0, "weight": 1}}]] |
removeCycle | when traverse to next step, don`t traverse already visited vertices | true/false. default is true | already visited is defined by following(label, vertex). so if steps are friend -> friend, then remove second depth friends if they exist in first depth friends |
step: Each step define what to traverse in a single hop on the graph. The first step has to be a direct neighbor of the starting vertices, the second step is a direct neighbor of vertices from the first step and so on. A step is specified with a list of query params, hence the steps
field of a query request becoming an array of arrays of dictionaries.
query param:
field name | definition | data type | note | example |
---|---|---|---|---|
label | name of label to traverse. | string | required. must be an existing label. | “graph_test” |
direction | in/out direction to traverse | string | optional, default out | “out” |
limit | how many edges to fetch | int | optional, default 10 | 10 |
offset | start position on this index | int | optional, default 0 | 50 |
interval | the range to filter on indexed properties | json dict | optional | {"from": {"time": 0, "weight": 1}, "to": {"time": 1, "weight": 15}} |
duration | time range | json dict | optional | {"from": 1407616431, "to": 1417616431} |
scoring | a mapping from indexed properties' names to their weights the weighted sum of property values will be the final score. | json dict | optional | {"time": 1, "weight": 2} |
where | filter condition(like sql`s where clause). logical operation(and/or) is supported and each condition can have exact equal(=), sets(in), and range(between x and y). do not use any quotes for string type | string | optional | ex) “((_from = 123 and _to = abcd) or gender = M) and is_hidden = false and weight between 1 and 10 or time in (1, 2, 3)”. note that it only support long/string/boolean type |
outputField | replace edge`s to field with this field in props | string | optional | “outputField”: “service_user_id”. this change to field into props[‘service_user_id’] |
exclude | decide if vertices that appear on this label and different labels in this step should be filtered out | boolean | optional, default false | true, exclude vertices that appear on this label and other labels in this step will be filtered out. |
include | decide if vertices that appear on this label and different labels in this step should be remain in result. | boolean | optional, default false | |
duplicate | policy on how to deal with duplicate edges. duplicate edges means edges with same (from, to, label, direction). | string one of “first”, “sum”, “countSum”, “raw” | optional, default “first” | “first” means only first occurrence of edge survive. “sum” means sums up all scores of same edges but only one edge survive. “countSum” means counts up occurrence of same edges but only one edge survive. “raw” means same edges will be survived as they are. |
rpcTimeout | timeout for this request | integer | optional, default 100ms | note: maximum value should be less than 1000ms |
maxAttempt | how many times client will try to fetch result from HBase | integer | optional, default 1 | note: maximum value should be less than 5 |
edge query provide following 4 APIs. s2graph itself would not provide any business logic dependent query. it would rather provide necessary data to help implementing business logic.
get all edges. flat hierarchy.
{ "size": 2, "results": [ { "from": 1, "to": 88277115755635400, "label": "talk_friend_long_term_agg_by_account_id", "direction": "out", "_timestamp": 1425088498, "props": { "talk_user_id": 41780, "score": 8, "service_user_id": 88277115755635400, "profile_id": 424, "birth_date": 517, "birth_year": 1977, "gender": "F" }, "score": 8 }, { "from": 1, "to": 88300639020224930, "label": "talk_friend_long_term_agg_by_account_id", "direction": "out", "_timestamp": 1425088493, "props": { "talk_user_id": 1545029, "score": 0, "service_user_id": 88300639020224930, "profile_id": 9571562, "birth_date": 605, "birth_year": 1979, "gender": "F" }, "score": 0 } ] }
get all edges with group by edge`s target vertex.
{ "size": 2, "results": [ { "name": "account_id", "id": "88277115755635393", "scoreSum": 8, "aggr": { "name": "account_id", "ids": [ "1" ] } }, { "name": "account_id", "id": "88300639020224928", "scoreSum": 0, "aggr": { "name": "account_id", "ids": [ "1" ] } } ] }
get all edges excluding all edges from srcVertices to last step.
get all edges excluding all edges from srcVertices to last step with group by edge`s target vertex.
get all vertex data.
Example 1. Selecting the first 100 edges of label graph_test
, which start from the vertex with account_id=1
, sorted using the default index of graph_test
.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}], "steps": [ [{"label": "graph_test", "direction": "out", "offset": 0, "limit": 100 }] ] } '
Example 2. Selecting the 50th ~ 100th edges from the same vertex.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}], "steps": [ [{"label": "graph_test", "direction": "in", "offset": 50, "limit": 50}] ] } '
Example 3. Selecting the 50th ~ 100th edges from the same vertex, now with a time range filter.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}], "steps": [ [{"label": "graph_test", "direction": "in", "offset": 50, "limit": 50, "duration": {"from": 1416214118, "to": 1416214218}] ] } '
Example 4. Selecting 50th ~ 100th edges from the same vertex, sorted using the indexed properties time
and weight
, with the same time range filter, and applying weighted sum using time: 1.5, weight: 10
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}], "steps": [ [{"label": "graph_test", "direction": "in", "offset": 50, "limit": 50, "duration": {"from": 1416214118, "to": 1416214218}, "scoring": {"time": 1.5, "weight": 10}] ] } '
Example 5. Selecting 100 edges representing friends
, from the vertex with account_id=1
, and again selecting their 10 friends, therefore selecting at most 1,000 “friends of friends”.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}], "steps": [ [{"label": "friends", "direction": "out", "limit": 100}], [{"label": "friends", "direction": "out", "limit": 10}] ] } '
Example 6. Selecting 100 edges representing friends
and their 10 listened_music
edges, to get “music that my friends have listened to”.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": "account_id", "id":1}], "steps": [ [{"label": "talk_friend", "direction": "out", "limit": 100}], [{"label": "play_music", "direction": "out", "limit": 10}] ] } '
Example 1. Selecting all vertices from column account_id
of a service s2graph
.
curl -XPOST localhost:9000/graphs/getVertices -H 'Content-Type: Application/json' -d ' [ {"serviceName": "s2graph", "columnName": "account_id", "ids": [1, 2, 3]}, {"serviceName": "agit", "columnName": "user_id", "ids": [1, 2, 3]} ] '
In many cases, the first step to start using s2graph is to migrate a large dataset into s2graph. s2graph provides a bulk loading script for importing the initial dataset.
To use bulk load, you need running Spark cluster and TSV file with bulk load format.
Note that if you don't need extra properties on vertices(i.e., you only need vertex id), you only need to publish the edges and not the vertices. Publishing edges will effectively create vertices with empty properties.
timestamp | operation | logType | from | to | label | props |
---|---|---|---|---|---|---|
1416236400 | insert | edge | 56493 | 26071316 | talk_friend_long_term_agg_by_account_id | {“timestamp”:1416236400,“score”:0} |
timestamp | operation | logType | id | serviceName | columnName | props |
---|---|---|---|---|---|---|
1416236400 | insert | vertex | 56493 | kakaotalk | account_id | {"is_active":true, "country_iso": "kr"} |
to build bulk loader, you need to build loader project. just run following commend.
`sbt “project loader” “clean” “assembly”
you will see s2graph-loader-assembly-0.0.4-SNAPSHOT.jar under loader/target/scala-2.xx/
For bulk loading, source data can be either in HDFS or Kafka queue.
assumes that data is bulk loading format and constantly comming into Kafka MQ.
following is the way we do online migration from RDBMS to s2graph. assumes that client send same events that goes to primary storage(RDBMS) and s2graph.
find 50 talk friends then find 20 talk friends
{ "srcVertices": [{"serviceName": "kakaotalk", "columnName": "talk_user_id", "id":$id}], "steps": [ [{"label": "talk_friend", "direction": "out", "limit": 50}], [{"label": "talk_friend", "direction": "out", "limit": 20}] ] }
total vuser = 980
number of rest server | tps | mean test time |
---|---|---|
10 | 5,981.5 | 151.36 ms |
20 | 10,589 | 86.45 ms |
30 | 16,295.4 | 56.43 ms |
find 100 talk friends
{ "srcVertices": [{"serviceName": "kakaotalk", "columnName": "talk_user_id", "id":$id}], "steps": [ [{"label": "talk_friend", "direction": "out", "limit": 100}] ] }
total vuser = 2,072
number of rest server | tps | mean test time |
---|---|---|
20 | 53,713.4 | 37.31 ms |
{ "srcVertices": [ { "serviceName": "kakaotalk", "columnName": "talk_user_id", "id": %s } ], "steps": [ [ { "label": "talk_friend_long_term_agg", "direction": "out", "offset": 0, "limit": %d } ] ] }
number of rest server | vuser | offset | first step limit | tps | latency |
---|---|---|---|---|---|
1 | 30 | 0 | 10 | 9790TPS | 3ms |
1 | 30 | 80 | 10 | 9,958.2TPS | 2.91ms |
1 | 30 | 0 | 20 | 7,418.1TPS | 3.92ms |
1 | 30 | 0 | 40 | 5,118.5TPS | 5.72ms |
1 | 30 | 0 | 60 | 3,966.9TPS | 7.38ms |
1 | 30 | 0 | 80 | 3,408.4TPS | 8.58ms |
1 | 30 | 0 | 100 | 3,048.1TPS | 9.76ms |
2 | 60 | 0 | 100 | 5,869.4TPS | 10.04ms |
4 | 120 | 0 | 100 | 11,473.1TPS | 10.27ms |
{ "srcVertices": [ { "serviceName": "kakaotalk", "columnName": "talk_user_id", "id": %s } ], "steps": [ [ { "label": "talk_friend_long_term_agg", "direction": "out", "offset": 0, "limit": %d } ], [ { "label": "talk_friend_long_term_agg", "direction": "out", "offset": 0, "limit": %d } ] ] }
number of rest server | vuser | first step limit | second step limit | tps | latency |
---|---|---|---|---|---|
1 | 30 | 10 | 10 | 2,008.2TPS | 14.7ms |
1 | 30 | 10 | 20 | 1,221.3TPS | 24.13ms |
1 | 30 | 10 | 40 | 678TPS | 43.92ms |
1 | 30 | 10 | 60 | 488.2TPS | 60.72ms |
1 | 30 | 10 | 80 | 360.2TPS | 82.55ms |
1 | 30 | 10 | 100 | 312.1TPS | 94.7ms |
1 | 20 | 10 | 100 | 297TPS | 66.73ms |
1 | 10 | 10 | 100 | 302TPS | 32.86ms |
1 | 30 | 20 | 10 | 1163.3TPS | 25.5ms |
1 | 30 | 20 | 20 | 645.9TPS | 45.79ms |
1 | 30 | 40 | 10 | 618.4TPS | 47.96ms |
1 | 30 | 60 | 10 | 448.9TPS | 66.16ms |
1 | 30 | 80 | 10 | 339.3TPS | 87.82ms |
1 | 30 | 100 | 10 | 272.5TPS | 108.65ms |
1 | 20 | 100 | 10 | 288.5TPS | 68.34ms |
1 | 10 | 100 | 10 | 261.4TPS | 37.49ms |
2 | 60 | 100 | 10 | 412.9TPS | 143.83ms |
4 | 120 | 100 | 10 | 791.7TPS | 150.06ms |
{ "srcVertices": [ { "serviceName": "kakaotalk", "columnName": "talk_user_id", "id": %s } ], "steps": [ [ { "label": "talk_friend_long_term_agg", "direction": "out", "offset": 0, "limit": %d } ], [ { "label": "talk_friend_long_term_agg", "direction": "out", "offset": 0, "limit": %d } ], [ { "label": "talk_friend_long_term_agg", "direction": "out", "offset": 0, "limit": %d } ] ] }
number of rest server | vuser | first step limit | second step limit | third step limit | tps | latency |
---|---|---|---|---|---|---|
1 | 30 | 10 | 10 | 10 | 250.2TPS | 118.86ms |
1 | 30 | 10 | 10 | 20 | 90.4TPS | 329.46ms |
1 | 20 | 10 | 10 | 20 | 83.2TPS | 238.42ms |
1 | 10 | 10 | 10 | 20 | 82.6TPS | 120.16ms |