HugeGraphServer 除了上一节提到的遍历(traverser)方法,还提供了一类专门做推荐的方法,我们称为rank API
, 可在图中为一个点推荐与其关系密切的其它点。
Personal Rank 算法典型场景是用于推荐应用中, 根据某个点现有的出边, 推荐具有相近 / 相同关系的其他点, 比如根据某个人的阅读记录 / 习惯, 向它推荐其他可能感兴趣的书, 或潜在的书友, 举例如下:
a,b,c,d,e
5本书, 我们的想给 tom 推荐一些书友, 以及一些书, 最容易的想法就是看看还有哪些人喜欢过这些书 (共同兴趣)b,d,f
3本书, 以及 jay, 它喜欢 c,d,e,g
4本书, lee 它喜欢 a,d,e,f
4本书上面是一个简单的例子, 这里再提供一个公开的 1MB 测试数据集 MovieLens 为例, 用户需下载该数据集,然后使用 HugeGraph-Loader 导入到 HugeGraph 中,简单起见,数据中顶点 user 和 movie 的属性都忽略,仅使用 id 字段即可,边 rating 的具体评分值也忽略。loader 使用的元数据 文件和输入源映射文件内容如下:
//////////////////////////////////////////////////////////// // UserID::Gender::Age::Occupation::Zip-code // MovieID::Title::Genres // UserID::MovieID::Rating::Timestamp //////////////////////////////////////////////////////////// // Define schema schema.propertyKey("id").asInt().ifNotExist().create(); schema.propertyKey("rate").asInt().ifNotExist().create(); schema.vertexLabel("user") .properties("id") .primaryKeys("id") .ifNotExist() .create(); schema.vertexLabel("movie") .properties("id") .primaryKeys("id") .ifNotExist() .create(); schema.edgeLabel("rating") .sourceLabel("user") .targetLabel("movie") .properties("rate") .ifNotExist() .create();
{ "vertices": [ { "label": "user", "input": { "type": "file", "path": "users.dat", "format": "TEXT", "delimiter": "::", "header": ["UserID", "Gender", "Age", "Occupation", "Zip-code"] }, "ignored": ["Gender", "Age", "Occupation", "Zip-code"], "mapping": { "UserID": "id" } }, { "label": "movie", "input": { "type": "file", "path": "movies.dat", "format": "TEXT", "delimiter": "::", "header": ["MovieID", "Title", "Genres"] }, "ignored": ["Title", "Genres"], "mapping": { "MovieID": "id" } } ], "edges": [ { "label": "rating", "source": ["UserID"], "target": ["MovieID"], "input": { "type": "file", "path": "ratings.dat", "format": "TEXT", "delimiter": "::", "header": ["UserID", "MovieID", "Rating", "Timestamp"] }, "ignored": ["Timestamp"], "mapping": { "UserID": "id", "MovieID": "id", "Rating": "rate" } } ] }
注意将映射文件中
input.path
的值修改为自己本地的路径。
适用于二分图,给出所有源顶点相关的其他顶点及其相关性组成的列表。
二分图:也称二部图,是图论里的一种特殊模型,也是一种特殊的网络流。其最大的特点在于,可以将图里的顶点分为两个集合,两个集合之间的点有边相连,但集合内的点之间没有直接关联。
假设有一个用户和物品的二分图,基于随机游走的 PersonalRank 算法步骤如下:
rating
来查找共同的打分人:必填项:
选填项:
0.85
10000
5
BOTH_LABEL
100
0.0001
(后续实现)true
POST http://localhost:8080/graphs/hugegraph/traversers/personalrank
{ "source": "1:1", "label": "rating", "alpha": 0.6, "max_depth": 15, "with_label": "OTHER_LABEL", "sorted": true, "limit": 10 }
200
{ "2:2858": 0.0005014026017816927, "2:1196": 0.0004336708357653617, "2:1210": 0.0004128083140214213, "2:593": 0.00038117341069881513, "2:480": 0.00037005373269728036, "2:1198": 0.000366641614652057, "2:2396": 0.0003622362410538888, "2:2571": 0.0003593312457300953, "2:589": 0.00035922123055598566, "2:110": 0.0003466135844390885 }
两类不同顶点连接形成的二分图中,给某个点推荐相关性最高的其他顶点,例如:
public class Loader { public static void main(String[] args) { HugeClient client = new HugeClient("http://127.0.0.1:8080", "hugegraph"); SchemaManager schema = client.schema(); schema.propertyKey("name").asText().ifNotExist().create(); schema.vertexLabel("person") .properties("name") .useCustomizeStringId() .ifNotExist() .create(); schema.vertexLabel("movie") .properties("name") .useCustomizeStringId() .ifNotExist() .create(); schema.edgeLabel("follow") .sourceLabel("person") .targetLabel("person") .ifNotExist() .create(); schema.edgeLabel("like") .sourceLabel("person") .targetLabel("movie") .ifNotExist() .create(); schema.edgeLabel("directedBy") .sourceLabel("movie") .targetLabel("person") .ifNotExist() .create(); GraphManager graph = client.graph(); Vertex O = graph.addVertex(T.label, "person", T.id, "O", "name", "O"); Vertex A = graph.addVertex(T.label, "person", T.id, "A", "name", "A"); Vertex B = graph.addVertex(T.label, "person", T.id, "B", "name", "B"); Vertex C = graph.addVertex(T.label, "person", T.id, "C", "name", "C"); Vertex D = graph.addVertex(T.label, "person", T.id, "D", "name", "D"); Vertex E = graph.addVertex(T.label, "movie", T.id, "E", "name", "E"); Vertex F = graph.addVertex(T.label, "movie", T.id, "F", "name", "F"); Vertex G = graph.addVertex(T.label, "movie", T.id, "G", "name", "G"); Vertex H = graph.addVertex(T.label, "movie", T.id, "H", "name", "H"); Vertex I = graph.addVertex(T.label, "movie", T.id, "I", "name", "I"); Vertex J = graph.addVertex(T.label, "movie", T.id, "J", "name", "J"); Vertex K = graph.addVertex(T.label, "person", T.id, "K", "name", "K"); Vertex L = graph.addVertex(T.label, "person", T.id, "L", "name", "L"); Vertex M = graph.addVertex(T.label, "person", T.id, "M", "name", "M"); O.addEdge("follow", A); O.addEdge("follow", B); O.addEdge("follow", C); D.addEdge("follow", O); A.addEdge("follow", B); A.addEdge("like", E); A.addEdge("like", F); B.addEdge("like", G); B.addEdge("like", H); C.addEdge("like", I); C.addEdge("like", J); E.addEdge("directedBy", K); F.addEdge("directedBy", B); F.addEdge("directedBy", L); G.addEdge("directedBy", M); } }
在一般图结构中,找出每一层与给定起点相关性最高的前 N 个顶点及其相关度,用图的语义理解就是:从起点往外走, 走到各层各个顶点的概率。
POST http://localhost:8080/graphs/hugegraph/traversers/neighborrank
{ "source":"O", "steps":[ { "direction":"OUT", "labels":[ "follow" ], "max_degree":-1, "top":100 }, { "direction":"OUT", "labels":[ "follow", "like" ], "max_degree":-1, "top":100 }, { "direction":"OUT", "labels":[ "directedBy" ], "max_degree":-1, "top":100 } ], "alpha":0.9, "capacity":-1 }
200
{ "ranks": [ { "O": 1 }, { "B": 0.4305, "A": 0.3, "C": 0.3 }, { "G": 0.17550000000000002, "H": 0.17550000000000002, "I": 0.135, "J": 0.135, "E": 0.09000000000000001, "F": 0.09000000000000001 }, { "M": 0.15795, "K": 0.08100000000000002, "L": 0.04050000000000001 } ] }
为给定的起点在不同的层中找到最应该推荐的顶点。