tree: 349a671f5e274bd4d753cd05d2c0419474f4d8d1 [path history] [tgz]
  1. assembly/
  2. src/
  3. pom.xml
  4. README.md
hugegraph-spark-connector/README.md

HugeGraph Spark Connector

License

HugeGraph Spark Connector is a Spark connector application for reading and writing HugeGraph data in Spark standard format.

Building

Required:

  • Java 8+
  • Maven 3.6+

To build without executing tests:

mvn clean package -DskipTests

To build with default tests:

mvn clean packge

How to use

If we have a graph, the schema is defined as follows:

Schema

schema.propertyKey("name").asText().ifNotExist().create()
schema.propertyKey("age").asInt().ifNotExist().create()
schema.propertyKey("city").asText().ifNotExist().create()
schema.propertyKey("weight").asDouble().ifNotExist().create()
schema.propertyKey("lang").asText().ifNotExist().create()
schema.propertyKey("date").asText().ifNotExist().create()
schema.propertyKey("price").asDouble().ifNotExist().create()

schema.vertexLabel("person")
        .properties("name", "age", "city")
        .useCustomizeStringId()
        .nullableKeys("age", "city")
        .ifNotExist()
        .create()

schema.vertexLabel("software")
        .properties("name", "lang", "price")
        .primaryKeys("name")
        .ifNotExist()
        .create()

schema.edgeLabel("knows")
        .sourceLabel("person")
        .targetLabel("person")
        .properties("date", "weight")
        .ifNotExist()
        .create()

schema.edgeLabel("created")
        .sourceLabel("person")
        .targetLabel("software")
        .properties("date", "weight")
        .ifNotExist()
        .create()

Then we can insert graph data through Spark, first add dependency in your pom.

<dependency>
    <groupId>org.apache.hugegraph</groupId>
    <artifactId>hugegraph-spark-connector</artifactId>
    <version>${revision}</version>
</dependency>

Vertex Sink

val df = sparkSession.createDataFrame(Seq(
  Tuple3("marko", 29, "Beijing"),
  Tuple3("vadas", 27, "HongKong"),
  Tuple3("Josh", 32, "Beijing"),
  Tuple3("peter", 35, "ShangHai"),
  Tuple3("li,nary", 26, "Wu,han"),
  Tuple3("Bob", 18, "HangZhou"),
)) toDF("name", "age", "city")

df.show()

df.write
  .format("org.apache.hugegraph.spark.connector.DataSource")
  .option("host", "127.0.0.1")
  .option("port", "8080")
  .option("graph", "hugegraph")
  .option("data-type", "vertex")
  .option("label", "person")
  .option("id", "name")
  .option("batch-size", 2)
  .mode(SaveMode.Overwrite)
  .save()

Edge Sink

val df = sparkSession.createDataFrame(Seq(
  Tuple4("marko", "vadas", "20160110", 0.5),
  Tuple4("peter", "Josh", "20230801", 1.0),
  Tuple4("peter", "li,nary", "20130220", 2.0)
)).toDF("source", "target", "date", "weight")

df.show()

df.write
  .format("org.apache.hugegraph.spark.connector.DataSource")
  .option("host", "127.0.0.1")
  .option("port", "8080")
  .option("graph", "hugegraph")
  .option("data-type", "edge")
  .option("label", "knows")
  .option("source-name", "source")
  .option("target-name", "target")
  .option("batch-size", 2)
  .mode(SaveMode.Overwrite)
  .save()

Configs

Client Configs are used to configure hugegraph-client.

Client Configs

ParamsDefault ValueDescription
hostlocalhostAddress of HugeGraphServer
port8080Port of HugeGraphServer
graphhugegraphGraph space name
protocolhttpProtocol for sending requests to the server, optional http or https
usernamenullUsername of the current graph when HugeGraphServer enables permission authentication
tokennullToken of the current graph when HugeGraphServer has enabled authorization authentication
timeout60Timeout (seconds) for inserting results to return
max-connCPUS * 4The maximum number of HTTP connections between HugeClient and HugeGraphServer
max-conn-per-routeCPUS * 2The maximum number of HTTP connections for each route between HugeClient and HugeGraphServer
trust-store-filenullThe client’s certificate file path when the request protocol is https
trust-store-tokennullThe client's certificate password when the request protocol is https
Graph Data Configs

Graph Data Configs are used to set graph space configuration.

ParamsDefault ValueDescription
date-typeGraph data type, must be vertex or edge
labelLabel to which the vertex/edge data to be imported belongs
idSpecify a column as the id column of the vertex. When the vertex id policy is CUSTOMIZE, it is required; when the id policy is PRIMARY_KEY, it must be empty
source-nameSelect certain columns of the input source as the id column of source vertex. When the id policy of the source vertex is CUSTOMIZE, a certain column must be specified as the id column of the vertex; when the id policy of the source vertex is When PRIMARY_KEY, one or more columns must be specified for splicing the id of the generated vertex, that is, no matter which id strategy is used, this item is required
target-nameSpecify certain columns as the id columns of target vertex, similar to source
selected-fieldsSelect some columns to insert, other unselected ones are not inserted, cannot exist at the same time as ignored
ignored-fieldsIgnore some columns so that they do not participate in insertion, cannot exist at the same time as selected
batch-size500The number of data items in each batch when importing data

Common Configs

Common Configs contains some common configurations.

ParamsDefault ValueDescription
delimiter,Separator of source-name, target-name, selected-fields or ignore-fields

Licence

The same as HugeGraph, hugegraph-spark-connector is also licensed under Apache 2.0 License.