commit	90899cf22ba2e04c1e783724fe77e71ba1c220af	[log] [tgz]
author	lixueclaire <youli.lx@alibaba-inc.com>	Thu Mar 06 11:20:45 2025 +0800
committer	GitHub <noreply@github.com>	Thu Mar 06 11:20:45 2025 +0800
tree	3936af2b6d21aaece17c7a72c157dfa070de6490
parent	c3f811dcb27a906390980486c45959b6304eaed4 [diff]

tree: 3936af2b6d21aaece17c7a72c157dfa070de6490

README.md

Artifacts for GraphAr VLDB2025 Submission

[!NOTE]
This branch is provided as artifacts for VLDB2025.
For the latest version of GraphAr, please refer to the branch main.

This repository contains the artifacts for the VLDB2025 paper of GraphAr, with all source code and guide to reproduce the results presented in the paper:

Xue Li, Weibin Zeng, Zhibin Wang, Diwen Zhu, Jingbo Xu, Wenyuan Yu, Jingren Zhou. GraphAr: An Efficient Storage Scheme for Graph Data in Data Lakes. PVLDB, 18(3): 530 - 543, 2024.

Dependencies

GraphAr is developed and tested on Ubuntu 20.04.5 LTS. Building GraphAr requires the following software installed as dependencies:

A C++17-enabled compiler and build-essential tools
CMake 3.16 or higher
curl-devel with SSL (Linux) for s3 filesystem support

Building Artifacts

# Clone the repository and checkout the research branch
git clone https://github.com/apache/incubator-graphar.git
cd incubator-graphar
git checkout research
git submodule update --init

# Build the artifacts
mkdir build
cd build
chmod +x ../script/build.sh
../script/build.sh

Getting Graph Data

Table 1 of the paper lists the graphs used in the evaluation, sourced from various datasets. We offer instructions on how to prepare the data for the evaluation, either for public datasets or synthetic datasets.

Additionally, we have stored all graph data for benchmarking in an Aliyun OSS bucket. To download the graphs, please use the following command:

../script/download_data.sh {path_to_dataset}

The data will be downloaded to the specified directory. Please be aware that the total size of the data exceeds 2TB, and the download may take a long time. Alternatively, we also provide some small datasets located in the dataset directory for testing purposes.

Micro-Benchmark of Neighbor Retrieval

This section outlines the steps to reproduce the neighbor retrieval benchmarking results reported in Section 6.2 of the paper. You may want to use the following commands.

cd incubator-graphar/build
../script/run_neighbor_retrieval.sh {graph_path} {vertex_num} {source_vertex}

For example:

../script/run_neighbor_retrieval.sh {path_to_graphar}/dataset/facebook/facebook 4039 1642

Other datasets can be used in the same way, with the corresponding parameters specified as needed. We also provide a script in script/run_neighbor_retrieval_all.sh for reference.

Micro-Benchmark of Label Filtering

This section outlines the steps to reproduce the label filtering benchmarking results reported in Section 6.3 of the paper.

To run the label filtering benchmarking component, please adjust the parameters according to the dataset (refer to script/label_filtering.md) for both simple condition test and complex condition test.

Then, run the tests using the following commands:

# simple-condition filtering
./release/parquet-graphar-label-all-example < {graph_path} 

# complex-condition filtering
./release/parquet-graphar-label-example < {graph_path}

For example:

./release/parquet-graphar-label-all-example < {path_to_graphar}/dataset/bloom/bloom-43-nodes.csv

./release/parquet-graphar-label-example < {path_to_graphar}/dataset/bloom/bloom-43-nodes.csv

Storage Media

The evaluation of different storage media is reported in Section 6.4 of the paper. This test employs the same methodology as the previously mentioned micro-benchmarks, using graph data stored across various storage options. The storage media can be specified in the path, e.g., OSS:://bucket/dataset/facebook/facebook, to indicate that the data is stored on OSS rather than relying on the local file system.

End-to-End Graph Query Workloads

This section contains the scripts to reproduce the end-to-end graph query results reported in Section 6.5 of the paper.

Once the LDBC dataset is converted into Parquet and GraphAr format, you can run the LDBC workload using a command like the following:

./release/run-work-load {path_to_dataset}/sf-30/person_knows_person {path_to_dataset}/sf-30/person_knows_person-vertex-base 165430 70220 delta

This command will run the LDBC workload IS-3 on the SF-30 dataset, formatted in GraphAr. The total number of person vertices is 165,430, and the query vertex id is 70,220. The delta parameter specifies the use of the delta encoding technique. For complete end-to-end LDBC workload execution, please refer to script/run-is3.sh, script/run-ic8.sh, and script/run-bi2.sh.

Integration with GraphScope

This section contains a brief guide on how to reproduce the integration results with GraphScope, as reported in Section 6.6 of the paper.

Serving as the Archive Format

To run the graph loading benchmarking:

First, build and install Vineyard, which is the default storage backend of GraphScope, following the instructions in the official documentation.
Enter the build directory of Vineyard, and run the script/graphscope_run_writer.sh script to dump the graph data into GraphAr format.
Run the script/graphscope_run_loader.sh script to load the graph data into GraphScope using GraphAr format.

Please refer to this page for more details on integrating GraphAr with GraphScope. Additionally, consult the documentation to learn how to use GraphAr inside GraphScope.

Serving as the Storage Backend

Leveraging the capabilities for graph-related querying, the graph query engine within GraphScope can execute queries directly on the GraphAr data in an out-of-core manner. The source code for this integration is available in the GraphScope project.

For running the BI execution benchmarking, please:

First, build and install the GraphScope project with GraphAr integration.
Then, deploy the GIE (GraphScope Interactive Engine) following the instructions in the documentation.
Finally, run the generic benchmark tool for GIE, following the steps outlined in the documentation.

Citation

Please cite the paper in your publications if our work helps your research.

@article{li2024graphar,
  author = {Xue Li and Weibin Zeng and Zhibin Wang and Diwen Zhu and Jingbo Xu and Wenyuan Yu and Jingren Zhou},
  title = {GraphAr: An Efficient Storage Scheme for Graph Data in Data Lakes},
  journal = {Proceedings of the VLDB Endowment},
  year = {2024},
  volume = {18},
  number = {3},
  pages = {530--543},
  publisher = {VLDB Endowment},
}