| commit | 90899cf22ba2e04c1e783724fe77e71ba1c220af | [log] [tgz] |
|---|---|---|
| author | lixueclaire <youli.lx@alibaba-inc.com> | Thu Mar 06 11:20:45 2025 +0800 |
| committer | GitHub <noreply@github.com> | Thu Mar 06 11:20:45 2025 +0800 |
| tree | 3936af2b6d21aaece17c7a72c157dfa070de6490 | |
| parent | c3f811dcb27a906390980486c45959b6304eaed4 [diff] |
update citation information (#669)
[!NOTE]
This branch is provided as artifacts for VLDB2025.
For the latest version of GraphAr, please refer to the branch main.
This repository contains the artifacts for the VLDB2025 paper of GraphAr, with all source code and guide to reproduce the results presented in the paper:
GraphAr is developed and tested on Ubuntu 20.04.5 LTS. Building GraphAr requires the following software installed as dependencies:
# Clone the repository and checkout the research branch git clone https://github.com/apache/incubator-graphar.git cd incubator-graphar git checkout research git submodule update --init # Build the artifacts mkdir build cd build chmod +x ../script/build.sh ../script/build.sh
Table 1 of the paper lists the graphs used in the evaluation, sourced from various datasets. We offer instructions on how to prepare the data for the evaluation, either for public datasets or synthetic datasets.
Additionally, we have stored all graph data for benchmarking in an Aliyun OSS bucket. To download the graphs, please use the following command:
../script/download_data.sh {path_to_dataset}
The data will be downloaded to the specified directory. Please be aware that the total size of the data exceeds 2TB, and the download may take a long time. Alternatively, we also provide some small datasets located in the dataset directory for testing purposes.
This section outlines the steps to reproduce the neighbor retrieval benchmarking results reported in Section 6.2 of the paper. You may want to use the following commands.
cd incubator-graphar/build ../script/run_neighbor_retrieval.sh {graph_path} {vertex_num} {source_vertex}
For example:
../script/run_neighbor_retrieval.sh {path_to_graphar}/dataset/facebook/facebook 4039 1642
Other datasets can be used in the same way, with the corresponding parameters specified as needed. We also provide a script in script/run_neighbor_retrieval_all.sh for reference.
This section outlines the steps to reproduce the label filtering benchmarking results reported in Section 6.3 of the paper.
To run the label filtering benchmarking component, please adjust the parameters according to the dataset (refer to script/label_filtering.md) for both simple condition test and complex condition test.
Then, run the tests using the following commands:
# simple-condition filtering ./release/parquet-graphar-label-all-example < {graph_path} # complex-condition filtering ./release/parquet-graphar-label-example < {graph_path}
For example:
./release/parquet-graphar-label-all-example < {path_to_graphar}/dataset/bloom/bloom-43-nodes.csv ./release/parquet-graphar-label-example < {path_to_graphar}/dataset/bloom/bloom-43-nodes.csv
The evaluation of different storage media is reported in Section 6.4 of the paper. This test employs the same methodology as the previously mentioned micro-benchmarks, using graph data stored across various storage options. The storage media can be specified in the path, e.g., OSS:://bucket/dataset/facebook/facebook, to indicate that the data is stored on OSS rather than relying on the local file system.
This section contains the scripts to reproduce the end-to-end graph query results reported in Section 6.5 of the paper.
Once the LDBC dataset is converted into Parquet and GraphAr format, you can run the LDBC workload using a command like the following:
./release/run-work-load {path_to_dataset}/sf-30/person_knows_person {path_to_dataset}/sf-30/person_knows_person-vertex-base 165430 70220 delta
This command will run the LDBC workload IS-3 on the SF-30 dataset, formatted in GraphAr. The total number of person vertices is 165,430, and the query vertex id is 70,220. The delta parameter specifies the use of the delta encoding technique. For complete end-to-end LDBC workload execution, please refer to script/run-is3.sh, script/run-ic8.sh, and script/run-bi2.sh.
This section contains a brief guide on how to reproduce the integration results with GraphScope, as reported in Section 6.6 of the paper.
To run the graph loading benchmarking:
script/graphscope_run_writer.sh script to dump the graph data into GraphAr format.script/graphscope_run_loader.sh script to load the graph data into GraphScope using GraphAr format.Please refer to this page for more details on integrating GraphAr with GraphScope. Additionally, consult the documentation to learn how to use GraphAr inside GraphScope.
Leveraging the capabilities for graph-related querying, the graph query engine within GraphScope can execute queries directly on the GraphAr data in an out-of-core manner. The source code for this integration is available in the GraphScope project.
For running the BI execution benchmarking, please:
Please cite the paper in your publications if our work helps your research.
@article{li2024graphar, author = {Xue Li and Weibin Zeng and Zhibin Wang and Diwen Zhu and Jingbo Xu and Wenyuan Yu and Jingren Zhou}, title = {GraphAr: An Efficient Storage Scheme for Graph Data in Data Lakes}, journal = {Proceedings of the VLDB Endowment}, year = {2024}, volume = {18}, number = {3}, pages = {530--543}, publisher = {VLDB Endowment}, }