[SYSTEMDS-393] Performance distributed connected components

This patch makes a few tweaks to significantly improve the performance
of the new connected components builtin function where the graph G does
not fix in the driver memory and thus, spawns distributed spark
operations.

The test case was a 1M x 1M graph with 1G edges, ran with driver memory
of 10GB and 9 executors 80GB each. The baseline runtime of 10 calls to
connected components (each requiring 4 iterations until convergence) was
pretty bad with 1,512s due to excessive shuffle and GC overhead.

1) Modified Script: Removed the unnecessary removal of self-edges as the
chosen update rule is robust enough to handle both cases. This removed
the excessive shuffling overhead for matrix-matrix binary operations
without existing hash partitioning. This change alone reduced the total
runtime of 10 calls to 760s.

2) Handling of approximately known sparsity: The large GC overhead was
due to not converting the MCSR representation into read-optimized CSR
during checkpointing (spark caching). We now compute these conditions
with the upper bound information that is available in cases where the
exact nnz is unknown. This further reduce the total runtime to 131s

With codegen the runtime is further slightly improved to 120s (including
spark context creation, and matrix creation) as we avoid materializing G
* t(c) in memory by fusing it with rowMaxs(G * t(c)). For 40 update rule
computations (and thus scans of the graph), this is fairly reasonable.
3 files changed
tree: cd10a4b2900aa4bdbd6710d513e4b2ff33b9b37b
  1. .github/
  2. bin/
  3. conf/
  4. dev/
  5. docker/
  6. scripts/
  7. src/
  8. .gitattributes
  9. .gitignore
  10. CONTRIBUTING.md
  11. LICENSE
  12. NOTICE
  13. pom.xml
  14. README.md
README.md

Apache SystemDS

Overview: SystemDS is a versatile system for the end-to-end data science lifecycle from data integration, cleaning, and feature engineering, over efficient, local and distributed ML model training, to deployment and serving. To this end, we aim to provide a stack of declarative languages with R-like syntax for (1) the different tasks of the data-science lifecycle, and (2) users with different expertise. These high-level scripts are compiled into hybrid execution plans of local, in-memory CPU and GPU operations, as well as distributed operations on Apache Spark. In contrast to existing systems - that either provide homogeneous tensors or 2D Datasets - and in order to serve the entire data science lifecycle, the underlying data model are DataTensors, i.e., tensors (multi-dimensional arrays) whose first dimension may have a heterogeneous and nested schema.

Quick Start Install, Quick Start and Hello World

Documentation: SystemDS Documentation

Python Documentation Python SystemDS Documentation

Status and Build: SystemDS is still in pre-alpha status. The original code base was forked from Apache SystemML 1.2 in September 2018. We will continue to support linear algebra programs over matrices, while replacing the underlying data model and compiler, as well as substantially extending the supported functionalities. Until the first release, you can build your own snapshot via Apache Maven: mvn clean package -P distribution.

Build Documentation Component Test Application Test Function Test Python Test Federated Python Test