commit	be7191c2502ad7f5445891ceb671f335e88e51c9	[log] [tgz]
author	Matthias Boehm <mboehm7@gmail.com>	Sat May 30 00:15:11 2020 +0200
committer	Matthias Boehm <mboehm7@gmail.com>	Sat May 30 00:26:30 2020 +0200
tree	cd10a4b2900aa4bdbd6710d513e4b2ff33b9b37b
parent	0e76598f47c94607bcf36b57f3bd4f1fd1cfc4d5 [diff]

commit

be7191c2502ad7f5445891ceb671f335e88e51c9

[log] [tgz]

author

Matthias Boehm <mboehm7@gmail.com>

Sat May 30 00:15:11 2020 +0200

committer

Matthias Boehm <mboehm7@gmail.com>

Sat May 30 00:26:30 2020 +0200

tree

cd10a4b2900aa4bdbd6710d513e4b2ff33b9b37b

parent

0e76598f47c94607bcf36b57f3bd4f1fd1cfc4d5 [diff]

[SYSTEMDS-393] Performance distributed connected components This patch makes a few tweaks to significantly improve the performance of the new connected components builtin function where the graph G does not fix in the driver memory and thus, spawns distributed spark operations. The test case was a 1M x 1M graph with 1G edges, ran with driver memory of 10GB and 9 executors 80GB each. The baseline runtime of 10 calls to connected components (each requiring 4 iterations until convergence) was pretty bad with 1,512s due to excessive shuffle and GC overhead. 1) Modified Script: Removed the unnecessary removal of self-edges as the chosen update rule is robust enough to handle both cases. This removed the excessive shuffling overhead for matrix-matrix binary operations without existing hash partitioning. This change alone reduced the total runtime of 10 calls to 760s. 2) Handling of approximately known sparsity: The large GC overhead was due to not converting the MCSR representation into read-optimized CSR during checkpointing (spark caching). We now compute these conditions with the upper bound information that is available in cases where the exact nnz is unknown. This further reduce the total runtime to 131s With codegen the runtime is further slightly improved to 120s (including spark context creation, and matrix creation) as we avoid materializing G * t(c) in memory by fusing it with rowMaxs(G * t(c)). For 40 update rule computations (and thus scans of the graph), this is fairly reasonable.

tree: cd10a4b2900aa4bdbd6710d513e4b2ff33b9b37b

README.md

Apache SystemDS

Overview: SystemDS is a versatile system for the end-to-end data science lifecycle from data integration, cleaning, and feature engineering, over efficient, local and distributed ML model training, to deployment and serving. To this end, we aim to provide a stack of declarative languages with R-like syntax for (1) the different tasks of the data-science lifecycle, and (2) users with different expertise. These high-level scripts are compiled into hybrid execution plans of local, in-memory CPU and GPU operations, as well as distributed operations on Apache Spark. In contrast to existing systems - that either provide homogeneous tensors or 2D Datasets - and in order to serve the entire data science lifecycle, the underlying data model are DataTensors, i.e., tensors (multi-dimensional arrays) whose first dimension may have a heterogeneous and nested schema.

Quick Start Install, Quick Start and Hello World

Documentation: SystemDS Documentation

Python Documentation Python SystemDS Documentation

Status and Build: SystemDS is still in pre-alpha status. The original code base was forked from Apache SystemML 1.2 in September 2018. We will continue to support linear algebra programs over matrices, while replacing the underlying data model and compiler, as well as substantially extending the supported functionalities. Until the first release, you can build your own snapshot via Apache Maven: mvn clean package -P distribution.