[SYSTEMDS-2816] Fix incorrect tracking of spark broadcast sizes

On running multiLogReg on one day of the Criteo dataset (~65GB,
192215183 x 40) showed initially good performance but then after a while
individual spark jobs got significantly slower. The underlying issue is
an incorrect tracking of live broadcast objects (and their sizes), which
are taken into account when deciding to collect the output of an RDD
operation or pipe it to HDFS and then read it in to avoid the double
memory requirement (list of blocks and target matrix).

In detail, root cause was that there are two kinds of broadcasts
(partitioned and non-partitioned) which have different sizes for the
same matrix. The removal bookkeeping took the non-partitioned sizes and
thus ignored our default partitioned broadcasts. This patch simply
cleans this up by taking whatever size is available.

This issue was been introduced w/ SYSTEMML-1313 (Apr 2018), so every
release afterward got affected. How much performance penalty this
caused, is dependent on the size of broadcasts, total number of
iterations, and performance difference hdfs-write/read vs collect.
3 files changed
tree: 78c9024be17f65f7dc0893b72b9958626349d2ae
  1. .github/
  2. bin/
  3. conf/
  4. dev/
  5. docker/
  6. docs/
  7. notebooks/
  8. scripts/
  9. src/
  10. .gitattributes
  11. .gitignore
  12. .gitmodules
  13. CONTRIBUTING.md
  14. LICENSE
  15. NOTICE
  16. pom.xml
  17. README.md
README.md

Apache SystemDS

Overview: SystemDS is a versatile system for the end-to-end data science lifecycle from data integration, cleaning, and feature engineering, over efficient, local and distributed ML model training, to deployment and serving. To this end, we aim to provide a stack of declarative languages with R-like syntax for (1) the different tasks of the data-science lifecycle, and (2) users with different expertise. These high-level scripts are compiled into hybrid execution plans of local, in-memory CPU and GPU operations, as well as distributed operations on Apache Spark. In contrast to existing systems - that either provide homogeneous tensors or 2D Datasets - and in order to serve the entire data science lifecycle, the underlying data model are DataTensors, i.e., tensors (multi-dimensional arrays) whose first dimension may have a heterogeneous and nested schema.

Quick Start Install, Quick Start and Hello World

Documentation: SystemDS Documentation

Python Documentation Python SystemDS Documentation

Issue Tracker Jira Dashboard

Status and Build: SystemDS is renamed from SystemML which is an Apache Top Level Project. To build from source visit SystemDS Install from source

Build Documentation Component Test Application Test Function Test Python Test Federated Python Test