commit	11a23342cbd82c15e21a324c37b6d3b013a236f5	[log] [tgz]
author	Matthias Boehm <mboehm7@gmail.com>	Sat Jan 30 21:01:23 2021 +0100
committer	Matthias Boehm <mboehm7@gmail.com>	Sat Jan 30 21:01:23 2021 +0100
tree	78c9024be17f65f7dc0893b72b9958626349d2ae
parent	ced2ae8212124b49afb1dff330e2592131e324d0 [diff]

commit

11a23342cbd82c15e21a324c37b6d3b013a236f5

[log] [tgz]

author

Matthias Boehm <mboehm7@gmail.com>

Sat Jan 30 21:01:23 2021 +0100

committer

Matthias Boehm <mboehm7@gmail.com>

Sat Jan 30 21:01:23 2021 +0100

tree

78c9024be17f65f7dc0893b72b9958626349d2ae

parent

ced2ae8212124b49afb1dff330e2592131e324d0 [diff]

[SYSTEMDS-2816] Fix incorrect tracking of spark broadcast sizes On running multiLogReg on one day of the Criteo dataset (~65GB, 192215183 x 40) showed initially good performance but then after a while individual spark jobs got significantly slower. The underlying issue is an incorrect tracking of live broadcast objects (and their sizes), which are taken into account when deciding to collect the output of an RDD operation or pipe it to HDFS and then read it in to avoid the double memory requirement (list of blocks and target matrix). In detail, root cause was that there are two kinds of broadcasts (partitioned and non-partitioned) which have different sizes for the same matrix. The removal bookkeeping took the non-partitioned sizes and thus ignored our default partitioned broadcasts. This patch simply cleans this up by taking whatever size is available. This issue was been introduced w/ SYSTEMML-1313 (Apr 2018), so every release afterward got affected. How much performance penalty this caused, is dependent on the size of broadcasts, total number of iterations, and performance difference hdfs-write/read vs collect.

tree: 78c9024be17f65f7dc0893b72b9958626349d2ae

README.md

Apache SystemDS

Overview: SystemDS is a versatile system for the end-to-end data science lifecycle from data integration, cleaning, and feature engineering, over efficient, local and distributed ML model training, to deployment and serving. To this end, we aim to provide a stack of declarative languages with R-like syntax for (1) the different tasks of the data-science lifecycle, and (2) users with different expertise. These high-level scripts are compiled into hybrid execution plans of local, in-memory CPU and GPU operations, as well as distributed operations on Apache Spark. In contrast to existing systems - that either provide homogeneous tensors or 2D Datasets - and in order to serve the entire data science lifecycle, the underlying data model are DataTensors, i.e., tensors (multi-dimensional arrays) whose first dimension may have a heterogeneous and nested schema.

Quick Start Install, Quick Start and Hello World

Documentation: SystemDS Documentation

Python Documentation Python SystemDS Documentation

Issue Tracker Jira Dashboard

Status and Build: SystemDS is renamed from SystemML which is an Apache Top Level Project. To build from source visit SystemDS Install from source