commit | 11a23342cbd82c15e21a324c37b6d3b013a236f5 | [log] [tgz] |
---|---|---|
author | Matthias Boehm <mboehm7@gmail.com> | Sat Jan 30 21:01:23 2021 +0100 |
committer | Matthias Boehm <mboehm7@gmail.com> | Sat Jan 30 21:01:23 2021 +0100 |
tree | 78c9024be17f65f7dc0893b72b9958626349d2ae | |
parent | ced2ae8212124b49afb1dff330e2592131e324d0 [diff] |
[SYSTEMDS-2816] Fix incorrect tracking of spark broadcast sizes On running multiLogReg on one day of the Criteo dataset (~65GB, 192215183 x 40) showed initially good performance but then after a while individual spark jobs got significantly slower. The underlying issue is an incorrect tracking of live broadcast objects (and their sizes), which are taken into account when deciding to collect the output of an RDD operation or pipe it to HDFS and then read it in to avoid the double memory requirement (list of blocks and target matrix). In detail, root cause was that there are two kinds of broadcasts (partitioned and non-partitioned) which have different sizes for the same matrix. The removal bookkeeping took the non-partitioned sizes and thus ignored our default partitioned broadcasts. This patch simply cleans this up by taking whatever size is available. This issue was been introduced w/ SYSTEMML-1313 (Apr 2018), so every release afterward got affected. How much performance penalty this caused, is dependent on the size of broadcasts, total number of iterations, and performance difference hdfs-write/read vs collect.
Overview: SystemDS is a versatile system for the end-to-end data science lifecycle from data integration, cleaning, and feature engineering, over efficient, local and distributed ML model training, to deployment and serving. To this end, we aim to provide a stack of declarative languages with R-like syntax for (1) the different tasks of the data-science lifecycle, and (2) users with different expertise. These high-level scripts are compiled into hybrid execution plans of local, in-memory CPU and GPU operations, as well as distributed operations on Apache Spark. In contrast to existing systems - that either provide homogeneous tensors or 2D Datasets - and in order to serve the entire data science lifecycle, the underlying data model are DataTensors, i.e., tensors (multi-dimensional arrays) whose first dimension may have a heterogeneous and nested schema.
Quick Start Install, Quick Start and Hello World
Documentation: SystemDS Documentation
Python Documentation Python SystemDS Documentation
Issue Tracker Jira Dashboard
Status and Build: SystemDS is renamed from SystemML which is an Apache Top Level Project. To build from source visit SystemDS Install from source