commit | 15c30ea40d526b34cce929716a1a2363549c6c44 | [log] [tgz] |
---|---|---|
author | Kyle Krueger <kyle.s.krueger@gmail.com> | Sat Jun 17 20:10:53 2023 +0200 |
committer | baunsgaard <baunsgaard@tu-berlin.de> | Fri Jul 14 15:35:19 2023 +0200 |
tree | 81068308cadd3b8e23b2dc256d1cd2a955adfd6a | |
parent | e8f5af8734768a3bb3d3694f4db37bdc8c823474 [diff] |
[SYSTEMDS-2834] Python I/O Benchmarking This commit extends the performance benchmarks to include a python benchmark for the transfer of data from the Python API into and out of systemds. Results include: double: read.dml; 40.781715454 double: load_native.py; 39.19094614699134 int: read.dml; 32.824596657 int: load_native.py; 36.457156577002024 string: read.dml; 34.440663763 string: load_native.py; 38.71029913998791 boolean: read.dml; 33.266684618 boolean: load_native.py; 36.68671202700352 double: load_numpy.py; 32.85507999898982 double: load_pandas.py; 512.6433556610136 float: load_numpy.py; 38.261559439997654 float: load_pandas.py; 546.0650390849914 long: load_numpy.py; 39.400702337006805 long: load_pandas.py; 536.5950958920002 int64: load_numpy.py; 32.98173662999761 int64: load_pandas.py; 487.0634801320266 int32: load_numpy.py; 32.48500068101566 int32: load_pandas.py; 489.97116349000135 uint8: load_numpy.py; 31.86706029099878 uint8: load_pandas.py; 496.9151880980062 string: load_pandas.py; 504.3096235789999 bool: load_numpy.py; 33.19832509398111 bool: load_pandas.py; 479.9256292580103 Pandas reading and writing is underperforming and need to be refined, while numpy transfer is on par with normal reads. Both instances indicate potentials for improvements, especially pandas. Closes #1847
Overview: SystemDS is an open source ML system for the end-to-end data science lifecycle from data integration, cleaning, and feature engineering, over efficient, local and distributed ML model training, to deployment and serving. To this end, we aim to provide a stack of declarative languages with R-like syntax for (1) the different tasks of the data-science lifecycle, and (2) users with different expertise. These high-level scripts are compiled into hybrid execution plans of local, in-memory CPU and GPU operations, as well as distributed operations on Apache Spark. In contrast to existing systems - that either provide homogeneous tensors or 2D Datasets - and in order to serve the entire data science lifecycle, the underlying data model are DataTensors, i.e., tensors (multi-dimensional arrays) whose first dimension may have a heterogeneous and nested schema.
Quick Start Install, Quick Start and Hello World
Documentation: SystemDS Documentation
Python Documentation Python SystemDS Documentation
Issue Tracker Jira Dashboard
Status and Build: SystemDS is renamed from SystemML which is an Apache Top Level Project. To build from source visit SystemDS Install from source