[SYSTEMDS-3548] Optimize python dataframe transfer

This commit optimizes how the pandas_to_frame_block function accesses Java types.
It also fixes a small regression, where exceptions from the parallelization threads weren't propagating exceptions properly.

- Fix perftests not working with large, split-up datasets IO datagen splits large datasets into multiple files (for example 100k_1k). This commit makes load_pandas.py and load_numpy.py able to read those.
- Add pandas to FrameBlock row-wise parallel processing in the case of cols > rows. It also adds some other small, unused utility methods.
- Add javadocs
- Adjust Py4jConverterUtilsTest to reflect the code changes in the main class.
- adds missing tests for added code in SYSTEMDS-3548. This includes the FrameBlock and Py4jConverterUtils functions, as well as python pandas to systemds io e2e tests.
- Fix pandas io test (rows have to be >4)

Closes #2189
8 files changed
tree: 78637ef26bac7f8ab8c50486517fe6a1bd6b4859
  1. .github/
  2. .mvn/
  3. bin/
  4. conf/
  5. dev/
  6. docker/
  7. docs/
  8. scripts/
  9. src/
  10. .asf.yaml
  11. .gitattributes
  12. .gitignore
  13. .gitmodules
  14. CITATION
  15. codecov.yml
  16. CONTRIBUTING.md
  17. doap.rdf
  18. LICENSE
  19. NOTICE
  20. pom.xml
  21. README.md
README.md

Apache SystemDS

Overview: Apache SystemDS is an open-source machine learning (ML) system for the end-to-end data science lifecycle from data preparation and cleaning, over efficient ML model training, to debugging and serving. ML algorithms or pipelines are specified in a high-level language with R-like syntax or related Python and Java APIs (with many builtin primitives), and the system automatically generates hybrid runtime plans of local, in-memory operations and distributed operations on Apache Spark. Additional backends exist for GPUs and federated learning.

ResourceLinks
Quick StartInstall, Quick Start and Hello World
Documentation:SystemDS Documentation
Python DocumentationPython SystemDS Documentation
Issue TrackerJira Dashboard

Status and Build: SystemDS is renamed from SystemML which is an Apache Top Level Project. To build from source visit SystemDS Install from source

Build Documentation LicenseCheck Java Tests codecov Python Test Total PyPI downloads Monthly PyPI downloads