commit | 4f62f4b1772ac8f46b3bbdfba5986ce7280b2da8 | [log] [tgz] |
---|---|---|
author | Ekta Khanna <ekhanna@pivotal.io> | Fri Feb 14 02:02:56 2020 +0000 |
committer | kaknikhil <kaknikhil@users.noreply.github.com> | Tue Mar 24 13:06:25 2020 -0700 |
tree | c6826b5cdb983e363db3f2be5bee19e2ec389e28 | |
parent | ef763f15c4f008a3ec9cf69b9898bf2918d4f9cf [diff] |
DL: Fix disk issue by using truncate guc JIRA: MADLIB-1406 While testing places10 with fit multiple (gpdb5 and gpdb6, 10 iterations and 20 msts), we ran out of disk space although we had at least 1.5T left at the beginning of the query. The main contributor to the disk bloat is the update statement that we run at the end of each hop ``` UPDATE {self.model_output_table} SET {self.model_weights_col} = {self.weights_to_update_tbl}.{self.model_weights_col} FROM {self.weights_to_update_tbl} WHERE {self.model_output_table}.{self.mst_key_col} = {self.weights_to_update_tbl}.{self.mst_key_col} ``` In postgres/gpdb, every update command is really two commands i.e. insert and then delete. Because of this, the actual space is not freed and only gets freed when vacuum is run consistently or vacuum full is run at the end. We verified this by printing the {self.model_output_table} size for each mst_key and it kept on growing with each update statement. Also the disk space for other intermediate tables that get created in the run_training function never gets cleared even though we drop these tables inside the said function. This is because drop/truncate does not release disk space inside a pl function since it's in a sub transaction (this is so that it can rollback). It only releases the space once the pl function has completed execution. The only way to make this work was to add a truncate statement and change the gpdb code to do a truncate inside a sub transaction. gpdb 6.5 introduced a guc https://github.com/greenplum-db/gpdb/commit/b4692794a0abd8d8051c23019d26e22f7f3d0aa5 which when turned 'on' allows for truncating the disk space inside a sub transaction. Note that this guc is only available in gpdb 6.5 and up. run_training workflow (this function is called per hop) 1. join schedule table with mst_weights table to do the hop 2. Call the uda and copy the output to an intermediate table 3. Update the model table with the results of the previous step 4. set the truncate guc to on 5. Create temp table from model table 6. truncate the model table to release disk space 7. rename temp table to model table so that it can be reused for the next hop Warm Start: For warm start, we can't keep calling truncate on the user passed output table because then we won't be able to roll it back in case of a failure. So for warm start, we create a copy of the output table passed by the user and then operate on the copied table. At the end, we drop the original output table and rename the copied table to the original table name. Co-authored-by: Ekta Khanna <ekhanna@pivotal.io>
MADlib® is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data.
See the project website MADlib Home for links to the latest binary and source packages.
We appreciate all forms of project contributions to MADlib including bug reports, providing help to new users, documentation, or code patches. Please refer to Contribution Guidelines for instructions.
For more installation and contribution guides, please refer to the MADlib Wiki.
Compiling from source on Linux details are also on the wiki.
We provide a Docker image with necessary dependencies required to compile and test MADlib on PostgreSQL 10.5. You can view the dependency Docker file at ./tool/docker/base/Dockerfile_ubuntu16_postgres10. The image is hosted on Docker Hub at madlib/postgres_10:latest. Later we will provide a similar Docker image for Greenplum Database.
We provide a script to quickly run this docker image at ./tool/docker_start.sh, which will mount your local madlib directory, build MADlib and run install check on this Docker image. At the end, it will docker exec
as postgres user. Note that you have to run this script from inside your madlib directory, and you can specify your docker CONTAINER_NAME (default is madlib) and IMAGE_TAG (default is latest). Here is an example:
CONTAINER_NAME=my_madlib IMAGE_TAG=LaTex ./tool/docker_start.sh
Notice that this script only needs to be run once. After that, you will have a local docker container with CONTAINER_NAME running. To get access to the container, run the following command and you can keep working on it.
docker exec -it CONTAINER_NAME bash
To kill this docker container, run:
docker kill CONTAINER_NAME docker rm CONTAINER_NAME
You can also manually run those commands to do the same thing:
## 1) Pull down the `madlib/postgres_10:latest` image from docker hub: docker pull madlib/postgres_10:latest ## 2) Launch a container corresponding to the MADlib image, name it ## madlib, mounting the source code folder to the container: docker run -d -it --name madlib \ -v (path to madlib directory):/madlib/ madlib/postgres_10 # where madlib is the directory where the MADlib source code resides. ################################# * WARNING * ################################# # Please be aware that when mounting a volume as shown above, any changes you # make in the "madlib" folder inside the Docker container will be # reflected on your local disk (and vice versa). This means that deleting data # in the mounted volume from a Docker container will delete the data from your # local disk also. ############################################################################### ## 3) When the container is up, connect to it and build MADlib: docker exec -it madlib bash mkdir /madlib/build_docker cd /madlib/build_docker cmake .. make make doc make install ## 4) Install MADlib: src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres install ## 5) Several other commands can now be run, such as: # Run install check, on all modules: src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres install-check # Run install check, on a specific module, say svm: src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres install-check -t svm # Reinstall MADlib: src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres reinstall ## 6) Kill and remove containers (after exiting the container): docker kill madlib docker rm madlib
Instruction for building design pdf on Docker:
For users who wants to build design pdf, make sure you use the IMAGE_TAG=LaTex
parameter when running the script. After launching your docker container, run the following to get design.pdf
:
cd /madlib/build_docker make design_pdf cd doc/design
Detailed build instructions are available in ReadMe_Build.txt
The latest documentation of MADlib modules can be found at MADlib Docs
.
The following block-diagram gives a high-level overview of MADlib's architecture.
MADlib incorporates software from the following third-party components. Bundled with source code:
libstemmer
“small string processing language”m_widen_init
“allows compilation with recent versions of gcc with runtime dependencies from earlier versions of libstdc++”argparse 1.2.1
“provides an easy, declarative interface for creating command line tools”PyYAML 3.10
“YAML parser and emitter for Python”UseLATEX.cmake
“CMAKE commands to use the LaTeX compiler”Downloaded at build time (or supplied as build dependencies):
Boost 1.61.0 (or newer)
“provides peer-reviewed portable C++ source libraries”PyXB 1.2.6
“Python library for XML Schema Bindings”Eigen 3.2.2
“C++ template library for linear algebra”Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE
file distributed with this work for additional information regarding copyright ownership. The ASF licenses this project to You under the Apache License, Version 2.0 (the “License”); you may not use this project except in compliance with the License. You may obtain a copy of the License at LICENSE
.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
As specified in LICENSE
additional license information regarding included third-party libraries can be found inside the licenses
directory.
Changes between MADlib versions are described in the ReleaseNotes.txt
file.
MAD Skills : New Analysis Practices for Big Data (VLDB 2009)
Hybrid In-Database Inference for Declarative Information Extraction (SIGMOD 2011)
Towards a Unified Architecture for In-Database Analytics (SIGMOD 2012)
The MADlib Analytics Library or MAD Skills, the SQL (VLDB 2012)