DBSCAN: Fast parallel-optimized DBSCAN

This optimized version of DBSCAN offers a dramatic improvment over
the original brute force implementation.  It should have roughly
roughly O(N/S log N) runtimes, where N is the size of the input
dataset (number of points/rows) and S is the number of segments used
(note:  the algorithm now decides on its own how many segments to use,
as sometimes splitting things up further will only increase the runtime.
Therefore S is not necessarily the total number of segments in the
cluster, it may be less depending on the structure of the dataset.)

This borrows many aspects from the previous attempt (using
an overlapping spatial binary tree to segment the dataset and
using an R tree index to speed up range queries), but differs in
its approach in other ways.

The brute force DBSCAN runs on N^2 time. To improve this,
we split the data into different overlapping regions, running
DBSCAN on each in parallel, then merging the results together
on the coordinator. More specifically:

1.  The data is segmented into connected spatial regions, using an
    in-house designed binary tree optimized specifically for DBSCAN.
    This custom DBSCAN-optimized tree is similar to a kd-tree, but tries to simultaneously
    keep the child nodes of each node as balanced as possible while splitting along a
    hyperplane which favors passing through the least dense regions of the
    space. To accomplish this, it constructs a course-grained density map
    of each node before deciding where to split, minimizing a loss function which
    tries to estimate the longest expected runtime of any segment assigned to
    the descendants of that node.

2.  Each leaf of the optimized spatial tree runs the dbscan
    algorithm on the points in its spatial region, including
    some points from other regions near its boundaries, using
    an R tree index for efficient range queries.

3.  Merge the clusters found in each leaf with each other by keeping track
    of which points in neighboring leaves are within eps of a cluster's home
    leaf. Uses madilb's wcc graph module (weakly connected components) to identify
    the equivalence classes of clusters across all leaves.

================ Partial List of Detailed Improvements ==============

- Initial build_kdtree() re-write

    Previously, we were including a lot of points that were nowhere
    near the border as border points.  The new version should only
    include those which are actually within epsilon distance of a border.

    It also outputs the information in a different format that
    provides a bit more information, and will make a fast merge of
    clusters from different segments possible in a later stage.

    The output of build_kdtree (renamed to build_optimized_tree) is now
    a single table of annotated points instead of many tables.  The __leaf_id__
    column has been split into two similar concepts, __leaf_id__ and
    __dist_id__.  __leaf_id__ tells you which leaf a point is in, and
    __dist_id__ tells you which (extended) leaf a row is in--which
    determines the segment on which it will be stored.

    Note that each point can have several entries in the segmented_output
    table.  There is not a 1-to-1 correspondence between points and rows,
    since we store each border point in multiple overlapping leaves
    (segments) at once.

- Convert find_core_points() UDA to dbscan_leaf() UDF

- Add dbscan_record class, for managing input and output of dbscan_leaf
   ( maps to dbscan_record SQL TYPE, but with some hidden fields added for
   keeping track of record-specific variables in python; these are
   automatically removed before yielding the final result, when the
   record gets cloned via the copy constructor )

- Allow passing id_column as an expression:
       id_column input param is now handled the same as we do the
    point column, so that any expression evaluating to an integer type
    should work instead of requiring the id to be just a single column.
    The name of the column in the output table will now always be 'id',
    but the actual expression used on the original input table is still
    stored in the summary table.

- Use sqrt(eps) only for kd_tree, raw eps for dbscan_leaf
    The second one compares a squared distance to a squared distance
    metric, while for generating the kd tree we need an actual distance

- Further re-write of build_kd_tree():

      It was taking 30s to generate the overlapping kd-tree for 10,000 points,
    where we were doing many long queries, iteratively in python.  And
    several minutes for 100,000 points.  But this should be the fast part!

      Now, it will generate the non-overlapping kd-tree first, then add
    the external points for the overlap regions.  The generation of
    the regular kd-tree, along with cutoff table, is working now and
    very fast (< 1s for 10,000 points, < 5s for 100,000 points).
    It's just a single 0.5s recursive query for generating the segmented
    source (without external point augmentation), and an even simpler 0.1s
    query afterwards for the cutoffs.
       I don't expect adding the overlap points will take too much time
    either, as we already have the non-overlapping cutoffs.  We can
    generate a leaf bounds table from that (min & max boundary points
    for each leaf) without even touching the source table, and should
    be able to join that to the source table in a final query which
    scans through N/S * max_depth rows per segment (similar to the
    first two fast queries).

-  Adds print_timings debug statements, for detailed performance profiling

-  Changes __dist_id__ to dist_id and add new __dist_id__;
   This adds another layer of abstraction to the mapping from
   leaves to dist_id's to segments, but was needed to ensure that
   the leaves are balanced among the physical cluster segments...
   otherwise, for a small number of leaves the dist_ids map somewhat
   randomly to the segments, and several can end up on the same segment
   while other segments handle no data at all.  This will always
   place leaves on the segments in a round-robin fashion rather than
   based on gpdb's internal DISTRIBUTED BY hash function

-  Move dbscan_optmized into its own class

-  Fix use of wrong eps in build_kd_tree (there were cases where we
   were using eps instead of eps^2 for the squared-dist-norm).

-  Add functions for optimized tree segmentation (uses loss function
   instead of kdtree for making cuts)

-  Skip missing dist_id's while creating dist_map:

   If all the dist_id's are sequential with no gaps, as with the kd-tree
   then this wouldn't be necessary.  But now, since some branches of
   the tree stop splitting before others, we need to account for these
   missing dist_id's... otherwise you can end up with several dist_id's
   on the same segment and other segments with nothing.

- Accelerates range_query() using vectorized numpy operations

- Move many functions into a self-contained DBSCANStorage class

  This makes things much easier to manage, less params to pass around

- Early removal of points from rtree, ony storing minimum necessary

    Delete each point from rtree as soon as the first range query is run
    on it, to speed up range queries run on neighboring points.  The smaller
    the tree, the faster the results get returned.  (Searching the tree
    for a single point is log N, but a range query returns many points... so
    it's at *least* O(log N + k) and possibly more like O(k log N), where k
    is the number of candidate neighbors returned, which we have to go
    through one at a time to check if they are actually in range)

    Deleting points as soon as we label them without any further
    modifications would throw off the calculation of is_core_point.
    We can still delete them early, but to ensure is_core_point is
    right we need to keep track of the number of neighbors each internal
    point has.  Some are internal and others are internal.  The sum of
    both determines the final value of is_core_point.

    After doing this, a lot of things we were previously doing no longer
    made sense.  Since we need to query external neighbors of the
    internal points anyway, we may as well do all of the inverse-queries
    up front instead (searching for internal neighbors of each external
    point).  In general, external points are expected to be fewer than
    internal points so hopefully this will be faster.  But then we have
    no more use for the cluster-specific trees (unless we want to add
    them back for use by the internal points?)  range_query no longer
    has min_results or max_results params

- All examples passing, one each for 3D, 4D, 9D, 16D, 25D, 36D, 49D,
    and 64D, tested both on 3 segment cluster and 16 segment cluster

- Change default depth to num_segs instead of log_2(num_segs)

    log_2(num_segs) made sense for the kd_tree, because each node of the tree
    is always split exactly in half, resulting in a balanced binary tree.
    With the optimized tree algorithm, splits can happen anywhere, there can
    be 1 segment on the left and 50 on the right, and then later down the
    tree the 50 get split up more while the 1 does not.  In the worst case,
    one node gets split off from the rest at each cut, meaning in order to
    populate num_segs leaves you need num_segs cuts.  This could result in
    a longer optimzation time, but the optimization phase is usually very
    fast compared to the actual parallel-DBSCAN phase that follows.
    It should only matter for small datasets, and the user always has the
    option of setting a lower max_depth than the default if optimization
    is taking too much time.

- Cast input to DOUBLE PRECISION[] in source_view

    (This is necessary to be able to handle input tables where the points
     column is REAL[], INTEGER[], etc.

- Rename depth -> max_segmentation_depth
- Rename "kd_tree" method option to "optimized"
- Remove unused kd_tree code

- Fix a subtle issue involving early removal of border points from rtree:

    If range_query is called on a point in the outer for loop (where we're
    looking for a new cluster seed, not growing a cluster yet), then it will be
    tentatively labelled as NOISE_POINT... but may be assigned to a cluster
    later, if it borders a core point while we're growing a cluster.
    Instead of keeping these points in the rtree, we go ahead and delete them
    (same as for core points or noise points, which we know won't need
    further processing) but must keep track of a list of _possible_border_points
    in each internal db_rec.  During update_neighbors (called by range_query),
    we can also add the id to each of its unlabelled neighbors'
    _possible_border_points list, in case they turn out to be core points later.
    (None of them will be labelled as core points yet, since anything marked as
    a core point has already been removed from the tree; but there may be some
    NOISE_POINT's; we skip the neighboring NOISE_POINT's since we know they
    won't ever be core points.)

- Add scripts for creating blobs for perf testing
- Add scripts for generating and running dbscan perf tests
5 files changed
tree: fb919a52bdbc4759c568b381a391b7755eba9b54
  1. .github/
  2. cmake/
  3. deploy/
  4. doc/
  5. examples/
  6. licenses/
  7. methods/
  8. src/
  9. tool/
  10. .gitignore
  11. CMakeLists.txt
  12. configure
  13. Jenkinsfile
  14. LICENSE
  15. NOTICE
  16. pom.xml
  17. README.md
  18. ReadMe_Build.txt
  19. RELEASE_NOTES
  20. Release_Review_HOWTO.txt
README.md

MADlib® is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data.

Build Status

Installation and Contribution

See the project website MADlib Home for links to the latest binary and source packages.

We appreciate all forms of project contributions to MADlib including bug reports, providing help to new users, documentation, or code patches. Please refer to Contribution Guidelines for instructions.

For more installation and contribution guides, please refer to the MADlib Wiki.

Compiling from source on Linux details are also on the wiki.

Development with Docker

We provide a Docker image with necessary dependencies required to compile and test MADlib on PostgreSQL 10.5. You can view the dependency Docker file at ./tool/docker/base/Dockerfile_ubuntu16_postgres10. The image is hosted on Docker Hub at madlib/postgres_10:latest. Later we will provide a similar Docker image for Greenplum Database.

We provide a script to quickly run this docker image at ./tool/docker_start.sh, which will mount your local madlib directory, build MADlib and run install check on this Docker image. At the end, it will docker exec as postgres user. Note that you have to run this script from inside your madlib directory, and you can specify your docker CONTAINER_NAME (default is madlib) and IMAGE_TAG (default is latest). Here is an example:

CONTAINER_NAME=my_madlib IMAGE_TAG=LaTex ./tool/docker_start.sh

Notice that this script only needs to be run once. After that, you will have a local docker container with CONTAINER_NAME running. To get access to the container, run the following command and you can keep working on it.

docker exec -it CONTAINER_NAME bash

To kill this docker container, run:

docker kill CONTAINER_NAME
docker rm CONTAINER_NAME

You can also manually run those commands to do the same thing:

## 1) Pull down the `madlib/postgres_10:latest` image from docker hub:
docker pull madlib/postgres_10:latest

## 2) Launch a container corresponding to the MADlib image, name it
##    madlib, mounting the source code folder to the container:
docker run -d -it --name madlib \
    -v (path to madlib directory):/madlib/ madlib/postgres_10
# where madlib is the directory where the MADlib source code resides.

################################# * WARNING * #################################
# Please be aware that when mounting a volume as shown above, any changes you
# make in the "madlib" folder inside the Docker container will be
# reflected on your local disk (and vice versa). This means that deleting data
# in the mounted volume from a Docker container will delete the data from your
# local disk also.
###############################################################################

## 3) When the container is up, connect to it and build MADlib:
docker exec -it madlib bash
mkdir /madlib/build_docker
cd /madlib/build_docker
cmake ..
make
make doc
make install

## 4) Install MADlib:
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres install

## 5) Several other commands can now be run, such as:
# Run install check, on all modules:
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres install-check
# Run install check, on a specific module, say svm:
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres install-check -t svm
# Reinstall MADlib:
src/bin/madpack -p postgres -c postgres/postgres@localhost:5432/postgres reinstall

## 6) Kill and remove containers (after exiting the container):
docker kill madlib
docker rm madlib

Instruction for building design pdf on Docker:

For users who wants to build design pdf, make sure you use the IMAGE_TAG=LaTex parameter when running the script. After launching your docker container, run the following to get design.pdf:

cd /madlib/build_docker
make design_pdf
cd doc/design

Detailed build instructions are available in ReadMe_Build.txt

User and Developer Documentation

The latest documentation of MADlib modules can be found at MADlib Docs.

Architecture

The following block-diagram gives a high-level overview of MADlib's architecture.

MADlib Architecture

Third Party Components

MADlib incorporates software from the following third-party components. Bundled with source code:

  1. libstemmer “small string processing language”
  2. m_widen_init “allows compilation with recent versions of gcc with runtime dependencies from earlier versions of libstdc++”
  3. argparse 1.2.1 “provides an easy, declarative interface for creating command line tools”
  4. PyYAML 3.10 “YAML parser and emitter for Python”
  5. UseLATEX.cmake “CMAKE commands to use the LaTeX compiler”

Downloaded at build time (or supplied as build dependencies):

  1. Boost 1.61.0 (or newer) “provides peer-reviewed portable C++ source libraries”
  2. PyXB 1.2.6 “Python library for XML Schema Bindings”
  3. Eigen 3.2.2 “C++ template library for linear algebra”

Licensing

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this project to You under the Apache License, Version 2.0 (the “License”); you may not use this project except in compliance with the License. You may obtain a copy of the License at LICENSE.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

As specified in LICENSE additional license information regarding included third-party libraries can be found inside the licenses directory.

Release Notes

Changes between MADlib versions are described in the ReleaseNotes.txt file.

Papers and Talks

Related Software

  • PivotalR - PivotalR also lets the user run the functions of the open-source big-data machine learning package MADlib directly from R.
  • PyMADlib - PyMADlib is a python wrapper for MADlib, which brings you the power and flexibility of python with the number crunching power of MADlib.