tree 6ebe4f498481b31d4de0bba9cbb97fb6b9251163
parent da9fce524be12d8a4a68e7314bb7471449ff957c
author Domino Valdano <dvaldano@vmware.com> 1601520471 -0700
committer Domino Valdano <dvaldano@vmware.com> 1611101902 -0800

DL: Major Refactor of Model Hopper

JIRA: MADLIB-1428

- Use only 2 temporary tables (model_input_tbl & model_output_tbl)
  for moving the model weights around during hopping and training,
  instead of 3 (mst_weights_tbl, weights_to_update_tbl, and model_output_table)
  This elmiminates the UPDATE step, leaving only HOP and UDF steps

- Add dist_key column to model_output table and DISTRIBUTE BY this instead
   of mst_key.  This removes Redistribute Motion from UDF query plan, so
   that weights only ever move during the hop query, not during the
   training query.

- Simplified schedule rotation: schedule table created only once, then gets
  rotated on segments, instead of re-creating many times by transfering
  data back and forth from master to segments to master each hop.  No longer
  need separate "current_schedule" and "grand_schedule" data structures.

- Skip first hop of each iteration
   (just rename model_output to model_input instead)

- Split get_model_arch_and_weights() into query_weights() and get_model_arch()
    So we don't have to transfer weights from segment to master in places
    where we only need the model_arch json.

- Much faster initialization code:  previously, we were reading the weights
  in from the original model output table (during warm start) and the model
  arch table (for transfer learning) one mst row at a time from segment to
  master, then writing them each back out one row at a time from master
  back to segments with a large number of SELECT and INSERT queries.
  Now, we just use a single query to copy the weights directly from the
  original model output table into the new model output table on the
  segments, without ever sending them to master.  And a similar single
  query copies the transfer learning weights directly from model_arch to
  model_output for training.  Both of these happen in parallel on the
  segments, instead of in sequence on master.  During testing on
  a 20-segment cluster with 20 models, this resulted in a 10x reduction
  in initialization time (26s instead of 5 mins)

- Add some debugging that can be enabled to help profile the
  performance of fit multiple, and track which segment each mst_key
  is located during each hop. This also serves as an example for
  the utils/debug PR this is rebased on top of.

- Add "unit" tests for fit mult model hopping code (implemented
  as dev-check tests so they can access the db)

- Send Traceback of stack from segment back to coordinator

- Cache plans for Hop & UDF queries
