commit	0c826c26a547597b9f97e24c809ff073c2ffd30b	[log] [tgz]
author	John Yang <johnyangk@gmail.com>	Tue Oct 30 10:13:18 2018 +0900
committer	Jangho Seo <jangho@jangho.io>	Tue Oct 30 10:13:18 2018 +0900
tree	c8c26d70382c3e4854a7736ce93675b00d46bdfc
parent	970751acfb4f9ca357310db6d12f1ccd6841cc43 [diff]

[NEMO-8] Implement PipeManagerMaster/Worker (#129)

JIRA: [NEMO-8: Implement PipeManagerMaster/Worker](https://issues.apache.org/jira/projects/NEMO/issues/NEMO-8)

**Major changes:**
- Supports fully-pipelined data streaming for bounded sources (not unbounded sources)
  - Tasks do 'finish' after processing all input data, as the data is finite
  - When a tasks finishes, it emits all data it has (e.g., GroupByKey accumulated results) and closes corresponding outgoing pipes, notifying downstream tasks the end of the pipes
  - For stream-processing unbounded sources, we need watermarks (https://issues.apache.org/jira/browse/NEMO-233)
- Introduces PipeManagerMaster/Worker
  - Shares code with BlockManagerMaster/Worker 
- Naive, Element-wise serialization+compression+writeAndFlush
  - Very likely that this will cause some serious overheads. Will run proper benchmarks and fix the issues in a later PR.

**Minor changes to note:**
- JobConf#SchedulerImplClassName: Batch and Streaming options
- StreamingPolicyParallelismFive: The default policy + PipeTransferEverythingPass
- Fixes the StreamingScheduler to pass the new streaming integration tests
- Fixes a coder bug in the Beam frontend (PCollectionView coder)

**Tests for the changes:**
- WindowedWordCountITCase#testStreamingFixedWindow
- WindowedWordCountITCase#testStreamingSlidingWindow

**Other comments:**
- Also closes "Implement common API for data transfer" (https://issues.apache.org/jira/browse/NEMO-9)

Closes #129

42 files changed

tree: c8c26d70382c3e4854a7736ce93675b00d46bdfc

README.md

Nemo

A Data Processing System for Flexible Employment With Different Deployment Characteristics.

Online Documentation

Details about Nemo and its development can be found in:

Our website: https://nemo.apache.org/
Our project wiki: https://cwiki.apache.org/confluence/display/NEMO/
Our Dev mailing list for contributing: dev@nemo.apache.org (subscribe)

Please refer to the Contribution guideline to contribute to our project.

Nemo prerequisites and setup

Prerequisites

Java 8
Maven
YARN settings
- Download Hadoop 2.7.2 at https://archive.apache.org/dist/hadoop/common/hadoop-2.7.2/
- Set the shell profile as following:
```
export HADOOP_HOME=/path/to/hadoop-2.7.2
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/bin
```
Protobuf 2.5.0
- On Ubuntu 14.04 LTS and its point releases:
```
sudo apt-get install protobuf-compiler
```
- On Ubuntu 16.04 LTS and its point releases:
```
sudo add-apt-repository ppa:snuspl/protobuf-250
sudo apt update
sudo apt install protobuf-compiler=2.5.0-9xenial1
```
- On macOS:
```
brew tap homebrew/versions
brew install protobuf@2.5
```
- Or build from source:
  - Downloadable at https://github.com/google/protobuf/releases/tag/v2.5.0
  - Extract the downloaded tarball
  - ./configure
  - make
  - make check
  - sudo make install
- To check for a successful installation of version 2.5.0, run protoc --version

Installing Nemo

Run all tests and install: mvn clean install -T 2C
Run only unit tests and install: mvn clean install -DskipITs -T 2C

Running Beam applications

Configurable options

-job_id: ID of the Beam job
-user_main: Canonical name of the Beam application
-user_args: Arguments that the Beam application accepts
-optimization_policy: Canonical name of the optimization policy to apply to a job DAG in Nemo Compiler
-deploy_mode: yarn is supported(default value is local)

Examples

## MapReduce example
./bin/run_beam.sh \
	-job_id mr_default \
	-executor_json `pwd`/examples/resources/beam_test_executor_resources.json \
	-optimization_policy org.apache.nemo.compiler.optimizer.policy.DefaultPolicy \
	-user_main org.apache.nemo.examples.beam.WordCount \
	-user_args "`pwd`/examples/resources/test_input_wordcount `pwd`/examples/resources/test_output_wordcount"

## YARN cluster example
./bin/run_beam.sh \
	-deploy_mode yarn \
 	-job_id mr_transient \
	-executor_json `pwd`/examples/resources/beam_test_executor_resources.json \
 	-user_main org.apache.nemo.examples.beam.WordCount \
 	-optimization_policy org.apache.nemo.compiler.optimizer.policy.TransientResourcePolicy \
	-user_args "hdfs://v-m:9000/test_input_wordcount hdfs://v-m:9000/test_output_wordcount"

Resource Configuration

-executor_json command line option can be used to provide a path to the JSON file that describes resource configuration for executors. Its default value is config/default.json, which initializes one of each Transient, Reserved, and Compute executor, each of which has one core and 1024MB memory.

Configurable options

num (optional): Number of containers. Default value is 1
type: Three container types are supported:
- Transient : Containers that store eviction-prone resources. When batch jobs use idle resources in Transient containers, they can be arbitrarily evicted when latency-critical jobs attempt to use the resources.
- Reserved : Containers that store eviction-free resources. Reserved containers are used to reliably store intermediate data which have high eviction cost.
- Compute : Containers that are mainly used for computation.
memory_mb: Memory size in MB
capacity: Number of Tasks that can be run in an executor. Set this value to be the same as the number of CPU cores of the container.

Examples

[
  {
    "num": 12,
    "type": "Transient",
    "memory_mb": 1024,
    "capacity": 4
  },
  {
    "type": "Reserved",
    "memory_mb": 1024,
    "capacity": 2
  }
]

This example configuration specifies

12 transient containers with 4 cores and 1024MB memory each
1 reserved container with 2 cores and 1024MB memory

Monitoring your job using web UI

Nemo Compiler and Engine can store JSON representation of intermediate DAGs.

-dag_dir command line option is used to specify the directory where the JSON files are stored. The default directory is ./dag. Using our online visualizer, you can easily visualize a DAG. Just drop the JSON file of the DAG as an input to it.

Examples

./bin/run_beam.sh \
	-job_id als \
	-executor_json `pwd`/examples/resources/beam_test_executor_resources.json \
  	-user_main org.apache.nemo.examples.beam.AlternatingLeastSquare \
  	-optimization_policy org.apache.nemo.compiler.optimizer.policy.TransientResourcePolicy \
  	-dag_dir "./dag/als" \
  	-user_args "`pwd`/examples/resources/test_input_als 10 3"

Speeding up builds

To exclude Spark related packages: mvn clean install -T 2C -DskipTests -pl \!compiler/frontend/spark,\!examples/spark
To exclude Beam related packages: mvn clean install -T 2C -DskipTests -pl \!compiler/frontend/beam,\!examples/beam