Apache Tez

Clone this repo:
  1. b99c7ce TEZ-4091: UnorderedPartitionedKVWriter::readDataForDME should check if in-mem file is flushed or not (#53) by rbalamohan · 4 days ago master
  2. 3dce6c9 * TEZ-4088: Create in-memory ifile writer for transferring smaller payloads (follow up of TEZ-4075) by rbalamohan · 11 days ago
  3. de019d5 TEZ-4075: Reimplement tez.runtime.transfer.data-via-events.enabled (#48) (Contributed by Richard Zhang) by rbalamohan · 2 weeks ago
  4. 9f5c1b7 TEZ-4086. Allow various examples to work when outputPath is on a FileSystem other than the default FileSystem. (#45) by Siddharth Seth · 2 weeks ago
  5. 7a3e378 TEZ-4082. Reduce excessive getFileLinkInfo calls in Tez by Jonathan Eagles · 8 weeks ago

Apache Tez

Apache Tez is a generic data-processing pipeline engine envisioned as a low-level engine for higher abstractions such as Apache Hadoop Map-Reduce, Apache Pig, Apache Hive etc.

At its heart, tez is very simple and has just two components:

  • The data-processing pipeline engine where-in one can plug-in input, processing and output implementations to perform arbitrary data-processing. Every ‘task’ in tez has the following:
  • Input to consume key/value pairs from.
  • Processor to process them.
  • Output to collect the processed key/value pairs.
  • A master for the data-processing application, where-by one can put together arbitrary data-processing ‘tasks’ described above into a task-DAG to process data as desired. The generic master is implemented as a Apache Hadoop YARN ApplicationMaster.