Merge pull request #58 from nadav-har-tzvi/documentation/python documentation for pyspark sdk

commit: 424531479874c38a16708e2aeb0101b7760a8ed8 [log] [tgz]
author: Yaniv Rodenski <roadan@gmail.com> Tue Jul 09 15:26:25 2019 +1000
committer: GitHub <noreply@github.com> Tue Jul 09 15:26:25 2019 +1000
tree: e0585bba49003b1661457ab85f1b589d7943a8df
parent: 02fc86209fb8c168d3d706b4be861a3df8ef6397 [diff]
parent: 402ecfe9061d635b8e927f3818cf6d84e1a1da4f [diff]
diff --git a/docs/docs/config.md b/docs/docs/config.md
index 52adfc2..c449245 100644
--- a/docs/docs/config.md
+++ b/docs/docs/config.md

@@ -28,12 +28,15 @@
 |   +-- dev/
 |   |   +-- job.yaml
 |   |   +-- spark.yaml
+|   |   +-- datasets.yaml
 |   +-- test/
 |   |   +-- job.yaml
 |   |   +-- spark.yaml
+|   |   +-- datasets.yaml
 |   +-- prod/
 |       +-- job.yaml
 |       +-- spark.yaml
+|       +-- datasets.yaml
 +-- src/
 |   +-- start/
 |       +-- dev/
@@ -77,4 +80,82 @@
 For more information about specific framework configuration options, look at the [frameworks](frameworks/) section of this documentation.
 
 ### Datasets 
+
+One of the main concerns when running a data-pipelines is where and how you get the data required to run the jobs.
+
+Amaterasu provides a mechanism for managing and configuring such data-sources. Each one of our SDKs provides a way to load and persist data easily. This functionality is based on prior configuration.
+
+In a job repository, each environment contains a ```datasets.yml``` file. This file contains the configurations of all datasets to be used in the job.
+
+Below is an example of a simple configuration, for a dataset stored as parquet in Amazon S3.
+
+```yaml
+file:
+  - uri: s3a://amaterasu-example/input/random-beers
+    format: parquet
+    name: random-beers
+```
+
+#### Detailed configuration
+Amaterasu supports different types of datasets and their corresponding configuration options.
+Do note that different Apache Amaterasu frameworks may have their take on the configurations below.
+##### File
+Currently, the following formats are supported - JSON, parquet, CSV, ORC.
+
+The following storage types are currently supported - s3, azure blob storage, file system, HDFS.
+
+The following configuration options are currently supported:
+
+| Parameter | Description |
+|:---------:|:-----------:|
+| name      |The name of the dataset|
+| uri       |The file URI. Supported schemes are: s3a, file, hdfs, wasb, wasbs, gs|
+| format    |The file format - JSON, parquet, CSV, ORC|
+
+Example:
+```yaml
+file:
+  - uri: s3a://amaterasu-example/input/random-beers
+    format: parquet
+    name: random-beers
+``` 
+
+> Note! If the scheme or format isn't currently supported by the available Apache Amaterasu frameworks, it is still possible to define using a [generic dataset](#generic-datasets).
+ 
+##### Hive Table
+ 
+Example:
+ 
+```yaml
+hive:
+    - table: mytable
+      name: mydataset 
+
+```
+
+##### Generic Datasets
+Aside from hive tables and files, there is a need to allow configuration of other data sources, even if they aren't currently supported by Apache Amaterasu. 
+
+There is a wide set of use cases -
+ * Pulling data from external APIs, where API keys change between environments (prod and dev keys)
+ * Pulling data from organizational web services
+ * Pulling data from relational databases
+ * and many more.
+ 
+ To support the configuration management for such use cases, Apache Amaterasu provides the means to define generic datasets.
+ 
+Example:
+```yaml
+generic:
+    - name: mygenericds 
+      key1: value1
+      key2: value2
+      
+    - name: myothergenericds
+      key1: value1
+      key2: value2
+``` 
+
+The usage of generic datasets is explained in the relevant SDK documentation [section](./frameworks.md#integration-with-unsupported-data-sources)
+
 ### Custom Configuration
\ No newline at end of file

diff --git a/docs/docs/frameworks.md b/docs/docs/frameworks.md
index 3ee07c5..a0c787c 100644
--- a/docs/docs/frameworks.md
+++ b/docs/docs/frameworks.md

@@ -41,13 +41,147 @@
 
 # Amaterasu Frameworks
 
+## Python 
+Amaterasu supports a variety of Python frameworks:
+
+1. PySpark workload ([See below](#pyspark))
+
+2. Pandas workload 
+
+3. Generic Python workload
+
+Each workload type has a dedicated Apache Amaterasu SDK. 
+The Apache Amaterasu SDK is available in PyPI and can be installed as follows:
+```bash
+pip install apache-amaterasu
+```
+
+Alternatively, it is possible to download the SDK source and manually install it via ```easy_install``` or executing the setup script.
+
+```bash
+wget <link to source distribution>
+tar -xzf apache-amaterasu-0.2.1-incubating.tar.gz
+cd apache-amaterasu-0.2.1-incubating
+python setup.py install
+```
+
+### Action dependencies
+Apache Amaterasu has the capability of ensuring Python dependencies are present on all execution nodes when executing action sources.
+
+In order to define the required dependencies, a ```requirements.txt``` file has to be added to the job repository.
+Currently, only a global ```requirements.txt``` is supported.
+
+Below you can see where the requirements file has to be added:
+```
+repo
++-- deps/
+|   +-- requirements.txt <-- This is the place where you define python dependendencies
++-- env/
+|   +-- dev/
+|   |   +-- job.yaml
+|   |   +-- spark.yaml
+|   +-- test/
+|   |   +-- job.yaml
+|   |   +-- spark.yaml
+|   +-- prod/
+|       +-- job.yaml
+|       +-- spark.yaml
++-- src/
+|   +-- start/
+|       +-- dev/
+|       |   +-- job.yaml
+|       |   +-- spark.yaml
+|       +-- test/
+|       |   +-- job.yaml
+|       |   +-- spark.yaml
+|       +-- prod/
+|           +-- job.yaml
+|           +-- spark.yaml
++-- maki.yaml 
+
+```
+
+When a ```requirements.txt``` file exists, Apache Amaterasu distributes it to the execution containers and locally installs the dependencies in each container.
+
+> **Important** - Your execution nodes need to have egress connection available in order to use pip
+
+### Pandas
+### Generic Python
+
+
+## Java and JVM programs
+
 ## Apache Spark
 
 ### Spark Configuration
+Apache Amaterasu has the capability of deploying Spark applications and provide configuration and integration 
+utilities that work in the realm of dataframes and RDDs.
 
 ### Scala
 ### PySpark
+Assuming that Apache Amaterasu has been [configured](./config.md) correctly for your cluster and all required 
+spark [configurations](#spark-configuration) are in place, all you need to do is integrate with the Apache Amaterasu 
+PySpark SDK from within your source scripts.
 
-## Python 
+#### Integration with supported data sources
 
-## Java and JVM programs
\ No newline at end of file
+The Spark SDK provides a mechanism to seamlessly consume and persist datasets. To do so, you must first define a [dataset configuration](./config.md#datasets). 
+Let's assume that you have a 2 datasets defined. The first, ```mydataset``` is the input dataset and ```resultdataset```  is the output dataset.
+Let's also assume that you have 2 different environments, production and development.
+Let's take a sneak peek at an example of how ```mydataset``` is defined in each environment:
+
+__production__
+```yaml
+file:
+    url: s3a://myprodbucket/path/myinputdata
+    format: parquet
+```
+    
+    
+___development__
+```yaml
+file:
+    url: s3a://mydevbucket/path/myinputdata
+    format: parquet
+```
+     
+The snippet below shows how to use the Amaterasu SDK to integrate with the data sources easily.
+
+```python
+from amaterasu.pyspark.runtime import AmaContext
+
+ama_context = AmaContext.builder().build()
+some_dataset: Dataframe = ama_context.load_dataset("mydataset")
+... # Do some black magic here
+ama_context.persist_dataset("resultdataset", some_dataset_after_transformation)
+```
+
+The code above will work regardless of whether you run in development environment or production.
+> Important: You will need to have the datasets above defined in both environments!
+
+#### Integration with unsupported data sources
+
+Apache Amaterasu is still an evolving project, we add more and more builtin integrations as we go (e.g. - Pandas dataframes, new datasource types and so on). 
+As we are aware of the need to manage datasets across environments, regardless of whether Apache Amaterasu supports them by default or not, we designed Apache Amaterasu to provide a way to define [generic datasets](./config.md#generic-datasets). 
+In addition to defining generic datasets, Apache Amaterasu also provides an API to reference any dataset configuration defined in the environment.
+
+The snippet below shows an example of referencing a dataset configuration in the code.
+
+```python
+from amaterasu.pyspark.runtime import AmaContext
+
+ama_context = AmaContext.builder().build()
+dataset_conf = ama_context.dataset_manager.get_dataset_configuration("mygenericdataset")  # This is a dataset without builtin support
+
+some_prop = dataset_conf['mydataset_prop1']
+some_prop2 = dataset_conf['mydataset_prop2']
+
+my_df = my_custom_loading_logic(some_prop, some_prop2)
+my_new_df = black_magic(my_df)
+ama_context.persist('magic_df', my_new_df)  # This is a dataset with builtin support
+
+```
+
+
+      
+

diff --git a/docs/docs/index.md b/docs/docs/index.md
index 3d92cfb..d6d1692 100755
--- a/docs/docs/index.md
+++ b/docs/docs/index.md

@@ -65,6 +65,7 @@
 | zk         | The ZooKeeper connection<br> string to be used by<br> amaterasu | The address of a zookeeper node  |
 | master     | The clusters' Mesos master | The address of the Mesos Master    |
 | user       | The user that will be used<br> to run amaterasu | root          |
+| pythonPath | The path to the Python3 executable | python3 (/usr/bin/python3) |
 
 #### Apache YARN
 
@@ -75,7 +76,7 @@
 | ---------- | ------------------------------ | -------------- |
 | Mode       | The cluster manager to be used | mesos          |
 | zk         | The ZooKeeper connection<br> string to be used by<br> amaterasu | The address of a zookeeper node  |
-
+| pythonPath | The path to the Python3 executable | python3 (/usr/bin/python3) |
 
 ## Running a Job
commit	424531479874c38a16708e2aeb0101b7760a8ed8	[log] [tgz]
author	Yaniv Rodenski <roadan@gmail.com>	Tue Jul 09 15:26:25 2019 +1000
committer	GitHub <noreply@github.com>	Tue Jul 09 15:26:25 2019 +1000
tree	e0585bba49003b1661457ab85f1b589d7943a8df
parent	02fc86209fb8c168d3d706b4be861a3df8ef6397 [diff]
parent	402ecfe9061d635b8e927f3818cf6d84e1a1da4f [diff]