Reworded to apply Yaniv's comments
diff --git a/docs/docs/config.md b/docs/docs/config.md
index 17d3ec6..6fed6e7 100644
@@ -81,9 +81,9 @@
-One aspect of maintaining different deployment environments is where and how you get the data required to run the jobs.
+One of the main concerns when running a data-pipelines is where and how you get the data required to run the jobs.
-To provide an abstraction, each of our SDKs provides a way to load and persist data easily. This functionality is based on prior configuration.
+Amaterasu provides a mechanism for managing and configuring such data-sources. Each one of our SDKs provides a way to load and persist data easily. This functionality is based on prior configuration.
In a job repository, each environment contains a ```datasets.yml``` file. This file contains the configurations of all datasets to be used in the job.
@@ -97,10 +97,10 @@
#### Detailed configuration
-Below are the different types of datasets and their corresponding configuration options.
+Amaterasu supports different types of datasets and their corresponding configuration options.
Do note that different Apache Amaterasu frameworks may have their take on the configurations below.
-The following formats are currently supported - JSON, parquet, CSV, ORC.
+Currently, the following formats are supported - JSON, parquet, CSV, ORC.
The following storage types are currently supported - s3, azure blob storage, file system, HDFS.
diff --git a/docs/docs/frameworks.md b/docs/docs/frameworks.md
index 209c0c7..ffb48c0 100644
@@ -42,7 +42,7 @@
# Amaterasu Frameworks
-Apache Amaterasu supports the following types of Python workloads:
+Amaterasu supports a variety of Python frameworks:
1. PySpark workload ([See below](#pyspark))
@@ -75,7 +75,7 @@
-| +-- requirements.txt <-- This is the place for defining dependencies
+| +-- requirements.txt <-- This is the place where you define python dependendencies
| +-- dev/
| | +-- job.yaml
@@ -114,19 +114,18 @@
## Apache Spark
### Spark Configuration
+Apache Amaterasu has the capability of deploying Spark applications and provide configuration and integration
+utilities that work in the realm of dataframes and RDDs.
-Apache Amaterasu has the capability of deploying PySpark applications and provide configuration and integration
-utilities that work in the realm of dataframes and RDDs.
Assuming that Apache Amaterasu has been [configured](./config.md) correctly for your cluster and all required
spark [configurations](#spark-configuration) are in place, all you need to do is integrate with the Apache Amaterasu
PySpark SDK from within your source scripts.
#### Integration with supported data sources
-The starting point of this, is to write a [dataset configuration](./config.md#datasets).
+The Spark SDK provides a mechanism to seamlessly consume and persist datasets. To do so, you must first define a [dataset configuration](./config.md#datasets).
Let's assume that you have a 2 datasets defined. The first, ```mydataset``` is the input dataset and ```resultdataset``` is the output dataset.
Let's also assume that you have 2 different environments, production and development.
Let's take a sneak peek at an example of how ```mydataset``` is defined in each environment: