Merge pull request #4 from wphyojpl/master

breaking: multiple new features and performance enhancement attempts
tree: df1924731d09500bb4bcb12234a565b72d6a8c3a
  1. docker/
  2. documentations/
  3. k8s_spark/
  4. local.spark.cluster/
  5. parquet_flask/
  6. terraform/
  7. tests/
  8. .gitignore
  9. CONTRIBUTING.md
  10. Deployment-in-AWS.md
  11. in_situ_record_schema.json
  12. in_situ_schema.json
  13. LICENSE
  14. README.md
  15. s3a.parquet.performance.issue.md
  16. setup.py
README.md

parquet_test_1

Ref:

  • how to replace parquet file partially
https://stackoverflow.com/questions/38487667/overwrite-specific-partitions-in-spark-dataframe-write-method?noredirect=1&lq=1
> Finally! This is now a feature in Spark 2.3.0: SPARK-20236
> To use it, you need to set the spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example:
> https://stackoverflow.com/questions/50006526/overwrite-only-some-partitions-in-a-partitioned-spark-dataset

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.toDF().write.mode("overwrite").format("parquet").partitionBy("date", "name").save("s3://path/to/somewhere")