Software for in situ data analytics

Clone this repo:
  1. 3f9690f Update README and Deployment-in-AWS according to latest code changes (#21) by Jason Min-Liang Kang · 3 weeks ago master
  2. 8fdfc8c Fix helm related issues (#19) by Jason Min-Liang Kang · 3 weeks ago
  3. 2c78bfe Update query-by-id method to accept index as parameter (#20) by Jason Min-Liang Kang · 3 weeks ago
  4. 806a6d5 Treat -99999 to be at surface (#16) by Jason Min-Liang Kang · 3 weeks ago
  5. e040f94 Implement parquet partition by platform (#15) by Jason Min-Liang Kang · 3 weeks ago


Ingest in-situ data (in json) to AWS S3 as parquet object files.


Follow this guide to deploy SDAP In-Situ to AWS cloud.


How to ingest a insitu json file to Parquet

  • Assumption: K8s is successfully deployed

  • Download this repo

  • (optional) create different python3.6 environment

  • install dependencies

      python3 install
  • setup AWS tokens

      export AWS_ACCESS_KEY_ID=xxx
      export AWS_SECRET_ACCESS_KEY=xxx
      export AWS_SESSION_TOKEN=really.long.token
      export AWS_REGION=us-west-2
    • alternatively the default profile under ~/.aws/credentials can be setup as well
  • setup current directory to PYTHONPATH

  • run the script:

      python3 -m parquet_cli.ingest_s3 --help
    • sample script:

        python3 -m parquet_cli.ingest_s3 \
          --LOG_LEVEL 30 \
          --CDMS_DOMAIN  \
          --CDMS_BEARER_TOKEN Mock-CDMS-Flask-Token  \
          --PARQUET_META_TBL_NAME cdms_parquet_meta_dev_v1  \
          --BUCKET_NAME cdms-dev-ncar-in-situ-stage  \
          --KEY_PREFIX cdms_icoads_2017-01-01.json

Useful Commands

  • to replace parquet file partially
> Finally! This is now a feature in Spark 2.3.0: SPARK-20236
> To use it, you need to set the spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example:

data.toDF().write.mode("overwrite").format("parquet").partitionBy("date", "name").save("s3://path/to/somewhere")