System Packages:
sudo yum update
sudo yum install gcc ruby
Python 3:
sudo yum install epel-release
sudo yum install -y https://centos7.iuscommunity.org/ius-release.rpm
sudo yum install -y python35u python35u-libs python35u-devel python35u-pip
ln -s /usr/bin/python3.5 ~/.local/bin/python3
ln -s /usr/bin/pip3.5 ~/.local/bin/pip3
~/.local/bin
to the PATH
.OpenSlide:
sudo yum install openslide
Python packages:
pip3 install -U matplotlib numpy pandas scipy jupyter ipython scikit-learn scikit-image flask openslide-python
SystemML (only driver):
git clone https://github.com/apache/incubator-systemml.git
cd incubator-systemml
mvn clean package
pip3 install -e src/main/python
Add the following to the data
folder (same location on all nodes):
training_image_data
folder with the training slides.testing_image_data
folder with the testing slides.training_ground_truth.csv
file containing the tumor & molecular scores for each slide.Layout:
- MachineLearning.ipynb - Preprocessing.ipynb - breastcancer/ - preprocessing.py - visualization.py - convnet.dml - nn/ - ... - data/ - training_ground_truth.csv - training_image_data - TUPAC-TR-001.svs - TUPAC-TR-002.svs - ... - testing_image_data - TUPAC-TE-001.svs - TUPAC-TE-002.svs - ...
Adjust the Spark settings in $SPARK_HOME/conf/spark-defaults.conf
using the following examples, depending on the job being executed:
All jobs:
# Use most of the driver memory. spark.driver.memory 70g # Remove the max result size constraint. spark.driver.maxResultSize 0 # Increase the message size. spark.akka.frameSize 128 # Extend the network timeout threshold. spark.network.timeout 1000s # Setup some extra Java options for performance. spark.driver.extraJavaOptions -server -Xmn12G spark.executor.extraJavaOptions -server -Xmn12G # Setup local directories on separate disks for intermediate read/write performance, if running # on Spark Standalone clusters. spark.local.dirs /disk2/local,/disk3/local,/disk4/local,/disk5/local,/disk6/local,/disk7/local,/disk8/local,/disk9/local,/disk10/local,/disk11/local,/disk12/local
Preprocessing:
# Save 1/2 executor memory for Python processes spark.executor.memory 50g
Machine Learning:
# Use all executor memory for JVM spark.executor.memory 100g
cd
to this breast_cancer
folder.
Start Jupyter + PySpark with the following command (could also use Yarn in client mode with --master yarn --deploy-mode
):
PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master spark://MASTER_URL:7077 --driver-class-path $SYSTEMML_HOME/target/SystemML.jar --jars $SYSTEMML_HOME/target/SystemML.jar
git clone https://github.com/openslide/openslide-python.git
python3 path/to/openslide-python/examples/deepzoom/deepzoom_multiserver.py -Q 100 path/to/data/
python3 path/to/openslide-python/examples/deepzoom/deepzoom_multiserver.py -Q 100 -l HOSTING_URL_HERE path/to/data/
HOSTING_URL_HERE:5000
.