[1] http://emedicine.medscape.com/article/1947145-overview#a3
[2] http://emedicine.medscape.com/article/1947145-overview#a7
[3] http://emedicine.medscape.com/article/1954658-overview
[4] http://emedicine.medscape.com/article/1947145-workup#c12
In an effort to automate the process of classification, this project aims to develop a large-scale deep learning approach for predicting tumor scores directly from the pixels of whole-slide histopathology images. Our proposed approach is based on a recent research paper from Stanford [1]. Starting with 500 extremely high-resolution tumor slide images with accompanying score labels, we aim to make use of Apache Spark in a preprocessing step to cut and filter the images into smaller square samples, generating 4.7 million samples for a total of ~7TB of data [2]. We then utilize Apache SystemML on top of Spark to develop and train a custom, large-scale, deep convolutional neural network on these samples, making use of the familiar linear algebra syntax and automatically-distributed execution of SystemML [3]. Our model takes as input the pixel values of the individual samples, and is trained to predict the correct tumor score classification for each one. In addition to distributed linear algebra, we aim to exploit task-parallelism via parallel for-loops for hyperparameter optimization, as well as hardware acceleration for faster training via a GPU-backed runtime. Ultimately, we aim to develop a model that is sufficiently stronger than existing approaches for the task of breast cancer tumor proliferation score classification.
References:
[1] https://web.stanford.edu/group/rubinlab/pubs/2243353.pdf
[2] See Preprocessing.ipynb
.
[3] See MachineLearning.ipynb
, softmax_clf.dml
, and convnet.dml
.
System Packages:
sudo yum update
sudo yum install gcc ruby
Python 3:
sudo yum install epel-release
sudo yum install -y https://centos7.iuscommunity.org/ius-release.rpm
sudo yum install -y python35u python35u-libs python35u-devel python35u-pip
ln -s /usr/bin/python3.5 ~/.local/bin/python3
ln -s /usr/bin/pip3.5 ~/.local/bin/pip3
~/.local/bin
to the PATH
.OpenSlide:
sudo yum install openslide
Python packages:
pip3 install -U matplotlib numpy pandas scipy jupyter ipython scikit-learn scikit-image flask openslide-python
SystemML (only driver):
git clone https://github.com/apache/incubator-systemml.git
cd incubator-systemml
mvn clean package
pip3 install -e src/main/python
Create a data
folder with the following contents (same location on all nodes):
training_image_data
folder with the training slides.testing_image_data
folder with the testing slides.training_ground_truth.csv
file containing the tumor & molecular scores for each slide.Create a project folder (i.e. breast_cancer
) with the following contents (only driver):
All notebooks (*.ipynb
).
All DML scripts (*.dml
).
SystemML-NN installed as an nn
folder containing the contents of $SYSTEMML_HOME/scripts/staging/SystemML-NN/nn
(either copy & paste, or use a softlink).
The data
folder (or a softlink pointing to it).
Layout:
- MachineLearning.ipynb - Preprocessing.ipynb - ... - data/ - training_ground_truth.csv - training_image_data - TUPAC-TR-001.svs - TUPAC-TR-002.svs - ... - testing_image_data - TUPAC-TE-001.svs - TUPAC-TE-002.svs - ...
Adjust the Spark settings in $SPARK_HOME/conf/spark-defaults.conf
using the following examples, depending on the job being executed:
All jobs:
# Use most of the driver memory. spark.driver.memory 70g # Remove the max result size constraint. spark.driver.maxResultSize 0 # Increase the message size. spark.akka.frameSize 128 # Extend the network timeout threshold. spark.network.timeout 1000s # Setup some extra Java options for performance. spark.driver.extraJavaOptions -server -Xmn12G spark.executor.extraJavaOptions -server -Xmn12G # Setup local directories on separate disks for intermediate read/write performance, if running # on Spark Standalone clusters. spark.local.dirs /disk2/local,/disk3/local,/disk4/local,/disk5/local,/disk6/local,/disk7/local,/disk8/local,/disk9/local,/disk10/local,/disk11/local,/disk12/local
Preprocessing:
# Save 1/2 executor memory for Python processes spark.executor.memory 50g
Machine Learning:
# Use all executor memory for JVM spark.executor.memory 100g
Start Jupyter + PySpark with the following command (could also use Yarn in client mode with --master yarn --deploy-mode
):
PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master spark://MASTER_URL:7077 --driver-class-path $SYSTEMML_HOME/target/SystemML.jar --jars $SYSTEMML_HOME/target/SystemML.jar
git clone https://github.com/openslide/openslide-python.git
python3 path/to/openslide-python/examples/deepzoom/deepzoom_multiserver.py -Q 100 path/to/data/
python3 path/to/openslide-python/examples/deepzoom/deepzoom_multiserver.py -Q 100 -l HOSTING_URL_HERE path/to/data/
HOSTING_URL_HERE:5000
.