layout: home title: Home custom_title: Apache Spark™ - Unified Engine for large-scale data analytics description: Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. type: page navigation: weight: 1 show: true

Split the data into train/test datasets

train_df, test_df = df.randomSplit([.80, .20], seed=42)

Set hyperparameters for the algorithm

rf = RandomForestRegressor(numTrees=100)

Fit the model to the training data

model = rf.fit(train_df)

Generate predictions on the test dataset.

model.transform(test_df).show() {% endhighlight %} {% highlight python %} df = spark.read.csv(“accounts.csv”, header=True)

Select subset of features and filter for balance > 0

filtered_df = df.select(“AccountBalance”, “CountOfDependents”).filter(“AccountBalance > 0”)

Generate summary statistics

filtered_df.summary().show() {% endhighlight %} Run now $ SPARK-HOME/bin/spark-sql spark-sql> {% highlight sql %} SELECT name.first AS first_name, name.last AS last_name, age FROM json.logs.json WHERE age > 21; {% endhighlight %} Run now $ SPARK-HOME/bin/spark-shell scala> {% highlight scala %} val df = spark.read.json(“logs.json”) df.where(“age > 21”) .select(“name.first”).show() {% endhighlight %} Run now $ SPARK-HOME/bin/spark-shell scala> {% highlight java %} Dataset df = spark.read().json(“logs.json”); df.where(“age > 21”) .select(“name.first”).show(); {% endhighlight %} Run now $ SPARK-HOME/bin/sparkR > {% highlight r %} df <- read.json(path = “logs.json”) df <- filter(df, df$age > 21) head(select(df, df$name.first)) {% endhighlight %}