docs/ml-security.md - spark - Git at Google

 ---
 layout: global
 title: "ML Model Security"
 displayTitle: "Spark ML Model Security"
 license: |
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 ---

 # Overview

 In Apache Spark, loading a machine learning (ML) model is fundamentally equivalent to loading and executing code.
 Spark ML models often contain serialized objects, transformation logic, and execution graphs that are evaluated by the Spark runtime
 during model loading and inference.
 The principle is not unique to Spark, it applies equally to scikit-learn, PyTorch, TensorFlow, and other modern ML ecosystems.
 As a result, loading a model from an untrusted source introduces the same security risks as executing untrusted software.

 # Why Loading an ML Model Is Equivalent to Loading Code?

 Spark ML frameworks serialize not only data (such as weights and parameters) but also executable structures and behaviors.
 Because of this, model loading is not merely data parsing. It involves interpreting and executing instructions, which means a malicious model can:

 * Execute arbitrary commands
 * Access or exfiltrate data
 * Modify system state
 * Install backdoors or malware

 In practice, loading a model from an untrusted source is equivalent to running a program downloaded from the internet.

 # Security Implications

 Because Spark ML models can embed executable logic, loading untrusted models can lead to:

 * Remote code execution (RCE)
 * Data exfiltration from Spark jobs
 * Compromise of cluster nodes
 * Privilege escalation within Spark environments
 * Supply-chain attacks through model distribution

 These risks are amplified in distributed environments, where a malicious model may execute across multiple cluster nodes.

 # Responsibility of End Users

 Because loading ML models is equivalent to loading executable code, the responsibility for security ultimately lies with the end user or deploying organization.
 End users are responsible for ensuring that ML models are subject to the same security assessment, validation, and operational controls as any third-party software.
 This includes:

 * Verifying the source and authenticity of the model
 * Ensuring integrity and provenance
 * Applying organizational security policies
 * Performing risk assessments before deployment

 Frameworks and libraries can provide safeguards, but they cannot guarantee security when loading arbitrary third-party models.

 # Best Practices

 * Load models only from trusted and verified sources
 * Validate cryptographic hashes or digital signatures
 * Execute models in isolated environments
 * Restrict filesystem, network, and credential access
 * Keep Spark, ML libraries, and dependencies fully patched
	---
	layout: global
	title: "ML Model Security"
	displayTitle: "Spark ML Model Security"
	license: \|
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	---

	# Overview

	In Apache Spark, loading a machine learning (ML) model is fundamentally equivalent to loading and executing code.
	Spark ML models often contain serialized objects, transformation logic, and execution graphs that are evaluated by the Spark runtime
	during model loading and inference.
	The principle is not unique to Spark, it applies equally to scikit-learn, PyTorch, TensorFlow, and other modern ML ecosystems.
	As a result, loading a model from an untrusted source introduces the same security risks as executing untrusted software.

	# Why Loading an ML Model Is Equivalent to Loading Code?

	Spark ML frameworks serialize not only data (such as weights and parameters) but also executable structures and behaviors.
	Because of this, model loading is not merely data parsing. It involves interpreting and executing instructions, which means a malicious model can:

	* Execute arbitrary commands
	* Access or exfiltrate data
	* Modify system state
	* Install backdoors or malware

	In practice, loading a model from an untrusted source is equivalent to running a program downloaded from the internet.

	# Security Implications

	Because Spark ML models can embed executable logic, loading untrusted models can lead to:

	* Remote code execution (RCE)
	* Data exfiltration from Spark jobs
	* Compromise of cluster nodes
	* Privilege escalation within Spark environments
	* Supply-chain attacks through model distribution

	These risks are amplified in distributed environments, where a malicious model may execute across multiple cluster nodes.

	# Responsibility of End Users

	Because loading ML models is equivalent to loading executable code, the responsibility for security ultimately lies with the end user or deploying organization.
	End users are responsible for ensuring that ML models are subject to the same security assessment, validation, and operational controls as any third-party software.
	This includes:

	* Verifying the source and authenticity of the model
	* Ensuring integrity and provenance
	* Applying organizational security policies
	* Performing risk assessments before deployment

	Frameworks and libraries can provide safeguards, but they cannot guarantee security when loading arbitrary third-party models.

	# Best Practices

	* Load models only from trusted and verified sources
	* Validate cryptographic hashes or digital signatures
	* Execute models in isolated environments
	* Restrict filesystem, network, and credential access
	* Keep Spark, ML libraries, and dependencies fully patched