| --- |
| layout: global |
| title: "ML Model Security" |
| displayTitle: "Spark ML Model Security" |
| license: | |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --- |
| |
| # Overview |
| |
| In Apache Spark, loading a machine learning (ML) model is fundamentally equivalent to loading and executing code. |
| Spark ML models often contain serialized objects, transformation logic, and execution graphs that are evaluated by the Spark runtime |
| during model loading and inference. |
| The principle is not unique to Spark, it applies equally to scikit-learn, PyTorch, TensorFlow, and other modern ML ecosystems. |
| As a result, loading a model from an untrusted source introduces the same security risks as executing untrusted software. |
| |
| # Why Loading an ML Model Is Equivalent to Loading Code? |
| |
| Spark ML frameworks serialize not only data (such as weights and parameters) but also executable structures and behaviors. |
| Because of this, model loading is not merely data parsing. It involves interpreting and executing instructions, which means a malicious model can: |
| |
| * Execute arbitrary commands |
| * Access or exfiltrate data |
| * Modify system state |
| * Install backdoors or malware |
| |
| In practice, loading a model from an untrusted source is equivalent to running a program downloaded from the internet. |
| |
| # Security Implications |
| |
| Because Spark ML models can embed executable logic, loading untrusted models can lead to: |
| |
| * Remote code execution (RCE) |
| * Data exfiltration from Spark jobs |
| * Compromise of cluster nodes |
| * Privilege escalation within Spark environments |
| * Supply-chain attacks through model distribution |
| |
| These risks are amplified in distributed environments, where a malicious model may execute across multiple cluster nodes. |
| |
| # Responsibility of End Users |
| |
| Because loading ML models is equivalent to loading executable code, the responsibility for security ultimately lies with the end user or deploying organization. |
| End users are responsible for ensuring that ML models are subject to the same security assessment, validation, and operational controls as any third-party software. |
| This includes: |
| |
| * Verifying the source and authenticity of the model |
| * Ensuring integrity and provenance |
| * Applying organizational security policies |
| * Performing risk assessments before deployment |
| |
| Frameworks and libraries can provide safeguards, but they cannot guarantee security when loading arbitrary third-party models. |
| |
| # Best Practices |
| |
| * Load models only from trusted and verified sources |
| * Validate cryptographic hashes or digital signatures |
| * Execute models in isolated environments |
| * Restrict filesystem, network, and credential access |
| * Keep Spark, ML libraries, and dependencies fully patched |
| |