license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
This document aims to explain and demystify JDBC connection providers as they are used by Spark and the usage or custom provider implementation is not obvious.
Spark initially provided non-authenticated or user/password authenticated connections. This is quite insecure and must be avoided when possible.
JDBC connection providers (CPs from now on) are making JDBC connections initiated by JDBC sources which can be a more secure alternative.
Spark provides two ways to deal with stronger authentication:
keytab
and principal
(but only if the JDBC driver supports keytab)org.apache.spark.sql.jdbc.JdbcConnectionProvider
developer API added which allows developers to implement any kind of database/use-case specific authentication method.CPs are loaded with service loader independently. So, if one CP is failed to load it has no effect on all other CPs.
There are cases where the built-in CP doesn't provide the exact feature which needed so they can be turned off and can be replaced with custom implementation. All CPs must provide a name
which must be unique. One can set the following configuration entry in SparkConf
to turn off CPs: spark.sql.sources.disabledJdbcConnProviderList=name1,name2
.
When more than one JDBC connection provider can handle a specific driver and options, it is possible to disambiguate and enforce a particular CP for the JDBC data source. One can set the DataFrame option connectionProvider
to specify the name of the CP they want to use.
When a Spark source initiates JDBC connection it looks for a CP which supports the included driver, the user just need to provide the keytab
location and the principal
. The keytab
file must exist on each node where connection is initiated.
CPs has a mandatory API which must be implemented:
def canHandle(driver: Driver, options: Map[String, String]): Boolean
If this function returns true
then Spark
considers the CP can handle the connection setup. Built-in CPs returning true
in the following cases:
keytab
or principal
provided) then the basic
named CP responds.keytab
and principal
provided) then the database specific CP responds. Database specific providers are checking the JDBC driver class name and the decision is made based on that. For example PostgresConnectionProvider
responds only when the driver class name is org.postgresql.Driver
.Important to mention that exactly one CP must return true
from canHandle
for a particular connection request because otherwise Spark
can't decide which CP need to be used to make the connection. Such cases exception is thrown and the data processing stops.
Spark provides an example CP in the examples project (which does nothing). There are basically 2 files:
examples/src/main/scala/org/apache/spark/examples/sql/jdbc/ExampleJdbcConnectionProvider.scala
which contains the main logic that can be further developed.examples/src/main/resources/META-INF/services/org.apache.spark.sql.jdbc.JdbcConnectionProvider
which manifest file is used by service loader (this tells Spark
that this CP needs to be loaded).Implementation considerations:
modifiesSecurityContext
method returns true
then the CP’s getConnection
method will be called synchronized by org.apache.spark.security.SecurityConfigurationLock
to avoid race.