layout: global title: TRANSFORM displayTitle: TRANSFORM license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
The TRANSFORM
clause is used to specify a Hive-style transform query specification to transform the inputs by running a user-specified command or script.
Spark's script transform supports two modes:
spark.sql.catalogImplementation=true
or SparkSession.builder.enableHiveSupport()
. In this case, now Spark only uses the script transform with ROW FORMAT DELIMITED
and treats all values passed to the script as strings.spark.sql.catalogImplementation=true
or Spark SQL is started with SparkSession.builder.enableHiveSupport()
, Spark can use the script transform with both Hive SerDe and ROW FORMAT DELIMITED
.SELECT TRANSFORM ( expression [ , ... ] ) [ ROW FORMAT row_format ] [ RECORDWRITER record_writer_class ] USING command_or_script [ AS ( [ col_name [ col_type ] ] [ , ... ] ) ] [ ROW FORMAT row_format ] [ RECORDREADER record_reader_class ]
expression
Specifies a combination of one or more values, operators and SQL functions that results in a value.
row_format
Specifies the row format for input and output. See HIVE FORMAT for more syntax details.
RECORDWRITER
Specifies a fully-qualified class name of a custom RecordWriter. The default value is org.apache.hadoop.hive.ql.exec.TextRecordWriter
.
RECORDREADER
Specifies a fully-qualified class name of a custom RecordReader. The default value is org.apache.hadoop.hive.ql.exec.TextRecordReader
.
command_or_script
Specifies a command or a path to script to process data.
When Spark uses ROW FORMAT DELIMITED
format:
\u0001
as the default field delimiter and this delimiter can be overridden by FIELDS TERMINATED BY
.\n
as the default line delimiter and this delimiter can be overridden by LINES TERMINATED BY
.\N
as the default NULL
value in order to differentiate NULL
values from the literal string NULL
. This delimiter can be overridden by NULL DEFINED AS
.STRING
and combines columns by tabs before feeding to the user script. For complex types such as ARRAY
/MAP
/STRUCT
, Spark uses to_json
casts it to an input JSON
string and uses from_json
to convert the result output JSON
string to ARRAY
/MAP
/STRUCT
data.COLLECTION ITEMS TERMINATED BY
and MAP KEYS TERMINATED BY
are delimiters to split complex data such as ARRAY
/MAP
/STRUCT
, Spark uses to_json
and from_json
to handle complex data types with JSON
format. So COLLECTION ITEMS TERMINATED BY
and MAP KEYS TERMINATED BY
won't work in default row format.STRING
columns. Any cell containing only a string \N
is re-interpreted as a literal NULL
value, and then the resulting STRING
column will be cast to the data types specified in col_type
.NULL
. For example:output tabs: 1, 2 output columns: A: INT, B INT, C: INT result: +---+---+------+ | a| b| c| +---+---+------+ | 1| 2| NULL| +---+---+------+
output tabs: 1, 2, 3 output columns: A: INT, B INT result: +---+---+ | a| b| +---+---+ | 1| 2| +---+---+
AS
clause after USING my_script
, the output schema is key: STRING, value: STRING
. The key
column contains all the characters before the first tab and the value
column contains the remaining characters after the first tab. If there are no tabs, Spark returns the NULL
value. For example:output tabs: 1, 2, 3 output columns: result: +-----+-------+ | key| value| +-----+-------+ | 1| 2| +-----+-------+ output tabs: 1, 2 output columns: result: +-----+-------+ | key| value| +-----+-------+ | 1| NULL| +-----+-------+
When Hive support is enabled and Hive SerDe mode is used:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
by default, so columns are cast to STRING
and combined by tabs before feeding to the user script.NULL
values are converted to a string \N
in order to differentiate literal NULL
values from the literal string NULL
.STRING
columns, any cell containing only a string \N
is re-interpreted as a NULL
value, and then the resulting STRING column will be cast to the data type specified in col_type
.NULL
.AS
clause after USING my_script
, the output schema is key: STRING, value: STRING
. The key
column contains all the characters before the first tab and the value
column contains the remaining characters after the first tab. If there is no tab, Spark returns the NULL
value.ROW FORMAT SERDE
or ROW FORMAT DELIMITED
.CREATE TABLE person (zip_code INT, name STRING, age INT); INSERT INTO person VALUES (94588, 'Zen Hui', 50), (94588, 'Dan Li', 18), (94588, 'Anil K', 27), (94588, 'John V', NULL), (94511, 'David K', 42), (94511, 'Aryan B.', 18), (94511, 'Lalit B.', NULL); -- With specified output without data type SELECT TRANSFORM(zip_code, name, age) USING 'cat' AS (a, b, c) FROM person WHERE zip_code > 94511; +-------+---------+-----+ | a | b| c| +-------+---------+-----+ | 94588| Anil K| 27| | 94588| John V| NULL| | 94588| Zen Hui| 50| | 94588| Dan Li| 18| +-------+---------+-----+ -- With specified output with data type SELECT TRANSFORM(zip_code, name, age) USING 'cat' AS (a STRING, b STRING, c STRING) FROM person WHERE zip_code > 94511; +-------+---------+-----+ | a | b| c| +-------+---------+-----+ | 94588| Anil K| 27| | 94588| John V| NULL| | 94588| Zen Hui| 50| | 94588| Dan Li| 18| +-------+---------+-----+ -- Using ROW FORMAT DELIMITED SELECT TRANSFORM(name, age) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' NULL DEFINED AS 'NULL' USING 'cat' AS (name_age string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '@' LINES TERMINATED BY '\n' NULL DEFINED AS 'NULL' FROM person; +---------------+ | name_age| +---------------+ | Anil K,27| | John V,null| | ryan B.,18| | David K,42| | Zen Hui,50| | Dan Li,18| | Lalit B.,null| +---------------+ -- Using Hive Serde SELECT TRANSFORM(zip_code, name, age) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'field.delim' = '\t' ) USING 'cat' AS (a STRING, b STRING, c STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'field.delim' = '\t' ) FROM person WHERE zip_code > 94511; +-------+---------+-----+ | a | b| c| +-------+---------+-----+ | 94588| Anil K| 27| | 94588| John V| NULL| | 94588| Zen Hui| 50| | 94588| Dan Li| 18| +-------+---------+-----+ -- Schema-less mode SELECT TRANSFORM(zip_code, name, age) USING 'cat' FROM person WHERE zip_code > 94500; +-------+---------------------+ | key| value| +-------+---------------------+ | 94588| Anil K 27| | 94588| John V \N| | 94511| Aryan B. 18| | 94511| David K 42| | 94588| Zen Hui 50| | 94588| Dan Li 18| | 94511| Lalit B. \N| +-------+---------------------+