| //// |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| http://www.apache.org/licenses/LICENSE-2.0 |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| //// |
| :documentationPath: /pipeline/transforms/ |
| :language: en_US |
| :description: The Microsoft Excel Input transform reads data from Microsoft Excel spreadsheet. The default spreadsheet type (engine) is set to Excel XLSX, XLS. |
| |
| = image:transforms/icons/excelinput.svg[Microsoft Excel input transform Icon, role="image-doc-icon"] Microsoft Excel input |
| |
| [%noheader,cols="3a,1a", role="table-no-borders" ] |
| |=== |
| | |
| == Description |
| |
| The Microsoft Excel Input transform reads data from Microsoft Excel spreadsheet. |
| |
| The default spreadsheet type (engine) is set to Excel XLSX, XLS. |
| |
| When you read other file types like OpenOffice ODS and using special functions like protected worksheets, you need to change the Spread sheet type (engine) in the Content tab accordingly. |
| |
| | |
| == Supported Engines |
| [%noheader,cols="2,1a",frame=none, role="table-supported-engines"] |
| !=== |
| !Hop Engine! image:check_mark.svg[Supported, 24] |
| !Spark! image:check_mark.svg[Supported, 24] |
| !Flink! image:check_mark.svg[Supported, 24] |
| !Dataflow! image:check_mark.svg[Supported, 24] |
| !=== |
| |=== |
| |
| == Options |
| |
| === Files Tab |
| |
| [options="header"] |
| |=== |
| |Option|Description |
| |transform Name|Name of the transform; the name has to be unique in a single transform. |
| |Spread sheet type (engine) a|This field allows you to specify the spreadsheet type. |
| Currently the following are supported: |
| |
| * Excel XLSX, XLS: This is the default, if you select this spread sheet type you can read all known Excel file types. |
| Functionality provided by the Apache POI project. |
| * Excel XLSX (Streaming): This spread sheet type allows to read in large Excel files. |
| * Open Office ODS: By selecting this type you can read OpenOffice spreadsheet using the ODFDOM engine. |
| |File or directory|Specifies the location and/or name of the input text file. |
| Note: Click Add to add the file/directory/wildcard combination to the list of selected files (grid) below. |
| |Regular expression|Specify the regular expression you want to use to select the files in the directory specified in the previous option. |
| For example, you want to process all files that have a .txt extension. |
| (For further details see the Text File Input transform and the section "Selecting file using Regular Expressions") |
| |Exclude Regular Expression|Excludes all files (in a given location) that meet the criteria specified by this regular expression. |
| |Selected Files|Contains a list of selected files (or wildcard selections) and a property specifying if file is required or not. |
| If a file is required and it is not found, an error is generated;otherwise, the file name is skipped. |
| |Accept filenames from previous transforms|Allows you to read in file names from a previous transform in the transform. |
| You must also specify which transform you are importing from, and the input field in that transform from which you will retrieve the filename data |
| |Show filenames(s)...|Displays a list of all files that will be loaded based on the current selected file definitions |
| |Preview rows|Click Preview to examine the contents of the specified Excel file |
| |=== |
| |
| === Sheets |
| |
| In this tab you can specify the names of the sheets in the Excel workbook to read. |
| For each of the sheet names you can specify the row and column to start at. |
| |
| === Content |
| |
| |=== |
| |Option|Description |
| |Header|Enable if the sheets specified contain a header row to skip |
| |No empty rows|Enable if you don't want empty rows in the output of this transform |
| |Stop on empty row|Makes the transform stop reading the current sheet of a file when a empty line is encountered |
| |Limit|Limits the number of rows to this number (zero (0) means all rows). |
| |Encoding|Specifies the text file encoding to use. |
| Leave blank to use the default encoding on your system. |
| To use Unicode, specify UTF-8 or UTF-16. On first use, Spoon searches your system for available encodings.) |
| |=== |
| |
| === Error handling |
| |
| |=== |
| |Option|Description |
| |Strict types?|If checked, Hop will report data type errors in the input. |
| |Ignore errors?|Enable if you want to ignore errors during parsing |
| |Skip error lines?|If checked, Hop will skip lines that contain errors. |
| These lines can be dumped to a separate file by specifying a path in the Failing line numbers files directory field below. |
| If this is not checked, lines with errors will appear as NULL values in the output. |
| |Warnings file directory|When warnings are generated, they are placed in this directory. |
| The name of that file is <warning dir>/filename.<date_time>.<warning extension> |
| |Error files directory|When errors occur, they are placed in this directory. |
| The name of that file is <errorfile_dir>/filename.<date_time>.<errorfile_extension> |
| |Failing line numbers files directory|When a parsing error occurs on a line, the line number is placed in this directory. |
| The name of that file is <errorline dir>/filename.<date_time>.<errorline extension> |
| |=== |
| |
| === Fields tab |
| |
| The fields tab is for specifying the fields that must be read from the Excel files. |
| Use Get fields from header row to fill in the available fields if the sheets have a header row automatically. |
| |
| The Type column performs type conversions for a given field. |
| For example, if you want to read a date and you have a String value in the Excel file, specify the conversion mask. |
| Note: In the case of Number to Date conversion (for example, 20051028--> October 28th, 2005) specify the conversion mask yyyyMMdd because there will be an implicit Number to String conversion taking place before doing the String to Date conversion. |
| |
| |=== |
| |Option|Description |
| |Name|The name of the field. |
| |Type|The field's data type; String, Date or Number. |
| |Length|The length option depends on the field type. |
| Number: total number of significant figures in a number; String: total length of a string; Date: determines how much of the date string is printed or recorded. |
| |Precision|The precision option depends on the field type, but only Number is supported; it returns the number of floating point digits. |
| |Trim type|Truncates the field (left, right, both) before processing. |
| Useful for fields that have no static length. |
| |Repeat|If set to Y, will repeat this value if the field in the next row is empty. |
| |Format|The format mask (number type). |
| See Text File Input transform and the section "Number Formats" for a complete description of format symbols. |
| |Currency|Symbol used to represent currencies. |
| |Decimal|A decimal point; this is either a dot or a comma. |
| |Grouping|A method of separating units of thousands in numbers of four digits or larger. |
| This is either a dot or a comma. |
| |=== |
| |
| === Additional output fields tab |
| |
| This tab retrieves custom metadata fields to add to the transform's output. |
| The purpose of each field is defined in its name, but you can use these fields for whatever you want. |
| Each item defines an output field that will contain the following information. |
| Some of these are missing. |
| |
| |=== |
| |Option|Description |
| |Full filename field|The full file name plus the extension. |
| |Sheetname field|The worksheet name you're using. |
| |Sheet row nr field|The current sheet row number. |
| |Row nr written field|Number of rows written |
| |Short filename field|The field name that contains the filename without path information but with an extension. |
| |Extension field|The field name that contains the extension of the filename. |
| |Path field|The field name that contains the path in operating system format. |
| |Size field|The field name that contains the size of the file, in bytes. |
| |Is hidden field|The field name that contains if the file is hidden or not (boolean). |
| |Uri field|The field name that contains the URI. |
| |Root uri field|The field name that contains only the root part of the URI. |
| |=== |
| |