blob: 73061cd5a8b8a690991328187847feae98221f99 [file] [log] [blame]
////
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
////
:documentationPath: /plugins/transforms/
:language: en_US
:page-alternativeEditUrl: https://github.com/apache/incubator-hop/edit/master/plugins/transforms/fuzzymatch/src/main/doc/fuzzymatch.adoc
= Fuzzy match
== Description
The Fuzzy Match transform finds strings that potentially match using duplicate-detecting algorithms that calculate the similarity of two streams of data. This transform returns matching values as a separated list as specified by user-defined minimal or maximal values.
== Options
=== General tab
[width="90%", options="header"]
|===
|Option|Description
|transform name|Name of this transform as it appears in the pipeline workspace
|Lookup transform|Identifies the transform that contains the fields to match
|Lookup field|Identifies the field to match
|Main stream field|Identifies the primary stream to match the Lookup field with
|Algorithm|Identifies which string-matching algorithm to use---options include Levenshtein, Damerau-Levenshtein, Needleman Wunsch, Jaro, Jaro Winkler, Pair letters similarity, Metaphone, Double Metaphone, SoundEx, or Refined SoundEx
|Case sensitive|Identifies if streams can or cannot differ based on the use of uppercase and lowercase letters---only for use with the Levenshtein algorithms
|Get closer value|When checked, returns a single result with the highest similarity score---when unchecked, returns all matches that satisfy the minimal and maximal value setting as a separated list, separated by the values separator
|Minimum value|Identifies the lowest possible similarity score
|Maximal value|Identifies the highest possible similarity score
|Values separator|Identifies the string that separate the matches. Only available for specific algorithms and when the Get closer value option is unchecked.
|===
*Algorithm Definitions*
Within the Algorithm field, there are several options available to compare and match strings.
* Levenshtein and Damerau-Levenshtein---calculate the distance between two strings by looking at how many edit transforms are needed to get from one string to another. The former only looks at inserts, deletes, and replacements. The latter adds transposition. The score indicates the minimum number of changes needed. For instance, the difference between John and Jan would be two; to turn the name John into Jan you need one transform to replace the O with an A, and another transform to delete the H.
* Needleman Wunsch---calculates the similarity of two sequences and is mainly used in bioinformatics. The algorithm calculates a gap penalty. The aforementioned example would have a score of negative two.
* Jaro and Jaro Winkler---calculate a similarity index between two strings. The result is a fraction between zero, indicating no similarity, and one, indicating an identical match.
* Pair letters similarity---dissects the two strings in pairs and calculates the similarity of the two strings by dividing the number of common pairs by the sum of the pairs from both strings.
* Metaphone, Double Metaphone, SoundEx, and Refined SoundEx---are phonetic algorithms, which try to match strings based on how they would sound. Each is based on the English language and would not be useful to compare other languages.
** The Metaphone algorithm returns an encoded value based on the English pronunciation of a given word. The encoded value of the names John and Jan would return the value JN for both names.
** The Double Metaphone algorithm has fundamental design improvements over its predecessor and uses a more complex ruleset for coding. It can return a primary and a secondary encoded value for a string. The names John and Jan each return metaphone key values of JN and AN.
** The Soundex algorthim returns a single encoded value for a name that consists of a letter followed by three numerical digits. The letter is the first letter of the name, and the digits encode the remaining consonants.
** The Refined SoundEx algorithm is an improvement over its predecessor. Encoded values for this algorithm are six digits long, the initial character is encoded, and multiple possible encodings can be returned for a single name. Using this algorithm, the name John returns the values 160000 and 460000, as does the name Jan.
=== Fields tab
The Fields tab enables you to define how to return the results of a comparison.
[width="90%", options="header"]
|===
|Option|Description
|Match field|Defines the name of the column that contains the comparison value
|Value field|Defines the similarity score for which to return a value
|===
You can also specify the list of additional fields to retrieve from the lookup stream.