Phonetic matching algorithms may be used to encode tokens so that two different spellings that are pronounced similarly will match.
For overviews of and comparisons between algorithms, see http://en.wikipedia.org/wiki/Phonetic_algorithm and http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html
Beider-Morse Phonetic Matching (BMPM)
For examples of how to use this encoding in your analyzer, see Beider Morse Filter in the Filter Descriptions section.
Beider-Morse Phonetic Matching (BMPM) is a "soundalike" tool that lets you search using a new phonetic matching system. BMPM helps you search for personal names (or just surnames) in a Solr/Lucene index, and is far superior to the existing phonetic codecs, such as regular soundex, metaphone, caverphone, etc.
In general, phonetic matching lets you search a name list for names that are phonetically equivalent to the desired name. BMPM is similar to a soundex search in that an exact spelling is not required. Unlike soundex, it does not generate a large quantity of false hits.
From the spelling of the name, BMPM attempts to determine the language. It then applies phonetic rules for that particular language to transliterate the name into a phonetic alphabet. If it is not possible to determine the language with a fair degree of certainty, it uses generic phonetic instead. Finally, it applies language-independent rules regarding such things as voiced and unvoiced consonants and vowels to further insure the reliability of the matches.
For example, assume that the matches found when searching for Stephen in a database are "Stefan", "Steph", "Stephen", "Steve", "Steven", "Stove", and "Stuffin". "Stefan", "Stephen", and "Steven" are probably relevant, and are names that you want to see. "Stuffin", however, is probably not relevant. Also rejected were "Steph", "Steve", and "Stove". Of those, "Stove" is probably not one that we would have wanted. But "Steph" and "Steve" are possibly ones that you might be interested in.
For Solr, BMPM searching is available for the following languages:
-
English
-
French
-
German
-
Greek
-
Hebrew written in Hebrew letters
-
Hungarian
-
Italian
-
Polish
-
Romanian
-
Russian written in Cyrillic letters
-
Russian transliterated into English letters
-
Spanish
-
Turkish
The name matching is also applicable to non-Jewish surnames from the countries in which those languages are spoken.
For more information, see here: http://stevemorse.org/phoneticinfo.htm and http://stevemorse.org/phonetics/bmpm.htm.
Daitch-Mokotoff Soundex
To use this encoding in your analyzer, see Daitch-Mokotoff Soundex Filter in the Filter Descriptions section.
The Daitch-Mokotoff Soundex algorithm is a refinement of the Russel and American Soundex algorithms, yielding greater accuracy in matching especially Slavic and Yiddish surnames with similar pronunciation but differences in spelling.
The main differences compared to the other soundex variants are:
-
coded names are 6 digits long
-
initial character of the name is coded
-
rules to encoded multi-character n-grams
-
multiple possible encodings for the same name (branching)
Note: the implementation used by Solr (commons-codec’s DaitchMokotoffSoundex
) has additional branching rules compared to the original description of the algorithm.
For more information, see http://en.wikipedia.org/wiki/Daitch%E2%80%93Mokotoff_Soundex and http://www.avotaynu.com/soundex.htm
Double Metaphone
To use this encoding in your analyzer, see Double Metaphone Filter in the Filter Descriptions section. Alternatively, you may specify encoding="DoubleMetaphone"
with the Phonetic Filter, but note that the Phonetic Filter version will not provide the second ("alternate") encoding that is generated by the Double Metaphone Filter for some tokens.
Encodes tokens using the double metaphone algorithm by Lawrence Philips. See the original article at http://www.drdobbs.com/the-double-metaphone-search-algorithm/184401251?pgno=2
Metaphone
To use this encoding in your analyzer, specify encoding="Metaphone"
with the Phonetic Filter.
Encodes tokens using the Metaphone algorithm by Lawrence Philips, described in "Hanging on the Metaphone" in Computer Language, Dec. 1990.
Another reference for more information is Double Metaphone Search Algorithm, by Lawrence Philips.
Soundex
To use this encoding in your analyzer, specify encoding="Soundex"
with the Phonetic Filter.
Encodes tokens using the Soundex algorithm, which is used to relate similar names, but can also be used as a general purpose scheme to find words with similar phonemes.
See also http://en.wikipedia.org/wiki/Soundex.
Refined Soundex
To use this encoding in your analyzer, specify encoding="RefinedSoundex"
with the Phonetic Filter.
Encodes tokens using an improved version of the Soundex algorithm.
Caverphone
To use this encoding in your analyzer, specify encoding="Caverphone"
with the Phonetic Filter.
Caverphone is an algorithm created by the Caversham Project at the University of Otago. The algorithm is optimised for accents present in the southern part of the city of Dunedin, New Zealand.
See http://en.wikipedia.org/wiki/Caverphone and the Caverphone 2.0 specification at http://caversham.otago.ac.nz/files/working/ctp150804.pdf
Kölner Phonetik a.k.a. Cologne Phonetic
To use this encoding in your analyzer, specify encoding="ColognePhonetic"
with the Phonetic Filter.
The Kölner Phonetik, an algorithm published by Hans Joachim Postel in 1969, is optimized for the German language.
NYSIIS
To use this encoding in your analyzer, specify encoding="Nysiis"
with the Phonetic Filter.
NYSIIS is an encoding used to relate similar names, but can also be used as a general purpose scheme to find words with similar phonemes.