String::Approx - phpMan

Che Dong
NAME
    String::Approx - Perl extension for approximate matching (fuzzy
    matching)

SYNOPSIS
      use String::Approx 'amatch';

      print if amatch("foobar");

      my @matches = amatch("xyzzy", @inputs);

      my @catches = amatch("plugh", ['2'], @inputs);

DESCRIPTION
    String::Approx lets you match and substitute strings approximately. With
    this you can emulate errors: typing errors, spelling errors, closely
    related vocabularies (colour color), genetic mutations (GAG ACT),
    abbreviations (McScot, MacScot).

    NOTE: String::Approx suits the task of string matching, not string
    comparison, and it works for strings, not for text.

    If you want to compare strings for similarity, you probably just want
    the Levenshtein edit distance (explained below), the Text::Levenshtein
    and Text::LevenshteinXS modules in CPAN. See also Text::WagnerFischer
    and Text::PhraseDistance. (There are functions for this in
    String::Approx, e.g. adist(), but their results sometimes differ from
    the bare Levenshtein et al.)

    If you want to compare things like text or source code, consisting of
    words or tokens and phrases and sentences, or expressions and
    statements, you should probably use some other tool than String::Approx,
    like for example the standard UNIX diff(1) tool, or the Algorithm::Diff
    module from CPAN.

    The measure of approximateness is the *Levenshtein edit distance*. It is
    the total number of "edits": insertions,

            word world

    deletions,

            monkey money

    and substitutions

            sun fun

    required to transform a string to another string. For example, to
    transform *"lead"* into *"gold"*, you need three edits:

            lead gead goad gold

    The edit distance of "lead" and "gold" is therefore three, or 75%.

    String::Approx uses the Levenshtein edit distance as its measure, but
    String::Approx is not well-suited for comparing strings of different
    length, in other words, if you want a "fuzzy eq", see above.
    String::Approx is more like regular expressions or index(), it finds
    substrings that are close matches.>

MATCH
            use String::Approx 'amatch';

            $matched     = amatch("pattern")
            $matched     = amatch("pattern", [ modifiers ])

            $any_matched = amatch("pattern", @inputs)
            $any_matched = amatch("pattern", [ modifiers ], @inputs)

            @match       = amatch("pattern")
            @match       = amatch("pattern", [ modifiers ])

            @matches     = amatch("pattern", @inputs)
            @matches     = amatch("pattern", [ modifiers ], @inputs)

    Match pattern approximately. In list context return the matched @inputs.
    If no inputs are given, match against the $_. In scalar context return
    true if *any* of the inputs match, false if none match.

    Notice that the pattern is a string. Not a regular expression. None of
    the regular expression notations (^, ., *, and so on) work. They are
    characters just like the others. Note-on-note: some limited form of
    *"regular expressionism"* is planned in future: for example character
    classes ([abc]) and *any-chars* (.). But that feature will be turned on
    by a special *modifier* (just a guess: "r"), so there should be no
    backward compatibility problem.

    Notice also that matching is not symmetric. The inputs are matched
    against the pattern, not the other way round. In other words: the
    pattern can be a substring, a submatch, of an input element. An input
    element is always a superstring of the pattern.

  MODIFIERS
    With the modifiers you can control the amount of approximateness and
    certain other control variables. The modifiers are one or more strings,
    for example "i", within a string optionally separated by whitespace. The
    modifiers are inside an anonymous array: the [ ] in the syntax are not
    notational, they really do mean [ ], for example [ "i", "2" ]. ["2 i"]
    would be identical.

    The implicit default approximateness is 10%, rounded up. In other words:
    every tenth character in the pattern may be an error, an edit. You can
    explicitly set the maximum approximateness by supplying a modifier like

            number
            number%

    Examples: "3", "15%".

    Note that "0%" is not rounded up, it is equal to 0.

    Using a similar syntax you can separately control the maximum number of
    insertions, deletions, and substitutions by prefixing the numbers with
    I, D, or S, like this:

            Inumber
            Inumber%
            Dnumber
            Dnumber%
            Snumber
            Snumber%

    Examples: "I2", "D20%", "S0".

    You can ignore case ("A" becames equal to "a" and vice versa) by adding
    the "i" modifier.

    For example

            [ "i 25%", "S0" ]

    means *ignore case*, *allow every fourth character to be "an edit"*, but
    allow *no substitutions*. (See NOTES about disallowing substitutions or
    insertions.)

    NOTE: setting "I0 D0 S0" is not equivalent to using index(). If you want
    to use index(), use index().

SUBSTITUTE
            use String::Approx 'asubstitute';

            @substituted = asubstitute("pattern", "replacement")
            @substituted = asubstitute("pattern", "replacement", @inputs)
            @substituted = asubstitute("pattern", "replacement", [ modifiers ])
            @substituted = asubstitute("pattern", "replacement",
                                       [ modifiers ], @inputs)

    Substitute approximate pattern with replacement and return as a list
    <copies> of @inputs, the substitutions having been made on the elements
    that did match the pattern. If no inputs are given, substitute in the
    $_. The replacement can contain magic strings $&, $`, $' that stand for
    the matched string, the string before it, and the string after it,
    respectively. All the other arguments are as in "amatch()", plus one
    additional modifier, "g" which means substitute globally (all the
    matches in an element and not just the first one, as is the default).

    See "BAD NEWS" about the unfortunate stinginess of "asubstitute()".

INDEX
            use String::Approx 'aindex';

            $index   = aindex("pattern")
            @indices = aindex("pattern", @inputs)
            $index   = aindex("pattern", [ modifiers ])
            @indices = aindex("pattern", [ modifiers ], @inputs)

    Like "amatch()" but returns the index/indices at which the pattern
    matches approximately. In list context and if @inputs are used, returns
    a list of indices, one index for each input element. If there's no
    approximate match, -1 is returned as the index.

    NOTE: if there is character repetition (e.g. "aa") either in the pattern
    or in the text, the returned index might start "too early". This is
    consistent with the goal of the module of matching "as early as
    possible", just like regular expressions (that there might be a "less
    approximate" match starting later is of somewhat irrelevant).

    There's also backwards-scanning "arindex()".

SLICE
            use String::Approx 'aslice';

            ($index, $size)   = aslice("pattern")
            ([$i0, $s0], ...) = aslice("pattern", @inputs)
            ($index, $size)   = aslice("pattern", [ modifiers ])
            ([$i0, $s0], ...) = aslice("pattern", [ modifiers ], @inputs)

    Like "aindex()" but returns also the size (length) of the match. If the
    match fails, returns an empty list (when matching against $_) or an
    empty anonymous list corresponding to the particular input.

    NOTE: size of the match will very probably be something you did not
    expect (such as longer than the pattern, or a negative number). This may
    or may not be fixed in future releases. Also the beginning of the match
    may vary from the expected as with aindex(), see above.

    If the modifier

            "minimal_distance"

    is used, the minimal possible edit distance is returned as the third
    element:

            ($index, $size, $distance) = aslice("pattern", [ modifiers ])
            ([$i0, $s0, $d0], ...)     = aslice("pattern", [ modifiers ], @inputs)

DISTANCE
            use String::Approx 'adist';

            $dist = adist("pattern", $input);
            @dist = adist("pattern", @input);

    Return the *edit distance* or distances between the pattern and the
    input or inputs. Zero edit distance means exact match. (Remember that
    the match can 'float' in the inputs, the match is a substring match.) If
    the pattern is longer than the input or inputs, the returned distance or
    distances is or are negative.

            use String::Approx 'adistr';

            $dist = adistr("pattern", $input);
            @dist = adistr("pattern", @inputs);

    Return the relative *edit distance* or distances between the pattern and
    the input or inputs. Zero relative edit distance means exact match, one
    means completely different. (Remember that the match can 'float' in the
    inputs, the match is a substring match.) If the pattern is longer than
    the input or inputs, the returned distance or distances is or are
    negative.

    You can use adist() or adistr() to sort the inputs according to their
    approximateness:

            my %d;
            @d{@inputs} = map { abs } adistr("pattern", @inputs);
            my @d = sort { $d{$a} <=> $d{$b} } @inputs;

    Now @d contains the inputs, the most like "pattern" first.

CONTROLLING THE CACHE
    "String::Approx" maintains a LU (least-used) cache that holds the
    'matching engines' for each instance of a *pattern+modifiers*. The cache
    is intended to help the case where you match a small set of patterns
    against a large set of string. However, the more engines you cache the
    more you eat memory. If you have a lot of different patterns or if you
    have a lot of memory to burn, you may want to control the cache
    yourself. For example, allowing a larger cache consumes more memory but
    probably runs a little bit faster since the cache fills (and needs
    flushing) less often.

    The cache has two parameters: *max* and *purge*. The first one is the
    maximum size of the cache and the second one is the cache flushing
    ratio: when the number of cache entries exceeds *max*, *max* times
    *purge* cache entries are flushed. The default values are 1000 and 0.75,
    respectively, which means that when the 1001st entry would be cached,
    750 least used entries will be removed from the cache. To access the
    parameters you can use the calls

            $now_max = String::Approx::cache_max();
            String::Approx::cache_max($new_max);

            $now_purge = String::Approx::cache_purge();
            String::Approx::cache_purge($new_purge);

            $limit = String::Approx::cache_n_purge();

    To be honest, there are actually two caches: the first one is used far
    the patterns with no modifiers, the second one for the patterns with
    pattern modifiers. Using the standard parameters you will therefore
    actually cache up to 2000 entries. The above calls control both caches
    for the same price.

    To disable caching completely use

            String::Approx::cache_disable();

    Note that this doesn't flush any possibly existing cache entries, to do
    that use

            String::Approx::cache_flush_all();

NOTES
    Because matching is by *substrings*, not by whole strings, insertions
    and substitutions produce often very similar results: "abcde" matches
    "axbcde" either by insertion or substitution of "x".

    The maximum edit distance is also the maximum number of edits. That is,
    the "I2" in

            amatch("abcd", ["I2"])

    is useless because the maximum edit distance is (implicitly) 1. You may
    have meant to say

            amatch("abcd", ["2D1S1"])

    or something like that.

    If you want to simulate transposes

            feet fete

    you need to allow at least edit distance of two because in terms of our
    edit primitives a transpose is first one deletion and then one
    insertion.

  TEXT POSITION
    The starting and ending positions of matching, substituting, indexing,
    or slicing can be changed from the beginning and end of the input(s) to
    some other positions by using either or both of the modifiers

            "initial_position=24"
            "final_position=42"

    or the both the modifiers

            "initial_position=24"
            "position_range=10"

    By setting the "position_range" to be zero you can limit (anchor) the
    operation to happen only once (if a match is possible) at the position.

VERSION
    Major release 3.

CHANGES FROM VERSION 2
  GOOD NEWS
    The version 3 is 2-3 times faster than version 2
    No pattern length limitation
        The algorithm is independent on the pattern length: its time
        complexity is *O(kn)*, where *k* is the number of edits and *n* the
        length of the text (input). The preprocessing of the pattern will of
        course take some *O(m)* (*m* being the pattern length) time, but
        "amatch()" and "asubstitute()" cache the result of this
        preprocessing so that it is done only once per pattern.

  BAD NEWS
    You do need a C compiler to install the module
        Perl's regular expressions are no more used; instead a faster and
        more scalable algorithm written in C is used.

    "asubstitute()" is now always stingy
        The string matched and substituted is now always stingy, as short as
        possible. It used to be as long as possible. This is an unfortunate
        change stemming from switching the matching algorithm. Example: with
        edit distance of two and substituting for "word" from "cork" and
        "wool" previously did match "cork" and "wool". Now it does match
        "or" and "wo". As little as possible, or, in other words, with as
        much approximateness, as many edits, as possible. Because there is
        no *need* to match the "c" of "cork", it is not matched.

    no more "aregex()" because regular expressions are no more used
    no more "compat1" for String::Approx version 1 compatibility

ACKNOWLEDGEMENTS
    The following people have provided valuable test cases, documentation
    clarifications, and other feedback:

    Jared August, Arthur Bergman, Anirvan Chatterjee, Steve A. Chervitz,
    Aldo Calpini, David Curiel, Teun van den Dool, Alberto Fontaneda, Rob
    Fugina, Dmitrij Frishman, Lars Gregersen, Kevin Greiner, B. Elijah
    Griffin, Mike Hanafey, Mitch Helle, Ricky Houghton, 'idallen', Helmut
    Jarausch, Damian Keefe, Ben Kennedy, Craig Kelley, Franz Kirsch, Dag
    Kristian, Mark Land, J. D. Laub, John P. Linderman, Tim Maher, Juha
    Muilu, Sergey Novoselov, Andy Oram, Ji Y Park, Eric Promislow, Nikolaus
    Rath, Stefan Ram, Slaven Rezic, Dag Kristian Rognlien, Stewart Russell,
    Slaven Rezic, Chris Rosin, Pasha Sadri, Ilya Sandler, Bob J.A.
    Schijvenaars, Ross Smith, Frank Tobin, Greg Ward, Rich Williams, Rick
    Wise.

    The matching algorithm was developed by Udi Manber, Sun Wu, and Burra
    Gopal in the Department of Computer Science, University of Arizona.

AUTHOR
    Jarkko Hietaniemi <jhi AT iki.fi>

COPYRIGHT AND LICENSE
    Copyright 2001-2013 by Jarkko Hietaniemi

    This library is free software; you can redistribute it and/or modify
    under either the terms of the Artistic License 2.0, or the GNU Library
    General Public License, Version 2. See the files Artistic and LGPL for
    more details.

    Furthermore: no warranties or obligations of any kind are given, and the
    separate file COPYRIGHT must be included intact in all copies and
    derived materials.