phpman > perldoc > Lingua::EN::Sentence

Markdown | JSON | MCP    

NAME
    Lingua::EN::Sentence - split text into sentences

SYNOPSIS
            use Lingua::EN::Sentence qw( get_sentences add_acronyms );

            add_acronyms('lt','gen');               ## adding support for 'Lt. Gen.'
            my $sentences=get_sentences($text);     ## Get the sentences.
            foreach my $sentence (@$sentences) {
                    ## do something with $sentence
            }

DESCRIPTION
    The "Lingua::EN::Sentence" module contains the function get_sentences, which splits text into
    its constituent sentences, based on a regular expression and a list of abbreviations (built in
    and given).

    Certain well know exceptions, such as abbreviations, may cause incorrect segmentations. But some
    of them are already integrated into this code and are being taken care of. Still, if you see
    that there are words causing the get_sentences function to fail, you can add those to the
    module, so it notices them.

ALGORITHM
    Basically, I use a 'brute' regular expression to split the text into sentences. (Well, nothing
    is yet split - I just mark the end-of-sentence). Then I look into a set of rules which decide
    when an end-of-sentence is justified and when it's a mistake. In case of a mistake, the
    end-of-sentence mark is removed.

    What are such mistakes? Cases of abbreviations, for example. I have a list of such abbreviations
    (Please see public globals belwo for a list), and more general rules (for example, the
    abbreviations 'i.e.' and '.e.g.' need not to be in the list as a special rule takes care of all
    single letter abbreviations).

FUNCTIONS
    All functions used should be requested in the 'use' clause. None is exported by default.

    get_sentences( $text )
        The get_sentences function takes a scalar containing ascii text as an argument and returns a
        reference to an array of sentences that the text has been split into. Returned sentences
        will be trimmed (beginning and end of sentence) of white space. Strings with no
        alpha-numeric characters in them, won't be returned as sentences.

    add_acronyms( @acronyms )
        This function is used for adding acronyms not supported by this code. The input should be
        regular expressions for matching the desired acronyms, but should not include the final
        period ("."). So, for example, "blv?d" matches "blvd." and "bld.". "a\.mlf" will match
        "a.mlf.". You do not need to bother with acronyms consisting of single letters and dots
        (e.g. "U.S.A."), as these are found automatically. Note also that acronyms are searched for
        on a case insensitive basis.

        Please see 'Acronym/Abbreviations list' section for the abbreviations already supported by
        this module.

    get_acronyms( )
        This function will return the defined list of acronyms.

    set_acronyms( @my_acronyms )
        This function replaces the predefined acronym list with the given list. See "add_acronyms"
        for details on the input specifications.

    get_EOS( )
        This function returns the value of the string used to mark the end of sentence. You might
        want to see what it is, and to make sure your text doesn't contain it. You can use set_EOS()
        to alter the end-of-sentence string to whatever you desire.

    set_EOS( $new_EOS_string )
        This function alters the end-of-sentence string used to mark the end of sentences.

    set_locale( $new_locale ) Receives language locale in the form language.country.character-set
    for example: "fr_CA.ISO8859-1" for Canadian French using character set ISO8859-1.
        Returns a reference to a hash containing the current locale formatting values. Returns undef
        if got undef.

        The following will set the LC_COLLATE behaviour to Argentinian Spanish. NOTE: The naming and
        availability of locales depends on your operating sysem. Please consult the perllocale
        manpage for how to find out which locales are available in your system.

        $loc = set_locale( "es_AR.ISO8859-1" );

        This actually does this:

        $loc = setlocale( LC_ALL, "es_AR.ISO8859-1" );

Acronym/Abbreviations list
    You can use the get_acronyms() function to get acronyms. It has become too long to specify in
    the documentation.

    If I come across a good general-purpose list - I'll incorporate it into this module. Feel free
    to suggest such lists.

FUTURE WORK
            [1] Object Oriented like usage
            [2] Supporting more than just English/French
            [3] Code optimization. Currently everything is RE based and not so optimized RE
            [4] Possibly use more semantic heuristics for detecting a beginning of a sentence

SEE ALSO
            Text::Sentence

REPOSITORY
    <https://github.com/kimryan/Lingua-EN-Sentence>

AUTHOR
    Shlomo Yona shlomo AT cs.il

    Currently being maintained by Kim Ryan, kimryan at CPAN d o t org

COPYRIGHT AND LICENSE
    Copyright (c) 2001-2016 Shlomo Yona. All rights reserved. Copyright (c) 2018 Kim Ryan. All
    rights reserved.

    This library is free software; you can redistribute it and/or modify it under the same terms as
    Perl itself.

Lingua::EN::Sentence
NAME SYNOPSIS DESCRIPTION ALGORITHM FUNCTIONS
get_sentences( $text ) add_acronyms( @acronyms ) get_acronyms( ) set_acronyms( @my_acronyms ) get_EOS( ) set_EOS( $new_EOS_string ) set_locale( $new_locale ) Receives language locale in the form language.country.character-set
Acronym/Abbreviations list FUTURE WORK SEE ALSO REPOSITORY AUTHOR COPYRIGHT AND LICENSE

Generated by phpman v3.7.12 Author: Che Dong Under GNU General Public License
2026-06-13 17:31 @216.73.216.233
CrawledBy Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
Valid XHTML 1.0 TransitionalValid CSS!

^_back to top