pcreperform - phpMan

Command: man perldoc info search(apropos)  


PCRE(3)                                                                PCRE(3)



NAME
       PCRE - Perl-compatible regular expressions

PCRE PERFORMANCE

       Certain  items  that  may  appear in regular expression patterns are more efficient
       than others. It is more efficient to use a character class like [aeiou] than a  set
       of  alternatives  such  as  (a|e|i|o|u). In general, the simplest construction that
       provides the required behaviour is usually the  most  efficient.  Jeffrey  Friedl’s
       book  contains  a lot of useful general discussion about optimizing regular expres-
       sions for efficient performance. This document contains a  few  observations  about
       PCRE.

       Using  Unicode  character  properties (the \p, \P, and \X escapes) is slow, because
       PCRE has to scan a structure that contains data for over fifteen  thousand  charac-
       ters  whenever it needs a character’s property. If you can find an alternative pat-
       tern that does not use character properties, it will probably be faster.

       When a pattern begins with .* not in parentheses, or in parentheses  that  are  not
       the  subject  of a backreference, and the PCRE_DOTALL option is set, the pattern is
       implicitly anchored by PCRE, since it can match only at  the  start  of  a  subject
       string.  However,  if  PCRE_DOTALL  is not set, PCRE cannot make this optimization,
       because the . metacharacter does not then match  a  newline,  and  if  the  subject
       string contains newlines, the pattern may match from the character immediately fol-
       lowing one of them instead of from the very start. For example, the pattern

         .*second

       matches the subject "first\nand second" (where \n stands for a newline  character),
       with  the match starting at the seventh character. In order to do this, PCRE has to
       retry the match starting after every newline in the subject.

       If you are using such a pattern with subject strings that do not contain  newlines,
       the  best  performance  is obtained by setting PCRE_DOTALL, or starting the pattern
       with ^.* to indicate explicit anchoring. That saves PCRE from having to scan  along
       the subject looking for a newline to restart at.

       Beware  of  patterns  that contain nested indefinite repeats. These can take a long
       time to run when applied to a string that does  not  match.  Consider  the  pattern
       fragment

         (a+)*

       This  can match "aaaa" in 33 different ways, and this number increases very rapidly
       as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4 times, and  for
       each  of  those  cases  other  than 0, the + repeats can match different numbers of
       times.) When the remainder of the pattern is such that the entire match is going to
       fail,  PCRE  has in principle to try every possible variation, and this can take an
       extremely long time.

       An optimization catches some of the more simple cases such as

         (a+)*b

       where a literal character follows. Before embarking on the standard matching proce-
       dure,  PCRE checks that there is a "b" later in the subject string, and if there is
       not, it fails the match immediately. However, when there is  no  following  literal
       this  optimization  cannot  be  used.  You  can see the difference by comparing the
       behaviour of

         (a+)*\d

       with the pattern above. The former gives a failure almost instantly when applied to
       a  whole  line of "a" characters, whereas the latter takes an appreciable time with
       strings longer than about 20 characters.

       In many cases, the solution to this kind of performance issue is to use  an  atomic
       group or a possessive quantifier.

Last updated: 09 September 2004
Copyright (c) 1997-2004 University of Cambridge.



                                                                       PCRE(3)

Generated by $Id: phpMan.php,v 4.55 2007/09/05 04:42:51 chedong Exp $ Author: Che Dong
On Apache/1.3.41 (Unix) PHP/5.2.5 mod_perl/1.30 mod_gzip/1.3.26.1a
Under GNU General Public License
2008-08-20 19:42 @38.103.63.61 CrawledBy CCBot/1.0 (+http://www.commoncrawl.org/bot.html)
Valid XHTML 1.0!Valid CSS!