[% setvar title Regex modifier for support of chunk processing and prefix matching %]

This file is part of the Perl 6 Archive

Note: these documents may be out of date. Do not use as reference!

To see what is currently happening visit http://www.perl6.org/

TITLE

Regex modifier for support of chunk processing and prefix matching

VERSION

  Maintainer: Bart Lateur <bart.lateur@skynet.be>
  Date: 23 Sep 2000
  Mailing List: perl6-language-regex@perl.org
  Number: 316
  Version: 1
  Status: Developing

ABSTRACT

This RFC proposes to add a regex modifier, which allows to reliably recognize possible matches that span multiple records. In addition, it can recognize if a string is/ends with a string that may possibly be the start of a matching string.

DESCRIPTION

Currently, string processing happens using one of two approaches:

Sometimes, you are not able to reliably find a way to split up the input so that no match could possibly span multiple input records. But at the same time, the input file is too big to be loaded in memory in one piece. What to do?

If you just split the data into chunks of arbitrary length and process them independently, you may have the problem that some matching strings aren't recognized, or worse still, that the wrong kind of match is found instead.

Example

You want to extract occurrences of the string 'abcd', or, as a second choice, 'bc', using the regex:

    /abcd|bc/
    

You process this in chunks of 1k each. It can happen that an occurrence of the string abcd is broken into two parts, between chunks. What will happen, is one of the following:

<pre>

    end of  start of    /abcd/bc/   
    first   second      will allow
    chunk   chunk       you to find:
    ------  --------    ------------
             abcd        abcd
     a       bcd         bc
     ab      cd         (no match)
     abc     d           bc
     abcd                abcd

</pre>

Remedy

My solution is to still use the same regex, but with an extra modifier. This modifier makes the regex engine abort its search, as soon as the end of the string is prematurely reached. In the above example, if "abc" is found, no attempt would be made to try to match it against "bc".

For this modifier, I chose the letter 'z', for its mnemonic relationship with "\z", the regex marker for end-of-string.

In addition, pos() is set to the offset of the start of the recognized match prefix. In case of a plain succesful match, or of a normal not-found termination, pos is undef() on exit.

This serves both as a flag, as pos will only be defined if the search has been aborted for this reason, and it allows more optimized searching, because after you have appended the next chunk to the current one, the next try will simply start again at the position where the pattern may first match, skipping any earlier matches. In the above example, that would be at the "a".

Exception-happy people would most likely implement it using exceptions, and throw an exception when the end of the buffer is prematurely reached. But I'm not like that. Reaching the end of a string is not an exception.

Application

The following code can do the task described, provided that matches are rare enough so that the buffer probably won't grow too big:

    my $chunksize = 1024;
    while(read FH, my $buffer, $chunksize) {
        while(/(abcd|bc)/gz) {
            # do something boring with the matched string:
            print "$1\n";
        }
        if(defined pos) {  # end-of-buffer exception
            # append the next chunk to the current one
            read FH, $buffer, $chunksize, length $buffer;
            # retry matching
            redo;
        }
    }

    

Procedural programming style

As you can see from the example code, the program flow stays very close to what people would ordinarily program under normal circumstances.

By contrast, RFC 93 proposes another solution to the same problem, but using callbacks. Since the same sub must do one of several things, the first thing that needs to be done is to channel different kinds of requests to their own handler. As a result, you need a complete rewrite from what you'd use in the ordinary case.

I think that a lot of people will find my approach far less intimidating.

Match prefix

It can be useful to be able to recognize if a string could possibly be a prefix for a potential match. For example in an interactive program, you want to allow a user to enter a number into an input field, but nothing else. After every single keystroke, you can test what he just entered against a regex matching the valid format for a number, so that 1234E can be recognized as a prefix for the regex

    /^\d+\.?\d*(?:E[+-]?\d+)$/
    

I originally had thought of providing a separate, dedicated regex modifier, just for the match prefix, but I don't think too many people need this that desperately. You can easily build a working application with just the '/z' modifier. If you can't, you're in over your head, anyway. ;-)

IMPLEMENTATION

The regex engine must be given the option to abort the search, and not do any backtracking, as soon as it attempts to do a lookahead to the next character, but it finds the end of the string instead.

In addition, pos() must be patched so that under this modifier, it normally is reset to undef(), but when this exception takes place, pos is set to the offset of the start of the prefix.

REFERENCES

RFC 93: Regex: Support for incremental pattern matching