[% setvar title Regex modifier for support of chunk processing and prefix matching %]
Note: these documents may be out of date. Do not use as reference! |
To see what is currently happening visit http://www.perl6.org/
Regex modifier for support of chunk processing and prefix matching
Maintainer: Bart Lateur <bart.lateur@skynet.be> Date: 23 Sep 2000 Mailing List: perl6-language-regex@perl.org Number: 316 Version: 1 Status: Developing
This RFC proposes to add a regex modifier, which allows to reliably recognize possible matches that span multiple records. In addition, it can recognize if a string is/ends with a string that may possibly be the start of a matching string.
Currently, string processing happens using one of two approaches:
Sometimes, you are not able to reliably find a way to split up the input so that no match could possibly span multiple input records. But at the same time, the input file is too big to be loaded in memory in one piece. What to do?
If you just split the data into chunks of arbitrary length and process them independently, you may have the problem that some matching strings aren't recognized, or worse still, that the wrong kind of match is found instead.
You want to extract occurrences of the string 'abcd', or, as a second choice, 'bc', using the regex:
/abcd|bc/
You process this in chunks of 1k each. It can happen that an occurrence of the string abcd is broken into two parts, between chunks. What will happen, is one of the following:
<pre>
end of start of /abcd/bc/ first second will allow chunk chunk you to find: ------ -------- ------------ abcd abcd a bcd bc ab cd (no match) abc d bc abcd abcd
</pre>
My solution is to still use the same regex, but with an extra modifier. This modifier makes the regex engine abort its search, as soon as the end of the string is prematurely reached. In the above example, if "abc" is found, no attempt would be made to try to match it against "bc".
For this modifier, I chose the letter 'z', for its mnemonic relationship with "\z", the regex marker for end-of-string.
In addition, pos() is set to the offset of the start of the recognized match prefix. In case of a plain succesful match, or of a normal not-found termination, pos is undef() on exit.
This serves both as a flag, as pos will only be defined if the search has been aborted for this reason, and it allows more optimized searching, because after you have appended the next chunk to the current one, the next try will simply start again at the position where the pattern may first match, skipping any earlier matches. In the above example, that would be at the "a".
Exception-happy people would most likely implement it using exceptions, and throw an exception when the end of the buffer is prematurely reached. But I'm not like that. Reaching the end of a string is not an exception.
The following code can do the task described, provided that matches are rare enough so that the buffer probably won't grow too big:
my $chunksize = 1024; while(read FH, my $buffer, $chunksize) { while(/(abcd|bc)/gz) { # do something boring with the matched string: print "$1\n"; } if(defined pos) { # end-of-buffer exception # append the next chunk to the current one read FH, $buffer, $chunksize, length $buffer; # retry matching redo; } }
As you can see from the example code, the program flow stays very close to what people would ordinarily program under normal circumstances.
By contrast, RFC 93 proposes another solution to the same problem, but using callbacks. Since the same sub must do one of several things, the first thing that needs to be done is to channel different kinds of requests to their own handler. As a result, you need a complete rewrite from what you'd use in the ordinary case.
I think that a lot of people will find my approach far less intimidating.
It can be useful to be able to recognize if a string could possibly be a
prefix for a potential match. For example in an interactive program,
you want to allow a user to enter a number into an input field, but
nothing else. After every single keystroke, you can test what he just
entered against a regex matching the valid format for a number, so that
1234E
can be recognized as a prefix for the regex
/^\d+\.?\d*(?:E[+-]?\d+)$/
I originally had thought of providing a separate, dedicated regex modifier, just for the match prefix, but I don't think too many people need this that desperately. You can easily build a working application with just the '/z' modifier. If you can't, you're in over your head, anyway. ;-)
The regex engine must be given the option to abort the search, and not do any backtracking, as soon as it attempts to do a lookahead to the next character, but it finds the end of the string instead.
In addition, pos() must be patched so that under this modifier, it normally is reset to undef(), but when this exception takes place, pos is set to the offset of the start of the prefix.
RFC 93: Regex: Support for incremental pattern matching