[% setvar title Variable-length lookbehind. %]
Note: these documents may be out of date. Do not use as reference! |
To see what is currently happening visit http://www.perl6.org/
Variable-length lookbehind.
Maintainer: Peter Heslin <Peter.Heslin@ucd.ie> Date: 9 Aug 2000 Last Modified: 30 Sept 2000 Mailing List: perl6-language-regex@perl.org Number: 72 Version: 4 Status: Developing
In Perl6, lookbehind in regular expressions should be extended to permit not only fixed-length, but also variable-length lookbehind.
Version 4 is a withdrawal of the originally proposed syntax in favour, in part, of RFC 317; the present RFC has not been withdrawn, because the proposal to implement variable-length lookbehind still stands.
Version 3 is a revision to make it clearer that the purpose of the proposal is to implement a general scheme for variable-length lookbehind. The title has been changed to reflect this.
Version 2 of this RFC redirects discussion of this topic to perl6-language-regex@perl.org.
The original proposal offered some new syntax as a portmanteau solution
to two very different problems: the lack of variable-length lookbehind
in Perl 5, and the lack of a method other than the monolithic and
inflexible study
by which the programmer could pass hints to the
regexp optimizer, based upon her knowledge of the peculiarities of the
target string. The latter problem has now been addressed much more
cleanly by Hugo van der Sanden in RFC 317, and so I have withdrawn the
originally proposed syntax, and all that remains for this RFC to discuss
is variable-length lookbehind.
I originally proposed new syntax designed to kill two birds with one stone, but for variable-length lookbehind, the same syntax would serve as that currently used for fixed-length: (?<=pattern) and (?<!pattern).
The original proposal suggested that it would be nice if pos
in list
context were to return the offset into the the target string where
the match begins, as well as where it ends. Some people seemed to like
this idea, so I am leaving it in, even though it no longer has much to
do with the content of this RFC.
Ilya Zakharevich says it is trivial ;-) Here is part of his comment on an earlier version of this RFC:
=================================================================
As I said it for many times: this is absolutely trivial to implement. First of all, if you agree to rewrite
(?<= \w\s*\d ) # Semantic X: match "a 1"
as
(?<= \d\s*\w ) # Semantic Y: match "a 1"
then it is as simple as inserting go-back-by(1) nodes before each node for \s \d and \w.
And to support the more intuitive ;-) semantic X, the only more-or-less tricky part is to recursively go through the compile tree, and put "concatenated" nodes in the opposite order. A piece of cake.
=================================================================
The original RFC proposed syntax that would have corresponded to Ilya's syntax Y, precisely because I thought it would be easier to implement, but several people objected to this as being obscure, so I changed it to syntax X, which would mean that Perl and not the programmer would have to reverse the regexp in order to start matching it in reverse from the start of the current match.
One thing that should be avoided is to implement lookbehind by starting at offset 0 in the target string and simply matching forwards from there. As Hugo has pointed out, this would throw away the work of the optimizer and lead to quadratic behaviour. This is why variable-length lookbehind has to approach the match in reverse, starting from the current offset of the start of match.
It would be acceptable if, for simplicity of implementation, variable-length lookbehind were restricted to a subset of regular expression syntax, since even that would be an improvement over what we have now.
RFC 317: Access to optimisation information for regular expressions