[% setvar title Variable-length lookbehind. %]

This file is part of the Perl 6 Archive

Note: these documents may be out of date. Do not use as reference!

To see what is currently happening visit http://www.perl6.org/

TITLE

Variable-length lookbehind.

VERSION

  Maintainer: Peter Heslin <Peter.Heslin@ucd.ie>
  Date: 9 Aug 2000
  Last Modified: 30 Sept 2000
  Mailing List: perl6-language-regex@perl.org
  Number: 72
  Version: 4
  Status: Developing

ABSTRACT

In Perl6, lookbehind in regular expressions should be extended to permit not only fixed-length, but also variable-length lookbehind.

CHANGES

Version 4 is a withdrawal of the originally proposed syntax in favour, in part, of RFC 317; the present RFC has not been withdrawn, because the proposal to implement variable-length lookbehind still stands.

Version 3 is a revision to make it clearer that the purpose of the proposal is to implement a general scheme for variable-length lookbehind. The title has been changed to reflect this.

Version 2 of this RFC redirects discussion of this topic to perl6-language-regex@perl.org.

DESCRIPTION

The original proposal offered some new syntax as a portmanteau solution to two very different problems: the lack of variable-length lookbehind in Perl 5, and the lack of a method other than the monolithic and inflexible study by which the programmer could pass hints to the regexp optimizer, based upon her knowledge of the peculiarities of the target string. The latter problem has now been addressed much more cleanly by Hugo van der Sanden in RFC 317, and so I have withdrawn the originally proposed syntax, and all that remains for this RFC to discuss is variable-length lookbehind.

Syntax

I originally proposed new syntax designed to kill two birds with one stone, but for variable-length lookbehind, the same syntax would serve as that currently used for fixed-length: (?<=pattern) and (?<!pattern).

Related Features

The original proposal suggested that it would be nice if pos in list context were to return the offset into the the target string where the match begins, as well as where it ends. Some people seemed to like this idea, so I am leaving it in, even though it no longer has much to do with the content of this RFC.

IMPLEMENTATION

Ilya Zakharevich says it is trivial ;-) Here is part of his comment on an earlier version of this RFC:

================================================================= 

As I said it for many times: this is absolutely trivial to implement. First of all, if you agree to rewrite

 (?<= \w\s*\d )         # Semantic X: match "a  1"

as

 (?<= \d\s*\w )         # Semantic Y: match "a  1"

then it is as simple as inserting go-back-by(1) nodes before each node for \s \d and \w.

And to support the more intuitive ;-) semantic X, the only more-or-less tricky part is to recursively go through the compile tree, and put "concatenated" nodes in the opposite order. A piece of cake.

================================================================= 

The original RFC proposed syntax that would have corresponded to Ilya's syntax Y, precisely because I thought it would be easier to implement, but several people objected to this as being obscure, so I changed it to syntax X, which would mean that Perl and not the programmer would have to reverse the regexp in order to start matching it in reverse from the start of the current match.

One thing that should be avoided is to implement lookbehind by starting at offset 0 in the target string and simply matching forwards from there. As Hugo has pointed out, this would throw away the work of the optimizer and lead to quadratic behaviour. This is why variable-length lookbehind has to approach the match in reverse, starting from the current offset of the start of match.

It would be acceptable if, for simplicity of implementation, variable-length lookbehind were restricted to a subset of regular expression syntax, since even that would be an improvement over what we have now.

REFERENCES

RFC 317: Access to optimisation information for regular expressions