[% setvar title Consolidate the $1 and C<\1> notations %]

This file is part of the Perl 6 Archive

Note: these documents may be out of date. Do not use as reference!

To see what is currently happening visit http://www.perl6.org/


Consolidate the $1 and \1 notations


  Maintainer: David Storrs <dstorrs@dstorrs.com>
  Date: 28 Sep 2000
  Last Modified: 30 Sep 2000
  Mailing List: perl6-language-regex@perl.org
  Number: 331
  Version: 2
  Status: Frozen


Currently, \1 and $1 have only slightly different meanings within a regex. It is possible to consolidate them without losing any functionality and, in the process, we gain intuitiveness.


v1->v2: A major rewrite:


Note: For convenience, I am going to talk about \1 and $1 in this RFC. In actuality, these notations extend indefinitely: \1..\n and $1..$n. Take it as read that anything which applies to $1 also applies to $2, $3, etc.

The Problem

In current versions of Perl, \1 and $1 mean different things. Specifically, \1 means "whatever was matched by the first set of grouping parens in this regex match." $1 means "whatever was matched by the first set of grouping parens in the previously-run regex match." For example:

the second will match 'foo_foo_bar', while the first will match 'foo_[SOMETHING]_bar' where [SOMETHING] is whatever was captured in the previous match...which could be a long, long way away, possibly even in some module that you didn't even realize you were including (because it was included by a module that was included by a module that was included by a...).

The primary reason for this distinction is s///, in which the left hand side is a pattern while the right hand side is a string (assuming no 'e' modifier). Therefore:

Note that, in the first example, the two $1s refer to different things, whereas in the second example, $1 and \1 refer to the same thing. This is counterintuitive and non-Perlish; Perl should be intuitive and DWIMish.

A separate, though less important, problem with the way backreferences are currently implemented is that it is difficult for a human to tell at a glance whether \10 means "escape character 10" or "backreference 10"...the only way to tell is to count the number of captured elements and see if there actually are ten of them, in which case \10 is a backreference and otherwise it is an escape character. In general, this isn't a problem because most patterns don't have ten sets of capturing parens.

The Solution

Ok, so the problem is that $1 and \1 are counterintuitive. How do we make them intuitive without losing any functionality?

First, let's get rid of the \1 form for backreferences.

Second, let's say that $n refers to the nth captured subelement of the pattern match which occured in this statement--note that this is distinct from "in this pattern match." That means that, in s/(foo)$1/$1bar/, both $1s refer to the same thing (the string 'foo'), even though one of them occured inside a pattern and one occured inside a string. (See note [1] in the IMPLEMENTATION section.)

Third, let's create a new special variable, @/ (mnemonic: the / is the default delimiter for a pattern match; if the English module remains extant, then @/ could have the long name of @LAST_MATCH, but there are currently several threads concerning removal of the English module). Much like the current $1, $2... variables, this array will only be created (and hence, the speed price will only be paid), if you access its members. The 0th element of @/ will contain the qr()d form of the last pattern match, while successive elements refer to the captured subelements.

Fourth, let's change when we update the variables which store the captures (the current $1, $2, etc). @/ will only be updated when the entire statement which contains a pattern match has finished running (e.g., when the entire s/// is completed), rather than as soon as the pattern match is done (and therefore before the substitution happens).

Some Examples