[% setvar title Split Scalars and Objects/References into Two Types %]
Note: these documents may be out of date. Do not use as reference! |
To see what is currently happening visit http://www.perl6.org/
Split Scalars and Objects/References into Two Types
Maintainer: Nathan Wiger <nate@wiger.org> Date: 24 Aug 2000 Last Modified: 25 Aug 2000 Mailing List: perl6-language@perl.org Number: 147 Version: 2 Status: Retracted
Sometimes, you start down a train of thought, and you wind up with something that you think is a beautiful idea.
But later on it turns out that, in reality, it sucks. Bad. REALLY Bad.
This is one of those ideas. There's so many problems with it I don't know where to begin. It has some interesting features, but not enough to destroy Perl for.
Anyways, there's much better solutions to the issues. Check out the discussions on -objects about RFCs 159 and 161 in particular. Feel free to read this RFC if you want to become violently ill.
Many RFC's have been proposed dealing with Perl's current prefixing system. Many have criticized it, even suggesting to drop prefixes altogether, but the vast majority of these criticisms center around one thing: Not being able to tell the difference between "true" scalars and references/objects.
The answer to this has frequently been that the $scalar type defines not a quality, but a quantity: one. While true in one sense, Perl does have both @arrays and %hashes, notations which do the opposite: define a quality. While they are both composed of multiple things, those things are fundamentally different, and hence the need for two notations.
References and true scalars ("scalars" henceforth) are also fundamentally different things. Therefore, this RFC proposes that they be split into two separate types in Perl 6: $scalars and *references.
Please take the time to read this. It is a thorough document that explores both the pros AND cons of this approach.
Please note that the * is used as the reference prefix in this document. Nonetheless, this proposal has NOTHING to do with typeglobs, and using the * prefix would require typeglobs to disappear. The prefix is debatable, let's focus on the idea.
Everyone on this list should be familiar with the problem. You can't tell scalars and references apart by looking at them. They are completely ambiguous. Consider:
$stuff{key} # hash value $stuff[0] # array value $stuff # scalar or reference? %stuff # hash @stuff # array $stuff # scalar or reference?
You get the idea. In simple examples, this ambiguity goes away because of the way you address them:
$stuff # scalar or reference? $stuff->{key} # hash reference $stuff->[0] # array reference $stuff->func # object reference
Once we use them, we recognize that the last three are references. However, there are many situations when we can't disambiguate them:
$stuff = Class->method # scalar or reference? read_data($stuff) # scalar or reference? $$stuff # scalar or reference?
Good variable names help, but the fact remains that these are fundamentally different types. For example, you know what will happen when you do this:
print @stuff # the contents of @stuff while(($k,$v) = each %h)) # each key/value of hash %h
But what about these examples?
$z = $x + $y # numeric addition? $stuff ||= "Bob" # is this correct? print ref $stuff # sure about that?
Anyone who has worked on a large project in Perl will relate to the fact that even this code block is difficult to read unless you know exactly what the functions are supposed to return:
$c = Class->new; # ok (maybe) so far... ($u, $g) = $c->get_user; ($d, $f, $j, $m) = $c->read_data; # ... lots of code passes ... print "Thanks $u for your ", $g->purchase; $member = $d || $CLASS_DEFAULT; # scalars or ???? while (($k,$v) = each %$f) { print "Now updating info for $k...\n"; push @$j, @{ $v->{'personal'} }; $name = $member->name($$v->{name}->{first}) || "unknown"; } print "Your confirmation number is $m\n";
Here, we've lost the advantage of prefixes. While there are lots of $'s in the code, they don't tell us anything useful other than "there's a single something here". [1]
However, note that the @ and % still do, because they indicate quality in addition to quantity.
This RFC proposes that references and scalars be split into two distinct types. This is not a novel approach. It is also not a simple one. There are many issues to address.
However, the dividing line is clear:
Type Definition ----------------- ---------------------------------- $scalar Holds a single simple value, such as a string or number *reference Holds a single complex value, such as a hash reference or object
Like %hashes versus @arrays, these two new types have the advantage that they can describe both a quantity and a quality. RFC 23, "Higher order functions" adds a new variable prefix, ^, because it introduces a fundamental new type: the placeholder. The same idea should apply here as well.
Consider the above sample code now:
*c = Class->new; ($u, *g) = *c->get_user; (*d, *f, *j, $m) = *c->read_data; # ... lots of code passes ... print "Thanks $u for your ", *g->purchase; *member = *d || *CLASS_DEFAULT; # oh, they're refs! while (($k,*v) = each %*f) { print "Now updating info for $k...\n"; push @*j, @{$*v{personal}}; $name = *member->name($*v{name}{first}) || "unknown"; } print "Your confirmation number is $m\n";
We now have a much better idea of what's going on. For one thing, we can instantly tell the difference between our $scalar assignments and our *reference assignments. We can now be 100% sure that $name really holds a scalar, and *c really holds a reference of some type. [2]
Moreover, with the complete syntax described below, we can actually tell our objects and references apart more easily than with the current "all refs use ->" approach.
Basically, all references and objects will be prefixed with a *. The single * will take the place of the current $-> combination syntax required for references. All scalars will continue to be prefixed with a $.
This table should help illustrate the new syntax:
Perl 5 Perl 6 ------------------------ -------------------------- $x = 5; $x = 5; $name = "Nate"; $name = "Nate"; $_ = 1; $_ = 1; $_->func; *_->func; # new *_ anon ref # References $n = \$name; *n = $name; # \ redundant print $$n; print $*n; $hash{key} = $val; $hash{key} = $val; $hash = \%hash; *hash = %hash; $hash->{key} $*hash{key} # -> redundant $array[3] = "Nate"; $array[3] = "Nate"; $aref = \@array; *aref = @array; $aref->[0] $*aref[0] # -> redundant $aref->[0][1] $*aref[0][1] @{$aref->[0][1]} @{*aref[0][1]} # Objects -- keep -> to make objects stand out $c = Class->func; *c = Class->func; [no equivalent] $c = Class->func; $r = $c || $DEFAULT; *r = *c || *DEFAULT; print $c->name; print *c->name; # $*c->name implied print @{$c->name}; print @{*c->name}; # Combinations $val = $c->func || $D->{VAL}; $val = *c->func || $*D{VAL}; $name = $h->{name} || $a->[3][1]; *name = *h{name} || *a[3][1]
Notice this buys us another key advantage. Since the * delimits references, we no longer are forced to use -> in all references. As such, -> can now be reserved solely for objects, allowing us to easily locate our objects now.
The exact syntax is still open for discussion. There are many more clarifying examples below as well.
There are many advantages to this approach. There are also potential problems, which will be addressed second.
As mentioned above, this is the key advantage to this approach. Not only is the code clearer, but we can now actually tell what we want from a function without complex casting or syntax rules:
*object = date; # object with accessor functions $scalar = date; # scalar ctime date string
Normally, the date() function would either have to choose what to return, or would require specific casting in order to return the correct thing. And, once it's returned, later in the code you still can't easily tell what you got. Differentiating between *references and $scalars allows us to do both.
Plus, as the table above shows, many references (such as multidimensional array references) become much easier to read.
Currently, the new want() proposed by RFC 21 relies on odd syntax to differentiate between hashrefs and arrayrefs:
$arrayref = @{func()}; $hashref = %{func()};
While this works for simple situations, it requires some contortions in complex situations:
$hashref = %{func() || $def || $r->getdata()};
By differentiating our references with this new syntax, we can move this to the left side of the assignment, which is where it belongs:
@*arrayref = func(); %*hashref = func(); %*hashref = func() || *def || *r->getdata();
The code is cleaner and easier to read, and can more more easily span multiple lines. [4] When integrated with the previous point, we can now easily tell apart objects and refs even in assignment contexts.
Currently, if the leftmost prefix is a $, the actual context is still ambiguous:
print $$$$name; # ref or scalar? $d = $$$$name; # ref or scalar?
However, with our new syntax, $ is only used to delimit an actualy scalar. So:
print $***name; # scalar *d = ****name; # ref
This is also extended to arrays and hashes:
@***name # array %***name # hash
Note that this syntax is less confusing than this:
@$$$name %$$$name
Where you have to infer that $$$name is a ref. The *, by contrast, tells us this explicitly.
Currently, if you're reading a complex type in Perl you have to scan left to right to figure out it's a ref:
@{ $r->{$key} }
You first see the @, which tells you it's an array. However, the next thing you see is a $, which leads you to believe it's a scalar, until you see the ->. In this simple isolated example, it's not too bad. But bury this in a chunk of code, and it's very hard to read and use.
Contrast this with the new syntax:
@{*r{$key}}
Not only is this more compact, but by simply reading left to right you can determine it's an array of a ref to a hash.
Currently, you can get to an individual element of an array or hash with the syntax:
$array[0] # first scalar or ref of @array $hash{key} # scalar or ref indexed by "key"
Differentiating the two would require that the prefix also reflect the return value:
$array[0] # first element (scalar) of @array *array[0] # first element (ref) of @array $hash{key} # "key"-indexed element (scalar) of %hash *hash{key} # "key"-indexed element (ref) of %hash
The structure of @array and %hash would remain identical, in that each slot could only hold one element (so $array[0] and *array[0] would point to the same data, only "casting" them differently). However, you'd have to make sure that you called them with the correct prefix:
my @array; # ok *array[0] = new Class; # ok $array[0]->func(); # wrong *array[0]->func(); # ok $stuff = *array[0]; # wrong $stuff = $array[0]; # ok (but empty)
This actually gives us a key advantage. Here, Perl can now return the value if it's the correct type, or undef otherwise. As such:
$array[3] = "Nate"; # ok print $array[3]; # ok *array[3]->func(); # *array[3] returns undef
So, * is acting a little like an implicit ref call. This means that we'll be guaranteed to be getting what we want without nasty ref checks all over the place. It also means that our error messages can be clearer:
array[3] is a scalar value, not a reference at line 3.
Note that this approach is analagous from a user level to how hashes and arrays work as well:
$array[3] = "Nate"; # ok print $array[3]; # ok print $array{3}; # undef
Just like [] and {} are used to get at different types of values, so too are $ and * now used to get to different types of values.
RFC 49, "Objects should have builtin stringifying STRING method", and RFC 73, "All Perl core functions should return objects", together describe an interface which could allow easily embedded core objects with true polymorphic capabilities.
For example, using these methods the date() function would not have to choose what to return in a scalar context. Instead, an object would be returned which could morph into a "true" scalar value as needed. This approach is quite powerful in embedding objects at a low-level.
However, there are some problems. For one thing, sometimes you really want an actual scalar:
$x = Math->calc; # create $x object $z = $x * $y; # cast via $x->NUMBER to scalar $m = $x; # object reassignment (oops?)
If $x is a polymorphic object, there are a couple of disadvantages:
1. There's a lot of overhead in maintaining objects. 2. Lines 2 and 3 work completely differently, which is counterintuitive. 3. You still can't tell what's an object and what's a scalar by looking at the code.
By differentiating the types, however, we can explictly ask for what we want to get back:
*x = Math->calc; # same function overloaded to $x = Math->calc; # return both $z = $x * $y; # true scalar math *m = *x; # true reference assignment
Now it's obvious what's going on. We know what we're getting, we know what we are passing around, and there's no wasted overhead for passing objects around if we don't need to.
Perl prides itself on being a human-friendly language. Indeed, it is. And fun! Mostly.
However, trying to figure out what's a scalar and what's an object just by looking at a whole bunch of $'s is not fun. And let's face it, most people regard single values and references to be totally different things. And they are, just like arrays and hashes are totally different.
Many Perl programmers I've taught are quite surprised the first (and second, and third, and fourth) time they see something like this:
$cgi = new CGI || $oldcgi; $user = $cgi->param($name) || $ENV{REMOTE_USER}; print "Welcome $user, to " . $cgi->server_name;
This is confusing. Only $scalars can play such double-duty. %hashes and @arrays can only hold hashes and arrays. But Perl 5 $scalars can currently hold values or references, giving Perl 5 OO a "bolted on the side" appearance. [3]
While we have become accustomed to looking at it, the simple fact is it's confusing. The same "type" is being used for things with different qualities, and it's just as confusing as if arrays and hashes had the same prefix.
Now, consider this code:
*cgi = new CGI || *oldcgi; $user = *cgi->param($name) || $ENV{REMOTE_USER}; print "Welcome $user, to " . *cgi->server_name;
This is less confusing. It is now apparent where our *objects are and where our $scalars are. The code is more easily readable and can also be more specific in what types it is requesting. You can instantly pick out your *objects and $scalars at a glance.
This is a simple one, but should not be overlooked. With a separate * type, we now have to ability to have both $name and *name, where $name could be the actual value and *name could be a reference to $name. Increases in namespace are basically always beneficial.
This approach is not without its problems. These need to be carefully addressed before this can be implemented.
Consider an array with mixed references and scalars:
@mixed = qw($x $y *cgi $z *apache);
A Perl 5 loop would do something like this (remember all the prefixes will be $ unlike what's shown above):
for ( @mixed ) { if ( ref $_ ) { $name .= $_->func; } else { $sum += $_; } }
Now consider the Perl 6 for loop over it, where the prefixes are mixed
for ( @mixed ) { if ( $_ ) { # uh-oh # ... }
The problem is that the builtin anonymous scalar $_ can now only handle scalars, since we've separated the types. Luckily, we have already added a new anonymous reference, *_. As such, our for loop can now obey the following rules:
1. $_ and *_ are initialized to undef each loop 2. only the appropriate one is set depending on whether the element is a scalar or reference
So, our for loop becomes:
for ( @mixed ) { if ( *_ ) { $name .= *_->func; } else { $sum += $_; } }
Note that the length of the code is exactly the same, and that the test no longer requires a ref() call. In this specific example, I would argue this could be seen as a benefit, because you can use it to only address certain elements of the array:
for ( @mixed ) { $sum += $_ if $_; # only address true scalars }
This will implicitly only get true scalars, since all the references will be placed in *_ (and the loop ignores them).
However, the situation gets ugly when you want to name your loop parameters:
for my $line ( @mixed ) { # uh-oh }
Now, we've explicitly told the for loop to use $line instead of $_. But this can only hold scalars, not references. There are two possibile solutions:
1. Put the scalars in $line, but the references in *_ still (and vice versa if they had said "my *line"). 2. Automatically convert the name of $line to *line if necessary. 3. Do #1, but require them to use a HOFN if they want automatic conversion (i.e., "^line")
None of these is a perfect solution. This is an issue that needs further examination. Note that an analagous problem exists with mixed hashes.
Differentiating scalars and references gives us many key advantages:
1. Ability to differentiate types in all contexts 2. More compact, readable, and flexible syntax 3. Leftmost $ tells true context 4. Implicit type checking 5. Better cognitive properties 6. Increased namespace
Initially, I started writing this RFC as an idea that should be at least considered, but not necessarily one that I thought would really benefit Perl 6.
However, I have since completely changed my mind. As I went through the process, I found that there were many, many deficiencies with the current "stick everything in $" philosophy of Perl 5. I believe that this proposal could benefit Perl 6 tremendously.
If this gets accepted, it will require a massive rewrite of Perl's internals. Thank heavens we're doing that anyways. :-)
[1] The "single" part isn't true, either, since $arrayrefs and $hashrefs in fact point to multiple values. Some will counter with the idea that $arrayrefs and $hashrefs actually point to "single" arrays and hashes. However, this is a stretch in the author's opinion.
[2] Opponents might poke at this, claiming that you only know * represents a reference "of some type". What next, individual types for all the different numbers, filehandles, and so forth? However, this is a slippery slope fallacy. Numbers and strings are types of scalars, just like objects and filehandles are types of references.
[3] Andy Wardley came up with this one. :-)
[4] Note that this does not have to replace the syntax in RFC 21, but can serve as an alternative form.
The p52p6 translator would have to convert all Perl 5 reference and object notations into the new Perl 6 syntax. This is not actually as hard as it sounds since a few regexp's would do the trick for references. Objects are trickier and would require some backtracking through the code.
RFC 49: Objects should have builtin stringifying STRING method
RFC 73: All Perl core functions should return objects
RFC 9: Highlander Variable Types
RFC 109: Less line noise - let's get rid of @%
RFC 95: Object Classes
Upcoming RFC by brian d foy and myself on polymorphic objects
RFC 21: Replace wantarray
with a generic want
function
RFC 23: Higher order functions