[% setvar title Implementation of hash iterators %]
Note: these documents may be out of date. Do not use as reference! |
To see what is currently happening visit http://www.perl6.org/
Implementation of hash iterators
Maintainer: Tom Hughes <tom@compton.nu> Date: 20 Aug 2000 Last Modified: 28 Sep 2000 Mailing List: perl6-internals@perl.org Number: 136 Version: 3 Status: Frozen
Perl 5 makes very limited use of iterators internally. It has long been suggested that certain common idioms could be implemented much more efficiently using iterators and this RFC suggests some possible mechanisms for achieving this goal.
In perl 5 every hash has exactly one iterator, which can be used explicitly by the programmer using the each function and which is also used by perl implicitly to implement the keys and values functions.
This is confusing and dangerous as you can easily get an action at a distance effect whereby using each/keys/values in a function affects the iteration being performed by the caller.
In perl 6 the keys and values functions should no longer use the same iterator as the each function - each use of keys and values should use it's own private iterator instead.
In fact those iterators should probably be lazy so that a piece of code like this:
foreach (keys %x) { ... }
does not immediately iterate over the whole hash, pushing copies of all the keys onto the stack; but should instead create an iterator object of some sort that can be passed to the loop op which will then pull off one value for each pass of the loop.
This is significantly more efficient for large hashes, especially if the loop exits early.
The only problem with this system is that the current implementation of keys and values causes the result to be frozen so that changes to the hash made in the loop body do not effect the values being iterated over.
I believe this problem can be solved by using the vtable for the hash to wrap any mutating functions with a completion routine that will advance the iterator to completion, creating a temporary list of copied keys/values that it can then continue to iterate over.
Obviously after this completion was done the wrapper routine would call the original vtable routine to update the hash. It would also remove itself from the vtable so you would only pay the cost of the wrapper on the first change to the hash. Equally you would only pay for building the temporary list when it was necessary - ie when you change the hash inside the loop.
For keys the iterator would need to trap insertions and deletions and for values it would need to trap these and also modifications of existing values.
Because this mechanism can sit above the main hash implementation it should work equally for both builtin hashes and for things like tied hashes where implementation details are unknown to the core.
The current tied hash interface is only capable of supporting a single iterator via the FIRSTKEY and NEXTKEY functions.
In order to support the enhanced iteration described above it is suggested that the tied hash interface be extended but the precise details of that are a matter for a language RFC rather than this document.
Certain underlying structures with which tied hashes are used (for example some types of dbm file) only support a single iterator which makes support of an enhanced tied hash interface difficult.
It is however possible to work around this limitation using some form of buffering. This could be done either by the tied hash implementation or, more helpfully, at the perl level if the tied hash indicated that it could only support a single iterator. There are basically three ways that this buffering could work:
This technique requires that when creating an iterator for a hash which only supports single iteration that the underlying iterator should be reset and then advanced to completion, creating a list of values which is saved and then iterated over by the perl level iterator.
This form of buffering works by having the first iterator save each value as it iterates. If a second iterator is then required it can begin work by traversing the saved values and then if necessary moving on to use the underlying iterator - the most advanced of all the iterators is responsible for reading and saving values from the underlying iterator.
The buffer values can be discarded whenever the number of iterators in use falls to zero.
The final form of buffering works by having the first iterator work as normal, retrieving and returning values from the real underlying iterator, until such time as a second iterator is created.
At this point the underlying iterator will be advanced to the end and the result values cached. The second logical iterator can then reset the underlying iterator and start to use it while the original logical iterator can continue by using the buffered values.
It should be possible to share the buffering of the tail values between multiple iterators as appropriate and to destroy the buffer only when all the iterators are destroyed.
Overall I believe that the third of these provided the best results - it is free until a second iterator is created and then only buffers as much as is necessary to support the goal of multiple iteration.
RFC 123: Builtin: lazy
perlfunc manpage for discussion of each, keys and values