[% setvar title Efficient numerics with perl %]
Note: these documents may be out of date. Do not use as reference! |
To see what is currently happening visit http://www.perl6.org/
Efficient numerics with perl
Maintainer: pdl-porters team <pdl-porters@jach.hawaii.edu> Date: 16 Aug 2000 Last Modified: 30 Sep 2000 Mailing List: perl6-language-data@perl.org Number: 116 Version: 3 Status: Frozen
Not much happened. Most seemed to be dumbfounded by the density of the description. What we really need is a good manual or book on PDL as a reference. Short of this we have this RFC :(.
At the very least it should help to point out that good support for numerical functionality is more than just matrix multiplication on multidim arrays.
The RFCs on multi-dimensional arrays for Perl6 address some of the issues discussed in this RFC.
This RFC describes basic requirements of a numerical object/variable type for perl. A reference implementation (that could almost certainly do with some syntactical sweetening) is provided. It is currently available as the PDL distribution. A few key concepts of the reference implementation will be described. Any efforts to get numerical support into the perl core should provide means to either easily hook a numerical core into perl to achieve this or supply the features at the core design level. Current implementation aspects are sketched.
This RFC is not really asking for any feature in particular but rather a description of existing PDL features and implementation aspects to show what's possible and desired in a numerical module.
It is also the pdl-porters' contribution to overstuff everybodies brains [1].
PDL provides an object class to compactly store and speedily manipulate the large N-dimensional data arrays which are the bread and butter of scientific computing.
Apart from the object representation itself PDL implements three key concepts which its users have found critical for efficient and enjoyable numerical computing with perl:
The basic data object of PDL is a zero- to multidimensional array of numbers, all of the same data type (for efficiency reasons, it is desirable to store the numbers in the native binary format and consecutively in memory instead of having a Perl SV for each one). Perl6 could come up with a representation for such types as part of its core (or be sufficiently flexible in its SV design).
It is also critical to be able to specify a range of types,
for example PDL offers support for single and double precision
(float/double) as well as a variety of integer types ('byte',
'ushort', 'short', 'long'). A future implementation should
support a complex type at the core level. In the current
implemenetation this is difficult due to parser limitations
of the code generator (PDL::PP
) and C's lack of an intrinsic
complex type.
During data analysis it is often desirable to be able to refer to parts of the data, for example every other number in this vector without allocating new storage: if the original vector is 100Mb long, we don't want to allocate and copy the data for another 50Mb vector. For this purpose PDL supports "virtual piddles", which are pointers to parts of other piddles (the implementation is transparent, virtual piddles appearing just as any other piddle object to the user).
$a = zeroes(1000,1000,100); # more than 100MB worth of data $b = $a->slice('0:-1:2'); # virtual piddle of every 2nd element $b *= 2; # changes the elements in $a
slice
is just one example of this type of operation. $b
contains
only information how to access $a
's data rather than a copy. Similarly,
a transposed object with the first two dimensions exchanged is easily
obtained:
$c = $a->xchg(0,1); # exchange dimension 0 and 1
Again, $c
only contains information how to access $a
's data.
Similarly we can create an index:
$i = pdl(1,9,27); # a number of indices $c = $a->index($i);
Changing $c
changes $a
(and vice versa). However in this
case PDL makes no nice optimisation and manipulating $c
uses memory
for temps even though the changes are propagated back. The crucial point
is that $c
and $a
are tightly linked objects that propagate
changes in either to the other (in a way similar to dataflow). Find
out more about this in the implementation section.
An optimisation is in principle possible, and would be useful
for all array types not just PDL ones.
Similar optimizations for sparse and non-rectangular arrays would be desirable, this would be especially useful in a perl for numerics.
In a large number of practical situations encountered in scientific numerical programming fundamental (simple) operations are iterated over many (small) chunks of (multi-dimensional) data.
A very simple examples is addition
$c = $a + $b
of two multi-dimensional arrays (PDL objects). The fundamental
operation is the addition of two scalar elements. This fundamental
operation is just iterated over all pairs of elements of $a
and
$b
and results placed element by element into $c
.
Based on this observation PDL introduced the notion of PDL threading
to implement such functionality in a flexible and efficient way:
it provides tools (PDL::PP
) to define those optimized innermost
loops (the fundamental operations) and establishes a framework to
efficiently execute loops over these fundamental operations
automatically as prescribed by the dimensionality of the input data. Most
importantly, those loops over the innermost fundamental operations
can generally be run in parallel (i.e. run in separate threads).
PDL currently has experimental support to exploit this observation.
Two examples using the inner product as the fundamental operation may serve to illustrate the idea. To convert an RGB image into a greyscale image an inner product with a triple of weights with each RGB triple is performed:
$grey = inner $rgb, pdl(0.301,0.586,0.113);
The key is here that regardless of the number of trailing dimensions
of the RGB object $rgb
(which has a shape [3,.....]) the
fundamental operation (the inner product) is speedily iterated
over all triples of the data. So $rgb
could be a single RGB
pixel (shape [3]), a line of RGB data (shape [3,x]), an RGB image
(shape [3,x,y]), a stack of RGB images ([3,x,y,z]) and so on.
Each function that can be used in such a way in PDL has a signature that symbolically states dimensions consumed and generated in a fundamental operation. Illustrating this by the two functions we have used above the signatures of addition and inner product are
a(); b(); [o] c() # addition, we use ';'s as separators a(n); b(n); [o] c() # inner product
In this symbolic notation [o]
flags arguments which
are created (output arguments). The symbols in parentheses name
any dimensions of the variables involved. Empty parentheses refer to
scalars (0-dim). So addition takes two scalar input arguments
and generates one output scalar. The inner product consumes, as
expected, two one-dimensional input args (symbolized by a(n)
and b(n)
) to produce a scalar output ([o] c()
).
To complete the picture PDL has rules how these fundamental operations consume chunks of data of the input args to produce a muti-dimensional result of chunks of output data (the possibly multi-dimensional result piddle). The user decides exactly how those operations are iterated over his data by manipulating the dimensions of the input data appropriately. That is illustrated in the promised second example.
Those familar with linear algebra will know that
a matrix product consists of the result of inner products
between all pairs of row and column vectors of two
input matrices. Not surprisingly we can therefore use the
inner product to compute the matrix product of two PDL
objects $a
and $b
:
$c = inner $a->dummy(1), $b->xchg(0,1)->dummy(2);
dummy
and xchg
are dimension manipulation functions
(of the smart reference type) that are used to change the
dimensionality of the input piddles so that the inner
product is iterated over the desired pairs of sub-vectors
and placed into the resulting piddle $c
. We do not expect
the casual reader to follow the details of this example
but to appreciate the underlying idea.
The essence of PDL threading is here that (1) it is a useful feature in many practical situations (2) can be very efficiently implemented and (3) exposes parallelism in the underlying algorithm.
One could go as far as saying that the ideas underlying threading are PDL's main programming paradigm and the user is expected to rethink his approach accordingly for optimal performance.
One of the foremost design criteria was to implement a small set of core structures, routines and code generators that would depend heavily on each other. The rest of the package (implementing the real numerical functionality) should only depend on the external interface of these core functions.
Encapsulation can be achieved by consciously looking for code that would be repeated in subroutines and either making a new subroutine or a code generator do the job for the programmer. This way, when the repeated code suddenly needs to be different, there is no painstaking need of changing all instances of it - simply changing the subroutine or code generator will do it. We emphasize this because this kind of thinking seems to be forgotten all too often.
PDL introduced the code generator PDL::PP to create subroutines from concise descriptions to allow the programmer to focus on the algorithmic side of things. In many ways PDL::PP can be thought of as the XS of PDL: it is complex to learn, powerful, allows easy interfacing at the C-level and is, once mastered, straightforward to use. Beyond that using PDL::PP ensures that new routines follow uniform calling conventions and support PDL threading, slicing, etc.
In the current implementation it suffers from similar problems as XS in perl5: the syntax is strange and hard to learn, it is also feature incomplete (i.e. certain things can not be done right now). However, it implements a number of ideas that we find are absolutely crucial to pertain and develop further in a numerically friendly perl: algorithmic abstraction, encapsulation and provision of a tool for easy interfacing to the hundreds of existing numerical, graphics and IO libraries that are to be tapped into.
As an illustration for the abstraction that is achieved, here is the
definition for an inner product function. This definition tells the
code generator (PDL::PP
) to generate the appropriate XS code that
will then be compiled as part of a new loadable PDL module:
pp_def( 'inner', Pars => 'a(n); b(n); [o]c(); ', # the signature, see above Code => 'double tmp = 0; loop(n) %{ tmp += $a() * $b(); %} $c() = tmp;' );
The above code specifies firstly that the two first vectors must have
the same first dimension and the third vector will not have that
dimension. Inside the loop it is possible to use $a()
without
specifying the index because we have specified that we are looping
over the dimension called n
. Notice that we do not do any
checking that $a
and $b
have the same sized first
dimension - we just tell PDL::PP that we expect this to be so and it
checks whether any data passed to the inner
routine obeys this
assumption and croak
s if it does not.
After installation the user can then use
$c = inner $a,$b;
to compute the inner product of two piddles.
The code clearly illustrates the abstraction achieved, making no reference to an actual data type or how the vectors are stored. This kind of abstraction is similar to the ideas used in the C++ Standard Template Library. As a bonus, the use of PP means a lot of the tedious housekeeping of programming can be automated. Almost all index troubles vanish when writing and using these subroutines: all the necessary index checks are done automatically by the code generator.
The code generator also ensures that calling the function with arguments that have more dimensions than the function expects works as well - the function is simply repeatedly called in a loop over all these dimensions (threaded over the dimensions in PDL parlance), providing a fast way of performing many similar operations in a row without going back to the Perl level (with the latest versions of PDL, it is possible to use multiple OS-level threads for these loops, enabling an easy way of gaining from a multiprocessor computer). This threading allows many algorithms to be expressed concisely and PDL scripts rarely have cause for loops over elements on the Perl level.
In a future perl this functionality might be integrated with a perl (JIT?) compiler. Code could be specified in perl(-like?) syntax and compiled into an optimized C level transformation as required.
If this sounds rather vague it is not by accident -- we still lack a clear idea how to optimally achieve this goal. A nice perlish syntax to replace the PP gibberish should certainly be the starting point.
The PDL core is currently suffering from similar problems as
perl 5: it works but is poorly documented and many hacks make it
increasingly difficult to modify and maintain. But it works ;)
.
In any case, some of the perl6 guys have asked for a rundown of the principal guts of the implementation. Here it goes...
On the Perl side, an opaque scalar reference turned out to be the most
practical representation: Perl5 allows overloading of operators so
$c=$a+$b
can mean "add the objects (here representing matrices)
$a
and $b
and store the result in the scalar $c
with whatever
type that the result has". As with any perl object, new objects can
be easily derived from piddles for specialised purposes.
In Perl6 it may be desirable to use @syntax
for all arrays - whether
compact, implementing smart references, or not. The use of $a
for
PDL arrays and @a
for normal perl arrays is confusing, we would
like to see some unification. Ideally perl arrays would support some,
or all, of the PDL features whether they are multi-dimensional
or not.
struct pdl { unsigned long magicno; /* Always stores PDL_MAGICNO as a sanity check */ /* This is first so most pointer accesses to wrong type are caught */ int state; /* What's in this pdl */ pdl_trans *trans; /* Opaque pointer to internals of transformation from parent */ pdl_vaffine *vafftrans; void* sv; /* (optional) pointer back to original sv. ALWAYS check for non-null before use. We cannot inc refcnt on this one or we'd never get destroyed */ void *datasv; /* Pointer to SV containing data. Refcnt inced */ void *data; /* Null: no data alloced for this one */ int nvals; /* How many values allocated */ int datatype; PDL_Long *dims; /* Array of data dimensions */ PDL_Long *dimincs; /* Array of data default increments */ short ndims; /* Number of data dimensions */ unsigned char *threadids; /* Starting index of the thread index set n */ unsigned char nthreadids; pdl *progenitor; /* I'm in a mutated family. make_physical_now must copy me to the new generation. */ pdl *future_me; /* I'm the "then" pdl and this is my "now" (or more modern version, anyway */ pdl_children children; short living_for; /* perl side not referenced; delete me when */ PDL_Long def_dims[PDL_NDIMS]; /* Preallocated space for efficiency */ PDL_Long def_dimincs[PDL_NDIMS]; /* Preallocated space for efficiency */ unsigned char def_threadids[PDL_NTHREADIDS]; struct pdl_magic *magic; void *hdrsv; /* "header", settable from outside */ };
This is quite a structure for just storing some data in - what is going on?
We are going to start with some of the simpler members: first of all, there is the member
void *datasv;
which is really a pointer to a Perl SV structure (SV *
). The SV is
expected to be representing a string, in which the data of the piddle
is stored in a tightly packed form. This pointer counts as a reference
to the SV so the reference count has been incremented when the SV *
was placed here. This pointer is allowed to have the value NULL
which
means that there is no Perl SV for this data - for instance, the data
might be allocated by a mmap
operation. Note the use of an SV*
was purely for convenience, it allows easy transformation of
packed data from files into piddles. Other implementations are not
excluded.
The actual pointer to data is stored in the member
void *data;
which contains a pointer to a memory area with space for
int nvals;
data items of the data type of this piddle.
The data type of the data is stored in the variable
int datatype;
the values for this member are given in the enum pdl_datatypes
, discussed
above.
The number of dimensions in the piddle is given by the member
int ndims;
which shows how many entries there are in the arrays
PDL_Long *dims; PDL_Long *dimincs;
These arrays are intimately related: dims
gives the sizes of the dimensions
and dimincs
is always calculated by the code
int inc = 1; for(i=0; i<it->ndims; i++) { it->dimincs[i] = inc; inc *= it->dims[i]; }
in the routine pdl_resize_defaultincs
in pdlapi.c
.
What this means is that the dimincs can be used to calculate the offset
by code like
int offs = 0; for(i=0; i<it->ndims; i++) { offs += it->dimincs[i] * index[i]; }
but this is not always the right thing to do, at least without checking for certain things first.
Since the vast majority of piddles don't have more than 6 dimensions, it is more efficient to have default storage for the dimensions and dimincs inside the PDL struct.
PDL_Long def_dims[PDL_NDIMS]; PDL_Long def_dimincs[PDL_NDIMS];
The dims
and dimincs
may be set to point to the beginning of these
arrays if ndims
is smaller than or equal to the compile-time constant
PDL_NDIMS
. This is important to note when freeing a piddle struct.
The same applies for the threadids:
unsigned char def_threadids[PDL_NTHREADIDS];
It is possible to attach magic to piddles, much like Perl's own magic mechanism. If the member pointer
struct pdl_magic *magic;
is nonzero, the PDL has some magic attached to it. The implementation of magic is discussed below.
One of the first members of the structure is
int state;
The possible flags and their meanings are given in pdl.h
.
As you should already know, piddles often carry information about where they come from. For example, the code
$b = $a->slice("2:5"); $b .= 1;
will alter $a. This information is stored in the members
pdl_trans *trans; pdl_vaffine *vafftrans;
pdl_trans
and pdl_vaffine
are structures that we will look at in
more detail below.
When piddles are referred to through Perl SVs, we store an additional reference to it in the member
void* sv;
in order to be able to return a reference to the user when he wants to inspect the transformation structure on the Perl side.
Also, we store an opaque
void *hdrsv;
which is just for use by the user to hook up arbitrary data with this sv.
Smart references and most other fundamental functions
operating on piddles are implemented via transformations
which are represented by the type pdl_trans
in PDL.
A transformation links input and output piddles and contains all the infrastructure that defines how
slice
here).In general, executing a PDL function on a group of piddles
results in creation of a transformation of the requested
type that links all input and output arguments (at least
those that are piddles). In PDL functions that support
data flow between input and output args (e.g. slice
,
index
) this transformation links parent (input) and
child (output) piddles permanently until either the link is
explicitly broken by user request (sever
at the perl level)
or all parents and childen have been destroyed. In those
cases the transformation is lazy-evaluated, e.g. only executed
when piddle values are actually accessed.
In non-flowing functions, for example addition (+
) and inner
products (inner
), the transformation is installed just as
in flowing functions but then the transformation is immediately
executed and destroyed (breaking the link between input and output args)
before the function returns.
It should be noted that the close link between input and output args of a flowing function requires that piddle objects that are linked in such a way be kept alive beyond the point where they have gone out of scope from the point of view of perl:
$a = zeroes(20); $b = $a->slice('2:4'); undef $a; # last reference to $a is now destroyed
Although $a should now be destroyed according to perl's rules
the underlying *pdl
must actually only be freed when $b
also goes out of scope (since it still references
internally some of $a
's data). This example demonstrates that such
a dataflow paradigm between PDL objects necessitates a special
destruction algorithm that takes the links between piddles
into account and couples the lifespan of those objects. The
non-trivial algorithm is implemented in the function
pdl_destroy
in pdlapi.c.
pdl_trans
All transformations are implemented as structures
struct XXX_trans { int magicno; /* to detect memory overwrites */ short flags; /* state of the trans */ pdl_transvtable *vtable; /* the all important vtable */ void (*freeproc)(struct pdl_trans *); /* Call to free this trans (in case we had to malloc some stuff dor this trans) */ pdl *pdls[NP]; /* The pdls involved in the transformation */ int __datatype; /* the type of the transformation */ /* in general more members /* depending on the actual transformation (slice, add, etc) */ };
The transformation identifies all pdl
s involved in the trans
pdl *pdls[NP];
with NP
depending on the number of piddle args of the particular
trans. It records a state
short flags;
and the datatype
int __datatype;
of the trans (to which all piddles must be converted unless they are explicitly typed). Most important is the pointer to the vtable that contains the actual functionality
pdl_transvtable *vtable;
The vtable structure in turn looks something like (slightly simplified from pdl.h for clarity)
typedef struct pdl_transvtable { pdl_transtype transtype; int flags; int nparents; /* number of parent pdls (input) */ int npdls; /* number of child pdls (output) */ char *per_pdl_flags; /* optimization flags */ void (*redodims)(pdl_trans *tr); /* figure out dims of children */ void (*readdata)(pdl_trans *tr); /* flow parents to children */ void (*writebackdata)(pdl_trans *tr); /* flow backwards */ void (*freetrans)(pdl_trans *tr); /* Free both the contents and it of the trans member */ pdl_trans *(*copy)(pdl_trans *tr); /* Full copy */ int structsize; char *name; /* For debuggers, mostly */ } pdl_transvtable;
We focus on the callback functions:
void (*redodims)(pdl_trans *tr);
redodims
will work out the dimensions of piddles that need
to be created and is called from within the API function that
should be called to ensure that the dimensions of a piddle are
accessible (pdlapi.c):
void pdl_make_physdims(pdl *it)
readdata
and writebackdata
are responsible for the actual
computations of the child data from the parents or parent data
from those of the children, respectively (the dataflow aspect).
The PDL core makes sure that these are called as needed when
piddle data is accessed (lazy-evaluation). The general API
function to ensure that a piddle is up-to-date is
void pdl_make_physvaffine(pdl *it)
which should be called before accessing piddle data from XS/C (see Core.xs for some examples).
freetrans
frees dynamically allocated memory associated
with the trans as needed and copy
can copy the transformation.
The transformation and vtable code is hardly ever written by
hand but rather generated by PDL::PP
from concise descriptions
(see the example of the definition of inner
above).
Certain types of transformations can be optimized very
efficiently obviating the need for explicit readdata
and writebackdata
methods. Those transformations are
called pdl_vaffine. Most dimension manipulating
functions (e.g., slice
, xchg
) belong to this class.
The basic trick is that parent and child of such a transformation work
on the same (shared) block of data which they just choose
to interpret differently (by dusing different dims
, dimincs
and
offs
on the same data, compare the pdl
structure above).
Each operation on a piddle sharing
data with another one in this way is therefore automatically flown
from child to parent and back -- they are reading and writing
the same block of memory. This is currently not perl thread safe --
no big loss since the whole PDL core is not reentrant
(perl threading !=
PDL threading).
A more detailed description of vaffines will be added in a later version of this document.
Most of that functionality is implemented in the file pdlthread.c.
The PDL::PP
generated functions (in particular the
readdata
and writebackdata
callbacks) use this infrastructure to
make sure that the fundamental operation implemented by the
trans is performed in agreement with PDL's threading semantics.
Please, see PDL::PP and examples in the PDL distribution. Implementation and syntax are currently far from perfect.
As the above example on the definition of inner
suggests,
the principal functionality
is implemented through the function pp_def
which takes a hash
of directives as a concise specification of the desired functionality.
The Code
and RedoDimsCode
fields contain the algorithmic specification
in a home grown pseudo code syntax that has some faint similarity to
a mixture of perl, XS and C. Here we could do with some good ideas
for improvement. From this description PDL::PP
produces C and XS code
(which will go away in a future perl?) by processing the hash through a
translation table.
The ability to write the fundamental algorithm in a perlish way (or even in perl) that is optimized by (ideally JIT) compilation is what we are really after.
We have explained the key ideas which make PDL work, and work well. It remains to be determined which of these might belong in the core of perl6, in order to make PDL work more seamlessly, and which belong in the module.
Our take on this: perl6 should support a compact, numerical array type which is
overloadable and can be manipulated with nice syntax. A key point is to abolish the
visual distinction between perl arrays (@x
) and PDL arrays ($x
) which is confusing
but forced upon us by the current object paradigm. There is also considerable
scope for improving syntax to make numerical Perl more user-friendly - this is explained
in more detail by the RFCs on Ranges and Default Methods.
Compact arrays should support virtual slices when being manipulated.
It seems to us that the more detailed/generalised PDL threading paradigm is beyond the core and belongs in a module. Some syntactical hooks would be appreciated though so one could at least say code like this at the Perl level:
sub mysub : PDL { my($a, $b) = @_; signature('a(n), b(n)'); loop(n) { # n a bareword, loop has prototype # Threaded code } }
perl6 RFCs on Ranges and Default Methods.
Matlab
IDL (the one from rsinc)
Yorick
Mathematica
NumPython starship.python.net
Hords of other array-oriented languages, try SAL.KachinaTech.COM
RFCs on multidim arrays 202-207, 231
[1] LW:
Let me just add that I don't mind the brainstorming at all. To be a good language designer, you have to stuff your brain with what you *could* do before you can reasonably choose what you *will* do. At the moment, I'm not only trying to follow along here; I'm also reading all the books on computer languaes I can get my hands on--not just to look for ideas to steal, but also to remind myself of the mindset Perl was designed to escape. In fact, I'd go as far as to say that it's imperative that you overstuff your brain to the point where you no longer feel tempted to overstuff the language. Hypotheticality is your friend.