[% setvar title Normalisation and C %]

This file is part of the Perl 6 Archive

Note: these documents may be out of date. Do not use as reference!

To see what is currently happening visit http://www.perl6.org/


Normalisation and unicode::exact


  Maintainer: Simon Cozens <simon@brecon.co.uk>
  Date: 25 Sep 2000
  Mailing List: perl6-internals@perl.org
  Number: 295
  Version: 1
  Status: Developing


Perl 6 should support Unicode normalisation; this is going to make comparing strings confusing.


First, what's normalisation? Unicode gives the user a lot of flexibility over how data is represented. For instance, there are two ways of representing é; (that's an e with an acute accent) first, there's U+00E9, (Also handily named LATIN SMALL LETTER E WITH ACUTE) and there's secondly the two characters U+0065 U+0301. (That's a an acute accent which combines with an ordinary latin letter 'e')

Normalisation is the process of turning all data in the first type of representation to the second type. Well, strictly speaking, this is "decomposition", but the purpose of it is that we can now compare things for their meaning, and not solely for their representation, and doing it for that purpose is normalisation.

Perl 6 should support normalisation. But this creates a problem. Should the eq operator compare representations or meanings? After plying the perl5-porters with large quantities of alcohol at YAPC::Europe, the consensus was that it should compare meanings. Good. Perl's always been about handling text for meaning. But then how can we tell whether two strings are really equal in terms of their representation?


The current use bytes pragma (which is the subject of another of my Unicode RFCs) will allow comparison in terms of representation; there's also a problem of optimisation.

If we keep the original data, we need to perform decomposition every time we do any kind of string comparison, and I don't relish the prospect of cmp becoming really slow. Of course, you could store a decomposed PV inside the SV as well, but that's big and heavy; the only sensible, non-destructive optimisation would be to have some kind of IsNormalised flag in the SV which tells us not to bother decomposing this string, since it's already in a normalised representation.

I propose that by default, Perl 6 is allowed to chew up your data and decompose it. If the exact representation is important to the user, the pragma unicode::exact should be turned on; inside of the scope of unicode::exact, no normalisation is performed, and cmp and friends perform normalisation on a temporary copy of the string so as to be non-destructive to the original data. For instance:

    $x = chr(0x00E9); # LATIN SMALL LETTER E WITH ACUTE
    ... if ($x cmp $y);
    # $x is *actually* chr(0x0065).chr(0x0301) now
        use unicode::exact;
        $x = chr(0x00E9);
        ... if ($x cmp $y);
        # $x is compared as if it were chr(0x0065).chr(0x0301),
        # but it retains its old value of chr(0x00E9)

Outside of unicode::exact, whether the normalisation is done lazily (necessitating an IsNormalised flag) or when the data is stored is not specified by this RFC; it works fine both ways. I'd personally say it should be done on lazily.


The Unicode FAQ; www.unicode.org

RFC 300: use unicode::representation

RFC 312: Unicode Combinatorix

RFC ??:When UTF8 Leaks Out