[% setvar title Unicode Combinatorix %]
Note: these documents may be out of date. Do not use as reference! |
To see what is currently happening visit http://www.perl6.org/
Unicode Combinatorix
Maintainer: Simon Cozens <simon@brecon.co.uk> Date: 25 Sep 2000 Mailing List: perl6-internals@perl.org Number: 312 Version: 1 Status: Developing
How and when Unicode is used in Perl 6. Its sister RFC "When UTF8 Leaks Out" deals with how output is processed with respect to Unicode; this RFC deals with input and processing.
Here is an proposed overview of Perl 6's Unicode handling, except for all aspects of output.
Data comes into Perl in one of two methods: through a filehandle or socket, where line disciplines apply, or "any other method".
Data which comes in through a line discipline must be in UTF8, unless
no unicode
is in force.
Examples of data which enter Perl from "any other method" are
environment variables, stuff coming out of backticks, and globs.
It's assumed that these will come into Perl as ISO8859-1, and will be
converted into UTF8 on entry. If the user wants to tell us that this
data won't be ISO8859-1, then they may say use unicode::system 'discipline'
where 'discipline'
is any level 2 processing module. The data will be
filtered through that module, and we will have UTF8 data at the end of
this.
OK, at this stage, everything inside Perl is in Unicode. The next thing
we need to do is to normalise it, and the RFC on normalisation covers
that: basically, we either do this on demand or immediately. The setting
of unicode::exact
comes into play here.
Comparisons can be carried out in three ways: if
unicode::representation
is in force, then the bytes must be exactly
equal. If unicode::exact
is set, a normalised copy of the operands is
made, and they are compared. Otherwise, the operands are normalised and
compared. As the FAQ says, "Canonical equivalence matters".
Collation should take place according to the Unicode collation tables;
if use locale
is set, then the collation is localised as well. The
Unicode locales RFC suggests other areas affected by locales, such as
word and line breaking and Unicode character classes.
no unicode
just throws everything. None of the above happens.
Just leave that to me...
"What Level of Support Should I Look For?", www.unicode.org
RFC 311: Line Disciplines
RFC 295 Normalisation and unicode::exact
RFC 300 use unicode::representation
and no unicode
RFC ??: When UTF8 Leaks Out
RFC ??: Abstract Internals String Interaction
RFC ??: Unicode Locales