[% setvar title Abstract Internals String Interaction %]

This file is part of the Perl 6 Archive

Note: these documents may be out of date. Do not use as reference!

To see what is currently happening visit http://www.perl6.org/

TITLE

Abstract Internals String Interaction

VERSION

  Maintainer: Simon Cozens <simon@brecon.co.uk>
  Date: 26 Sep 2000
  Mailing List: perl6-internals-unicode@perl.org
  Number: 322
  Version: 1
  Status: Developing

ABSTRACT

All accesses to strings and portions of strings inside the Perl 6 core should be abstracted.

DESCRIPTION

Although RFC 294 states that data will be in UTF8 internally, there's going to come a time when we'll need to move to UTF16 or other encodings. If we start fiddling around "knowing" that we're dealing with UTF8 data, we're going to make it hard on ourselves when it comes to migrate.

But not only that, UTF8 is really rather gammy to deal with. It's a lot nicer if we can just treat each string as an array of wide characters.

Take, for instance, the operation of changing a character U+2654 (WHITE CHESS KING) into U+003F (QUESTION MARK) in the 3rd position of a 12 character string. Because UTF8 is a variable-width character set, we have a big problem: U+2654 is 3 bytes wide, and U+003F is one byte, we have to shift the rest of the string down two bytes, effectively having to rewrite it.

If we used a function or macro to change this single character, it would also have to rewrite the whole string; using this several times would be horrifically inefficient.

Even if we "pretend" that we're dealing with an array of wide chars, and used macros or functions to allow us to keep up this pretence, we'd still have to do that shifting around, which is nasty. The only way to get it working without the inefficiency would be to really deal with an array of wide chars, and have functions to convert such an array from and to our character set.

The obvious question would be, then, why bother with RFC 294 or UTF8 at all? Why not just store these arrays along with our SVs? Several reasons, notably:

IMPLEMENTATION

The following interface is proposed:

    wchar_t *  Perl_utf_to_raw( U8* string,      STRLEN* len )
    U8 *       Perl_raw_to_utf( wchar_t* string, STRLEN* len )

All functions which manipulate strings should convert them to raw form and operate on the array, converting it back when it's time to store it in the string.

REFERENCES

RFC 294: Internally, data is stored as UTF8

As-yet-unsubmitted RFC : When UTF8 Leaks Out