[% setvar title Internally, data is stored as UTF8 %]

This file is part of the Perl 6 Archive

Note: these documents may be out of date. Do not use as reference!

To see what is currently happening visit http://www.perl6.org/


Internally, data is stored as UTF8


  Maintainer: Simon Cozens <simon@brecon.co.uk>
  Date: 25 Sep 2000
  Mailing List: perl6-internals@perl.org
  Number: 294
  Version: 1
  Status: Developing


We need to settle on an internal data format; this RFC proposes that UTF8 should be that format.


Perl 5.6's Unicode support has been hampered by the fact that it was grafted onto the side of the old string support, and so it tried to handle both Unicode-encoded and non-Unicode data in the same structures; this made it an absolute swine to do any manipulation properly on these strings.

This could all be made a lot easier if we stuck to one single data format for internal representation, just as most other languages out there do. If we're going to have decent Unicode support, it naturally needs to be a UTF. So which one?

UTF32 is just not going to fly. It's too big and bulky. UTF16 is sensible, but there's probably a lot more legacy ASCII data out there than anything else, so it makes sense to propose UTF8 as a halfway house.


We'll need to get data into Unicode, and I have an RFC about that; we need to handle data internally, and I have an RFC about that. This RFC merely settles on the fact that we need a single internal data format for simplicity and that it should be UTF8.


The Unicode FAQ on UTFs and BOMs: (An excellent introduction to what UTFs are, what they look like and how they work.) www.unicode.org

RFC 295: Normalisation and unicode::exact

RFC ??: When UTF8 leaks out

RFC 300: use unicode::representation

RFC 312: Unicode Combinatorix

RFC 296: Getting Data Into Unicode Is Not Our Problem

RFC ??: Unicode Locales

RFC ??: Abstract the Internal String Interaction