[% setvar title Internal String Storage to be Opaque %]
Note: these documents may be out of date. Do not use as reference! |
To see what is currently happening visit http://www.perl6.org/
Internal String Storage to be Opaque
Maintainer: Simon Cozens <simon@brecon.co.uk> Date: 18 Aug 2000 Mailing List: perl6-internals@perl.org Number: 131 Version: 1 Status: Developing
Perl 5.6.0 tried to mix UTF8 and non-UTF8 strings inside SVs, meaning that every time you had to do something with a string, you had to check whether it was UTF8 or not, and if so probably do something slightly different to it instead. A single internal encoding which is opaque to the programmer Would Be Nice.
Let's imagine what happens if we have a string encoded in UTF8 and a string encoded in UTF16 and we need to concatenate them. What on earth do you do here? A byte-by-byte concatenation with no encoding change is probably the wrong thing, but it might not be. So, what happens at present is that one or the other has to get recoded. You can either change the original string to the new encoding (which makes you wonder why you bothered having separate encodings anyway) or you can take a copy and convert that. (which is the same, only more expensive)
A single encoding gets rid of all this mess.
'Course not. This is the internal representation we're talking about. It's only a convenience to the core, and doesn't relate specifically to what's coming into or going out of Perl from the user's point of view.
When data enters and leaves the core? Hmm, yes, I suppose it does. But that's life. Sorry.
That's the thing. It doesn't matter. It shouldn't matter. Keep it pluggable; you could have everything in Latin1, in UTF8, in UTF16, or who knows what, but the core developer shouldn't have to care. One good way to achieve this is to have the string presented in the variable as an array of wchars or similar. The point is that it's the same for everything, so comparisons don't suck.
Maybe. But would it matter? The end user certainly doesn't care what internal representation is used, and any module or program authors that depend on that can be found guilty of "unwarranted chumminess". The only difference would be between builds that use Unicode and those that don't.
You got it. But of course, you get to choose your UTF. For what it's worth.