On Fri, 22 Oct 2004 16:03:17 -0700, Jamiil wrote:
> Thanks for the help Guy!
> A long time ago I wrote a class wrapper for what I found to be some of
> the most common methods in std::string. In the past few days I have
> seen the word "UNICODE" and/or "Internationalization" popping up on my
> desk, so I have decided to make my programs multicultural ready. C++
> has this "std::wstring" that might be the bridge to achieve my goal.
No. Do not use whar_t or wstring.
Instead, if you want to support Unicode you should use a 32 bit
integer string. C++ does not provide one, so internationalisation
using basic_string<?> is not strictly possible for a 1-1 character
to code point representation.
You have two alternatives. For total portability of your
internal code, use basic_string<unsigned char> and assume UTF-8 encoding.
This is the encoding used by the Internet and also Linux.
You can probably get away with basic_string<char>.
With a tiny risk, you can also use basic_string<int>,
this is likely to be wide enough on most hosted systems,
although it may fail for embedded systems where int is 16 bits.
wchar_t is 32 bit on 32 bit Linux boxes, but it is 16 bit on
Windows (and Solaris?). On a small system (eg micro controller
or small embedded platform like a mobile phone) it might even be 8 bit.
This situation will change if you adopt the C99 extensions to
C++ which will eventually be part of the next C++ Standard,
then you can use int32_t (and your program will fail on systems
not implementing it).
If you're *only* handling plain text, with simple formatting,
plain old string with UTF-8 is the best option .. because there
is nothing to do. You're already supporting it :)
However, if you need to index into the string, or count
characters, UTF-8 is harder to work with. Finding position n
in the string takes O(n) time.
Note that in all cases you have to worry about physical I/O.
There's no way to be sure what encoding your input data has,
so you either have to mandate it, or ask the client to tell you
(eg with a command line option). There is a standard way to
detect big/little endian UTF-8,UCS-2,UCS-4 by examining the
first few bytes of a file in binary mode.. but it doesn't seem
to be used much.
Help-gplusplus mailing list