If you want to minimize the number of bugs reported for your web application you should know that codepages are domain of the past. If you want to create multi-language web content all you have to remember is that you should use UTF-8 encoding everywhere.
Also if you encounter pages in other encodings you should convert them into UTF-8.
What you should know about UTF-8:
- Any character can be encoded in UTF-8, converting any other encoding to UTF-8 can be done without loosing data
- UTF-8 is the de facto standard on all internet related protocols.
- Any ASCII text is a valid UTF-8 text
- UTF-8 is the most simple Unicode encoding and it’s the only one that is not dependent of the byte-ordering.
- It’s best to mark UTF-8 as the default encoding for any page you create.
The only real disadvantage I’ve discovered when using UTF-8 encoding is that the encoded text is larger than for some languages like the Asian ones. Still this should be no problem if you enable HTTP compression.
When the usage of UTF-8 can break things:
- If the text is going to be used on ancient devices that are able to use only 8-bit characters. Like: TV related equipments, or old mobile phones.
English
Română
“UTF-8 is the most simple Unicode encoding”
This is debatable. That would be UTF-32. The price you pay for endianness pales compared to what you have to do for processing UTF-8.
Yes is debatable, but I was considering this because:
* it simpler to convert an existing ASCII application to UTF-8 that to UTF-32
* sometime in the, far, future UTF-32 could be not enough, it did happened with UTF-16 before
* with some luck you can convince ASCII applications to work with UTF-8 strings without breaking them. This would be clearly impossible with UTF-32
* on average UTF-8 is consumes far less space than UTF-32.
And about processing: I wouldn’t even try to write my own UTF-8 string parsing routines – there are good, free and open-source solutions for this.