Sunday, July 19, 2009

What is UTF-8 ???

UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages, and other places where characters are stored or streamed.

UTF-8 encodes each character in 1 to 4 octets (8-bit bytes), with the single octet encoding used only for the 128 US-ASCII characters.

The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8. The Internet Mail Consortium (IMC) recommends that all email programs be able to display and create mail using UTF-8

ADTANTAGES:
1. The ASCII characters are represented by themselves as single bytes that do not appear anywhere else, which makes UTF-8 work with the majority of existing APIs that take bytes strings but only treat a small number of ASCII codes specially. This removes the need to write a new Unicode version of every API, and makes it much easier to convert existing systems to UTF-8 than any other Unicode encoding.
2. UTF-8 and UTF-16 are the standard encodings for XML documents. All other encodings must be specified explicitly either externally or through a text declaration.
3. UTF-8 and UTF-16 are the standard encodings for having Unicode in HTML documents, with UTF-8 as the preferred and most used encoding.
4. UTF-8 strings can be fairly reliably recognized as such by a simple algorithm.
5. Sorting of UTF-8 strings as arrays of unsigned bytes will produce the same results as sorting them based on Unicode code points.

No comments: