UTF-8 Character Encoding · May 6, 07:20 PM by Dylan Doxey
UTF-8 character encoding? What does it all mean?!?!
Question: Why does the character "é" sometimes get corrupted into "é"?
Explanation:
UTF-8 binary representation requires a format such as:
0xxxxxxx -- seven bit characters (ASCII/Unicode values 0 through 127)
110xxxxx 10xxxxxx -- 110 indicates a two byte representation, 10 indicates greater than seven bit character
1110xxxx 10xxxxxx 10xxxxxx -- 1110 indicates a three byte representation
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx -- 11110 indicates a four byte representation
Therefore:
é == 233 (ASCII) == 11101001 (binary)
The UTF-8 representation for:
11101001 -> (11)(101001) -> [110]000(11),[10](101001) -> 11000011,10101001
Where the pattern [110]xxxxx,[10]xxxxxx indicates a two byte character with up to eleven bits of data.
Therefore a program unknowingly processing UTF-8 encoded text will mistakenly interpret every 8 bits as a straight forward character encoding.
Specifically:
Non UTF-8 interpretation of 11000011,10101001 is 11000011 & 10101001 -> Ã & ©.
UTF-8 interpretation of 11000011,10101001 is [110]00011,[10]101001 -> 00011101001 -> é.
Question: What happens to UTF-8 characters in URLs?
The browser will hex encode the string and as a security feature will generally leave it hex encoded on the address bar of your browser. (To prevent malicious web developers from setting up a website on wellsfargo.com spelled with Cyrillic characters, for example.)
UTF-8 strings are easy to spot on your address bar because of the distinctive encoding patter of UTF-8.
110xxxxx 10xxxxxx => [C-D][0-F] [8-B][0-F]
1110xxxx 10xxxxxx 10xxxxxx => E[0-F] [8-B][0-F] [8-B][0-F]
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx => F[0-7] [8-B][0-F] [8-B][0-F] [8-B][0-F]
Consider the URL: http://yukistore.com/日立
This will probably appear in your browser as: http://yukistore.com/%E6%97%A5%E7%AB%8B
That's because this website is UTF-8 as declared in the charset meta tag in the HTML header.
This is clearly a pair of three byte UTF-8 unicode characters because it matches the pattern:
E[0-F] [8-B][0-F] [8-B][0-F]E[0-F] [8-B][0-F] [8-B][0-F]
In Perl you might write:
$url =~ m/ %E[0-9A-F] %[89AB][0-9A-F] %[89AB][0-9A-F] /msx;
Obviously!

Commenting is closed for this article.
