Go to content Go to navigation Go to search

UTF-8 Character Encoding · May 6, 07:20 PM by Dylan Doxey

UTF-8 character encoding? What does it all mean?!?!

Question: Why does the character "é" sometimes get corrupted into "é"?

Explanation:

UTF-8 binary representation requires a format such as:

        0xxxxxxx -- seven bit characters (ASCII/Unicode values 0 through 127) 
        110xxxxx 10xxxxxx -- 110 indicates a two byte representation, 10 indicates greater than seven bit character 
        1110xxxx 10xxxxxx 10xxxxxx -- 1110 indicates a three byte representation 
        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx -- 11110 indicates a four byte representation 

Therefore:

        é == 233 (ASCII) == 11101001 (binary) 

The UTF-8 representation for:

        11101001 -> (11)(101001) -> [110]000(11),[10](101001) -> 11000011,10101001 

Where the pattern [110]xxxxx,[10]xxxxxx indicates a two byte character with up to eleven bits of data.

Therefore a program unknowingly processing UTF-8 encoded text will mistakenly interpret every 8 bits as a straight forward character encoding.
Specifically:

        Non UTF-8 interpretation of 11000011,10101001 is 11000011 & 10101001 -> Ã & ©. 
        UTF-8 interpretation of 11000011,10101001 is [110]00011,[10]101001 -> 00011101001 -> é. 

Question: What happens to UTF-8 characters in URLs?

The browser will hex encode the string and as a security feature will generally leave it hex encoded on the address bar of your browser. (To prevent malicious web developers from setting up a website on wellsfargo.com spelled with Cyrillic characters, for example.)

UTF-8 strings are easy to spot on your address bar because of the distinctive encoding patter of UTF-8.

        110xxxxx 10xxxxxx                   => [C-D][0-F] [8-B][0-F] 
        1110xxxx 10xxxxxx 10xxxxxx          => E[0-F] [8-B][0-F] [8-B][0-F] 
        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx => F[0-7] [8-B][0-F] [8-B][0-F] [8-B][0-F] 

Consider the URL: http://yukistore.com/日立

This will probably appear in your browser as: http://yukistore.com/%E6%97%A5%E7%AB%8B
That's because this website is UTF-8 as declared in the charset meta tag in the HTML header.

This is clearly a pair of three byte UTF-8 unicode characters because it matches the pattern:

E[0-F] [8-B][0-F] [8-B][0-F]E[0-F] [8-B][0-F] [8-B][0-F]

In Perl you might write:

$url =~ m/ %E[0-9A-F] %[89AB][0-9A-F] %[89AB][0-9A-F] /msx;


Obviously!

Commenting is closed for this article.