Grauw’s blog
Character set vs. character encoding
Recently I was asked to explain the difference between character encoding and character set, and I thought it would be interesting to write about this over here as well.
In these two terms, ‘set’ refers to the set of characters and their numbers (code points), and ‘encoding’ refers to the representation of these code points. For example, Unicode is a character set, and UTF-8 and UTF-16 are different character encodings of Unicode.
To illustrate this difference, in the Unicode character set, the € character has code point 8364 (usually written as U+20AC, in hexadecimal notation). Using the UTF-16LE character encoding this is stored as AC 20
, while UTF-16BE stores this as 20 AC
, and the UTF-8 representation is E2 82 AC
.
In practice however, the two terms are used interchangeably. The difference as described above is not applicable to most non-Unicode character sets (such as Latin-1 and SJIS) because their code points are the same as their encoding. Because of that, there has never been a real distinction from a historical perspective.
The most important difference in English is that the term character set is a little old fashioned, and character encoding is most commonly used nowadays. The reason for this is likely that it is more correct to speak of character encoding when UTF-8 and UTF-16 are different possible encodings.
Some examples:
- The HTTP protocol uses
Content-Type: text/html; charset=UTF-8
- The more recent XML uses
<?xml version="1.0" encoding="UTF-8"?>
This illustrates how they are used synonymously. Both describe the character encoding of the content that follows.
p.s. the Dutch terms for this are “tekenset” or “karakterset”. The first is slightly more modern than the latter. There is no distinction between character set and encoding in Dutch.
Grauw
Comments
a little confused by Raghav at 2009-03-24 07:30
i could not get the main difference
if ‘A’ is a character – is its unicode value part of the encoding character.
Re: a little confused by Grauw at 2009-03-24 16:55
For the letter A, the code point in the Unicode character set is 65 (usually written as U+0041, in hexadecimal notation). The UTF-8 encoding of that code point is a byte with value 41
.
If a code point is above 128, in UTF-8 the difference between a character set’s code point and a character’s encoding becomes a little more clear. For example, the character ö has number 246 (U+00F6), and is encoded in UTF-8 as two bytes: C3 B6
. If you look at the bits of those two bytes though, the bits of the code point are still there, and it can be decoded back to the Unicode code point.
Hope that makes it a little more clear.
Easy and clear explanation by aspd at 2009-11-18 07:28
You have covered all the key point in an easily understandable manner
techwriter by at 2009-11-28 20:17
thanks for this clear distinction
still left over issues by Kalyan at 2010-03-25 12:14
I completely agree with the explanation here. But there is a catch which no one could explain me. If character set and character encoding is different, with the content-type meta-type tag in html what are you specifying?. According to you <Content-Type: text/html; charset=UTF-8>
the tag actually means it uses UTF-8 character ‘encoding’ which is right, but what is the character set here. Lets assume when typing in windows default is windows character set. This is not gauranteed to be compatible to unicode character set. How was this resolved?
Re: still left over issues by Grauw at 2010-03-26 17:19
The use of ‘charset’ in the HTTP headers is simply because of the term’s legacy that I explained; originally, when HTTP was designed, Unicode was still in its early design phases, and commonly character sets had a direct mapping to their binary representation. In fact before UTF-16 was introduced, Unicode also had a direct mapping of character code to 16-bit word value. So people spoke in terms of character sets.
Then later on as the need grew to make the distinction, the term ‘character encoding’ was invented to cater for this difference. However HTTP by then was already using the term charset, and renaming this to be slightly more accurate would break compatibility for little gain.
As for the character set in browsers, actually all browsers (and Windows NT-based systems as well) use Unicode internally, because practically all other character sets contain a subset of Unicode. If a web page is delivered in ISO-8859-1 format, the characters in this encoding are converted to Unicode code points.
In fact, the first 256 code points in Unicode were made identical to those in ISO-8859-1 to make this mapping easier. So in a way, you could even see ISO-8859-1 as a (limited) character encoding for Unicode :).
Some older Windows-programs don’t use Unicode yet, and those will indeed have problems with international content. A lot of music players for example used to have trouble with MP3 tags in say, Japanese. Fortunately nowadays this is rarely an issue anymore.
still confused by hrr at 2011-04-01 19:02
For my server, I have it set to Content-Type:text/html; charset=utf-8
When I enter unicode character € in my forms, by the time it reaches the server it becomes %E2%82%AC which is what the utf value for it is right? But how do I make my server store unicode code points so that what I see is the actual characters and not its value in utf-8
Nice Post by Syed Abdul Baqi at 2009-03-20 06:47
Nice post...
Thanks!!!