Chinese Computing

General Chinese Encoding Information

Computers don't speak any languages, they only know numbers. In order for computers to work with human languages such as Chinese and English, special mappings between numbers and letters or characters are made into standards that various computers and programs understand. These agreed upon ways of using Chinese are called characters sets or code sets. GB (short for "Guojia Biaozhun" or "National Standard") is the standard used in the People's Republic of China and Singapore and it has a set of about 7,000 simplified Chinese characters. Big5 is used in Taiwan and Hong Kong and has about 13,000 traditional Chinese characters. Unicode is an emerging standard that attempts to encode all the major languages, including Chinese. Unicode includes all the characters from GB and Big5 and more. A character set is different from a font that supports that character set. You may have a document written using GB, but to view it you need a font that includes all the GB characters. Viewing a GB encoded document as if it were in Big5 will produce garbage on the screen. Viewing a Chinese document on a program that thinks it is in English will also produce an unintelligible document with lots of accented letters and symbols.

The characters in Unicode are a superset of the characters in GB and Big5 so it is easy to convert directly from GB or Big5 into Unicode. However, while there is some overlap between GB and Big5, there are also many simplified characters in GB that are not in Big5, and many traditional characters in Big5 that are not in GB. Consequently, conversion between GB and Big5 is not trivial, since many simplified characters map to multiple Big5 traditional equivalents. Going from Big5 to GB is easier, since the conversion from traditional to simplified is much less ambiguous.

Charset Conversion

Charset Detection