Chinese Computing

Charset Attribute

When sending the HTTP header, the CGI program can also specify the character set that the dynamic HTML page will use. This allows the receiving browser to automatically display the page using the correct character set and fonts. This involves appending the charset after the content type. Examples of this using Perl are:

print "Content-type: text/html; charset=gb2312\n\n";
or
print "Content-type: text/html; charset=big5\n\n";

In the past Netscape and Internet Explorer have used different charset names for GB2312 and Big5, but the two used above seem to work on both currently.

Decoding Unicode Escapes

Sometimes, Internet Explorer will send Chinese characters entered in a form in a special escaped form based on their Unicode value. This will look something like 一. The CGI program will first need to convert this escaped form back into a Unicode character or back to GB or Big5 using the appropriate tables.

Below is a sample perl function that will do the conversion:

sub touni {
    my($escape) = @_;


    $escape =~ s/\&\#x(\d+);/$1/;
    my($hexval) = sprintf("%X", $escape);

    if (length($hexval) == 3) {
	$hexval = "0" . $hexval;
    } elsif (length($hexval) == 2) {
	$hexval = "00" . $hexval;
    }

    return pack("CC", hex(substr($hexval, 0, 2)), hex(substr($hexval, 2, 2)));
}