Chinese Computing Newsletter; July, 2000 ======================================== CONTENTS ======== * Java JDK 1.3.1 Fixes Chinese Input Problems; JDK 1.4 Beta Notes * English to Chinese Translation at Readworld.com * Chinese Pen Wireless Chinese Handwriting Recognition System * Unicode 3.1 Released * IBM Chinese Handwriting Recognizer for Palm * IBM Cantonese Text to Speech * Office XP Includes Chinese Speech Recognition * Chinese Domain Names using RACE Encoding * Site of the Month: AEG, Inc. * Code Sample of the Month: Convert UTF-8 HTML to NCR's in Perl ARTICLES ======== ** Java JDK 1.3.1 Fixes Chinese Input Problems; JDK 1.4 Beta Notes Programs written in Java 1.3 and run on English versions of Microsoft Windows had problems using Microsoft's Chinese input methods. Characters would appear as question marks instead of a Chinese characters. Programs run using JDK 1.3.1 no longer have this problem. One other solution for users of Windows 2000 is to switch you default locale to traditional or simplified Chinese. In other Java news, the new JDK 1.4 Beta includes several features, including direct access to the encoding converters that formerly could only be accessed through the I/O classes or String methods. Also included are a regular expression package. Being Unicode-based, implementing Chinese regular expressions should become easier with needing to worry about double-byte issues existing in other regex packages. Finally, JDK 1.4 has been updated to Unicode 3.0 as its base encoding. Unicode 3.0 has a variety of Chinese-related enhancements, including more CJK characters and ideographic description sequences. Related Links http://java.sun.com/j2se/1.3/docs/guide/intl/faq.html#Text%20Rendering http://www.javasoft.com/j2se/1.3/ http://www.microsoft.com/msdownload/iebuild/ime5_win32/en/ime5_win32.htm http://java.sun.com/j2se/1.4/relnotes.html ** ReadWorld English to Chinese Translation Service Readworld includes a English to Chinese translation service. Included is the ability to translation web pages, free text, and also e-mails. Users can translate from English into Simplified or Traditional Chinese. A free downloadable program is also available to run on your own computer. (Contributed by Nelson Chin) Related Links http://www.readworld.com/tran/ ** Chinese Pen: Wireless Chinese Handwriting Recognition System TwinBridge has recently started selling a Chinese Handwriting Recognition System with a wireless pen. The system can recognize both simplified and traditional Chinese characters, as well as English and numbers. The software includes the ability to adapt to individual users writing styles. Related Links http://www.twinbridge.com/Products/pen/new%20chinese%20pen%20USB.htm ** Unicode 3.1 Released Unicode version 3.1 was released in May, 2001 and has several additions and changes related to Chinese. For the first time characters have been encoded beyond the basic multilingual plane. "The Supplementary Ideographic Plane, or Plane 2, contains a very large collection of additional unified Han ideographs known as Vertical Extension B, comprising 42,711 characters, as well as 542 additional CJK Compatibility ideographs." This brings the total number of Unified Han ideographs in Unicode to 70,207, plus 832 CJK Compatibility ideographs. Minor changes have also been made concerning CJK punctuation, East Asian Character Width, and bopomofo. Related Links http://www.unicode.org/unicode/reports/tr27/ ** IBM Chinese Handwriting Recognizer for Palm From the Website: IBMCCR for Palm is an embedded Chinese handwriting recognizer designed for PalmOS. You can use this program to input Chinese characters to any other Palm applications. IBM CCR for Palm is an embedded handwriting recognizer designed for low resource PDAs like IBM Workpad, 3COM PalmPilot etc. IBM CCR is implemented as a generic input method for Palm, you can invoke the recognizer from any other Palm application and input Chinese characters to that program. At the same time, two handwriting areas are implemented in this application. You can write characters in the two boxes alternatively. The simplified Chinese version of IBM CCR for Palm can recognize all the Chinese characters in level-1 GB2312 and 577 frequently used GB2312 level-2 characters. To sum up, it can recognize 4332 characters totally. Related Links http://www.alphaworks.ibm.com/aw.nsf/techmain/4B70C6AE31F1D58F88256A100006DBE2?OpenDocument ** IBM Cantonese Text to Speech From the Website: ECI (Eloquence Command Interface) is a library that provides an interface between applications and the IBM ViaVoice Outloud text-to-speech system. Version 5.0 of ECI has been re-architected to provide support for multiple, concurrent speech synthesis threads, and a consistent interface on all supported platforms. As in prior versions of ECI, text is appended to the input buffer. Each word takes its voice definition from the active voice. Speech is synthesized from the input buffer according to the associated voice parameters, placed in the output audio buffer, and sent to the appropriate destination. The active voice can be set from a number of built-in voices. The language, the dialect, and the voice parameters can be modified individually using either ECI function calls, or annotations inserted into the input buffer with the input text. As text is added to the input buffer, the active voice definition is stored with it, so that changes to the active voice do not affect text already in the input buffer. Indices can be used to determine when the delimited text fragment has been synthesized. A message will be received when all text inserted before the index has been synthesized. Output can be sent to one of three types of destinations: a callback function, a file, or an audio device. These destination types are mutually exclusive, so sending output to one of them turns off output to the previous destination. The default is to send to an available audio device. Related Links http://www.alphaworks.ibm.com/aw.nsf/techmain/7C685C5F5A02B745882569A7007939C0?OpenDocument ** Office XP Includes Chinese Speech Recognition The later version of Microsoft's Office suite, Office XP, will include Chinese speech recognition software. "The Office XP speech feature, which Microsoft claims can recognize 60,000 words in Mandarin Chinese, also works with Japanese and English. Characters appear on the screen as the user speaks into a microphone. Microsoft says the program analyzes context of words and gets them right about 90 percent of the time." The Chinese version of Office XP has been available since June. Related Links http://www.siliconvalley.com/docs/news/svfront/xp053101.htm http://research.microsoft.com/labs/beij.asp http://research.microsoft.com/speech/ ** Standards Bodies Still Working on Chinese Character Domain Names A system to allow multilingual domain names is still working towards official deployment and use. Verisign currently allows registration of domain names in up to 39 different scripts encoding up to 350 different languages (including simplified and traditional Chinese characters), but this system is still in the testing stages. The primary method under consideration uses Unicode as the principal character set, and uses some compression techniques and a Base32-style encoding method to put it in a form acceptable to current domain name servers (the letters and numbers and dash from ASCII). This method is called RACE for Row-based ASCII Compatible Encoding for Internation Domain Names. Related Links http://www.siliconvalley.com/docs/news/svfront/icann061801.htm http://www.ietf.org/internet-drafts/draft-ietf-idn-race-03.txt http://global.networksolutions.com/en_US/purchasing/welcome.jhtml ** Site of the Month: AEG, Inc. As part of his Ph.D. dissertation at the University of Hawaii, Manoa, Roderick Gammon is providing a set of free, open source tools for Chinese NLP. From the website: CMCG illustrates a Cognitive Grammarian (CG) explanation of Mandarin quantifying determiner phrases (QDP) using a software model applied to a descriptive survey. That purpose is divided into three goals. First CG is detailed such that it may be used as a blueprint for a mechanical model that parses and generates QDP. Second, actual Mandarin QDP are analyzed such that the general model may be sufficiently instantiated for application to Mandarin texts. Finally, the mechanical model itself must be derived from the theoretical model and applied to the descriptive data as a test of efficacy. Related Links http://www.aeg-inc.net/cuttingEdge/main.html ** Code Sample of the Month: Convert UTF-8 HTML to NCR's in Perl #!/usr/local/bin/perl # Convert UTF-8 encoded files to use numerical character references # i.e. 係 if ($#ARGV == -1) { print "Please supply the name of the file to convert.\n"; } open(FD, $ARGV[0]) or die "Can't open $!\n"; while ($inline = ) { chomp($inline); @chars = (); $outline = ""; (@chars) = ($inline =~ m/([\x01-\x7f]| [\xc0-\xdf][\x80-\xbf]| [\xe0-\xef][\x80-\xbf][\x80-\xbf])/xg); foreach $char (@chars) { if (length($char) == 1) { $outline .= $char; } elsif (length($char) == 2) { $unival = (vec($char, 0, 8) & 0x1f) * 0x40 + (vec($char, 1, 8) & 0x3f); $outline .= "\&#$unival;"; } elsif (length($char) == 3) { $unival = (vec($char, 0, 8) & 0x0f) * 0x1000 + (vec($char, 1, 8) & 0x3f) * 0x40 + (vec($char, 2, 8) & 0x3f); $outline .= "\&#$unival;"; } } print $outline . "\n"; } close(FD); -------------------------------------------------------------------- Please send suggestions for future Chinese Computing Newsletter items to erik@chinesecomputing.com. Submissions are welcome and will receive credit. Past issues of the newsletter can be accessed through the www.chinesecomputing.com site. Feel free to redistribute the newsletter for non-commercial use as long as you retain this notice. To remove yourself from the list, send an e-mail to newsletter@chinesecomputing.com. On the subject line write "remove your@email-address.com". If you received the newsletter through the CCNet Chinese Computing mailing list, you must unsubscribe from that list directly.