Chinese Computing Newsletter; September, 2000 ============================================= CONTENTS ======== * CE-STAR for PocketPC Review * Automatic Chinese Weather Line * Google CJK Search * Upcoming Chinese Language Processing Conference * Penn Chinese Treebank Now Available * Chinese Partner 2000 for Win 95/98 (CP2000 All IN ONE) * Updated CEDICT Chinese/English Dictionary Available * Site of the Month: Chih-Hao Tsai's Technology Page * Code Sample of the Month: Creating Chinese Character Numbers ARTICLES ======== ** CE-STAR for PocketPC: Review by Daniel Baird The new CE-Star 2.0 has just been released. CE-Star is a Chinese add-on program for handhelds that includes input, display (including web browsing) and localization. The new 2.0 supports the newest version of Pocket PCs as well as any other handheld that runs on Windows CE. CE-Star is a well rounded product that includes several types of input methods (Pinyin, Zhuyin, CangJie, Cantonese Pinyin and others); supports GB(GBK), BIG-5, HZ, UTF-7 and UTF-8 encodings; includes Japanese support; includes support for special HK character set; and has phrase predict function. It also comes with a code conversion program and a big plus in my eyes: an English-Chinese Chinese-English dictionary. One of the nicer features is the input and display of either traditional or simplified characters. The user can choose which character style to input and which to display--for example you can input in simplified and display the character either as simplified or traditional. Also both traditional and simplified characters can be displayed simultaneously. The most important new feature for version 2.0 is the handwriting recognition for input. Now users can choose either soft keyboard or handwriting as input method. Of course the soft keyboard is slow but I have found the handwriting recognition to be quite satisfactory. The only drawback is that the program does take quite a bit of space--over 4 meg. for the full program. The trial version (full version with certain features disabled) can be downloaded from http://www.ce-star.com The cost is $19.50 for upgrade from 1.1 and $35 for new users. Daniel Baird - Related Links http://www.ce-star.com ** Automatic Chinese Weather Line From the Web page: "We are currently developing a weather inquiry system called ... (Mu4 Xing1), which interacts with people using Mandarin Chinese. You can call the system at 617-252-7035, and ask for weather information using natural language -- in Chinese! "The system knows the weather information for about 100 Chinese cities, many US cities, and other major cities world-wide. The information includes general weather actions, such as sunny/thunderstorm, etc., as well as temperature, wind speed, sunrise/sunset times and humidity. To know more about the system, check out the FAQ page -- or even better, give it a try!" Review: The system worked quite well for me, even with my foreigner's accent. It uses computer-generated speech which can take some adjustment initially but is generally understandable. It recognized almost all the Chinese cities I asked about and even when I asked about provinces it gave a listing of cities in the province that I could ask about specifically. It did fail to understand me the first two times I tried, but then gave an example sentence for me to use and I started speaking a little clearer. After that it worked great. It is an interesting tool for those interested in Chinese speech recognition or generation, or for anyone who wants to know the weather in China. - Related Links http://www.sls.lcs.mit.edu/wangc/datacollection.html ** Google CJK Search Web search engine Google has expanded to include the ability to search the web in both simplified and traditional Chinese. Google claims to have indexed 24 million Chinese language web pages that can be searched. Google will also supply the web search engine for Chinese portal NetEase. - Thanks to Mark Lewellen for submitting this story. - Related Links http://www.google.com/intl/zh-CN/ http://www.google.com/intl/zh-TW/ http://www.google.com/intl/zh-CN/pressrelease.html http://www.google.com/intl/zh-TW/pressrelease.html ** Upcoming Chinese Language Processing Conference From the website: "Growing interest in Chinese Language Processing is leading to the development of resources such as annotated corpora and automatic segmenters, part-of-speech taggers and parsers. The first Asian ACL provides an ideal opportunity to bring together influential researchers from Taiwan, Singapore, Hong Kong, and Beijing, as well as Chinese language researchers in the rest of the world, to discuss issues that are specific to the processing of Chinese. A critical tool for developing Chinese language processing tools is the availability of annotated corpora. The greater the consensus we have around guidelines for corpus annotation of part-of-speech tags, syntactic bracketing and other areas, the more useful this corpora will be." The 2nd Chinese Language Processing Workshop will be held October 8, 2000 in Hong Kong, China. A wide variety of papers will be presented in various areas of Chinese natural language processing. - Related Links http://morph.ldc.upenn.edu/ctb/clp00.html http://www.ldc.upenn.edu/ctb/clp00.cfp ** Penn Chinese Treebank Now Available The Chinese Treebank is now available from the Linguistic Data Consortium in the University of Pennsylvania. The Treebank is a corpus comprising about 100,000 segmented Chinese words annotated by part of speech, grammatical structure, and anaphora relation. The Treebank is an invaluable resource to people doing research in statistical methods of Chinese language analysis. The corpus itself uses simplified characters and the GB2312 character set. It is available for US$100 to people who are not LDC members. - Related Links http://www.ldc.upenn.edu/ctb/ http://morph.ldc.upenn.edu/Catalog/LDC2000T48.html ** Chinese Partner 2000 for Win 95/98 (CP2000 All IN ONE) From the website: "The Chinese Partner 2000 has built-in Chinese Partner 4.98 enabling system which is fully compatible with MS Windows 98 (SE) and MS Office 97/2000! With extensive 64 Chinese True Type fonts, it also offers Unicode compatibility with Office 97 in single cursor movement. It's fully compatible with Fareast version of MS Office 97. Other features including enhanced Chinese character display and printer support, dynamic localization for menu and dialog box, support 3rd party TrueType fonts, compatible with GBK and BIG5+ encoding, support ISO-2022-CN and MIME encoding. Auto-Detect GB/BIG5 Codes, super code converter." Also includes Chinese pen input, voice recognition, OCR, and Chinese/English dictionary. CP2000 All in One lists for US$268. People who have used CP2000 are encouraged to send in a review for the next newsletter. - Related Links http://www.twinbridge.com/Products/cp/cp2000_allinone.html ** Updated CEDICT Chinese/English Dictionary Available An updated and corrected version of the CEDICT Chinese/English dictionary is now available from the link below. The dictionary available in both traditional and simplified versions. People are encourage to help the free dictionary grow by sending in contributions of new entries. The dictionary currently has about 23,000 entries. - Related Links http://www.mandarintools.com/cedict.html ** Site of the Month: Chih-Hao Tsai's Technology Page Chih-Hao Tsai's web site includes a variety of useful Chinese language data. Among the most useful is a list of Chinese word lists available on the Internet, frequency and stroke counts of Chinese characters, a list of Chinese names and a name frequency count derived from the list, a Big5 pinyin input method for Traditional Chinese Windows (a much needed addition), and much more. People with an interest in Chinese computing are encourage to take a look around. - Related Links http://www.geocities.com/hao510/ ** Code Sample of the Month: Convert an integer into a Chinese number string using Perl (Chinese in Big5) $minus = "負"; @digits = ("零", "一", "二", "三", "四", "五", "六", "七", "八", "九"); @beforeWan = ("十", "百", "千"); @afterWan = ("", "萬", "億", "兆", "京"); $TEN = 10; # The heart of the program. Does the actual conversion sub EnglishToChineseNumber { my($enumber) = @_; # input is a integer, e.g. 938 my(@powers) = (); my($power) = 0; my($value) = 0; my($negative) = 0; # is it a negative integer? my($inzero) = 0; # are we in a stretch or 1 or more zeros (only add one zero for the stretch) my($canaddzero) = 0; # only add a zero if there's something non-zero on both sides of it my($cnumber) = ""; # the final result # If zero, just return zero if ($enumber == 0) { return $digits[0]; } # Check if it's negative, set the negative flag and make it positive if ($enumber < 0) { $negative = 1; $enumber = -$enumber; } # Get the value of the coefficient for each power of ten while ($TEN ** $power <= $enumber) { $value = ($enumber % ($TEN** ($power+1)))/($TEN**$power); $powers[$power] = $value; # Subtract out the current power's coefficient and increase the power $enumber -= $enumber % ($TEN**($power+1)); $power++; } # Take the decomposition of the number for above and generate the Chinese equivalent for ($i = 0; $i < $power; $i++) { #System.out.println("10^" + i + ":\t" + powers[i]); if (($i % 4) == 0) { # Reached the next four powers up level if ($powers[$i] != 0) { $inzero = 0; $canaddzero = 1; $cnumber = $digits[$powers[$i]] . $afterWan[$i/4] . $cnumber; } else { # Check that something in the next three powers is non-zero before adding if ((($i+3 < $power) && $powers[$i+3] != 0) || (($i+2 < $power) && $powers[$i+2] != 0) || (($i+1 < $power) && $powers[$i+1] != 0)) { $cnumber = $afterWan[$i/4] . $cnumber; } } } else { # Add one, tens, hundreds, or thousands place for each level if ($powers[$i] != 0) { $inzero = 0; $canaddzero = 1; if ($power == 2 && $i == 1 && $powers[$i] == 1) { # No 一 with 10 through 19 $cnumber = $beforeWan[($i % 4)-1] . $cnumber; #} else if ((i%4 = 3) && powers[i] == 2) { # when to use liang3 vs. er4 #cnumber.insert(0, ALTTWO + beforeWan[(i%4)-1]); } else { $cnumber = $digits[$powers[$i]] . $beforeWan[($i%4)-1] . $cnumber; } } else { if ($canaddzero == 1 && $inzero == 0) { # Only insert one 零 for all consecutive zeroes $inzero = 1; $cnumber = $digits[$powers[$i]] . $cnumber; } } } } # Add the negative character if ($negative == 1) { $cnumber = $minus . $cnumber; } return $cnumber; } --------------------------------------------------------------------- Please send suggestions for future Chinese Computing Newsletter items to erik@chinesecomputing.com. Reviews of software programs, announcements of upcoming releases, and other Chinese computing news is welcome and will be credited. Past issues of the newsletter can be accessed through the www.chinesecomputing.com site. Feel free to redistribute the newsletter for non-commercial use as long as you retain this notice. To remove yourself from the list, send an e-mail to newsletter@chinesecomputing.com. On the subject line write "remove your@email-address.com". If you received the newsletter through the CC-Net Chinese Computing mailing list, you must unsubscribe from that list directly.