Java的中文处理学习笔记：Hello Unicode(2) (笔记 by 车东)

试验2的一些结论：

所有的应用都是按照字节流=>字符流=>字节流方式进行的处理的：
byte_stream ==[input decoding]==> unicode_char_stream ==[output encoding]==> byte_stream；
在Java字节流到字符流（或者反之）都是含有隐含的解码处理的（缺省是按照系统缺省编码方式）；
最早的字节流解码过程从javac的代码编译就开始了；
Java中的字符character存储单位是双字节的UNICODE；

试验2：Java的输入输出过程中的字节流到字符流的转换过程

通过这个HelloUnicode.java程序，演示说明"Hello world 世界你好"这个字符串（16个字符）在不同缺省系统编码方式下的处理效果。在编码/解码的每个步骤之后，都打印出了相应字符串每个字符(Charactor)的byte值，short值和所在的UNICODE区间。

LANG=en_US LC_ALL=en_US

LANG=zh_CN LC_ALL=zh_CN.GBK

========testing1: write hello world to files========
[test 1-1]: with system default encoding=ISO-8859-1
string=Hello world 世界你好     length=20
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?    byte=-54 \uFFFFFFCA     short=202 \uCA  LATIN_1_SUPPLEMENT
char[13]='?    byte=-64 \uFFFFFFC0     short=192 \uC0  LATIN_1_SUPPLEMENT
char[14]='?    byte=-67 \uFFFFFFBD     short=189 \uBD  LATIN_1_SUPPLEMENT
char[15]='?    byte=-25 \uFFFFFFE7     short=231 \uE7  LATIN_1_SUPPLEMENT
char[16]='?    byte=-60 \uFFFFFFC4     short=196 \uC4  LATIN_1_SUPPLEMENT
char[17]='?    byte=-29 \uFFFFFFE3     short=227 \uE3  LATIN_1_SUPPLEMENT
char[18]='?    byte=-70 \uFFFFFFBA     short=186 \uBA  LATIN_1_SUPPLEMENT
char[19]='?    byte=-61 \uFFFFFFC3     short=195 \uC3  LATIN_1_SUPPLEMENT

第1步：在英文编码环境下，虽然屏幕上正确的显示了中文，
但实际上它打印的是“半个”汉字，将结果写入第1个文件 hello.orig.html

[test 1-2]: getBytes with platform default encoding and decoding as gb2312:
string=Hello world ???? length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?'    byte=22 \u16    short=19990 \u4E16      CJK_UNIFIED_IDEOGRAPHS
char[13]='?'    byte=76 \u4C    short=30028 \u754C      CJK_UNIFIED_IDEOGRAPHS
char[14]='?'    byte=96 \u60    short=20320 \u4F60      CJK_UNIFIED_IDEOGRAPHS
char[15]='?'    byte=125 \u7D   short=22909 \u597D      CJK_UNIFIED_IDEOGRAPHS

按系统缺省编码重新变成字节流，然后按照GB2312方式解码，这里虽然打印出的是问号
（因为当前的英文环境下系统对于255以上的字符是不知道用什么字符表示的，因此全部用?显示）
但从相应的UNICODE MAPPING和SHORT值我们可以知道字符是正确的中文

但下一步的写入第2个文件html.gb2312.html，
没有指定编码方式（按系统缺省的ISO-8859-1编码方式），
因此从后面的测试2－2读取的结果是真的'？'了

[test 1-3]: convert string to UTF8
string=Hello world 涓栫晫浣犲ソ length=24
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?    byte=-28 \uFFFFFFE4     short=228 \uE4  LATIN_1_SUPPLEMENT
char[13]='?    byte=-72 \uFFFFFFB8     short=184 \uB8  LATIN_1_SUPPLEMENT
char[14]='?    byte=-106 \uFFFFFF96    short=150 \u96  LATIN_1_SUPPLEMENT
char[15]='?    byte=-25 \uFFFFFFE7     short=231 \uE7  LATIN_1_SUPPLEMENT
char[16]='?    byte=-107 \uFFFFFF95    short=149 \u95  LATIN_1_SUPPLEMENT
char[17]='?    byte=-116 \uFFFFFF8C    short=140 \u8C  LATIN_1_SUPPLEMENT
char[18]='?    byte=-28 \uFFFFFFE4     short=228 \uE4  LATIN_1_SUPPLEMENT
char[19]='?    byte=-67 \uFFFFFFBD     short=189 \uBD  LATIN_1_SUPPLEMENT
char[20]='?    byte=-96 \uFFFFFFA0     short=160 \uA0  LATIN_1_SUPPLEMENT
char[21]='?    byte=-27 \uFFFFFFE5     short=229 \uE5  LATIN_1_SUPPLEMENT
char[22]='?    byte=-91 \uFFFFFFA5     short=165 \uA5  LATIN_1_SUPPLEMENT
char[23]='?    byte=-67 \uFFFFFFBD     short=189 \uBD  LATIN_1_SUPPLEMENT

第3个试验，将字符流按照UTF8方式编码后，写入第3个测试文件hello.utf8.html，
我们可以看到UTF8对英文没有影响，但对于其他文字使用了3字节编码方式，
因此比GB2312编码方式的存储要大50%，

========Testing2: reading and decoding from files========
[test 2-1]: read hello.orig.html: decoding with system default encoding
string=Hello world 世界你好     length=20
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?    byte=-54 \uFFFFFFCA     short=202 \uCA  LATIN_1_SUPPLEMENT
char[13]='?    byte=-64 \uFFFFFFC0     short=192 \uC0  LATIN_1_SUPPLEMENT
char[14]='?    byte=-67 \uFFFFFFBD     short=189 \uBD  LATIN_1_SUPPLEMENT
char[15]='?    byte=-25 \uFFFFFFE7     short=231 \uE7  LATIN_1_SUPPLEMENT
char[16]='?    byte=-60 \uFFFFFFC4     short=196 \uC4  LATIN_1_SUPPLEMENT
char[17]='?    byte=-29 \uFFFFFFE3     short=227 \uE3  LATIN_1_SUPPLEMENT
char[18]='?    byte=-70 \uFFFFFFBA     short=186 \uBA  LATIN_1_SUPPLEMENT
char[19]='?    byte=-61 \uFFFFFFC3     short=195 \uC3  LATIN_1_SUPPLEMENT

按系统从中间存储hello.orig.html文件中读取相应文件，
虽然是按字节方式（半个“字”）读取的，但由于能完整的还原，因此输出显示没有错误。
其实PHP等应用很少出现字符集问题其实就是这个原因，全程都是按字节流方式处理，
很好的还原了输入，但这样处理的同时也失去了对字符的控制

[test 2-2]: read hello.gb2312.html: decoding as GB2312
string=Hello world ???? length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[13]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[14]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[15]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN

最惨的就是输出的时候这些'?'真的是问号char(63)了，
数据如果是这样就真的没救了

[test 2-3]: read hello.utf8.html: decoding as UTF8
string=Hello world ???? length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?'    byte=22 \u16    short=19990 \u4E16      CJK_UNIFIED_IDEOGRAPHS
char[13]='?'    byte=76 \u4C    short=30028 \u754C      CJK_UNIFIED_IDEOGRAPHS
char[14]='?'    byte=96 \u60    short=20320 \u4F60      CJK_UNIFIED_IDEOGRAPHS
char[15]='?'    byte=125 \u7D   short=22909 \u597D      CJK_UNIFIED_IDEOGRAPHS

great! 字符虽然显示为'?'，但实际上字符的解码是正确的，
从相应的UNICODE MAPPING就可以看的出来。

    ========Testing1: write hello world to files========
[test 1-1]: with system default encoding=GBK
string=Hello world 世界你好     length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='世'   byte=22 \u16    short=19990 \u4E16      CJK_UNIFIED_IDEOGRAPHS
char[13]='界'   byte=76 \u4C    short=30028 \u754C      CJK_UNIFIED_IDEOGRAPHS
char[14]='你'   byte=96 \u60    short=20320 \u4F60      CJK_UNIFIED_IDEOGRAPHS
char[15]='好'   byte=125 \u7D   short=22909 \u597D      CJK_UNIFIED_IDEOGRAPHS

注意：在新的语言环境中做以上测试需要将源程序重新编译，
最早的字节流到字符流的解码过程从JavaC编译源文件就开始了，
这个测试和刚才最大的不同在于源文件中的“世界你好”这4个字是否按中文编码方式
编译导程序里的，而不是按字节方式编译成8个字符（实际上对应的是8个字节）在程序里。


[test 1-2]: getBytes with platform default encoding and decoding as gb2312:
string=Hello world 世界你好     length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='世'   byte=22 \u16    short=19990 \u4E16      CJK_UNIFIED_IDEOGRAPHS
char[13]='界'   byte=76 \u4C    short=30028 \u754C      CJK_UNIFIED_IDEOGRAPHS
char[14]='你'   byte=96 \u60    short=20320 \u4F60      CJK_UNIFIED_IDEOGRAPHS
char[15]='好'   byte=125 \u7D   short=22909 \u597D      CJK_UNIFIED_IDEOGRAPHS

在中文环境下，解码和上面缺省的编码是一致的，因此输出一致

[test 1-3]: convert string to UTF8
string=Hello world 涓栫晫浣犲ソ length=18
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='涓'   byte=-109 \uFFFFFF93    short=28051 \u6D93      CJK_UNIFIED_IDEOGRAPHS
char[13]='栫'   byte=43 \u2B    short=26667 \u682B      CJK_UNIFIED_IDEOGRAPHS
char[14]='晫'   byte=107 \u6B   short=26219 \u666B      CJK_UNIFIED_IDEOGRAPHS
char[15]='浣'   byte=99 \u63    short=28003 \u6D63      CJK_UNIFIED_IDEOGRAPHS
char[16]='犲'   byte=-78 \uFFFFFFB2     short=29362 \u72B2      CJK_UNIFIED_IDEOGRAPHS
char[17]='ソ'   byte=-67 \uFFFFFFBD     short=12477 \u30BD      KATAKANA

其实我们用于测试的终端窗口就是一个GBK字符集的应用，
这个输出其实都是把UNICODE按GBK字符集解码的效果。


========Testing2: reading and decoding from files========
[test 2-1]: read hello.orig.html: decoding with system default encoding
string=Hello world 世界你好     length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='世'   byte=22 \u16    short=19990 \u4E16      CJK_UNIFIED_IDEOGRAPHS
char[13]='界'   byte=76 \u4C    short=30028 \u754C      CJK_UNIFIED_IDEOGRAPHS
char[14]='你'   byte=96 \u60    short=20320 \u4F60      CJK_UNIFIED_IDEOGRAPHS
char[15]='好'   byte=125 \u7D   short=22909 \u597D      CJK_UNIFIED_IDEOGRAPHS

[test 2-2]: read hello.gb2312.html: decoding as GB2312
string=Hello world 世界你好     length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='世'   byte=22 \u16    short=19990 \u4E16      CJK_UNIFIED_IDEOGRAPHS
char[13]='界'   byte=76 \u4C    short=30028 \u754C      CJK_UNIFIED_IDEOGRAPHS
char[14]='你'   byte=96 \u60    short=20320 \u4F60      CJK_UNIFIED_IDEOGRAPHS
char[15]='好'   byte=125 \u7D   short=22909 \u597D      CJK_UNIFIED_IDEOGRAPHS

[test 2-3]: read hello.utf8.html: decoding as UTF8
string=Hello world 世界你好     length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='世'   byte=22 \u16    short=19990 \u4E16      CJK_UNIFIED_IDEOGRAPHS
char[13]='界'   byte=76 \u4C    short=30028 \u754C      CJK_UNIFIED_IDEOGRAPHS
char[14]='你'   byte=96 \u60    short=20320 \u4F60      CJK_UNIFIED_IDEOGRAPHS
char[15]='好'   byte=125 \u7D   short=22909 \u597D      CJK_UNIFIED_IDEOGRAPHS

结论：如果后台数据采用UNICODE方式的存储
然后根据需要指定字符集编码、解码方式，则应用几乎可以不受前端应用所处
环境字符集设置的影响

作者：车东发表于：2002-07-10 22:07 最后更新于：2007-04-12 11:04
版权声明：可以任意转载，转载时请务必以超链接形式标明文章原始出处和作者信息及本声明。
https://www.chedong.com/tech/hello_unicode_2.html

Comments

车东，你好！在此请教问题。
在我的应用中，把Big5的字符集转换成简体中文的时候，很少部分字出现空格（占2个字节），也有很少部分出现“？”(占1个字节)。我的处理方式是这样做的：有对应两个码表文件（big5-gb.table和gb-big5.table），转换就是根据这两个码表文件来对应的，当有不匹配的字符时，可以用两个方法来更新码表文件，但是现在出现一个问题：出现空格和"?"这样的字符，更新的时候，出现空格的字符部分更够更新成功，而出现“？”的字符，一个都不能更新成功。更新的代码如何下： protected synchronized void resetBig5Char(String gbChar, String big5Char) throws Exception{

byte[] TextBig5 = new String(big5Char.getBytes(), "BIG5").getBytes("BIG5");
byte[] Text = new String(gbChar.getBytes(), "GBK").getBytes("GBK");

int max = Text.length - 1;
int h = 0;
int l = 0;
int p = 0;
int b = 256;
byte[] big = new byte[2];
for (int i = 0; i h = (int) (Text[i]);
if (h h = b + h;
l = (int) (Text[i + 1]);
if (l l = b + (int) (Text[i + 1]);
}
if (h == 161 && l == 64) {
// do nothing
} else {
p = (h - 160) * 510 + (l - 1) * 2;
b_gbTable[p] = TextBig5[i];
b_gbTable[p + 1] = TextBig5[i + 1];
}
i++;
}
}
String filepathgb = "E:\\gb-big5.table";
BufferedOutputStream pWriter = new BufferedOutputStream(new FileOutputStream(filepathgb));
//BufferedOutputStream pWriter = new BufferedOutputStream(new FileOutputStream(s_gbTableFile));
pWriter.write(b_gbTable, 0, b_gbTable.length);
pWriter.close();
} 当进行转换的时候的，也是按照这样的方式转换的。比如：Big5 字符:黷憃恏昪用Big5的方式查看的话就是这样的：恒慤鰂邨脷
但是更新码表的时候对应的字符更新不成功，主要原因在于 byte[] TextBig5是1个字节，byte[] Text是两个字节，现在不清楚这是为什么，请指教，谢谢！！！

由: relyang 发表于 December 29, 2008 09:57 AM

发表一个评论

(如果你此前从未在此 Blog 上发表过评论，则你的评论必须在 Blog 主人验证后才能显示，请你耐心等候。)

名字:

Email 地址:

记住个人信息？

评论: (你可以使用 HTML 标签设置风格)

笔记 by 车东

为而不有……