|
[Original]
[Print]
[Top]
|
UTF-8 转换GB2312 有C代码吗?
或者相关的文档?
我看有UTF-8到unicode的文档,但没有找到utf-8到GB2312
|
|
----
学计算机,不用计算机 请问是在这里写签名档吗?
|
|
[Original]
[Print]
[Top]
|
|
[Original]
[Print]
[Top]
|
如果要C代码,可以查看iconv及其相关文档。
可以试试命令行:
$ iconv -f 'utf-8' -t 'gb2312' utf8_file
|
|
----
温故知新
|
|
[Original]
[Print]
[Top]
|
|
[Original]
[Print]
[Top]
|
Python 2.4 以上的文档, gbk,gb2312都有了。
>>> utxt = u'hello world'
>>> utxt.encode('gb2312')
-----------------------------------------------------------
4.9.2 Standard Encodings
Codec Aliases Languages
ascii 646, us-ascii English
big5 big5-tw, csbig5 Traditional Chinese
big5hkscs big5-hkscs, hkscs Traditional Chinese
cp037 IBM037, IBM039 English
cp424 EBCDIC-CP-HE, IBM424 Hebrew
cp437 437, IBM437 English
cp500 EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500 Western Europe
cp737 Greek
cp775 IBM775 Baltic languages
cp850 850, IBM850 Western Europe
cp852 852, IBM852 Central and Eastern Europe
cp855 855, IBM855 Bulgarian, Byelorussian, Macedonian, Russian, Serbian
cp856 Hebrew
cp857 857, IBM857 Turkish
cp860 860, IBM860 Portuguese
cp861 861, CP-IS, IBM861 Icelandic
cp862 862, IBM862 Hebrew
cp863 863, IBM863 Canadian
cp864 IBM864 Arabic
cp865 865, IBM865 Danish, Norwegian
cp866 866, IBM866 Russian
cp869 869, CP-GR, IBM869 Greek
cp874 Thai
cp875 Greek
cp932 932, ms932, mskanji, ms-kanji Japanese
cp949 949, ms949, uhc Korean
cp950 950, ms950 Traditional Chinese
cp1006 Urdu
cp1026 ibm1026 Turkish
cp1140 ibm1140 Western Europe
cp1250 windows-1250 Central and Eastern Europe
cp1251 windows-1251 Bulgarian, Byelorussian, Macedonian, Russian, Serbian
cp1252 windows-1252 Western Europe
cp1253 windows-1253 Greek
cp1254 windows-1254 Turkish
cp1255 windows-1255 Hebrew
cp1256 windows1256 Arabic
cp1257 windows-1257 Baltic languages
cp1258 windows-1258 Vietnamese
euc_jp eucjp, ujis, u-jis Japanese
euc_jis_2004 jisx0213, eucjis2004 Japanese
euc_jisx0213 eucjisx0213 Japanese
euc_kr euckr, korean, ksc5601, ks_c-5601, ks_c-5601-1987, ksx1001, ks_x-1001 Korean
gb2312 chinese, csiso58gb231280, euc-cn, euccn, eucgb2312-cn, gb2312-1980, gb2312-80, iso-ir-58 Simplified Chinese
gbk 936, cp936, ms936 Unified Chinese
gb18030 gb18030-2000 Unified Chinese
hz hzgb, hz-gb, hz-gb-2312 Simplified Chinese
iso2022_jp csiso2022jp, iso2022jp, iso-2022-jp Japanese
iso2022_jp_1 iso2022jp-1, iso-2022-jp-1 Japanese
iso2022_jp_2 iso2022jp-2, iso-2022-jp-2 Japanese, Korean, Simplified Chinese, Western Europe, Greek
iso2022_jp_2004 iso2022jp-2004, iso-2022-jp-2004 Japanese
iso2022_jp_3 iso2022jp-3, iso-2022-jp-3 Japanese
iso2022_jp_ext iso2022jp-ext, iso-2022-jp-ext Japanese
iso2022_kr csiso2022kr, iso2022kr, iso-2022-kr Korean
latin_1 iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1 West Europe
iso8859_2 iso-8859-2, latin2, L2 Central and Eastern Europe
iso8859_3 iso-8859-3, latin3, L3 Esperanto, Maltese
iso8859_4 iso-8859-4, latin4, L4 Baltic languagues
iso8859_5 iso-8859-5, cyrillic Bulgarian, Byelorussian, Macedonian, Russian, Serbian
iso8859_6 iso-8859-6, arabic Arabic
iso8859_7 iso-8859-7, greek, greek8 Greek
iso8859_8 iso-8859-8, hebrew Hebrew
iso8859_9 iso-8859-9, latin5, L5 Turkish
iso8859_10 iso-8859-10, latin6, L6 Nordic languages
iso8859_13 iso-8859-13 Baltic languages
iso8859_14 iso-8859-14, latin8, L8 Celtic languages
iso8859_15 iso-8859-15 Western Europe
johab cp1361, ms1361 Korean
koi8_r Russian
koi8_u Ukrainian
mac_cyrillic maccyrillic Bulgarian, Byelorussian, Macedonian, Russian, Serbian
mac_greek macgreek Greek
mac_iceland maciceland Icelandic
mac_latin2 maclatin2, maccentraleurope Central and Eastern Europe
mac_roman macroman Western Europe
mac_turkish macturkish Turkish
ptcp154 csptcp154, pt154, cp154, cyrillic-asian Kazakh
shift_jis csshiftjis, shiftjis, sjis, s_jis Japanese
shift_jis_2004 shiftjis2004, sjis_2004, sjis2004 Japanese
shift_jisx0213 shiftjisx0213, sjisx0213, s_jisx0213 Japanese
utf_16 U16, utf16 all languages
utf_16_be UTF-16BE all languages (BMP only)
utf_16_le UTF-16LE all languages (BMP only)
utf_7 U7 all languages
utf_8 U8, UTF, utf8 all languages
|
|
|
----
温故知新
|
|
[Original]
[Print]
[Top]
|
|
[Original]
[Print]
[Top]
|
没有办法了,一个一个尝试,发生异常就尝试下一个编码:(
[hr]
谢谢 alula 的提醒
def zhtounicode(str):
for c in 'utf-8', 'gbk', 'big5', 'jp', 'kr':
try:
return str.decode(c)
except:
pass
return str
|
|
|
[Original]
[Print]
[Top]
|
|
[Original]
[Print]
[Top]
|
asmcos:
zhcon 不支持 UTF-8 , cce2000 支持 UTF-8
BTW:
现在倒好
Ubuntu Linux 里的是 Python 2.4
Debian GNU/Linux 里的却是 Python 2.3
在 Win 下又是 ActivePython 2.4
....................
Debian 下加装了 python2.3-cjkcodecs python2.3-iconvcodec , 应该就差不多了吧
|
|
|
[Original]
[Print]
[Top]
|
|
[Original]
[Print]
[Top]
|
不说这种解决问题的思路正确与否。
只觉得你太勤快了。。。才会写那么多代码。
是不是可以偷懒一点,这么写:
def zhtounicode(str):
for c in 'utf-8', 'gbk', 'big5', 'jp', 'kr':
try:
return str.decode(c)
except:
pass
return str
|
|
|
----
温故知新
|
|
[Original]
[Print]
[Top]
|
|
[Original]
[Print]
[Top]
|
正常得很,都是用字符串表示的。Debian 里有个叫 libhz0 的包,可以根据词频做一定的猜测,可以参考一下。
Package: libhz-dev
Priority: optional
Section: libdevel
Installed-Size: 432
Maintainer: Yu Guanghui <ygh@debian.org>
Architecture: i386
Source: zh-autoconvert
Version: 0.3.14-2.1
Depends: libhz0 (= 0.3.14-2.1), libc6-dev
Filename: pool/main/z/zh-autoconvert/libhz-dev_0.3.14-2.1_i386.deb
Size: 154912
MD5sum: ed1e9ee9ba28f8126fab87cb65584f85
Description: Headers and static libraries for zh-autoconvert
Contains the symlinks, headers, and object files needed to compile and
link programs which use the zh-autoconvert library.
.
Author: Yu Guanghui <ygh@debian.org>
Tag: devel::library, role::sw:devel-lib
|
|
|
----
|
|
[Original]
[Print]
[Top]
|
|
[Original]
[Print]
[Top]
|
"text here".decode("gb2312").encode("utf-8")
应该是类似的方法。decode和encode的具体语法记不清了,可查一下文档。
|
|
[Original]
[Print]
[Top]
|
|