文章/答案/技术大牛

发布

社区首页 >问答首页 >如何破解国会图书馆中的乱码文本？

问如何破解国会图书馆中的乱码文本？
EN

Stack Overflow用户

提问于 2009-12-09 21:05:08

回答 3查看 1.4K关注 0票数 0

我正在使用python进行z39.50搜索，但在解码搜索结果时遇到问题。

“哈利波特”的第一个搜索结果显然是这本书的希伯来语版本。

我怎么才能把它变成unicode呢？

这是我用来获取帖子的最小代码：

#!/usr/bin/env python
# encoding: utf-8

from PyZ3950 import zoom
from PyZ3950 import zmarc

conn = zoom.Connection('z3950.loc.gov', 7090)
conn.databaseName = 'VOYAGER'

query = zoom.Query('CCL', 'ti="HARRY POTTER"')

res = conn.search(query)

print "%d hits:" % len(res)

for r in res[:1]:
   print unicode( r.data )

运行脚本会出现"UnicodeDecodeError：'ascii‘编解码器无法解码位置788中的字节0xf2 :序数不在范围内(128)“。

python

encoding

z39.50

回答 3

Stack Overflow用户

发布于 2009-12-09 21:10:22

r.data.decode('windows-1255').encode('utf-8')

你必须找出他们使用的正确编码，并将其替换为'windows-1255‘(如果你对希伯来语的猜测是正确的，这可能会起作用)。

票数 1

Stack Overflow用户

发布于 2009-12-09 22:04:04

我试图重现你的问题，但我进入了Python的"DLL地狱“的等价物。请指定您正在使用的每个(Python、PyZ3950和PLY)的版本。

您将从错误消息中注意到，在获得非ASCII字节之前，有788个ASCII字节。听起来不像希伯来语/阿拉伯语/希腊语/西里尔文/等，它们使用非ASCII字节来表示这些语言中最常用的字符。

执行print type(r.data), repr(r.data)并编辑问题以显示结果，而不是print unicode(r.data)。

更新我设法让它在最新版本的PyZ3950上运行，并与Python2.6一起使用--需要from ply import lex而不是PyZ3950ccl.py中的import lex (以及同样修复的import yacc。

以下是转储命中0和命中200的结果：

>>> print repr(res[0].data)
"01688cam  22003614a 45000010009000000050017000090080041000260350018000670350020
00085906004500105925004400150955002400194010001700218020001500235040001300250041
00130026305000180027610000270029488000540032124000330037524501270040888001620053
52460070006972600092007678800200008593000029010594900019010888800045011077000029
01152880006301181700002901244880005301273\x1e16012113\x1e20091209015332.0\x1e091
208s2008    is a          000 1 heb  \x1e  \x1fa(DLC)16012909\x1e  \x1fa(DLC)200
9664431\x1e  \x1fa0\x1fbibc\x1fcorignew\x1fd3\x1fencip\x1ff20\x1fgy-nonroman\x1e
0 \x1faacquire\x1fb1 shelf copies\x1fxpolicy default\x1e  \x1fbcd06 2009-12-08 I
BC\x1e  \x1fa  2009664431\x1e  \x1fa965511564X\x1e  \x1faDLC\x1fcDLC\x1e1 \x1fah
eb\x1fheng\x1e00\x1faPZ40.R685\x1fbH+\x1e1 \x1f6880-01\x1faRowling, J. K.\x1e1 \
x1f6100-01/(2/r&#x200f;\x1fa\x1b(2xelipb, b\x1b(B'\x1b(2i. wi.\x1b(B\x1e10\x1faH
arry Potter and ??.\x1flHebrew\x1e10\x1f6880-02\x1faHari Po\xf2ter \xf2ve-misdar
 \xb0of ha-\xf2hol ? /\x1fcG'e. \xf2Ke. Roling ; me-Anglit, Gili Bar-Hilel Samu
; iyurim, Mery Granpreh.\x1e10\x1f6245-02/(2/r&#x200f;\x1fa&#x200f;\x1b(2d`xi te
hx e........&#x200f; /\x1b(B\x1fc&#x200f;\x1b(2b\x1b(B'\x1b(2i. wi. xelipb ; n`p
bliz, bili ax\x1b(B-\x1b(2dll qne ; `iexim, nxi bx`ptxd.\x1b(B\x1e1 \x1fiTitle o
n t.p. verso:\x1faHarry Potter and the order of the phoenix ?\x1e  \x1f6880-03\x
1faTel-Aviv :\x1fbYedi\xb0ot a\xf2haronot :\x1fbSifre \xf2hemed :\x1fbSifre \xb0
Aliyat ha-gag,\x1fcc[2008]\x1e  \x1f6260-03/(2/r&#x200f;\x1fa&#x200f;\x1b(2zl\x1
b(B-\x1b(2`aia&#x200f; :\x1b(B\x1fb\x1b(2icirez `gxepez :&#x200f;\x1b(B\x1fb&#x2
00f;\x1b(2qtxi gnc :&#x200f;\x1b(B\x1fb&#x200f;\x1b(2qtxi rliiz dbb,&#x200f;\x1b
(B\x1fc&#x200f;&#x202a;[2008]&#x202c;\x1e  \x1fa887 p. :\x1fbill. ;\x1fc21 cm.\x
1e0 \x1f6880-04\x1faProzah\x1e0 \x1f6490-04/(2/r&#x200f;\x1fa&#x200f;\x1b(2txefd
\x1b(B\x1e1 \x1f6880-05\x1faBar-Hilel, Gili.\x1e1 \x1f6700-05/(2/r&#x200f;\x1fa&
#x200f;\x1b(2ax\x1b(B-\x1b(2dll qne, bili.\x1b(B\x1e1 \x1f6880-06\x1faGrandPr\xe
2e, Mary.\x1e1 \x1f6700-06/(2/r&#x200f;\x1fa&#x200f;\x1b(2bx`ptxd, nxi.\x1b(B\x1
e\x1d"
>>> print repr(res[200].data)
"01427cam  22003614a 45000010009000000050017000090080041000269060045000679250044
00112955017900156010001700335020001800352020001500370035002400385040001800409042
00140042705000220044110000280046324501160049126000760060730000200068344000350070
35040041007386500018007796500013007976500017008106500041008276000019008686000039
00887600004800926710005900974923003201033\x1e14882660\x1e20070925153312.0\x1e070
607s2007    ie       b    000 0 eng d\x1e  \x1fa7\x1fbcbc\x1fccopycat\x1fd3\x1fe
ncip\x1ff20\x1fgy-gencatlg\x1e0 \x1faacquire\x1fb2 shelf copies\x1fxpolicy defau
lt\x1e  \x1fanb05 2007-06-07 z-processor ; nb05 2007-06-07 to HLCD for processin
g;\x1falk21 2007-08-09 to sh00\x1fish21 2007/09-18 (telework)\x1fesh49 2007-09-2
0 to BCCD\x1fesh45 2007-09-25 (Revised)\x1e  \x1fa  2007390561\x1e  \x1fa9780955
492617\x1e  \x1fa0955492610\x1e  \x1fa(OCoLC)ocn129545188\x1e  \x1faVYF\x1fcVYF\
x1fdDLC\x1e  \x1falccopycat\x1e00\x1faBT1105\x1fb.H44 2007\x1e1 \x1faHederman, M
ark Patrick.\x1e10\x1faHarry Potter and the Da Vinci code :\x1fb'Thunder of a Ba
ttle fought in some other Star' /\x1fcMark Patrick Hederman.\x1e  \x1faDublin :\
x1fbDublin Centre for the Study of the Platonic Tradition,\x1fc2007.\x1e  \x1fa3
8 p. ;\x1fc21 cm.\x1e 0\x1faPlatonic Centre pamphlets ;\x1fv2\x1e  \x1faIncludes
 bibliographical references.\x1e 0\x1faChristianity.\x1e 0\x1faMystery.\x1e 0\x1
faImagination.\x1e 0\x1faPotter, Harry (Fictitious character)\x1e10\x1faRowling,
 J. K.\x1e10\x1faBrown, Dan,\x1fd1964-\x1ftDa Vinci code.\x1e10\x1faYeats, W. B.
\x1fq(William Butler),\x1fd1865-1939.\x1e2 \x1faDublin Centre for the Study of t
he Platonic Tradition.\x1e  \x1fd20070411\x1fn565079784\x1fsKennys\x1e\x1d"

您将注意到，在"ASCII“部分中，在爆炸部分之前有相当多的\x1e和\x1f。在每个转储的末尾也有一个\x1d。(GROUP|UNIT|记录)分隔符。您还会注意到，第二个输出也类似于gobbledegook，但没有提到希伯来语。

结论:忘了希伯来语吧。忘了Unicode吧--它不是sensible_unicode_text.encode("any_known_encoding")的结果。Z3950充斥着穿孔卡片、磁鼓和磁带的臭味。如果它知道Unicode，那么它在数据中就不明显。

看起来你需要阅读PyZ3950附带的ZOOM API文档，这将引导你进入ZOOM docs ...祝好运。

更新2

>>> r0 = res[0]
>>> dir(r0)
['__doc__', '__init__', '__module__', '__str__', '_rt', 'data', 'databaseName',
'get_field', 'get_fieldcount', 'is_surrogate_diag', 'syntax']
>>> r0.syntax
'USMARC'
>>>

看起来您需要了解MARC

更新3在第一个转储中注意到了像‏‪[2008]‬这样的BIDI东西……所以你最终会使用Unicode，在你遍历文档的各个层次，弄清楚其中包含了什么……再一次，祝你好运！

票数 1

Stack Overflow用户

发布于 2016-02-16 16:58:29

你需要为此转换Marc数据:你可以使用下面的代码：

from pymarc import MARCReader
temp_list = []
for i in range(0, 2):# You can take len(res) here for all results
    temp_list.append(res[i].data)
for i in range(0, 2):# You can take len(res) here for all results
    reader = MARCReader(temp_list[i])
    for i in reader:
        print i.title(),i.author()

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/1873754

复制

相似问题

问如何破解国会图书馆中的乱码文本？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何破解国会图书馆中的乱码文本？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何破解国会图书馆中的乱码文本？
EN