我第一次尝试使用呼呼来进行文本搜索。我想搜索包含单词"XML“的文档。但是因为我是个新手,所以我刚刚写了一个从文档中搜索单词的程序。其中文档是文本文件(myRoko.txt)
import os, os.path
from whoosh import index
from whoosh.index import open_dir
from whoosh.fields import Schema, ID, TEXT
from whoosh.qparser import QueryParser
from whoosh.query import *
if not os.path.exists("indexdir3"):
os.mkdir("indexdir3")
schema = Schema(name=ID(stored=True), content=TEXT)
ix = index.create_in("indexdir3", schema)
writer = ix.writer()
path = "myRoko.txt"
with open(path, "r") as f:
content = f.read()
f.close()
writer.add_document(name=path, content= content)
writer.commit()
ix = open_dir("indexdir3")
query_b = QueryParser('content', ix.schema).parse('XML')
with ix.searcher() as srch:
res_b = srch.search(query_b)
print res_b[0]上面的代码用于打印包含单词"XML“的文档。但是,代码返回以下错误:
raise ValueError("%r is not unicode or sequence" % value)
ValueError: 'A large number of documents are now represented and stored
as XML document on the web. Thus ................此错误的原因可能是什么?
发布于 2015-06-27 20:30:07
您遇到了Unicode问题。您应该将unicode字符串传递给索引器。为此,您需要以unicode格式打开文本文件:
import codecs
with codecs.open(path, "r","utf-8") as f:
content = f.read()并使用unicode字符串作为文件名:
path = u"myRoko.txt"修复之后,我得到了这个结果:
<Hit {'name': u'myRoko.txt'}>发布于 2016-09-08 01:33:16
writer.add_document(name=unicode(path), content=unicode(content))它必须是UNICODE
https://stackoverflow.com/questions/30779027
复制相似问题