首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >NotImplementedError:'split_respect_sentence_boundary=True‘只与split_by='word’兼容

NotImplementedError:'split_respect_sentence_boundary=True‘只与split_by='word’兼容
EN

Stack Overflow用户
提问于 2022-11-24 07:54:12
回答 1查看 21关注 0票数 0

我有以下代码行

代码语言:javascript
复制
from haystack.document_stores import InMemoryDocumentStore, SQLDocumentStore
from haystack.nodes import TextConverter, PDFToTextConverter,PreProcessor
from haystack.utils import clean_wiki_text, convert_files_to_docs, fetch_archive_from_http, print_answers

doc_dir = "C:\\Users\\abcd\\Downloads\\PDF Files\\"

docs = convert_files_to_docs(dir_path=doc_dir, clean_func=None, split_paragraphs=True


preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="passage",
    split_length=2)
doc = preprocessor.process(docs)

当我尝试运行它时,我会得到以下错误消息

代码语言:javascript
复制
NotImplementedError                       Traceback (most recent call last)
c:\Users\abcd\Downloads\solr9.ipynb Cell 27 in <cell line: 23>()
     16 print(type(docs))
     17 preprocessor = PreProcessor(
     18     clean_empty_lines=True,
     19     clean_whitespace=True,
     20     clean_header_footer=True,
     21     split_by="passage",
     22     split_length=2)
---> 23 doc = preprocessor.process(docs)

File ~\AppData\Roaming\Python\Python39\site-packages\haystack\nodes\preprocessor\preprocessor.py:167, in PreProcessor.process(self, documents, clean_whitespace, clean_header_footer, clean_empty_lines, remove_substrings, split_by, split_length, split_overlap, split_respect_sentence_boundary, id_hash_keys)
    165     ret = self._process_single(document=documents, id_hash_keys=id_hash_keys, **kwargs)  # type: ignore
    166 elif isinstance(documents, list):
--> 167     ret = self._process_batch(documents=list(documents), id_hash_keys=id_hash_keys, **kwargs)
    168 else:
    169     raise Exception("documents provided to PreProcessor.prepreprocess() is not of type list nor Document")


File ~\AppData\Roaming\Python\Python39\site-packages\haystack\nodes\preprocessor\preprocessor.py:225, in PreProcessor._process_batch(self, documents, id_hash_keys, **kwargs)
    222 def _process_batch(
    223     self, documents: List[Union[dict, Document]], id_hash_keys: Optional[List[str]] = None, **kwargs
    224 ) -> List[Document]:
--> 225     nested_docs = [
    226         self._process_single(d, id_hash_keys=id_hash_keys, **kwargs)
...
--> 324     raise NotImplementedError("'split_respect_sentence_boundary=True' is only compatible with split_by='word'.")
    326 if type(document.content) is not str:
    327     logger.error("Document content is not of type str. Nothing to split.")

NotImplementedError: 'split_respect_sentence_boundary=True' is only compatible with split_by='word'.

我甚至没有split_respect_sentence_boundary=True作为我的论点,我也没有split_by='word',我把它设置为split_by="passage"

如果我尝试将其更改为split_by="sentence",这也是相同的错误。

如果我在这里漏掉了什么,请告诉我。

尝试使用split_by="sentence",但得到相同的错误。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-11-24 09:38:38

正如您在PreProcessor API文档中看到的那样,split_respect_sentence_boundary的默认值是True

为了使代码工作,您应该指定split_respect_sentence_boundary=False

代码语言:javascript
复制
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="passage",
    split_length=2,
    split_respect_sentence_boundary=False)

我同意这种行为是不直观的。目前,该节点正在进行一次主要的重构。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/74557335

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档