首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何改进textacy.extract.semistructured_statements()结果

如何改进textacy.extract.semistructured_statements()结果
EN

Stack Overflow用户
提问于 2020-04-06 23:14:37
回答 2查看 1.2K关注 0票数 1

对于这个项目,我使用维基百科、spacy和textacy.extract模块。

我使用维基百科模块来抓取我设置主题的页面。它将返回其内容的字符串。

然后,我使用textacy.extract.semistructured_statements()过滤事实。它需要两个所需的args。第一个是文档,第二个是实体。

为了测试目的,我尝试将主题设置为Ubuntu和Bill。

代码语言:javascript
复制
#The Subject we are looking for
subject = 'Bill Gates'

#The Wikipedia Page
wikiResults = wikipedia.search(subject)
wikiPage = wikipedia.page(wikiResults[0]).content

#Spacy
nlp = spacy.load("en_core_web_sm")
document = nlp(wikiPage)

#Textacy.Extract
statments = textacy.extract.semistructured_statements(document, subject)

for statement in statements:
    subject, verb, fact = statement

    print(fact)

因此,当我运行这个程序时,我返回的是搜索Ubuntu的多个结果,而不是Bill。这是为什么,以及如何改进我的代码,从维基百科页面中提取更多的事实?

编辑:以下是最终结果

Ubuntu:

比尔·盖茨:

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-04-22 19:06:11

您需要使用不同的提示来处理文档,以提取用于描述主题的常用动词;如果要搜索多个单词,则还需要拆分字符串。例如,对于比尔·盖茨,你需要搜索“比尔”、“盖茨”、“比尔·盖茨”的组合,你需要提取不同的线索基动词来描述一个人/对象。

例如,搜索“盖茨”:

代码语言:javascript
复制
statments = textacy.extract.semistructured_statements(document, "Gates", cue = 'have',  max_n_words = 200, )

会给你带来更多的东西,比如:

代码语言:javascript
复制
* entity: Gates , cue: had , fact: primary responsibility for Microsoft's product strategy from the company's founding in 1975 until 2006
* entity: Gates , cue: is , fact: notorious for not being reachable by phone and for not returning phone calls
* entity: Gates , cue: was , fact: the second wealthiest person behind Carlos Slim, but regained the top position in 2013, according to the Bloomberg Billionaires List
* entity: Bill , cue: were , fact: the second-most generous philanthropists in America, having given over $28 billion to charity
* entity: Gates , cue: was , fact: seven years old
* entity: Gates , cue: was , fact: the guest on BBC Radio 4's Desert Island Discs on January 31, 2016, in which he talks about his relationships with his father and Steve Jobs, meeting Melinda Ann French, the start of Microsoft and some of his habits (for example reading The Economist "from cover to cover every week
* entity: Gates , cue: was , fact: the world's highest-earning billionaire in 2013, as his net worth increased by US$15.8 billion to US$78.5 billion

请注意,动词可以是否定的,就像在2的结果!

我还注意到,使用超过默认20个单词的max_n_words可能会导致更多令人费解的语句。

这是我的完整剧本:

代码语言:javascript
复制
import wikipedia
import spacy
import textacy
import en_core_web_sm

subject = 'Bill Gates'

#The Wikipedia Page
wikiResults = wikipedia.search(subject)
#print("wikiResults:", wikiResults)
wikiPage = wikipedia.page(wikiResults[0]).content
print("\n\nwikiPage:", wikiPage, "'\n")
nlp = en_core_web_sm.load()
document = nlp(wikiPage)
uniqueStatements = set()
for word in ["Gates", "Bill", "Bill Gates"]:
    for cue in ["be", "have", "write", "talk", "talk about"]:
        statments = textacy.extract.semistructured_statements(document, word, cue = cue,  max_n_words = 200, )
        for statement in statments:
            uniqueStatements.add(statement)

print("found", len(uniqueStatements), "statements.")
for statement in uniqueStatements:
    entity, cue, fact = statement
    print("* entity:",entity, ", cue:", cue, ", fact:", fact)

主动主语和暗示动词使我得到23个结果,而不是一个。

票数 3
EN

Stack Overflow用户

发布于 2020-04-21 21:11:04

我想感谢加布里埃尔。给我指引方向。

我添加了"It","he",“he”,“它们”,这是我在neuralcoref模块示例中看到的。

下面的代码将为您完成一项任务

代码语言:javascript
复制
import wikipedia
import spacy
import textacy
import en_core_web_sm

subject = 'Bill Gates'

#The Wikipedia Page
wikiResults = wikipedia.search(subject)

wikiPage = wikipedia.page(wikiResults[0]).content

nlp = en_core_web_sm.load()
document = nlp(wikiPage)
uniqueStatements = set()

for word in ["It","he","she","they"]+subject.split(' '):    
    for cue in ["be", "have", "write", "talk", "talk about"]:
        statments = textacy.extract.semistructured_statements(document, word, cue = cue,  max_n_words = 200, )
        for statement in statments:
            uniqueStatements.add(statement)

for statement in uniqueStatements:
    entity, cue, fact = statement
    print(entity, cue, fact)
票数 -1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/61070395

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档