文章/答案/技术大牛

发布

社区首页 >问答首页 >用Python实现基于词汇内容(短语)的句子解析

问用Python实现基于词汇内容(短语)的句子解析
EN

Stack Overflow用户

提问于 2014-12-01 17:56:20

回答 1查看 3.6K关注 0票数 11

Python是否能够识别输入字符串，并且不仅基于空白，还可以根据内容解析输入字符串？比如说，在这种情况下，“计算机系统”成了一个短语。有人能提供一个示例代码吗？

输入字符串：“用户对计算机系统响应时间的看法调查”

预期输出："A“、”调查“、"of”、"user“、”见解“、"of”、"computer system“、"response”、"time“

python

nltk

lexical

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-12-02 00:50:36

你要寻找的技术叫做多个名称，来自语言学和计算的多个子领域或子领域。

关键字提取
- 从信息检索的角度看，主要用于改进sear的索引/查询。
- 阅读这篇最近的调查报告：http://www.hlt.utdallas.edu/~saidul/acl14.pdf
- (我个人)强烈推荐：https://code.google.com/p/jatetoolkit/，当然还有著名的https://code.google.com/p/kea-algorithm/ (来自给你带来WEKA的人，http://www.cs.waikato.ac.nz/ml/weka/)
- 对于python，可能是https://github.com/aneesha/RAKE

Chunking
- 从自然语言处理，它也称为浅解析，
- 阅读Steve关于它是如何发生的工作：http://www.vinartus.net/spa/90e.pdf
- 主要的NLP框架和工具包应该有它们(例如，OpenNLP、GATE、NLTK* (请注意，NLTK的默认分块只适用于名称实体)
- 斯坦福大学也有一个：http://nlp.stanford.edu/projects/shallow-parsing.shtml

我将给出NLTK中NE块的一个例子：

>>> from nltk import word_tokenize, ne_chunk, pos_tag
>>> sent = "A survey of user opinion of computer system response time"
>>> chunked = ne_chunk(pos_tag(word_tokenize(sent)))
>>> for i in chunked:
...     print i
... 
('A', 'DT')
('survey', 'NN')
('of', 'IN')
('user', 'NN')
('opinion', 'NN')
('of', 'IN')
('computer', 'NN')
('system', 'NN')
('response', 'NN')
('time', 'NN')

与指名实体：

>>> sent2 = "Barack Obama meets Michael Jackson in Nihonbashi"
>>> chunked = ne_chunk(pos_tag(word_tokenize(sent2)))
>>> for i in chunked:
...     print i
... 
(PERSON Barack/NNP)
(ORGANIZATION Obama/NNP)
('meets', 'NNS')
(PERSON Michael/NNP Jackson/NNP)
('in', 'IN')
(GPE Nihonbashi/NNP)

你可以看到它几乎是有缺陷的，总比没有好，我想。

多字表达式提取
- 在NLP的热门话题中，每个人都想为了某种原因而提取它们。
- Ivan最值得注意的工作：http://lingo.stanford.edu/pubs/WP-2001-03.pdf和各种提取算法的混乱以及从ACL论文中提取的用法。
- 由于这个MWE非常神秘，而且我们不知道如何自动分类或正确地提取它们，但没有合适的工具(奇怪的是，MWE的输出研究人员经常可以通过键短语提取或分块来获得.)

术语提取

- This comes from translation studies where they want the translators to use the correct technical word when translating a document.
- Do note that terminology comes with a cornocopia of ISO standards that one should follows because of the convoluted translation industry that generates billions in income... 
- Monolingually, i've no idea what makes them different from terminology extractor, same algorithms, different interface... I guess the only thing about some term extractors is the ability to do it bilingually and produce a dictionary automatically.

这里有一些工具

- [https://github.com/srijiths/jtopia](https://github.com/srijiths/jtopia) and 
- [http://fivefilters.org/term-extraction/](http://fivefilters.org/term-extraction/)
- [https://github.com/turian/topia.termextract](https://github.com/turian/topia.termextract)
- [https://www.airpair.com/nlp/keyword-extraction-tutorial](https://www.airpair.com/nlp/keyword-extraction-tutorial)
- [http://termcoord.wordpress.com/about/testing-of-term-extraction-tools/free-term-extractors/](http://termcoord.wordpress.com/about/testing-of-term-extraction-tools/free-term-extractors/)
- Note on tools: there's still no one tool that stands out for term extraction though. And because of then big money involved, it's always some API calls and most code are "semi-open".. mostly closed. Then again, SEO is also big money, possibly it's just a culture thing in translation industry to be super secretive.

现在回到OP的问题上。

问：能否将“计算机系统”提取为短语？

答：并不是真正的

如上面所示，NLTK有预先训练过的块，但是它可以工作在名称实体上，即便如此，并不是所有命名实体都能被很好地识别。

也许OP可以尝试更激进的想法，让我们假设一个名词序列在一起总是形成一个短语：

>>> from nltk import word_tokenize, pos_tag
>>> sent = "A survey of user opinion of computer system response time"
>>> tagged = pos_tag(word_tokenize(sent))
>>> chunks = []
>>> current_chunk = []
>>> for word, pos in tagged:
...     if pos.startswith('N'):
...             current_chunk.append((word,pos))
...     else:
...             if current_chunk:
...                     chunks.append(current_chunk)
...             current_chunk = []
... 
>>> chunks
[[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')], [('survey', 'NN')], [('user', 'NN'), ('opinion', 'NN')]]
>>> for i in chunks:
...     print i
... 
[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')]
[('survey', 'NN')]
[('user', 'NN'), ('opinion', 'NN')]

因此，即使有了这样的解决方案，要想单独获得“计算机系统”似乎也很困难。但是，如果你认为得到“计算机系统响应时间”似乎是一个比“计算机系统”更有效的短语。

不要认为对计算机系统响应时间的所有解释似乎都是正确的：

计算机系统响应时间
[计算机[系统[响应时间]]
计算机系统
[计算机系统响应时间]

还有更多可能的解释。所以你必须要问，你用的短语是什么，然后看看如何切割长短语，比如“计算机系统响应时间”。

票数 18

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/27234280

复制

相似问题

问用Python实现基于词汇内容(短语)的句子解析
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python实现基于词汇内容(短语)的句子解析EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python实现基于词汇内容(短语)的句子解析
EN