Python是否能够识别输入字符串,并且不仅基于空白,还可以根据内容解析输入字符串?比如说,在这种情况下,“计算机系统”成了一个短语。有人能提供一个示例代码吗?
输入字符串:“用户对计算机系统响应时间的看法调查”
预期输出:"A“、”调查“、"of”、"user“、”见解“、"of”、"computer system“、"response”、"time“
发布于 2014-12-02 00:50:36
你要寻找的技术叫做多个名称,来自语言学和计算的多个子领域或子领域。
我将给出NLTK中NE块的一个例子:
>>> from nltk import word_tokenize, ne_chunk, pos_tag
>>> sent = "A survey of user opinion of computer system response time"
>>> chunked = ne_chunk(pos_tag(word_tokenize(sent)))
>>> for i in chunked:
... print i
...
('A', 'DT')
('survey', 'NN')
('of', 'IN')
('user', 'NN')
('opinion', 'NN')
('of', 'IN')
('computer', 'NN')
('system', 'NN')
('response', 'NN')
('time', 'NN')与指名实体:
>>> sent2 = "Barack Obama meets Michael Jackson in Nihonbashi"
>>> chunked = ne_chunk(pos_tag(word_tokenize(sent2)))
>>> for i in chunked:
... print i
...
(PERSON Barack/NNP)
(ORGANIZATION Obama/NNP)
('meets', 'NNS')
(PERSON Michael/NNP Jackson/NNP)
('in', 'IN')
(GPE Nihonbashi/NNP)你可以看到它几乎是有缺陷的,总比没有好,我想。
- This comes from translation studies where they want the translators to use the correct technical word when translating a document.
- Do note that terminology comes with a cornocopia of ISO standards that one should follows because of the convoluted translation industry that generates billions in income...
- Monolingually, i've no idea what makes them different from terminology extractor, same algorithms, different interface... I guess the only thing about some term extractors is the ability to do it bilingually and produce a dictionary automatically.
- [https://github.com/srijiths/jtopia](https://github.com/srijiths/jtopia) and
- [http://fivefilters.org/term-extraction/](http://fivefilters.org/term-extraction/)
- [https://github.com/turian/topia.termextract](https://github.com/turian/topia.termextract)
- [https://www.airpair.com/nlp/keyword-extraction-tutorial](https://www.airpair.com/nlp/keyword-extraction-tutorial)
- [http://termcoord.wordpress.com/about/testing-of-term-extraction-tools/free-term-extractors/](http://termcoord.wordpress.com/about/testing-of-term-extraction-tools/free-term-extractors/)
- Note on tools: there's still no one tool that stands out for term extraction though. And because of then big money involved, it's always some API calls and most code are "semi-open".. mostly closed. Then again, SEO is also big money, possibly it's just a culture thing in translation industry to be super secretive.
现在回到OP的问题上。
问:能否将“计算机系统”提取为短语?
答:并不是真正的
如上面所示,NLTK有预先训练过的块,但是它可以工作在名称实体上,即便如此,并不是所有命名实体都能被很好地识别。
也许OP可以尝试更激进的想法,让我们假设一个名词序列在一起总是形成一个短语:
>>> from nltk import word_tokenize, pos_tag
>>> sent = "A survey of user opinion of computer system response time"
>>> tagged = pos_tag(word_tokenize(sent))
>>> chunks = []
>>> current_chunk = []
>>> for word, pos in tagged:
... if pos.startswith('N'):
... current_chunk.append((word,pos))
... else:
... if current_chunk:
... chunks.append(current_chunk)
... current_chunk = []
...
>>> chunks
[[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')], [('survey', 'NN')], [('user', 'NN'), ('opinion', 'NN')]]
>>> for i in chunks:
... print i
...
[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')]
[('survey', 'NN')]
[('user', 'NN'), ('opinion', 'NN')]因此,即使有了这样的解决方案,要想单独获得“计算机系统”似乎也很困难。但是,如果你认为得到“计算机系统响应时间”似乎是一个比“计算机系统”更有效的短语。
不要认为对计算机系统响应时间的所有解释似乎都是正确的:
还有更多可能的解释。所以你必须要问,你用的短语是什么,然后看看如何切割长短语,比如“计算机系统响应时间”。
https://stackoverflow.com/questions/27234280
复制相似问题