文章/答案/技术大牛

发布

社区首页 >问答首页 >用Python在句子列表中形成单词大写

问用Python在句子列表中形成单词大写
EN

Stack Overflow用户

提问于 2014-02-18 04:41:47

回答 10查看 107.1K关注 0票数 33

我有一个句子清单：

text = ['cant railway station','citadel hotel',' police stn'].

我需要形成双标对并将它们存储在一个变量中。问题是，当我这样做的时候，我会得到一对句子，而不是单词。以下是我所做的：

text2 = [[word for word in line.split()] for line in text]
bigrams = nltk.bigrams(text2)
print(bigrams)

产额

[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])

火车站和大本营酒店不能一条龙。我想要的是

[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...

第一句的最后一个词不应与第二句的第一个词合并。我该怎么做才能让它发挥作用？

list

list-comprehension

nltk

collocation

python

回答 10

Stack Overflow用户

回答已采纳

发布于 2014-02-18 05:04:29

使用list comprehensions和zip

>>> text = ["this is a sentence", "so is this one"]
>>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
>>> print(bigrams)
[('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',     
'one')]

票数 56

Stack Overflow用户

发布于 2018-02-19 18:30:32

from nltk import word_tokenize 
from nltk.util import ngrams


text = ['cant railway station', 'citadel hotel', 'police stn']
for line in text:
    token = word_tokenize(line)
    bigram = list(ngrams(token, 2)) 

    # the '2' represents bigram; you can change it to get ngrams with different size

票数 17

Stack Overflow用户

发布于 2014-02-18 04:55:32

与其将文本转换为字符串列表，不如将每个句子分别作为字符串开始。我还删除了标点符号和句号，如果与您无关，只需删除以下部分：

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def get_bigrams(myString):
    tokenizer = WordPunctTokenizer()
    tokens = tokenizer.tokenize(myString)
    stemmer = PorterStemmer()
    bigram_finder = BigramCollocationFinder.from_words(tokens)
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)

    for bigram_tuple in bigrams:
        x = "%s %s" % bigram_tuple
        tokens.append(x)

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
    return result

要使用它，请这样做：

for line in sentence:
    features = get_bigrams(line)
    # train set here

请注意，这会更进一步，并且实际上在统计上对bigram进行了评分(这在训练模型时非常有用)。

票数 9

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/21844546

复制

相似问题

问用Python在句子列表中形成单词大写
EN

回答 10

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python在句子列表中形成单词大写EN

回答 10

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python在句子列表中形成单词大写
EN