文章/答案/技术大牛

发布

社区首页 >问答首页 >将语料库中的频率附加到推文中的每个标记

问将语料库中的频率附加到推文中的每个标记
EN

Stack Overflow用户

提问于 2019-07-24 22:11:42

回答 1查看 90关注 0票数 0

我正在处理推特数据，这些数据是用NLTK POS-tagger标记的。我的令牌看起来像：

[['wasabi', 'NN'], 
['juice', 'NN']]

我还有美国国家语料库的频率，单词列表，词性标签和它们的频率。我想从标记中查找单词和pos-tag，如果找到，则将ANC的频率附加到标记上。

来自SO的很好的建议很有帮助，但我发现有几个标记没有添加频率(可能是因为NLTK标记器非常不准确，例如，将'silent‘称为名词，而不是形容词)，当我试图仅添加频率时，我一直收到一个关键错误，因为NLTK将'jill’标记为NN，而不是NNP。

最后，如果找到单词，我决定采用第一个频率。现在的问题是我得到了这个词出现的所有频率。我只想要第一个，所以输出应该是：

[['wasabi', 'NN', '5'], 
['juice', 'NN', '369']]

代码，

with open('ANC-all-count.txt', 'r', errors='ignore') as f:
    freqs = csv.reader(f, delimiter='\t')

    freqs = {}
    for word, pos, f in freq_list:
        if word not in freqs: freqs[word] = {}
        freqs[word][pos] = f

        for i, (word, pos) in enumerate(tokens):
            if word not in freqs: 
                tokens[i].append(0)
                continue
            if pos not in freqs[word]:
                tokens[i] = [tokens[i][0:2]]
                single_token = tokens[i][0]
                if single_token[0] in freqs:
                    tokens[i].append(freqs[word].values())
                continue
            tokens[i].append(freqs[word][pos])

nltk

corpus

python

回答 1

Stack Overflow用户

发布于 2019-07-25 13:54:48

TL;DR

>>> from itertools import chain
>>> from collections import Counter

>>> from nltk.corpus import brown
>>> from nltk import pos_tag, word_tokenize

# Access first hundred tokenized sentence from brown corpus
# POS tag these sentences.
>>> tagged_sents = [pos_tag(tokenized_sent) for tokenized_sent in brown.sents()[:100]]

# Sanity check that the tagged_sents are what we want.
>>> list(chain(*tagged_sents))[:10]
[('The', 'DT'), ('Fulton', 'NNP'), ('County', 'NNP'), ('Grand', 'NNP'), ('Jury', 'NNP'), ('said', 'VBD'), ('Friday', 'NNP'), ('an', 'DT'), ('investigation', 'NN'), ('of', 'IN')]

# Use a collections.Counter to get the counts.
>>> freq = Counter(chain(*tagged_sents))

# Top 20 most common words.
>>> dict(freq.most_common(20))
{('the', 'DT'): 128, ('.', '.'): 89, (',', ','): 88, ('of', 'IN'): 67, ('to', 'TO'): 55, ('a', 'DT'): 50, ('and', 'CC'): 40, ('in', 'IN'): 39, ('``', '``'): 35, ("''", "''"): 34, ('The', 'DT'): 28, ('said', 'VBD'): 24, ('that', 'IN'): 24, ('for', 'IN'): 22, ('be', 'VB'): 21, ('was', 'VBD'): 18, ('jury', 'NN'): 17, ('Fulton', 'NNP'): 14, ('election', 'NN'): 14, ('will', 'MD'): 14}

# All the words from most to least common.
>>> dict(freq.most_common())


# To print out the word, pos and counts to file.
>>> with open('freq-counts', 'w') as fout:
...     for (word,pos), count in freq.most_common(20):
...         print('\t'.join([word, pos, str(count)]))

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57184986

复制

相似问题

问将语料库中的频率附加到推文中的每个标记
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将语料库中的频率附加到推文中的每个标记EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将语料库中的频率附加到推文中的每个标记
EN