文章/答案/技术大牛

发布

社区首页 >问答首页 >用于文本分析的Defaultdict(defaultdict)

问用于文本分析的Defaultdict(defaultdict)
EN

Stack Overflow用户

提问于 2016-01-12 16:30:22

回答 3查看 835关注 0票数 2

从文件中读取并清理的文本：

['the', 'cat', 'chased', 'the', 'dog', 'fled']

现在的挑战是返回一个以每个单词作为值的dict，并返回可以跟随它的单词作为键，并对它跟随的次数进行计数：

{'the': {'cat': 1, 'dog': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}, 'dog': {'fled': 1}}

Collections.counter将计算每个唯一值的频率。然而，我的算法解决这一挑战是长期和笨重的。如何使用defaultdict来使解决这个问题更加简单？

编辑:这是我的代码来解决这个问题。一个缺陷是嵌套dict中的值是一个单词在文本中出现的总次数，而不是它实际在这个特定单词后面出现的次数。

from collections import Counter, defaultdict

wordsFile = f.read()
words = [x.strip(string.punctuation).lower() for x in wordsFile.split()]    
counter = Counter(words)

# the dict of [unique word]:[index of appearance in 'words']
index = defaultdict(list) 

# Appends every position of 'term' to the 'term' key
for pos, term in enumerate(words):
    index[term].append(pos)  

# range ends at len(index) - 2 because last word in text has no follower
master = {}
for i in range(0,(len(index)-2)):

    # z will hold the [index of appearance in 'words'] values
    z = []
    z = index.values()[i] 
    try:

        # Because I am interested in follower words
        z = [words[a+1] for a in z]
        print z; print

        # To avoid value errors if a+1 exceeds range of list
    except Exception:
        pass

    # For each word, build r into the dict that contains each follower word and its frequency.

    r = {}
    for key in z:
        r.update({key: counter[key]})

    master.update({index.keys()[i]:r})


return  master

python

python-2.7

collections

defaultdict

回答 3

Stack Overflow用户

回答已采纳

发布于 2016-01-12 17:19:25

使用defaultdict

import collections

words = ['the', 'cat','chased', 'the', 'dog', 'fled']
result = collections.defaultdict(dict)

for i in range(len(words) - 1):   # loop till second to last word
    occurs = result[words[i]]    # get the dict containing the words that follow and their freqs
    new_freq = occurs.get(words[i+1], 0) + 1  # update the freqs
    occurs[words[i+1]] = new_freq

print list(result.items())

票数 1

Stack Overflow用户

发布于 2016-01-12 17:00:28

没有必要使用collections模块来实现工作解决方案：

示例1

import itertools
import pprint


def main():
    array = 'the', 'cat', 'chased', 'the', 'dog', 'fled'
    frequency = {}
    add_frequency(frequency, array)
    pprint.pprint(frequency)


def add_frequency(frequency, array):
    for a, b in pairwise(array):
        if a in frequency:
            follower = frequency[a]
        else:
            follower = frequency[a] = {}
        if b in follower:
            follower[b] += 1
        else:
            follower[b] = 1


def pairwise(iterable):
    """s -> (s[0], s[1]), (s[1], [s2]), (s[2], s[3]), ..."""
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)

if __name__ == '__main__':
    main()

下面的代码显示了如何使用collections.defaultdict执行您所要求的操作：

示例2

import collections
import itertools
import pprint


def main():
    array = 'the', 'cat', 'chased', 'the', 'dog', 'fled'
    frequency = collections.defaultdict(lambda: collections.defaultdict(int))
    add_frequency(frequency, array)
    pprint.pprint(frequency)


def add_frequency(frequency, array):
    for a, b in pairwise(array):
        frequency[a][b] += 1


def pairwise(iterable):
    """s -> (s[0], s[1]), (s[1], [s2]), (s[2], s[3]), ..."""
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)

if __name__ == '__main__':
    main()

在创建functools.partial时，您也可以使用lambda而不是lambda。

示例3

from collections import defaultdict
from functools import partial
from itertools import tee
from pprint import pprint


def main():
    array = 'the', 'cat', 'chased', 'the', 'dog', 'fled'
    frequency = defaultdict(partial(defaultdict, int))
    add_frequency(frequency, array)
    pprint(frequency)


def add_frequency(frequency, array):
    for a, b in pairwise(array):
        frequency[a][b] += 1


def pairwise(iterable):
    """s -> (s[0], s[1]), (s[1], [s2]), (s[2], s[3]), ..."""
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

if __name__ == '__main__':
    main()

票数 2

Stack Overflow用户

发布于 2016-01-12 17:19:11

我有一个简单的答案，尽管它不使用defaultdict -只是标准字典和setdefault。我可能错过了你的意图，但我看到的是：

def word_analysis(input):
    from itertools import tee, izip
    i1, i2 = tee(input)
    i2.next()
    results = {}
    for w1,w2 in izip(i1,i2):           # Process works pairwise
        d = results.setdefault(w1,{})   # Establish/use the first word dict
        d[w2] = 1 + d.setdefault(w2,0)  # Increment the counter
    return results

print word_analysis(['the', 'cat', 'chased', 'the', 'dog', 'fled'])

对我来说，这提供了与您报告的输出相同的结果：

{'the': {'dog': 1, 'cat': 1}, 'chased': {'the': 1}, 'dog': {'fled': 1}, 'cat': {'chased': 1}}

我是不是遗漏了什么？

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/34748904

复制

相似问题

问用于文本分析的Defaultdict(defaultdict)
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于文本分析的Defaultdict(defaultdict)EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于文本分析的Defaultdict(defaultdict)
EN