文章/答案/技术大牛

发布

社区首页 >问答首页 >Python列表规范化

问Python列表规范化
EN

Stack Overflow用户

提问于 2013-05-14 21:42:29

回答 4查看 787关注 0票数 1

我正致力于一个不断学习的网络爬虫，以找到与世界各地发生的特定危机和悲剧事件有关的新闻文章。我目前正在努力使数据模型尽可能精简和高效，考虑到它在爬行过程中的不断增长。

--我将数据模型存储在列表中(以便对正在爬行的页面进行TFIDF比较)，并且我希望减小列表的大小，但不丢失列表中每个项的相对计数。

这是一个来自2个爬行网页的示例模型：

[[u'remark', u'special', u'agent', u'richard', u'deslauri', u'press', u'investig', u'crime', u'terror', u'crime', u'inform', u'servic', u'inform', u'laboratori', u'servic', u'want', u'want', u'want', u'terror', u'crime', u'want', u'news', u'news', u'press', u'news', u'servic', u'crime', u'inform', u'servic', u'laboratori', u'servic', u'servic', u'crime', u'crime', u'crime', u'terror', u'boston', u'press', u'remark', u'special', u'agent', u'richard', u'deslauri', u'press', u'investig', u'remark', u'special', u'agent', u'richard', u'deslauri', u'press', u'investig', u'boston', u'special', u'agent', u'remark', u'richard', u'deslauri', u'boston', u'investig', u'time', u'time', u'investig', u'boston', u'terror', u'law', u'enforc', u'boston', u'polic', u'polic', u'alreadi', u'alreadi', u'law', u'enforc', u'around', u'evid', u'boston', u'polic', u'evid', u'laboratori', u'evid', u'laboratori', u'may', u'alreadi', u'laboratori', u'investig', u'boston', u'polic', u'law', u'enforc', u'investig', u'around', u'alreadi', u'around', u'investig', u'law', u'enforc', u'evid', u'may', u'time', u'may', u'may', u'investig', u'may', u'around', u'time', u'investig', u'investig', u'boston', u'boston', u'news', u'press', u'boston', u'want', u'boston', u'want', u'news', u'servic', u'inform'], [u'2011', u'request', u'inform', u'tamerlan', u'tsarnaev', u'foreign', u'govern', u'crime', u'crime', u'inform', u'servic', u'inform', u'servic', u'nation', u'want', u'ten', u'want', u'want', u'crime', u'want', u'news', u'news', u'press', u'releas', u'news', u'stori', u'servic', u'crime', u'inform', u'servic', u'servic', u'servic', u'crime', u'crime', u'crime', u'news', u'press', u'press', u'releas', u'2011', u'request', u'inform', u'tamerlan', u'tsarnaev', u'foreign', u'govern', u'2011', u'request', u'inform', u'tamerlan', u'tsarnaev', u'foreign', u'govern', u'2013', u'nation', u'press', u'tamerlan', u'tsarnaev', u'dzhokhar', u'tsarnaev', u'tamerlan', u'tsarnaev', u'dzhokhar', u'tsarnaev', u'dzhokhar', u'tsarnaev', u'tamerlan', u'tsarnaev', u'dzhokhar', u'tsarnaev', u'2011', u'foreign', u'govern', u'inform', u'tamerlan', u'tsarnaev', u'inform', u'2011', u'govern', u'inform', u'tamerlan', u'tsarnaev', u'foreign', u'foreign', u'govern', u'2011', u'inform', u'foreign', u'govern', u'nation', u'press', u'releas', u'crime', u'releas', u'ten', u'news', u'stori', u'2013', u'ten', u'news', u'stori', u'2013', u'ten', u'news', u'stori', u'2013', u'2011', u'request', u'inform', u'tamerlan', u'tsarnaev', u'foreign', u'govern', u'nation', u'press', u'releas', u'want', u'news', u'servic', u'inform', u'govern']]

我希望维护单词列表，而不是将计数嵌入到列表本身中。我想从以下几个方面列出：

波士顿，爆炸，爆炸，萨纳耶夫，萨纳耶夫，波士顿时间，爆炸案，萨纳耶夫

基本上，如果我有一个列表a，b，b，c，我想把它减少到a，a，b。

编辑:不清楚很抱歉，但我会再试一次。我做，而不是想要一套。事件发生的次数非常重要，因为它是一个加权列表，因此“波士顿”应该比“时间”或其他类似术语出现的次数更多。我想要完成的是最小化数据模型，同时从模型中删除一些不重要的术语。所以在上面的例子中，我故意忽略了C，因为它给模型增加了很多“脂肪”。我想保持相对论，因为A出现比B多1倍，比C多2倍，但由于C在原始模型中只出现一次，所以它被从精益模型中删除。

python

list

回答 4

Stack Overflow用户

回答已采纳

发布于 2013-05-15 03:43:54

对我来说，这似乎是一个“正常化”(而不是“缩减”)任务，尽管我不确定这是正确的术语。

我认为collections.Counter确实是您想在这里使用的。它有几种方便的方法，使更改项目数和获得结果非常容易。

可以直接从列表中创建实例，计算每个项目的出现情况。Counter.most_common()给出了键/计数对的列表，从最大频率到最小排序。最低计数是该列表中最后一个元组的第二个字段。

Counter.subtract()是这里的关键:传递一个具有与现有Counter实例相同的键元素的列表，它减少了每个键在新列表中出现的次数。若要创建此列表，请使用列表理解来获取每个键数相等于最不频繁键计数的次数(根据您的要求进行调整，如果计数超过某一阈值，则最终结果应出现该键一次)。嵌套列表理解是我最喜欢的一种方法，它可以使列表变平--键的重复最初是作为它们自己的列表创建的。

最后，Counter.elements()将给出一个列表，就像您开始使用的列表一样:每个键显示的次数与其计数相同。

from collections import Counter

def normalize_list(L, threshold):
    cntr = Counter(L)
    least_count = cntr.most_common()[-1][1]
    if least_count > threshold:
        least_count -= 1
    cntr.subtract([item for k in cntr.keys() for item in [k] * least_count])
    return list(cntr.elements())

>>> a, b, c, d, e = 'abcde'
>>> normalize_list([a, a, a, a, a, b, b, b, b, c, c, c, d, d], 10)
['a', 'a', 'a', 'c', 'b', 'b']

>>> normalize_list(your_list, 6)
[u'laboratori', u'releas', u'want', u'want', u'want', u'want', u'want', u'want', u'want', u'crime', u'crime', u'crime', u'crime', u'crime', u'crime', u'crime', u'crime', u'crime', u'crime', u'crime', u'boston', u'boston', u'boston', u'boston', u'boston', u'boston', u'boston', u'2011', u'2011', u'2011', u'tsarnaev', u'tsarnaev', u'tsarnaev', u'tsarnaev', u'tsarnaev', u'tsarnaev', u'tsarnaev', u'tsarnaev', u'tsarnaev', u'investig', u'investig', u'investig', u'investig', u'investig', u'investig', u'investig', u'may', u'govern', u'govern', u'govern', u'govern', u'govern', u'press', u'press', u'press', u'press', u'press', u'press', u'press', u'press', u'news', u'news', u'news', u'news', u'news', u'news', u'news', u'news', u'news', u'tamerlan', u'tamerlan', u'tamerlan', u'tamerlan', u'tamerlan', u'servic', u'servic', u'servic', u'servic', u'servic', u'servic', u'servic', u'servic', u'servic', u'servic', u'foreign', u'foreign', u'foreign', u'foreign', u'inform', u'inform', u'inform', u'inform', u'inform', u'inform', u'inform', u'inform', u'inform', u'inform', u'inform', u'inform']

当然，这并不能保留原始列表的顺序。

票数 1

Stack Overflow用户

发布于 2013-05-14 21:54:18

from collections import defaultdict
d = defaultdict(int)
for w in words[0]:
    d[w] += 1
mmin = min(d[p] for p in d)

然后，您可以从每个单词中减去这个mmin，并创建一个新列表。但也许这个小玩意已经足够紧凑了。为了保持顺序，您可以使用dict中的信息，并设计一些聪明的方法来筛选您的初始单词列表。

例如，对于单词列表[a,a,a,b,b,c]，字典将包含{a:3, b:2, c:1}和mmin=1。您可以通过从所有项目中减去1来获得{a:2, b:1}，从而使用这些信息拥有一个更精简的字典，并且由于c是0，所以它被删除了。

完整代码：

from collections import defaultdict
d = defaultdict(int)
words = ['a','a','a','b','b','c']
for w in words:
    d[w] += 1
mmin = min(d[p] for p in d)
slim=[]
for w in words:
    if d[w] > mmin:
        slim.append(w)
        d[w] -= 1
print slim

票数 3

Stack Overflow用户

发布于 2013-05-14 21:54:49

如果将示例模型分配给变量topics，则可以使用collections.Counter来维护所有主题及其计数的类似字典的对象：

from collections import Counter
topic_count = [Counter(topic) for topic in topics]
# [Counter({u'boston': 11, u'investig': 11, u'crime': 7, u'servic': 7, u'want': 6, u'press': 6, u'laboratori': 5, u'may': 5, u'news': 5, u'agent': 4, u'alreadi': 4, u'deslauri': 4, u'special': 4, u'richard': 4, u'polic': 4, u'terror': 4, u'around': 4, u'evid': 4, u'law': 4, u'remark': 4, u'inform': 4, u'enforc': 4, u'time': 4}),
#  Counter({u'tsarnaev': 13, u'inform': 12, u'govern': 9, u'tamerlan': 9, u'foreign': 8, u'news': 8, u'crime': 8, u'2011': 7, u'servic': 7, u'press': 6, u'releas': 5, u'want': 5, u'ten': 4, u'request': 4, u'stori': 4, u'nation': 4, u'2013': 4, u'dzhokhar': 4})]

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/16553340

复制

相似问题

问Python列表规范化
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python列表规范化EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python列表规范化
EN