我有一个在给定字符串列表中出现最多的值组合。
combination of values == value-value-value-value-value因此组合需要有5个值,并且值可以重复。
all_values = ["CONJ", "NUM", "ADV", "PRT", "ADP", "PRON", "VERB", "DET", "ADJ", "NOUN"]
all_strings = ["DET-VERB-PRON-PRON-VERB-VERB-ADP",
"DET-NOUN-DET-NOUN-CONJ-PRON-NOUN-NOUN-VERB-NOUN-NOUN",
"PRON-VERB-VERB-DET-NOUN-ADP-NUM-ADP-NOUN",
"NOUN-VERB-NOUN-VERB-ADV-ADJ-ADP-PRON-NOUN-VERB-ADV-ADV-VERB-ADV-VERB",
"ADJ-NOUN-NOUN-PRT-VERB-VERB-DET-NOUN-VERB-ADP-DET-NOUN-NOUN",
"NOUN-VERB-PRT-ADP-PRON-NOUN-DET-VERB-NUM-NOUN-ADP-ADV-VERB",
"NOUN-NOUN-ADP-DET-NOUN-ADP-NOUN-NOUN",
"NOUN-ADV-VERB-ADP-DET-NOUN-VERB-VERB-ADP-NUM-ADJ-NOUN",
"PRON-ADV-VERB-DET-NOUN-PRON-VERB-VERB-NOUN-ADP-DET-NOUN",
"PRON-VERB-ADP-PRON-ADP-PRON-ADJ-ADJ-NOUN",
"PRON-VERB-DET-NOUN-NOUN-VERB-NOUN-VERB-PRT-VERB-ADP-DET-NOUN",
"PRON-VERB-ADP-PRON-ADP-DET-NOUN-DET-VERB-PRON-VERB-VERB-NOUN-ADP-PRON",
"NOUN-VERB-ADP-DET-NOUN-ADJ-VERB-ADV-VERB-ADJ-ADP-PRON-VERB-NOUN",
"ADJ-NOUN-NOUN-ADV-ADJ-NOUN-VERB",
"NOUN-ADV-VERB-ADJ-PRON-ADJ-NOUN-VERB-VERB-NUM",
"NOUN-DET-NOUN-ADV-VERB-NOUN-VERB-ADV-DET-ADJ-NOUN",
"ADV-PRON-VERB-ADV-NUM-ADP-DET-NOUN-NOUN-ADJ-NOUN",
"PRON-VERB-DET-NOUN-ADP-PRON-NOUN",
"ADJ-ADP-PRON-NOUN-VERB-VERB-ADP-NOUN-NOUN",
"ADJ-NOUN-NOUN-VERB-DET-PRON-VERB-DET-NOUN",
"NOUN-VERB-PRT-VERB-DET-NOUN-PRT-VERB-ADP-PRON-ADJ-ADV-ADJ-ADV",
"PRON-NOUN-VERB-ADV-VERB-ADP-DET-NOUN",
"ADV-NOUN-VERB-ADV-VERB-NOUN-ADP-PRON-NOUN-VERB-NOUN-ADP-NOUN",
"PRON-VERB-ADP-DET-NOUN-NOUN-ADV-DET-VERB-VERB-PRT-VERB-PRON",
"DET-VERB-ADJ-NOUN-NOUN-ADP-NOUN-ADP-NOUN-VERB-DET-NOUN",
"ADJ-NOUN-VERB-VERB-PRON-NOUN-NOUN",
"ADP-PRON-VERB-PRON-NOUN-PRT-NOUN-CONJ-DET-NOUN-VERB-VERB",
"ADV-DET-ADJ-NOUN-PRON-VERB-ADJ-NUM-ADP-NOUN-NOUN-NOUN",
"VERB-PRON-NOUN-ADV-VERB-PRT-VERB-NOUN",
"ADV-ADP-PRON-VERB-ADP-ADV-VERB-PRON-VERB-DET-NOUN-ADP-NOUN",
"VERB-PRON-VERB-NOUN-ADP-NOUN-NOUN",
"DET-ADV-ADJ-NOUN-VERB-VERB-ADP-PRON",
"PRON-VERB-ADJ-ADJ-VERB-ADP-ADJ-ADJ-NOUN-NOUN-VERB",
"NOUN-NOUN-NOUN-ADP-DET-ADJ-NOUN-NOUN-PRT-VERB-PRON-ADP",
"PRON-VERB-DET-ADJ-NOUN-NOUN",
"ADJ-NOUN-NOUN-ADP-ADP-NOUN",
"DET-VERB-DET-NOUN-ADP-NOUN-PRT-NOUN",
"NOUN-NOUN-NOUN-CONJ-ADJ-NOUN-VERB-VERB-VERB",
"ADP-NOUN-PRON-VERB-VERB-PRON-NOUN-NOUN-CONJ-ADJ-PRT-ADJ-NOUN",
"PRON-VERB-PRON-PRON-VERB-ADV-ADV",
"NOUN-VERB-VERB-PRT-VERB-NOUN-ADP-NOUN-NOUN-ADP-DET-NOUN-NOUN",
"PRON-PRON-VERB-VERB-DET-NOUN-CONJ-PRON-VERB-VERB-ADP-DET",
"PRON-VERB-NOUN-ADP-DET-NOUN-CONJ-NOUN-CONJ-NOUN-PRON-VERB-ADP-VERB-PRON",
"NOUN-ADJ-NOUN-VERB-ADV-ADP-DET",
"DET-NOUN-VERB-ADP-ADP-NUM-ADJ-NOUN",
"PRON-VERB-VERB-PRT-VERB-VERB-PRT-ADP-DET-NOUN-ADP-DET-NOUN",
"CONJ-DET-VERB-DET-NOUN-NOUN-ADP-NOUN-NOUN",
"PRON-VERB-ADJ-NOUN-VERB-NOUN-ADP-ADJ-NOUN-ADP-NOUN",
"ADV-PRON-VERB-VERB-DET-ADJ-NOUN-ADP-NOUN-PRT-VERB-PRON-NOUN",
"VERB-PRON-VERB-DET-NOUN-ADP-DET-NOUN-PRT-VERB-VERB",
"PRON-VERB-ADP-DET-NOUN-NOUN-VERB-ADJ",
"ADV-VERB-DET-NOUN-ADP-DET-NOUN",
"ADV-ADP-VERB-ADV-PRON-VERB-VERB",
"NOUN-DET-NOUN-NOUN-NOUN-NOUN-VERB-ADJ-NOUN-ADP-DET-NOUN",
"PRON-VERB-NOUN-ADP-NOUN-ADP-NUM-NOUN-NOUN-PRT-VERB-NOUN",
"ADJ-NOUN-ADV-VERB-DET-VERB-VERB-ADJ-NOUN-ADP-PRON",
"NOUN-ADV-ADJ-ADV-ADJ-ADP-DET",
"NOUN-PRON-VERB-ADJ-VERB-ADV",
"VERB-PRT-DET-NOUN-VERB-ADP-DET-NOUN-VERB-ADV-NOUN-VERB-VERB-ADP-PRON",
"PRON-VERB-ADV-ADJ-NOUN-NOUN-NOUN-NOUN-NOUN-PRT-VERB-DET-NOUN",
"PRON-VERB-DET-VERB-VERB-DET-NOUN-NOUN-ADP-NOUN",
"NOUN-NOUN-NOUN-NOUN-NOUN-VERB-NUM-NUM",
"PRON-ADV-VERB-NUM-NOUN-ADV-ADJ-ADP-PRON-NOUN",
"NOUN-VERB-DET-NOUN-ADP-NOUN-ADV-NOUN-VERB-VERB-PRON-NOUN",
"NOUN-VERB-VERB-VERB-ADP-PRON",
"NOUN-VERB-DET-ADV-ADJ-NOUN",
"NOUN-ADV-VERB-PRON-DET-NOUN-NOUN-ADV",
"ADV-VERB-PRT-NUM-NOUN-NOUN-PRON-VERB-DET-NOUN-ADP-DET-NOUN",
"NOUN-VERB-ADV-NOUN-VERB-ADP-DET-NOUN-NOUN",
"PRON-VERB-ADJ-NOUN-DET-ADJ-NOUN",
"VERB-PRON-VERB-DET-ADJ-ADP-PRON",
"PRON-VERB-ADJ-NOUN-ADP-NOUN",
"NOUN-NOUN-NOUN-ADJ-VERB-NOUN-VERB-NOUN-ADP-PRON-ADV",
"NOUN-NOUN-VERB-NOUN-VERB-ADV-ADP-DET-ADJ-NOUN",
"NOUN-VERB-VERB-NOUN-NOUN-NOUN-ADV-VERB-PRON-PRT-VERB-NOUN"]因此,我需要找到最好的组合,例如,VERB-VERB-DET-NOUN-ADP或VERB-PRON-NOUN-ADV-VERB或其他任何组合。
我正在考虑从all_values列表中找到所有可能的值组合,但我确信有一种更快的方法。Ofc,全all_strings有超过50k的值。
发布于 2022-02-09 00:15:48
这里有一个快速的脚本,它似乎能相当快地完成这项工作。
from collections import defaultdict as dd
import time
all_values = ["CONJ", "NUM", "ADV", "PRT", "ADP", "PRON", "VERB", "DET", "ADJ", "NOUN"]
all_strings = ["DET-VERB-PRON-PRON-VERB-VERB-ADP",
"DET-NOUN-DET-NOUN-CONJ-PRON-NOUN-NOUN-VERB-NOUN-NOUN",
"PRON-VERB-VERB-DET-NOUN-ADP-NUM-ADP-NOUN",
"NOUN-VERB-NOUN-VERB-ADV-ADJ-ADP-PRON-NOUN-VERB-ADV-ADV-VERB-ADV-VERB",
"ADJ-NOUN-NOUN-PRT-VERB-VERB-DET-NOUN-VERB-ADP-DET-NOUN-NOUN",
"NOUN-VERB-PRT-ADP-PRON-NOUN-DET-VERB-NUM-NOUN-ADP-ADV-VERB",
"NOUN-NOUN-ADP-DET-NOUN-ADP-NOUN-NOUN",
"NOUN-ADV-VERB-ADP-DET-NOUN-VERB-VERB-ADP-NUM-ADJ-NOUN",
"PRON-ADV-VERB-DET-NOUN-PRON-VERB-VERB-NOUN-ADP-DET-NOUN",
"PRON-VERB-ADP-PRON-ADP-PRON-ADJ-ADJ-NOUN",
"PRON-VERB-DET-NOUN-NOUN-VERB-NOUN-VERB-PRT-VERB-ADP-DET-NOUN",
"PRON-VERB-ADP-PRON-ADP-DET-NOUN-DET-VERB-PRON-VERB-VERB-NOUN-ADP-PRON",
"NOUN-VERB-ADP-DET-NOUN-ADJ-VERB-ADV-VERB-ADJ-ADP-PRON-VERB-NOUN",
"ADJ-NOUN-NOUN-ADV-ADJ-NOUN-VERB",
"NOUN-ADV-VERB-ADJ-PRON-ADJ-NOUN-VERB-VERB-NUM",
"NOUN-DET-NOUN-ADV-VERB-NOUN-VERB-ADV-DET-ADJ-NOUN",
"ADV-PRON-VERB-ADV-NUM-ADP-DET-NOUN-NOUN-ADJ-NOUN",
"PRON-VERB-DET-NOUN-ADP-PRON-NOUN",
"ADJ-ADP-PRON-NOUN-VERB-VERB-ADP-NOUN-NOUN",
"ADJ-NOUN-NOUN-VERB-DET-PRON-VERB-DET-NOUN",
"NOUN-VERB-PRT-VERB-DET-NOUN-PRT-VERB-ADP-PRON-ADJ-ADV-ADJ-ADV",
"PRON-NOUN-VERB-ADV-VERB-ADP-DET-NOUN",
"ADV-NOUN-VERB-ADV-VERB-NOUN-ADP-PRON-NOUN-VERB-NOUN-ADP-NOUN",
"PRON-VERB-ADP-DET-NOUN-NOUN-ADV-DET-VERB-VERB-PRT-VERB-PRON",
"DET-VERB-ADJ-NOUN-NOUN-ADP-NOUN-ADP-NOUN-VERB-DET-NOUN",
"ADJ-NOUN-VERB-VERB-PRON-NOUN-NOUN",
"ADP-PRON-VERB-PRON-NOUN-PRT-NOUN-CONJ-DET-NOUN-VERB-VERB",
"ADV-DET-ADJ-NOUN-PRON-VERB-ADJ-NUM-ADP-NOUN-NOUN-NOUN",
"VERB-PRON-NOUN-ADV-VERB-PRT-VERB-NOUN",
"ADV-ADP-PRON-VERB-ADP-ADV-VERB-PRON-VERB-DET-NOUN-ADP-NOUN",
"VERB-PRON-VERB-NOUN-ADP-NOUN-NOUN",
"DET-ADV-ADJ-NOUN-VERB-VERB-ADP-PRON",
"PRON-VERB-ADJ-ADJ-VERB-ADP-ADJ-ADJ-NOUN-NOUN-VERB",
"NOUN-NOUN-NOUN-ADP-DET-ADJ-NOUN-NOUN-PRT-VERB-PRON-ADP",
"PRON-VERB-DET-ADJ-NOUN-NOUN",
"ADJ-NOUN-NOUN-ADP-ADP-NOUN",
"DET-VERB-DET-NOUN-ADP-NOUN-PRT-NOUN",
"NOUN-NOUN-NOUN-CONJ-ADJ-NOUN-VERB-VERB-VERB",
"ADP-NOUN-PRON-VERB-VERB-PRON-NOUN-NOUN-CONJ-ADJ-PRT-ADJ-NOUN",
"PRON-VERB-PRON-PRON-VERB-ADV-ADV",
"NOUN-VERB-VERB-PRT-VERB-NOUN-ADP-NOUN-NOUN-ADP-DET-NOUN-NOUN",
"PRON-PRON-VERB-VERB-DET-NOUN-CONJ-PRON-VERB-VERB-ADP-DET",
"PRON-VERB-NOUN-ADP-DET-NOUN-CONJ-NOUN-CONJ-NOUN-PRON-VERB-ADP-VERB-PRON",
"NOUN-ADJ-NOUN-VERB-ADV-ADP-DET",
"DET-NOUN-VERB-ADP-ADP-NUM-ADJ-NOUN",
"PRON-VERB-VERB-PRT-VERB-VERB-PRT-ADP-DET-NOUN-ADP-DET-NOUN",
"CONJ-DET-VERB-DET-NOUN-NOUN-ADP-NOUN-NOUN",
"PRON-VERB-ADJ-NOUN-VERB-NOUN-ADP-ADJ-NOUN-ADP-NOUN",
"ADV-PRON-VERB-VERB-DET-ADJ-NOUN-ADP-NOUN-PRT-VERB-PRON-NOUN",
"VERB-PRON-VERB-DET-NOUN-ADP-DET-NOUN-PRT-VERB-VERB",
"PRON-VERB-ADP-DET-NOUN-NOUN-VERB-ADJ",
"ADV-VERB-DET-NOUN-ADP-DET-NOUN",
"ADV-ADP-VERB-ADV-PRON-VERB-VERB",
"NOUN-DET-NOUN-NOUN-NOUN-NOUN-VERB-ADJ-NOUN-ADP-DET-NOUN",
"PRON-VERB-NOUN-ADP-NOUN-ADP-NUM-NOUN-NOUN-PRT-VERB-NOUN",
"ADJ-NOUN-ADV-VERB-DET-VERB-VERB-ADJ-NOUN-ADP-PRON",
"NOUN-ADV-ADJ-ADV-ADJ-ADP-DET",
"NOUN-PRON-VERB-ADJ-VERB-ADV",
"VERB-PRT-DET-NOUN-VERB-ADP-DET-NOUN-VERB-ADV-NOUN-VERB-VERB-ADP-PRON",
"PRON-VERB-ADV-ADJ-NOUN-NOUN-NOUN-NOUN-NOUN-PRT-VERB-DET-NOUN",
"PRON-VERB-DET-VERB-VERB-DET-NOUN-NOUN-ADP-NOUN",
"NOUN-NOUN-NOUN-NOUN-NOUN-VERB-NUM-NUM",
"PRON-ADV-VERB-NUM-NOUN-ADV-ADJ-ADP-PRON-NOUN",
"NOUN-VERB-DET-NOUN-ADP-NOUN-ADV-NOUN-VERB-VERB-PRON-NOUN",
"NOUN-VERB-VERB-VERB-ADP-PRON",
"NOUN-VERB-DET-ADV-ADJ-NOUN",
"NOUN-ADV-VERB-PRON-DET-NOUN-NOUN-ADV",
"ADV-VERB-PRT-NUM-NOUN-NOUN-PRON-VERB-DET-NOUN-ADP-DET-NOUN",
"NOUN-VERB-ADV-NOUN-VERB-ADP-DET-NOUN-NOUN",
"PRON-VERB-ADJ-NOUN-DET-ADJ-NOUN",
"VERB-PRON-VERB-DET-ADJ-ADP-PRON",
"PRON-VERB-ADJ-NOUN-ADP-NOUN",
"NOUN-NOUN-NOUN-ADJ-VERB-NOUN-VERB-NOUN-ADP-PRON-ADV",
"NOUN-NOUN-VERB-NOUN-VERB-ADV-ADP-DET-ADJ-NOUN",
"NOUN-VERB-VERB-NOUN-NOUN-NOUN-ADV-VERB-PRON-PRT-VERB-NOUN"]
start = time.time()
string_mat = [s.split('-') for s in all_strings]
freq = dd(int)
for row in string_mat:
for i in range(len(row) - 4):
freq[tuple(row[i:i+5])]+=1
max_freq,pattern = max((f,p) for p,f in freq.items())
print("Best pattern:",'-'.join(pattern))
print("frequency:",max_freq)
print("Computation time: {:.3} seconds".format(time.time()-start))结果:
Best pattern: VERB-ADP-DET-NOUN-NOUN
frequency: 4
Computation time: 0.00149 seconds奖励:要按降序获得所有模式及其频率的列表,请执行以下操作(在上面的脚本之后)。
for f,p in sorted([(f,p) for (p,f) in freq.items()],reverse = True):
print('-'.join(p),f': {f} time(s)')以下是前五名:
VERB-ADP-DET-NOUN-NOUN : 4 time(s)
PRON-VERB-DET-NOUN-ADP : 4 time(s)
NOUN-VERB-ADP-DET-NOUN : 4 time(s)
DET-NOUN-ADP-DET-NOUN : 4 time(s)
VERB-DET-NOUN-ADP-NOUN : 3 time(s)使用collections.Counter对象实现:
start = time.time()
string_mat = [s.split('-') for s in all_strings]
freq = Counter([tuple(row[i:i+5]) for row in string_mat for i in range(len(row)-4)])
print('5 most common:')
for p,f in freq.most_common(5):
print('{}: {}'.format('-'.join(p),f))
print("Computation time: {:.3} seconds".format(time.time()-start))结果:
5 most common:
NOUN-VERB-ADP-DET-NOUN: 4
VERB-ADP-DET-NOUN-NOUN: 4
PRON-VERB-DET-NOUN-ADP: 4
DET-NOUN-ADP-DET-NOUN: 4
VERB-DET-NOUN-ADP-NOUN: 3
Computation time: 0.00294 secondshttps://stackoverflow.com/questions/71042357
复制相似问题