首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >查找在给定字符串中出现最多的字符串组合。

查找在给定字符串中出现最多的字符串组合。
EN

Stack Overflow用户
提问于 2022-02-08 23:48:27
回答 1查看 52关注 0票数 0

我有一个在给定字符串列表中出现最多的值组合。

代码语言:javascript
复制
combination of values == value-value-value-value-value

因此组合需要有5个值,并且值可以重复。

代码语言:javascript
复制
all_values = ["CONJ", "NUM", "ADV", "PRT", "ADP", "PRON", "VERB", "DET", "ADJ", "NOUN"]


all_strings = ["DET-VERB-PRON-PRON-VERB-VERB-ADP",
"DET-NOUN-DET-NOUN-CONJ-PRON-NOUN-NOUN-VERB-NOUN-NOUN",
"PRON-VERB-VERB-DET-NOUN-ADP-NUM-ADP-NOUN",
"NOUN-VERB-NOUN-VERB-ADV-ADJ-ADP-PRON-NOUN-VERB-ADV-ADV-VERB-ADV-VERB",
"ADJ-NOUN-NOUN-PRT-VERB-VERB-DET-NOUN-VERB-ADP-DET-NOUN-NOUN",
"NOUN-VERB-PRT-ADP-PRON-NOUN-DET-VERB-NUM-NOUN-ADP-ADV-VERB",
"NOUN-NOUN-ADP-DET-NOUN-ADP-NOUN-NOUN",
"NOUN-ADV-VERB-ADP-DET-NOUN-VERB-VERB-ADP-NUM-ADJ-NOUN",
"PRON-ADV-VERB-DET-NOUN-PRON-VERB-VERB-NOUN-ADP-DET-NOUN",
"PRON-VERB-ADP-PRON-ADP-PRON-ADJ-ADJ-NOUN",
"PRON-VERB-DET-NOUN-NOUN-VERB-NOUN-VERB-PRT-VERB-ADP-DET-NOUN",
"PRON-VERB-ADP-PRON-ADP-DET-NOUN-DET-VERB-PRON-VERB-VERB-NOUN-ADP-PRON",
"NOUN-VERB-ADP-DET-NOUN-ADJ-VERB-ADV-VERB-ADJ-ADP-PRON-VERB-NOUN",
"ADJ-NOUN-NOUN-ADV-ADJ-NOUN-VERB",
"NOUN-ADV-VERB-ADJ-PRON-ADJ-NOUN-VERB-VERB-NUM",
"NOUN-DET-NOUN-ADV-VERB-NOUN-VERB-ADV-DET-ADJ-NOUN",
"ADV-PRON-VERB-ADV-NUM-ADP-DET-NOUN-NOUN-ADJ-NOUN",
"PRON-VERB-DET-NOUN-ADP-PRON-NOUN",
"ADJ-ADP-PRON-NOUN-VERB-VERB-ADP-NOUN-NOUN",
"ADJ-NOUN-NOUN-VERB-DET-PRON-VERB-DET-NOUN",
"NOUN-VERB-PRT-VERB-DET-NOUN-PRT-VERB-ADP-PRON-ADJ-ADV-ADJ-ADV",
"PRON-NOUN-VERB-ADV-VERB-ADP-DET-NOUN",
"ADV-NOUN-VERB-ADV-VERB-NOUN-ADP-PRON-NOUN-VERB-NOUN-ADP-NOUN",
"PRON-VERB-ADP-DET-NOUN-NOUN-ADV-DET-VERB-VERB-PRT-VERB-PRON",
"DET-VERB-ADJ-NOUN-NOUN-ADP-NOUN-ADP-NOUN-VERB-DET-NOUN",
"ADJ-NOUN-VERB-VERB-PRON-NOUN-NOUN",
"ADP-PRON-VERB-PRON-NOUN-PRT-NOUN-CONJ-DET-NOUN-VERB-VERB",
"ADV-DET-ADJ-NOUN-PRON-VERB-ADJ-NUM-ADP-NOUN-NOUN-NOUN",
"VERB-PRON-NOUN-ADV-VERB-PRT-VERB-NOUN",
"ADV-ADP-PRON-VERB-ADP-ADV-VERB-PRON-VERB-DET-NOUN-ADP-NOUN",
"VERB-PRON-VERB-NOUN-ADP-NOUN-NOUN",
"DET-ADV-ADJ-NOUN-VERB-VERB-ADP-PRON",
"PRON-VERB-ADJ-ADJ-VERB-ADP-ADJ-ADJ-NOUN-NOUN-VERB",
"NOUN-NOUN-NOUN-ADP-DET-ADJ-NOUN-NOUN-PRT-VERB-PRON-ADP",
"PRON-VERB-DET-ADJ-NOUN-NOUN",
"ADJ-NOUN-NOUN-ADP-ADP-NOUN",
"DET-VERB-DET-NOUN-ADP-NOUN-PRT-NOUN",
"NOUN-NOUN-NOUN-CONJ-ADJ-NOUN-VERB-VERB-VERB",
"ADP-NOUN-PRON-VERB-VERB-PRON-NOUN-NOUN-CONJ-ADJ-PRT-ADJ-NOUN",
"PRON-VERB-PRON-PRON-VERB-ADV-ADV",
"NOUN-VERB-VERB-PRT-VERB-NOUN-ADP-NOUN-NOUN-ADP-DET-NOUN-NOUN",
"PRON-PRON-VERB-VERB-DET-NOUN-CONJ-PRON-VERB-VERB-ADP-DET",
"PRON-VERB-NOUN-ADP-DET-NOUN-CONJ-NOUN-CONJ-NOUN-PRON-VERB-ADP-VERB-PRON",
"NOUN-ADJ-NOUN-VERB-ADV-ADP-DET",
"DET-NOUN-VERB-ADP-ADP-NUM-ADJ-NOUN",
"PRON-VERB-VERB-PRT-VERB-VERB-PRT-ADP-DET-NOUN-ADP-DET-NOUN",
"CONJ-DET-VERB-DET-NOUN-NOUN-ADP-NOUN-NOUN",
"PRON-VERB-ADJ-NOUN-VERB-NOUN-ADP-ADJ-NOUN-ADP-NOUN",
"ADV-PRON-VERB-VERB-DET-ADJ-NOUN-ADP-NOUN-PRT-VERB-PRON-NOUN",
"VERB-PRON-VERB-DET-NOUN-ADP-DET-NOUN-PRT-VERB-VERB",
"PRON-VERB-ADP-DET-NOUN-NOUN-VERB-ADJ",
"ADV-VERB-DET-NOUN-ADP-DET-NOUN",
"ADV-ADP-VERB-ADV-PRON-VERB-VERB",
"NOUN-DET-NOUN-NOUN-NOUN-NOUN-VERB-ADJ-NOUN-ADP-DET-NOUN",
"PRON-VERB-NOUN-ADP-NOUN-ADP-NUM-NOUN-NOUN-PRT-VERB-NOUN",
"ADJ-NOUN-ADV-VERB-DET-VERB-VERB-ADJ-NOUN-ADP-PRON",
"NOUN-ADV-ADJ-ADV-ADJ-ADP-DET",
"NOUN-PRON-VERB-ADJ-VERB-ADV",
"VERB-PRT-DET-NOUN-VERB-ADP-DET-NOUN-VERB-ADV-NOUN-VERB-VERB-ADP-PRON",
"PRON-VERB-ADV-ADJ-NOUN-NOUN-NOUN-NOUN-NOUN-PRT-VERB-DET-NOUN",
"PRON-VERB-DET-VERB-VERB-DET-NOUN-NOUN-ADP-NOUN",
"NOUN-NOUN-NOUN-NOUN-NOUN-VERB-NUM-NUM",
"PRON-ADV-VERB-NUM-NOUN-ADV-ADJ-ADP-PRON-NOUN",
"NOUN-VERB-DET-NOUN-ADP-NOUN-ADV-NOUN-VERB-VERB-PRON-NOUN",
"NOUN-VERB-VERB-VERB-ADP-PRON",
"NOUN-VERB-DET-ADV-ADJ-NOUN",
"NOUN-ADV-VERB-PRON-DET-NOUN-NOUN-ADV",
"ADV-VERB-PRT-NUM-NOUN-NOUN-PRON-VERB-DET-NOUN-ADP-DET-NOUN",
"NOUN-VERB-ADV-NOUN-VERB-ADP-DET-NOUN-NOUN",
"PRON-VERB-ADJ-NOUN-DET-ADJ-NOUN",
"VERB-PRON-VERB-DET-ADJ-ADP-PRON",
"PRON-VERB-ADJ-NOUN-ADP-NOUN",
"NOUN-NOUN-NOUN-ADJ-VERB-NOUN-VERB-NOUN-ADP-PRON-ADV",
"NOUN-NOUN-VERB-NOUN-VERB-ADV-ADP-DET-ADJ-NOUN",
"NOUN-VERB-VERB-NOUN-NOUN-NOUN-ADV-VERB-PRON-PRT-VERB-NOUN"]

因此,我需要找到最好的组合,例如,VERB-VERB-DET-NOUN-ADPVERB-PRON-NOUN-ADV-VERB或其他任何组合。

我正在考虑从all_values列表中找到所有可能的值组合,但我确信有一种更快的方法。Ofc,全all_strings有超过50k的值。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-02-09 00:15:48

这里有一个快速的脚本,它似乎能相当快地完成这项工作。

代码语言:javascript
复制
from collections import defaultdict as dd
import time

all_values = ["CONJ", "NUM", "ADV", "PRT", "ADP", "PRON", "VERB", "DET", "ADJ", "NOUN"]


all_strings = ["DET-VERB-PRON-PRON-VERB-VERB-ADP",
"DET-NOUN-DET-NOUN-CONJ-PRON-NOUN-NOUN-VERB-NOUN-NOUN",
"PRON-VERB-VERB-DET-NOUN-ADP-NUM-ADP-NOUN",
"NOUN-VERB-NOUN-VERB-ADV-ADJ-ADP-PRON-NOUN-VERB-ADV-ADV-VERB-ADV-VERB",
"ADJ-NOUN-NOUN-PRT-VERB-VERB-DET-NOUN-VERB-ADP-DET-NOUN-NOUN",
"NOUN-VERB-PRT-ADP-PRON-NOUN-DET-VERB-NUM-NOUN-ADP-ADV-VERB",
"NOUN-NOUN-ADP-DET-NOUN-ADP-NOUN-NOUN",
"NOUN-ADV-VERB-ADP-DET-NOUN-VERB-VERB-ADP-NUM-ADJ-NOUN",
"PRON-ADV-VERB-DET-NOUN-PRON-VERB-VERB-NOUN-ADP-DET-NOUN",
"PRON-VERB-ADP-PRON-ADP-PRON-ADJ-ADJ-NOUN",
"PRON-VERB-DET-NOUN-NOUN-VERB-NOUN-VERB-PRT-VERB-ADP-DET-NOUN",
"PRON-VERB-ADP-PRON-ADP-DET-NOUN-DET-VERB-PRON-VERB-VERB-NOUN-ADP-PRON",
"NOUN-VERB-ADP-DET-NOUN-ADJ-VERB-ADV-VERB-ADJ-ADP-PRON-VERB-NOUN",
"ADJ-NOUN-NOUN-ADV-ADJ-NOUN-VERB",
"NOUN-ADV-VERB-ADJ-PRON-ADJ-NOUN-VERB-VERB-NUM",
"NOUN-DET-NOUN-ADV-VERB-NOUN-VERB-ADV-DET-ADJ-NOUN",
"ADV-PRON-VERB-ADV-NUM-ADP-DET-NOUN-NOUN-ADJ-NOUN",
"PRON-VERB-DET-NOUN-ADP-PRON-NOUN",
"ADJ-ADP-PRON-NOUN-VERB-VERB-ADP-NOUN-NOUN",
"ADJ-NOUN-NOUN-VERB-DET-PRON-VERB-DET-NOUN",
"NOUN-VERB-PRT-VERB-DET-NOUN-PRT-VERB-ADP-PRON-ADJ-ADV-ADJ-ADV",
"PRON-NOUN-VERB-ADV-VERB-ADP-DET-NOUN",
"ADV-NOUN-VERB-ADV-VERB-NOUN-ADP-PRON-NOUN-VERB-NOUN-ADP-NOUN",
"PRON-VERB-ADP-DET-NOUN-NOUN-ADV-DET-VERB-VERB-PRT-VERB-PRON",
"DET-VERB-ADJ-NOUN-NOUN-ADP-NOUN-ADP-NOUN-VERB-DET-NOUN",
"ADJ-NOUN-VERB-VERB-PRON-NOUN-NOUN",
"ADP-PRON-VERB-PRON-NOUN-PRT-NOUN-CONJ-DET-NOUN-VERB-VERB",
"ADV-DET-ADJ-NOUN-PRON-VERB-ADJ-NUM-ADP-NOUN-NOUN-NOUN",
"VERB-PRON-NOUN-ADV-VERB-PRT-VERB-NOUN",
"ADV-ADP-PRON-VERB-ADP-ADV-VERB-PRON-VERB-DET-NOUN-ADP-NOUN",
"VERB-PRON-VERB-NOUN-ADP-NOUN-NOUN",
"DET-ADV-ADJ-NOUN-VERB-VERB-ADP-PRON",
"PRON-VERB-ADJ-ADJ-VERB-ADP-ADJ-ADJ-NOUN-NOUN-VERB",
"NOUN-NOUN-NOUN-ADP-DET-ADJ-NOUN-NOUN-PRT-VERB-PRON-ADP",
"PRON-VERB-DET-ADJ-NOUN-NOUN",
"ADJ-NOUN-NOUN-ADP-ADP-NOUN",
"DET-VERB-DET-NOUN-ADP-NOUN-PRT-NOUN",
"NOUN-NOUN-NOUN-CONJ-ADJ-NOUN-VERB-VERB-VERB",
"ADP-NOUN-PRON-VERB-VERB-PRON-NOUN-NOUN-CONJ-ADJ-PRT-ADJ-NOUN",
"PRON-VERB-PRON-PRON-VERB-ADV-ADV",
"NOUN-VERB-VERB-PRT-VERB-NOUN-ADP-NOUN-NOUN-ADP-DET-NOUN-NOUN",
"PRON-PRON-VERB-VERB-DET-NOUN-CONJ-PRON-VERB-VERB-ADP-DET",
"PRON-VERB-NOUN-ADP-DET-NOUN-CONJ-NOUN-CONJ-NOUN-PRON-VERB-ADP-VERB-PRON",
"NOUN-ADJ-NOUN-VERB-ADV-ADP-DET",
"DET-NOUN-VERB-ADP-ADP-NUM-ADJ-NOUN",
"PRON-VERB-VERB-PRT-VERB-VERB-PRT-ADP-DET-NOUN-ADP-DET-NOUN",
"CONJ-DET-VERB-DET-NOUN-NOUN-ADP-NOUN-NOUN",
"PRON-VERB-ADJ-NOUN-VERB-NOUN-ADP-ADJ-NOUN-ADP-NOUN",
"ADV-PRON-VERB-VERB-DET-ADJ-NOUN-ADP-NOUN-PRT-VERB-PRON-NOUN",
"VERB-PRON-VERB-DET-NOUN-ADP-DET-NOUN-PRT-VERB-VERB",
"PRON-VERB-ADP-DET-NOUN-NOUN-VERB-ADJ",
"ADV-VERB-DET-NOUN-ADP-DET-NOUN",
"ADV-ADP-VERB-ADV-PRON-VERB-VERB",
"NOUN-DET-NOUN-NOUN-NOUN-NOUN-VERB-ADJ-NOUN-ADP-DET-NOUN",
"PRON-VERB-NOUN-ADP-NOUN-ADP-NUM-NOUN-NOUN-PRT-VERB-NOUN",
"ADJ-NOUN-ADV-VERB-DET-VERB-VERB-ADJ-NOUN-ADP-PRON",
"NOUN-ADV-ADJ-ADV-ADJ-ADP-DET",
"NOUN-PRON-VERB-ADJ-VERB-ADV",
"VERB-PRT-DET-NOUN-VERB-ADP-DET-NOUN-VERB-ADV-NOUN-VERB-VERB-ADP-PRON",
"PRON-VERB-ADV-ADJ-NOUN-NOUN-NOUN-NOUN-NOUN-PRT-VERB-DET-NOUN",
"PRON-VERB-DET-VERB-VERB-DET-NOUN-NOUN-ADP-NOUN",
"NOUN-NOUN-NOUN-NOUN-NOUN-VERB-NUM-NUM",
"PRON-ADV-VERB-NUM-NOUN-ADV-ADJ-ADP-PRON-NOUN",
"NOUN-VERB-DET-NOUN-ADP-NOUN-ADV-NOUN-VERB-VERB-PRON-NOUN",
"NOUN-VERB-VERB-VERB-ADP-PRON",
"NOUN-VERB-DET-ADV-ADJ-NOUN",
"NOUN-ADV-VERB-PRON-DET-NOUN-NOUN-ADV",
"ADV-VERB-PRT-NUM-NOUN-NOUN-PRON-VERB-DET-NOUN-ADP-DET-NOUN",
"NOUN-VERB-ADV-NOUN-VERB-ADP-DET-NOUN-NOUN",
"PRON-VERB-ADJ-NOUN-DET-ADJ-NOUN",
"VERB-PRON-VERB-DET-ADJ-ADP-PRON",
"PRON-VERB-ADJ-NOUN-ADP-NOUN",
"NOUN-NOUN-NOUN-ADJ-VERB-NOUN-VERB-NOUN-ADP-PRON-ADV",
"NOUN-NOUN-VERB-NOUN-VERB-ADV-ADP-DET-ADJ-NOUN",
"NOUN-VERB-VERB-NOUN-NOUN-NOUN-ADV-VERB-PRON-PRT-VERB-NOUN"]

start = time.time()

string_mat = [s.split('-') for s in all_strings]
freq = dd(int)
for row in string_mat:
    for i in range(len(row) - 4):
        freq[tuple(row[i:i+5])]+=1
max_freq,pattern = max((f,p) for p,f in freq.items())

print("Best pattern:",'-'.join(pattern))
print("frequency:",max_freq)
print("Computation time: {:.3} seconds".format(time.time()-start))

结果:

代码语言:javascript
复制
Best pattern: VERB-ADP-DET-NOUN-NOUN
frequency: 4
Computation time: 0.00149 seconds

奖励:要按降序获得所有模式及其频率的列表,请执行以下操作(在上面的脚本之后)。

代码语言:javascript
复制
for f,p in sorted([(f,p) for (p,f) in freq.items()],reverse = True):
        print('-'.join(p),f': {f} time(s)')

以下是前五名:

代码语言:javascript
复制
VERB-ADP-DET-NOUN-NOUN : 4 time(s)
PRON-VERB-DET-NOUN-ADP : 4 time(s)
NOUN-VERB-ADP-DET-NOUN : 4 time(s)
DET-NOUN-ADP-DET-NOUN : 4 time(s)
VERB-DET-NOUN-ADP-NOUN : 3 time(s)

使用collections.Counter对象实现:

代码语言:javascript
复制
start = time.time()
string_mat = [s.split('-') for s in all_strings]
freq = Counter([tuple(row[i:i+5]) for row in string_mat for i in range(len(row)-4)])

print('5 most common:')
for p,f in freq.most_common(5):
    print('{}: {}'.format('-'.join(p),f))
print("Computation time: {:.3} seconds".format(time.time()-start))

结果:

代码语言:javascript
复制
5 most common:
NOUN-VERB-ADP-DET-NOUN: 4
VERB-ADP-DET-NOUN-NOUN: 4
PRON-VERB-DET-NOUN-ADP: 4
DET-NOUN-ADP-DET-NOUN: 4
VERB-DET-NOUN-ADP-NOUN: 3
Computation time: 0.00294 seconds
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/71042357

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档