我正在尝试构建一个类似于wordcounter.net (https://wordcounter.net/)的Python程序。我有一个excel文件,其中有一个列要分析文本。利用熊猫和其他功能,我创建了一个单字频率计数器。
但现在,我需要进一步修改以找到模式。
例如,有一段文字有“快乐的脸,悲伤的脸,成熟的小宝宝,甜蜜的脸,圆润的脸,悲伤的脸,圆润的”。
因此,在这里,它应该能够跟踪模式,如二字密度。
……
三字密度
……
我也试过:
for match in re.finditer(pattern, line):但是,这再次必须手动完成,我希望它能够自动找到模式。
有人能帮上忙吗?
发布于 2021-06-23 06:56:14
text = 'Happy face sad face mellow little baby sweet Happy face face mellow sad face mellow'
d = {}
for s in text.split():
d.setdefault(s, 0)
d[s] += 1
out = {}
for k, v in d.items():
out.setdefault(v, []).append(k)
for i in sorted(out.keys(), reverse=True):
print(f'{i} word density:')
print(f'\t{out[i]}')输出
5 word density:
['face']
3 word density:
['mellow']
2 word density:
['Happy', 'sad']
1 word density:
['little', 'baby', 'sweet']Edit2
from collections import Counter
def freq(lst, n):
lstn = []
for i in range(len(lst) - (n - 1)):
lstn.append(" ".join([lst[i + x] for x in range(n)]))
out = Counter(lstn)
print(f'{n} word density:')
for k, v in out.items():
print(f'\t"{k}" {v}')
text = 'Happy face sad face mellow little baby sweet Happy face face mellow sad face mellow'
lst = text.split()
freq(lst, 2)
freq(lst, 3)输出
2 word density:
"Happy face" 2
"face sad" 1
"sad face" 2
"face mellow" 3
"mellow little" 1
"little baby" 1
"baby sweet" 1
"sweet Happy" 1
"face face" 1
"mellow sad" 1
3 word density:
"Happy face sad" 1
"face sad face" 1
"sad face mellow" 2
"face mellow little" 1
"mellow little baby" 1
"little baby sweet" 1
"baby sweet Happy" 1
"sweet Happy face" 1
"Happy face face" 1
"face face mellow" 1
"face mellow sad" 1
"mellow sad face" 1https://stackoverflow.com/questions/68094770
复制相似问题