我正在对源代码进行字数统计,例如,我想知道一个txt中有多少个for,目前它做得很好,但在某些情况下,程序员是这样写的:for(或for (。在我的例子中,我的代码只计算有空间的for (,而不是没有空间的,我该如何解决这个问题呢?另外,在某些情况下,程序员将for(xxx或for (xxx或for ( xxx放在一起,我如何才能只为?
from collections import Counter
words_to_keep = {"for", "setup()", "loop()"}
def word_count(filename):
with open('hello.txt', 'r') as f: # use `filename`
return Counter(w for w in f.read().split() if w in words_to_keep)
counter = word_count('hola.txt')
for i in counter:
print (i, ":", counter [i])发布于 2020-10-30 00:58:57
正如您所观察到的,split的问题在于您需要用空格将文字括起来,但对于代码来说,这并不总是正确的。也许正则表达式是处理更一般的字符串匹配情况的最佳选择。
首先对接受的单词执行OR运算(在对它们执行escaping操作之后),对模式执行finds all matches,然后计算文件上正则表达式的匹配项:
import re
from collections import Counter
words_to_keep = {"for", "setup()", "loop()"}
pattern = re.compile('|'.join(re.escape(word) for word in words_to_keep))
# in this case, pattern = "for|setup\(\)|loop\(\)"
def word_count(filename):
with open(filename, 'r') as f:
words_found = pattern.findall(f.read())
return Counter(words_found)
for word, count in word_count('test.txt').items():
print (word, ":", count)如果文件很大,并且您不想一次读取所有文件,则可以使用添加Counters的好处
def word_count(filename):
counter = Counter()
with open(filename, 'r') as f:
for line in f:
counter += Counter(pattern.findall(line))
return counter发布于 2020-10-30 00:58:35
您可以使用re.sub替换可选的空格和括号
from collections import Counter
import re
words_to_keep = {"for", "setup()", "loop()"}
def word_count(filename):
with open('hello.txt', 'r') as f: # use `filename`
return Counter(w for w in re.sub(r'for(\s*\().*','',f.read()).split() if w in words_to_keep)
counter = word_count('hola.txt')
for i in counter:
print (i, ":", counter [i])发布于 2020-10-30 03:34:26
不使用Counter
import re
def word_count(filename, words):
res = {x: 0 for x in words}
pattern = re.compile('|'.join(re.escape(word) for word in words))
with open(filename, 'r') as f:
for a in re.finditer(pattern, f.read()):
res[a.group(0)] += 1
return res
words = ("for", "setup()", "loop()")
result = word_count('hola.txt', words)
print(result)https://stackoverflow.com/questions/64595645
复制相似问题