for filename in glob.glob(os.path.join(folder_path, '*.html')):
with open(filename, 'r',encoding='utf-8') as f:
text = f.read()
print (filename)
#print (text)
patent = BeautifulSoup(text)
cleantext = patent.get_text()
clean_lower=cleantext.lower()
for char in clean_lower:
>> if char not in punctuations:
no_punct = no_punct + char
for word in dictionary:
>>if word in no_punct:
>>>wordlist.append(word)
>>>countlist.append(no_punct.count(word))
print(wordlist,countlist)
df = pd.DataFrame({'word':wordlist, 'count':countlist})
df.columns=['word','count']
df=df.set_index('word')
print(df)['steam', 'heating', 'horizontal well', 'electromagnetic', 'single well', 'steam', 'foam', 'heating', 'horizontal well', 'solvent', 'hexane', 'electromagnetic', 'steam foam', 'surfactant', 'single well', 'miscible'] [84, 9, 4, 2, 1, 89, 2, 10, 4, 5, 7, 2, 1, 106, 1, 1]
count
word
steam 84
heating 9
horizontal well 4
electromagnetic 2
single well 1
steam 89
foam 2
heating 10
horizontal well 4
solvent 5
hexane 7
electromagnetic 2
steam foam 1
surfactant 106
single well 1
miscible 1我没有得到唯一的输出,有人能告诉我我在循环中哪里出错了吗?steam的字数应该是89,但我希望它只打印一次。
发布于 2020-03-24 01:28:25
使用df.drop_duplicates()删除重复行
发布于 2020-03-24 01:38:23
df = pd.DataFrame({'word':wordlist, 'count':countlist})
df.columns=['technology word','count']
df=df.set_index('technology word')
df.drop_duplicates()
print(df)
count
word
steam 84
heating 9
horizontal well 4
electromagnetic 2
single well 1
steam 89
foam 2
heating 10
horizontal well 4
solvent 5
hexane 7
electromagnetic 2
steam foam 1
surfactant 106
single well 1
miscible 1循环正确地更新了计数,但我只需要打印最后的单词和计数。我试过你的,但它不工作,我不希望它删除单词steam与最高计数89。
发布于 2020-03-24 02:20:55
count
word
steam 84
heating 9
horizontal well 4
electromagnetic 2
single well 1
steam 5
foam 2
heating 1
solvent 5
hexane 7
steam foam 1
surfactant 106
miscible 1
```
How can I add the count of words like example steam=84+5=89 from this data frame. How can I have the code uniquely count the word and add the number of occurences to one word.https://stackoverflow.com/questions/60818551
复制相似问题