首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >统计数据帧中唯一的字数

统计数据帧中唯一的字数
EN

Stack Overflow用户
提问于 2020-03-24 01:22:07
回答 3查看 20关注 0票数 1
代码语言:javascript
复制
for filename in glob.glob(os.path.join(folder_path, '*.html')):
  with open(filename, 'r',encoding='utf-8') as f:
    text = f.read()
    print (filename)
    #print (text)
    patent = BeautifulSoup(text)
    cleantext = patent.get_text()
    clean_lower=cleantext.lower()
    for char in clean_lower:
    >> if char not in punctuations:
       no_punct = no_punct + char
    for word in dictionary:
      >>if word in no_punct:
        >>>wordlist.append(word)
        >>>countlist.append(no_punct.count(word))

print(wordlist,countlist)
df = pd.DataFrame({'word':wordlist, 'count':countlist})
df.columns=['word','count']
df=df.set_index('word')
print(df)
代码语言:javascript
复制
['steam', 'heating', 'horizontal well', 'electromagnetic', 'single well', 'steam', 'foam', 'heating', 'horizontal well', 'solvent', 'hexane', 'electromagnetic', 'steam foam', 'surfactant', 'single well', 'miscible'] [84, 9, 4, 2, 1, 89, 2, 10, 4, 5, 7, 2, 1, 106, 1, 1]
                 count
word                  
steam               84
heating              9
horizontal well      4
electromagnetic      2
single well          1
steam               89
foam                 2
heating             10
horizontal well      4
solvent              5
hexane               7
electromagnetic      2
steam foam           1
surfactant         106
single well          1
miscible             1

我没有得到唯一的输出,有人能告诉我我在循环中哪里出错了吗?steam的字数应该是89,但我希望它只打印一次。

EN

回答 3

Stack Overflow用户

发布于 2020-03-24 01:28:25

使用df.drop_duplicates()删除重复行

票数 0
EN

Stack Overflow用户

发布于 2020-03-24 01:38:23

代码语言:javascript
复制
df = pd.DataFrame({'word':wordlist, 'count':countlist})

df.columns=['technology word','count']
df=df.set_index('technology word')
df.drop_duplicates()
print(df)
                 count
word                  
steam               84
heating              9
horizontal well      4
electromagnetic      2
single well          1
steam               89
foam                 2
heating             10
horizontal well      4
solvent              5
hexane               7
electromagnetic      2
steam foam           1
surfactant         106
single well          1
miscible             1

循环正确地更新了计数,但我只需要打印最后的单词和计数。我试过你的,但它不工作,我不希望它删除单词steam与最高计数89。

票数 0
EN

Stack Overflow用户

发布于 2020-03-24 02:20:55

代码语言:javascript
复制
                 count
word                  
steam               84
heating              9
horizontal well      4
electromagnetic      2
single well          1
steam                5
foam                 2
heating              1
solvent              5
hexane               7
steam foam           1
surfactant         106
miscible             1
​```
How can I add the count of words like example steam=84+5=89 from this data frame. How can I have the code uniquely count the word and add the number of occurences to one word.
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/60818551

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档