文章/答案/技术大牛

发布

社区首页 >问答首页 >在Pandas dataframe中找到最常用的单词

问在Pandas dataframe中找到最常用的单词
EN

Code Review用户

提问于 2020-09-13 21:29:35

回答 1查看 14.3K关注 0票数 5

我是Python编码方面的新手。我认为代码可以用更好、更紧凑的形式编写。由于移除停止词的方法，它的编译速度相当慢。

我想从列中找出最常见的10个单词，不包括URL链接、特殊字符、标点符号.别说废话了。

任何批评和建议，以提高我的代码的效率和可读性，将不胜感激。另外，我想知道是否有任何专门的python模块可以轻松获得所需的结果。

我有一个数据帧 df，这样：

print(df['text'])

0         If I smelled the scent of hand sanitizers toda...
1         Hey @Yankees @YankeesPR and @MLB - wouldn't it...
2         @diane3443 @wdunlap @realDonaldTrump Trump nev...
3         @brookbanktv The one gift #COVID19 has give me...
4         25 July : Media Bulletin on Novel #CoronaVirus...
                                ...                        
179103    Thanks @IamOhmai for nominating me for the @WH...
179104    2020! The year of insanity! Lol! #COVID19 http...
179105    @CTVNews A powerful painting by Juan Lucena. I...
179106    More than 1,200 students test positive for #CO...
179107    I stop when I see a Stop\n\n@SABCNews\n@Izinda...
Name: text, Length: 179108, dtype: object

我这样做的方式如下：

import pandas as pd
import nltk
import re
import string
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

stop_words = stopwords.words()

def cleaning(text):        
    # converting to lowercase, removing URL links, special characters, punctuations...
    text = text.lower()
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('[’“”…]', '', text)     

    # removing the emojies               # https://www.kaggle.com/alankritamishra/covid-19-tweet-sentiment-analysis#Sentiment-analysis
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)   
    
    # removing the stop-words          
    text_tokens = word_tokenize(text)
    tokens_without_sw = [word for word in text_tokens if not word in stop_words]
    filtered_sentence = (" ").join(tokens_without_sw)
    text = filtered_sentence
    
    return text

dt = df['text'].apply(cleaning)

from collections import Counter
p = Counter(" ".join(dt).split()).most_common(10)
rslt = pd.DataFrame(p, columns=['Word', 'Frequency'])
print(rslt)

          Word  Frequency
0      covid19     104546
1        cases      18150
2          new      14585
3  coronavirus      14189
4          amp      12227
5       people       9079
6     pandemic       7944
7           us       7223
8       deaths       7088
9       health       5231

函数cleaning()的一个示例IO：

inp = 'If I smelled the scent of hand sanitizers today on someone in the past, I would think they were so intoxicated that… https://t.co/QZvYbrOgb0'
outp = cleaning(inp)
print('Input:\n', inp)
print('Output:\n', outp)

Input:
 If I smelled the scent of hand sanitizers today on someone in the past, I would think they were so intoxicated that… https://t.co/QZvYbrOgb0
Output:
 smelled scent hand sanitizers today someone past would think intoxicated

python

beginner

python-3.x

pandas

回答 1

Code Review用户

回答已采纳

发布于 2021-04-28 18:06:05

注意:您正在浏览的数据是370k+行。由于我在评审期间经常运行不同版本的代码，所以我将版本限制为1000行。

你的密码到处都是。导入，下载，另一个导入，一个变量被加载，一个函数定义，这个函数被调用，哦，另一个导入。按照这个顺序。你同意把这些分类很有帮助吗?这样我们就能很容易地找到我们要找的东西了吗？

修改后的文件头如下所示：

import re
import string
import nltk

import pandas as pd

from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

在那之后，我们通常会把函数定义。然而，有一部分程序不一定要在函数本身中。即使处理多个文件，也只能执行一次。

# removing the emojies
# https://www.kaggle.com/alankritamishra/covid-19-tweet-sentiment-analysis#Sentiment-analysis
EMOJI_PATTERN = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)

这个变量现在在UPPER_CASE中，因为它是一个伪常量(Python实际上没有常量，但它提醒您和其他开发人员应该只设置一次变量)。通常在导入和函数定义之间放置伪常量，这样您就知道在哪里查找它们。

现在，程序的其余部分已经很好了。您可以使用更多的功能，但对于这样的程序，这将主要是一个练习。我会重命名一些变量，删除行，使用适当的文档串 (在cleaning函数开始时注释已经有了一个很好的开始)和准备可重复使用的程序。。毕竟，简单地从这个文件导入代码就好了，而不必使用它将代码复制到接下来的几个项目中，不是吗？而且我们不想每次导入这个程序时都运行它的细节，所以我们只在没有导入的情况下才显式地运行它。

import re
import string
import nltk

import pandas as pd

from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

STOP_WORDS = stopwords.words()

# removing the emojies
# https://www.kaggle.com/alankritamishra/covid-19-tweet-sentiment-analysis#Sentiment-analysis
EMOJI_PATTERN = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)


def cleaning(text):
    """
    Convert to lowercase.
    Rremove URL links, special characters and punctuation.
    Tokenize and remove stop words.
    """
    text = text.lower()
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('[’“”…]', '', text)

    text = EMOJI_PATTERN.sub(r'', text)

    # removing the stop-words
    text_tokens = word_tokenize(text)
    tokens_without_sw = [
        word for word in text_tokens if not word in STOP_WORDS]
    filtered_sentence = (" ").join(tokens_without_sw)
    text = filtered_sentence

    return text


if __name__ == "__main__":
    max_rows = 1000  # 'None' to read whole file
    input_file = 'covid19_tweets.csv'
    df = pd.read_csv(input_file,
                     delimiter = ',',
                     nrows = max_rows,
                     engine = "python")

    dt = df['text'].apply(cleaning)

    word_count = Counter(" ".join(dt).split()).most_common(10)
    word_frequency = pd.DataFrame(word_count, columns = ['Word', 'Frequency'])
    print(word_frequency)

当然，如果您想要一个更节省内存的版本，您可以删除最后几行中的所有中间变量。不过，这会让人更难读懂。只要您没有将多个大文件读入同一程序中的内存中，就可以了。

我提供的一些建议来自PEP8，官方的Python风格指南。我强烈建议你看一看。

票数 4

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/249329

复制

相似问题

问在Pandas dataframe中找到最常用的单词
EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Pandas dataframe中找到最常用的单词EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Pandas dataframe中找到最常用的单词
EN