我是Python编码方面的新手。我认为代码可以用更好、更紧凑的形式编写。由于移除停止词的方法,它的编译速度相当慢。
我想从列中找出最常见的10个单词,不包括URL链接、特殊字符、标点符号.别说废话了。
任何批评和建议,以提高我的代码的效率和可读性,将不胜感激。另外,我想知道是否有任何专门的python模块可以轻松获得所需的结果。
我有一个数据帧 df,这样:
print(df['text'])0 If I smelled the scent of hand sanitizers toda...
1 Hey @Yankees @YankeesPR and @MLB - wouldn't it...
2 @diane3443 @wdunlap @realDonaldTrump Trump nev...
3 @brookbanktv The one gift #COVID19 has give me...
4 25 July : Media Bulletin on Novel #CoronaVirus...
...
179103 Thanks @IamOhmai for nominating me for the @WH...
179104 2020! The year of insanity! Lol! #COVID19 http...
179105 @CTVNews A powerful painting by Juan Lucena. I...
179106 More than 1,200 students test positive for #CO...
179107 I stop when I see a Stop\n\n@SABCNews\n@Izinda...
Name: text, Length: 179108, dtype: object我这样做的方式如下:
import pandas as pd
import nltk
import re
import string
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
stop_words = stopwords.words()
def cleaning(text):
# converting to lowercase, removing URL links, special characters, punctuations...
text = text.lower()
text = re.sub('https?://\S+|www\.\S+', '', text)
text = re.sub('<.*?>+', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('\n', '', text)
text = re.sub('[’“”…]', '', text)
# removing the emojies # https://www.kaggle.com/alankritamishra/covid-19-tweet-sentiment-analysis#Sentiment-analysis
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
text = emoji_pattern.sub(r'', text)
# removing the stop-words
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in stop_words]
filtered_sentence = (" ").join(tokens_without_sw)
text = filtered_sentence
return text
dt = df['text'].apply(cleaning)
from collections import Counter
p = Counter(" ".join(dt).split()).most_common(10)
rslt = pd.DataFrame(p, columns=['Word', 'Frequency'])
print(rslt) Word Frequency
0 covid19 104546
1 cases 18150
2 new 14585
3 coronavirus 14189
4 amp 12227
5 people 9079
6 pandemic 7944
7 us 7223
8 deaths 7088
9 health 5231函数cleaning()的一个示例IO:
inp = 'If I smelled the scent of hand sanitizers today on someone in the past, I would think they were so intoxicated that… https://t.co/QZvYbrOgb0'
outp = cleaning(inp)
print('Input:\n', inp)
print('Output:\n', outp)Input:
If I smelled the scent of hand sanitizers today on someone in the past, I would think they were so intoxicated that… https://t.co/QZvYbrOgb0
Output:
smelled scent hand sanitizers today someone past would think intoxicated发布于 2021-04-28 18:06:05
注意:您正在浏览的数据是370k+行。由于我在评审期间经常运行不同版本的代码,所以我将版本限制为1000行。
你的密码到处都是。导入,下载,另一个导入,一个变量被加载,一个函数定义,这个函数被调用,哦,另一个导入。按照这个顺序。你同意把这些分类很有帮助吗?这样我们就能很容易地找到我们要找的东西了吗?
修改后的文件头如下所示:
import re
import string
import nltk
import pandas as pd
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')在那之后,我们通常会把函数定义。然而,有一部分程序不一定要在函数本身中。即使处理多个文件,也只能执行一次。
# removing the emojies
# https://www.kaggle.com/alankritamishra/covid-19-tweet-sentiment-analysis#Sentiment-analysis
EMOJI_PATTERN = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)这个变量现在在UPPER_CASE中,因为它是一个伪常量(Python实际上没有常量,但它提醒您和其他开发人员应该只设置一次变量)。通常在导入和函数定义之间放置伪常量,这样您就知道在哪里查找它们。
现在,程序的其余部分已经很好了。您可以使用更多的功能,但对于这样的程序,这将主要是一个练习。我会重命名一些变量,删除行,使用适当的文档串 (在cleaning函数开始时注释已经有了一个很好的开始)和准备可重复使用的程序。。毕竟,简单地从这个文件导入代码就好了,而不必使用它将代码复制到接下来的几个项目中,不是吗?而且我们不想每次导入这个程序时都运行它的细节,所以我们只在没有导入的情况下才显式地运行它。
import re
import string
import nltk
import pandas as pd
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
STOP_WORDS = stopwords.words()
# removing the emojies
# https://www.kaggle.com/alankritamishra/covid-19-tweet-sentiment-analysis#Sentiment-analysis
EMOJI_PATTERN = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
def cleaning(text):
"""
Convert to lowercase.
Rremove URL links, special characters and punctuation.
Tokenize and remove stop words.
"""
text = text.lower()
text = re.sub('https?://\S+|www\.\S+', '', text)
text = re.sub('<.*?>+', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('\n', '', text)
text = re.sub('[’“”…]', '', text)
text = EMOJI_PATTERN.sub(r'', text)
# removing the stop-words
text_tokens = word_tokenize(text)
tokens_without_sw = [
word for word in text_tokens if not word in STOP_WORDS]
filtered_sentence = (" ").join(tokens_without_sw)
text = filtered_sentence
return text
if __name__ == "__main__":
max_rows = 1000 # 'None' to read whole file
input_file = 'covid19_tweets.csv'
df = pd.read_csv(input_file,
delimiter = ',',
nrows = max_rows,
engine = "python")
dt = df['text'].apply(cleaning)
word_count = Counter(" ".join(dt).split()).most_common(10)
word_frequency = pd.DataFrame(word_count, columns = ['Word', 'Frequency'])
print(word_frequency)当然,如果您想要一个更节省内存的版本,您可以删除最后几行中的所有中间变量。不过,这会让人更难读懂。只要您没有将多个大文件读入同一程序中的内存中,就可以了。
我提供的一些建议来自PEP8,官方的Python风格指南。我强烈建议你看一看。
https://codereview.stackexchange.com/questions/249329
复制相似问题