我的数据集有42,000行。这是在矢量化文本之前我用来编辑文本的代码。然而,问题是它有一个嵌套的for循环,我猜这会使它非常慢,而且我无法在超过1500行的情况下使用它。有人能帮我想个更好的办法吗?
filtered = []
for i in range(2):
rev = re.sub('[^a-zA-Z]', ' ', df['text'][i])
rev = rev.lower()
rev = rev.split()
filtered =[]
for word in rev:
if word not in stopwords.words("english"):
word = PorterStemmer().stem(word)
filtered.append(word)
filtered = " ".join(filtered)
corpus.append(filtered)发布于 2021-05-11 12:50:23
我使用轮廓仪来测量您发布的代码的速度。
测量结果如下。
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 @profile
9 def profile_nltk():
10 1 435819.0 435819.0 0.3 df = pd.read_csv('IMDB_Dataset.csv') # (50000, 2)
11 1 1.0 1.0 0.0 filtered = []
12 1 247.0 247.0 0.0 reviews = df['review'][:4000]
13 1 0.0 0.0 0.0 corpus = []
14 4001 216341.0 54.1 0.1 for i in range(len(reviews)):
15 4000 221885.0 55.5 0.2 rev = re.sub('[^a-zA-Z]', ' ', df['review'][i])
16 4000 3878.0 1.0 0.0 rev = rev.lower()
17 4000 30209.0 7.6 0.0 rev = rev.split()
18 4000 1097.0 0.3 0.0 filtered = []
19 950808 235589.0 0.2 0.2 for word in rev:
20 946808 115658060.0 122.2 78.2 if word not in stopwords.words("english"):
21 486614 30898223.0 63.5 20.9 word = PorterStemmer().stem(word)
22 486614 149604.0 0.3 0.1 filtered.append(word)
23 4000 11290.0 2.8 0.0 filtered = " ".join(filtered)
24 4000 1429.0 0.4 0.0 corpus.append(filtered)正如@parsa-abbasi所指出的,检查停止词的过程约占总数的80%。
修改后的脚本的测量结果如下。同样的工艺已减少到约1/100的处理时间。
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 @profile
9 def profile_nltk():
10 1 441467.0 441467.0 1.4 df = pd.read_csv('IMDB_Dataset.csv') # (50000, 2)
11 1 1.0 1.0 0.0 filtered = []
12 1 335.0 335.0 0.0 reviews = df['review'][:4000]
13 1 1.0 1.0 0.0 corpus = []
14 1 2696.0 2696.0 0.0 stopwords_set = stopwords.words('english')
15 4001 59013.0 14.7 0.2 for i in range(len(reviews)):
16 4000 186393.0 46.6 0.6 rev = re.sub('[^a-zA-Z]', ' ', df['review'][i])
17 4000 3657.0 0.9 0.0 rev = rev.lower()
18 4000 27357.0 6.8 0.1 rev = rev.split()
19 4000 999.0 0.2 0.0 filtered = []
20 950808 220673.0 0.2 0.7 for word in rev:
21 # if word not in stopwords.words("english"):
22 946808 1201271.0 1.3 3.8 if word not in stopwords_set:
23 486614 29479712.0 60.6 92.8 word = PorterStemmer().stem(word)
24 486614 141242.0 0.3 0.4 filtered.append(word)
25 4000 10412.0 2.6 0.0 filtered = " ".join(filtered)
26 4000 1329.0 0.3 0.0 corpus.append(filtered)我希望这是有帮助的。
发布于 2021-05-01 21:50:00
编写的代码中最耗时的部分是秒字部分。每次循环迭代时,它都会调用库来获取停止词列表。因此,最好只获得一次停止词集,并在每次迭代时使用相同的集合。
我重写了代码如下(为了提高可读性,还做出了其他区别):
corpus = []
texts = df['text']
stopwords_set = stopwords.words("english")
stemmer = PorterStemmer()
for i in range(len(texts)):
rev = re.sub('[^a-zA-Z]', ' ', texts[i])
rev = rev.lower()
rev = rev.split()
filtered = []
filtered = [stemmer.stem(word) for word in rev if word not in stopwords_set]
filtered = " ".join(filtered)
corpus.append(filtered) https://stackoverflow.com/questions/67350459
复制相似问题