我正在使用Google (Python)做一个NLP项目,其中包含了一个文本数据集,涉及大约100,000个实例。现在,对于每一个实例,我都在对大约5-10个特性进行特征提取,每次运行代码都需要大约5-10分钟的时间。因为我正在尝试不同类型的特性,所以我运行了相当多次的特征提取过程,并且经过一段时间之后,总运行时间增加了。
我怀疑这可能是因为我的代码效率不高,目前依赖于列表理解、映射和迭代。由于数据的大小和如何存储文本的多个副本,代码占用了大量的内存。
因此,我想知道是否有更好的方法来执行特征提取,以加快进程(并节省空间)。我听说过numpy有矢量化操作,但我不知道该怎么做。
这是我代码的基本版本。
import nltk
import numpy as np
import pandas as pd
df = pd.DataFrame([["The quick brown fox jumps over the lazy dog.",
"Energy is sustainable if it meets the needs of the present without compromising the ability of future generations to meet their needs."],
["The scientific literature on limiting global warming describes pathways in which the world rapidly phases out coal-fired power plants, produces more electricity from clean sources such as wind and solar, shifts towards using electricity instead of fuels in sectors such as transport and heating buildings, and takes measures to conserve energy.",
"Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s"]], columns=['text1', 'text2'])
def process(text):
tokens = nltk.word_tokenize(text)
# Other techniques like stemming and lemmatization
return tokens
def get_features(text1, text2):
features = []
feature1 = len(text1) + len(text2)
features.append(feature1)
feature2 = len([word1 for word1 in text1 if word1 in text2])
features.append(feature2)
# Continued for about 5-10 features. Some features involve multiple steps like doing named entity recognition and creating features from there
return features
df.loc[:, 'text1_tokens'] = df.loc[:, 'text1'].apply(process)
df.loc[:, 'text2_tokens'] = df.loc[:, 'text2'].apply(process)
features = df.apply(lambda x: get_features(x['text1_tokens'], x['text2_tokens']), axis='columns')
df.loc[:, 'feature1'] = list(map(lambda x: x[0], features))
df.loc[:, 'feature2'] = list(map(lambda x: x[1], features))发布于 2021-11-02 20:40:54
feature2 = len([word1 for word1 in text1 if word1 in text2])这一行具有words_in_text1 * words_in_text2的运行时复杂性。根据这些文本的大小,您可能只需要在text2中获得一组单词就可以得到很大的加速。
您还创建了一个列表,该列表在同一行中被浪费了。如果文本中的单词顺序始终不重要,那么可能使用collections.Counter或类似对象会进一步提高速度。
例如:
from collections import Counter
text1_counts = Counter(text1)
text2_counts = Counter(text2)
feature2 = sum(count for word, count in text2_counts.items()
if word in text2_counts)如果您有更多类似问题的特性,解决这些问题将加快您的特征提取速度。
https://stackoverflow.com/questions/69816635
复制相似问题