我创建了一个函数,对长度为1045459的数据进行一些文本处理。编译的时间似乎比正常时间要长。
我就是这样做的:
def clean_descriptions(text_list):
# First get everything into lowercase
text_list = str(text_list)
for x in text_list:
x = x.lower()
# Remove all instances of 3 characters followed by a number
for substr in re.findall(r'\W([A-Z][A-Z][A-Z]\d)\W', text_list):
text_list = text_list.replace(substr, '')
text_list = text_list.replace('[]','')
# Remove NA
text_list = text_list.replace('NA','')
return text_list这就是我使用函数的方式:
df['short_description'] = clean_descriptions(df['short_description'].tolist())是否有更有效的方法来做后者呢?
下面是short_description的一个示例:
PRG2 - stelucie needs help with Radio
[VLR44] vlrd-fc-edg-fw-01-00-01:BGP Status - WARNING [DEEP-DIVE]
[LGB3] lgb3-ar-acc-sw172129.amazon.com:PSU Check
[BFI4] Walk Up Ticket - Other
[FC-OOB]-DMO3 is down [DEEP-DIVE]发布于 2020-10-08 14:30:48
如果你离开熊猫去做这件事,你就会让自己陷入糟糕的表现。
使用熊猫自己的替换作为内部和regex语法:
import pandas as pd
df = pd.DataFrame({"short": ["Some text " + a + b + c + d + " more text"
for a in "A"
for b in "DE"
for c in "1G"
for d in "2z"]})
print(df)
df["short"].replace(to_replace=r"(?i)(^|\W)([A-Z][A-Z][A-Z]\d)\W", value="",
regex=True, inplace=True) # In Place - do not reassign else all None
print(df)输出:
short
0 Some text AD12 more text
1 Some text AD1z more text
2 Some text ADG2 more text
3 Some text ADGz more text
4 Some text AE12 more text
5 Some text AE1z more text
6 Some text AEG2 more text
7 Some text AEGz more text
short
0 Some text AD12 more text
1 Some text AD1z more text
2 Some textmore text
3 Some text ADGz more text
4 Some text AE12 more text
5 Some text AE1z more text
6 Some textmore text
7 Some text AEGz more text发布于 2020-10-08 14:21:10
尝试替换代码的这一部分:
for substr in re.findall(r'\W([A-Z][A-Z][A-Z]\d)\W', text_list):
text_list = text_list.replace(substr, '')使用所有可能的子字符串创建一个集(而不是一个列表),以便只创建一次,而不是每次调用函数时(从数据中创建1045459次)。
另外,只需用以下方法替换to_lower函数(顺便说一下,这不正确):
text_list=text_list.lower()最后,将所有替换添加到同一个命令中:
text_list = text_list.replace('[]','').replace('NA','')https://stackoverflow.com/questions/64264448
复制相似问题