文章/答案/技术大牛

发布

社区首页 >问答首页 >加快文本处理步骤

问加快文本处理步骤
EN

Stack Overflow用户

提问于 2020-10-08 14:15:42

回答 2查看 54关注 0票数 1

我创建了一个函数，对长度为1045459的数据进行一些文本处理。编译的时间似乎比正常时间要长。

我就是这样做的：

def clean_descriptions(text_list):
    
    # First get everything into lowercase
    text_list = str(text_list)
    for x in text_list:
        x = x.lower()
    
    # Remove all instances of 3 characters followed by a number
    for substr in re.findall(r'\W([A-Z][A-Z][A-Z]\d)\W', text_list):
        text_list = text_list.replace(substr, '')
    text_list = text_list.replace('[]','') 
        
    # Remove NA
    text_list = text_list.replace('NA','')
        
    return text_list

这就是我使用函数的方式：

df['short_description'] = clean_descriptions(df['short_description'].tolist())

是否有更有效的方法来做后者呢？

下面是short_description的一个示例：

PRG2 - stelucie needs help with Radio
[VLR44] vlrd-fc-edg-fw-01-00-01:BGP Status - WARNING [DEEP-DIVE]
[LGB3] lgb3-ar-acc-sw172129.amazon.com:PSU Check
[BFI4] Walk Up Ticket - Other
[FC-OOB]-DMO3 is down [DEEP-DIVE]

python

regex

pandas

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-10-08 14:30:48

如果你离开熊猫去做这件事，你就会让自己陷入糟糕的表现。

使用熊猫自己的替换作为内部和regex语法：

import pandas as pd

df = pd.DataFrame({"short": ["Some text " + a + b + c + d + " more text" 
                             for a in "A"
                             for b in "DE"
                             for c in "1G"
                             for d in "2z"]})
print(df)

df["short"].replace(to_replace=r"(?i)(^|\W)([A-Z][A-Z][A-Z]\d)\W", value="", 
                    regex=True, inplace=True) # In Place - do not reassign else all None
print(df)

输出：

                      short
0  Some text AD12 more text
1  Some text AD1z more text
2  Some text ADG2 more text
3  Some text ADGz more text
4  Some text AE12 more text
5  Some text AE1z more text
6  Some text AEG2 more text
7  Some text AEGz more text

                      short
0  Some text AD12 more text
1  Some text AD1z more text
2        Some textmore text
3  Some text ADGz more text
4  Some text AE12 more text
5  Some text AE1z more text
6        Some textmore text
7  Some text AEGz more text

票数 2

Stack Overflow用户

发布于 2020-10-08 14:21:10

尝试替换代码的这一部分：

for substr in re.findall(r'\W([A-Z][A-Z][A-Z]\d)\W', text_list):
        text_list = text_list.replace(substr, '')

使用所有可能的子字符串创建一个集(而不是一个列表)，以便只创建一次，而不是每次调用函数时(从数据中创建1045459次)。

另外，只需用以下方法替换to_lower函数(顺便说一下，这不正确)：

text_list=text_list.lower()

最后，将所有替换添加到同一个命令中：

text_list = text_list.replace('[]','').replace('NA','')

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64264448

复制

相似问题

问加快文本处理步骤
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问加快文本处理步骤EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问加快文本处理步骤
EN