首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >将文件中的字符串转换应用于熊猫df

将文件中的字符串转换应用于熊猫df
EN

Stack Overflow用户
提问于 2022-09-06 14:21:23
回答 1查看 40关注 0票数 0

我有一个file.txt,其内容如下:

代码语言:javascript
复制
5th => fifth
av, ave, avn => avenue
91st => ninety first
ny => new york
91st => ninety first
nrth => north
91st => ninety first
nrth => northwest

我大概有1500行。对于同一个单词,有重复的,多个转换。我不在乎我们选择哪一个,只要我们以一致的方式选择它们。

我的数据框架将有一个包含字符串的列。对于该列中的每个字符串,目标是使用上面文件上的信息来转换字符串。例如:

代码语言:javascript
复制
"5th with 91st av nrth, seattle, ny,united states"

转化成

代码语言:javascript
复制
"fifth with ninety first avenue north, seattle, new york,united states"

下面是为数据框架创建mwe的一种方法:

代码语言:javascript
复制
 size = 60000
    df = pd.DataFrame({"address":
        [f"row: {i}  5th with 91st av nrth, seattle, ny,united states" for i in range(size)],
        "index":[i for i in range(size)]
    })

我试过两种解决方案。

第一项:

代码语言:javascript
复制
def string_substitution(df:pd.Series):
        with ('file.txt').open() as f:
                file_substitutions = f.read().splitlines()
        word_regex = re.compile(r'\w+')
        string_list = []
        for row in df.index:
            string = df.loc[row]
            words = [match.group() for match in word_regex.finditer(string)]
            substitution_set = set()
            # looking the words in txt file
            for word in words:
                df_regex = re.compile(r'\b' + word + r"\b")
                substitution_regex = re.compile(r"(?==>(.*))")
                for line in file_substitutions:
                    if df_regex.search(line) is not None:
                        # print(f"line: {line} ------------------ \n")
                        
                        substitution_string = substitution_regex.findall(line)
                        if substitution_string != []:
                            substitution_string = substitution_string[0]
                        else:
                            # line from file_substitutions is a comment
                            # so we break
                            break
                        # print(f"word: {word}, sub: {substitution_string} \n")
                        substitution_string = substitution_string.lstrip()
                        substitution_set.add((word,substitution_string))
                        # with this break we stop on the first match
                        break
            # print(substitution_set)
            # print(string)
            for word,substitution in substitution_set:
                df_regex = re.compile(r'\b' + word + r"\b")
                string = re.sub(df_regex, repl= substitution,string=string)
            string_list.append(string)
        return string_list

这个函数将被调用为:df["address"] = string_substitution(df["address"])

对于60,000行数据帧,这将花费超过1米的时间。

在我的第二个解决方案中,我尝试将dataframe划分为更小的子集,并使用以下方法将它们传递给string_substitution

代码语言:javascript
复制
with ('file.txt').open() as f:
    file_substitutions = f.read().splitlines() # we only open once the file
buckets = [
            df.iloc[
                pos:pos + self.bucket_size,
                df.columns.get_loc(self.target_column)
            ] for pos in range(0,df.shape[0], self.bucket_size)
        ]
        df_list = []
        with ThreadPoolExecutor() as pool:
            for results in pool.map(self._synonym_substitution_3, buckets):
                df_list.append(results)
        df[self.target_column] = pd.concat(df_list,ignore_index = True)

更糟的是..。

我的目标是为示例数据帧提供一个在几秒钟内运行的解决方案(如果可能的话少于10秒),而不是像现在这样在1m内运行。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-09-06 14:33:33

下面是一个regex解决方案,运行在大约800 is的60k行和7个替换值中:

代码语言:javascript
复制
words = pd.read_csv('file.txt', sep=r'\s*=>\s*',
                    engine='python', names=['word', 'repl'])

mapper = (words
   .assign(word=words['word'].str.split(r',\s*'))
   .explode('word')
   .drop_duplicates('word')
   .set_index('word')['repl']
)

import re

regex = '|'.join(map(re.escape, mapper.index))
# '5th|av|ave|avn|91st|ny|nrth'

df['address'] = df['address'].str.replace(regex, lambda m: mapper.get(m.group()), regex=True)

产出:

代码语言:javascript
复制
                                             address  index
0  row: 0  fifth with ninety first avenue north, ...      0
1  row: 1  fifth with ninety first avenue north, ...      1
2  row: 2  fifth with ninety first avenue north, ...      2
3  row: 3  fifth with ninety first avenue north, ...      3
...
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73623625

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档