我有一个file.txt,其内容如下:
5th => fifth
av, ave, avn => avenue
91st => ninety first
ny => new york
91st => ninety first
nrth => north
91st => ninety first
nrth => northwest我大概有1500行。对于同一个单词,有重复的,多个转换。我不在乎我们选择哪一个,只要我们以一致的方式选择它们。
我的数据框架将有一个包含字符串的列。对于该列中的每个字符串,目标是使用上面文件上的信息来转换字符串。例如:
"5th with 91st av nrth, seattle, ny,united states"转化成
"fifth with ninety first avenue north, seattle, new york,united states"下面是为数据框架创建mwe的一种方法:
size = 60000
df = pd.DataFrame({"address":
[f"row: {i} 5th with 91st av nrth, seattle, ny,united states" for i in range(size)],
"index":[i for i in range(size)]
})我试过两种解决方案。
第一项:
def string_substitution(df:pd.Series):
with ('file.txt').open() as f:
file_substitutions = f.read().splitlines()
word_regex = re.compile(r'\w+')
string_list = []
for row in df.index:
string = df.loc[row]
words = [match.group() for match in word_regex.finditer(string)]
substitution_set = set()
# looking the words in txt file
for word in words:
df_regex = re.compile(r'\b' + word + r"\b")
substitution_regex = re.compile(r"(?==>(.*))")
for line in file_substitutions:
if df_regex.search(line) is not None:
# print(f"line: {line} ------------------ \n")
substitution_string = substitution_regex.findall(line)
if substitution_string != []:
substitution_string = substitution_string[0]
else:
# line from file_substitutions is a comment
# so we break
break
# print(f"word: {word}, sub: {substitution_string} \n")
substitution_string = substitution_string.lstrip()
substitution_set.add((word,substitution_string))
# with this break we stop on the first match
break
# print(substitution_set)
# print(string)
for word,substitution in substitution_set:
df_regex = re.compile(r'\b' + word + r"\b")
string = re.sub(df_regex, repl= substitution,string=string)
string_list.append(string)
return string_list这个函数将被调用为:df["address"] = string_substitution(df["address"])。
对于60,000行数据帧,这将花费超过1米的时间。
在我的第二个解决方案中,我尝试将dataframe划分为更小的子集,并使用以下方法将它们传递给string_substitution:
with ('file.txt').open() as f:
file_substitutions = f.read().splitlines() # we only open once the file
buckets = [
df.iloc[
pos:pos + self.bucket_size,
df.columns.get_loc(self.target_column)
] for pos in range(0,df.shape[0], self.bucket_size)
]
df_list = []
with ThreadPoolExecutor() as pool:
for results in pool.map(self._synonym_substitution_3, buckets):
df_list.append(results)
df[self.target_column] = pd.concat(df_list,ignore_index = True)更糟的是..。
我的目标是为示例数据帧提供一个在几秒钟内运行的解决方案(如果可能的话少于10秒),而不是像现在这样在1m内运行。
发布于 2022-09-06 14:33:33
下面是一个regex解决方案,运行在大约800 is的60k行和7个替换值中:
words = pd.read_csv('file.txt', sep=r'\s*=>\s*',
engine='python', names=['word', 'repl'])
mapper = (words
.assign(word=words['word'].str.split(r',\s*'))
.explode('word')
.drop_duplicates('word')
.set_index('word')['repl']
)
import re
regex = '|'.join(map(re.escape, mapper.index))
# '5th|av|ave|avn|91st|ny|nrth'
df['address'] = df['address'].str.replace(regex, lambda m: mapper.get(m.group()), regex=True)产出:
address index
0 row: 0 fifth with ninety first avenue north, ... 0
1 row: 1 fifth with ninety first avenue north, ... 1
2 row: 2 fifth with ninety first avenue north, ... 2
3 row: 3 fifth with ninety first avenue north, ... 3
...https://stackoverflow.com/questions/73623625
复制相似问题