我正在尝试从csv中删除所有包含拉丁文和中文字符的中文字符。数据如下所示:
address lat
1 农工商超市, Zhangjiang, Pudong New District, 203718 31.204024
2 欧尚, 3057号, Jinke Road, Pudong, 201203, China 31.181804我需要它看起来像:
address lat
1 , Zhangjiang, Pudong New District, 203718 31.204024
2 , 3057, Jinke Road, Pudong, 201203, China 31.181804我尝试使用df.replace(/[^\x00-\x7F]/g, "")和df.replace(/[\u{0080}-\u{FFFF}]/gu,""),但出现错误:
df1.replace([^\x00-\x7F],"");
^
SyntaxError: invalid syntax需要帮助!谢谢
发布于 2018-02-17 23:09:01
你就快到了:
df['address'] = df['address'].str.replace(r'[^\x00-\x7F]+', '')结果:
In [99]: df
Out[99]:
address lat
0 , Zhangjiang, Pudong New District, 203718 31.204024
1 , 3057, Jinke Road, Pudong, 201203, China 31.181804发布于 2018-02-17 23:58:27
一种方法也可以是将filter与string.printable一起使用,类似于link
import string
printable = set(string.printable)
df['address'] = df['address'].apply(lambda row: ''.join(filter(lambda x: x in printable, row)))
df结果:
address lat
1 , Zhangjiang, Pudong New District, 203718 31.204024
2 , 3057, Jinke Road, Pudong, 201203, China 31.181804或者将encode和decode与lambda配合使用,类似于link
df['address'] = df['address'].apply(lambda row: row.encode('ascii',errors='ignore').decode())发布于 2018-02-18 06:29:42
如果您想限制字符集,一种更健壮的方法是使用您想要的编码读入文件对象,同时忽略错误
with open('your_csv_file.csv', encoding='ascii', errors='ignore') as infile:
df = pd.read_csv(infile)https://stackoverflow.com/questions/48842639
复制相似问题