我已经阅读了关于如何删除python中字符串的非ASCI字符的现有文章。但我的问题是,当我想将它应用到我从csv文件中读取的数据文件时,它不起作用。知道为什么吗?
import pandas as pd
import numpy as np
import re
import string
import unicodedata
def preprocess(x):
# Convert to unicode
text = unicode(x, "utf8")
# Convert back to ascii
x = unicodedata.normalize('NFKD',text).encode('ascii','ignore')
return x
preprocess("Ludwig Maximilian University of Munich / M\xc3\xbcnchen (LMU) and Siemens AG")慕尼黑马西米兰大学/慕尼黑大学(LMU)和西门子公司
df = pd.DataFrame(["Ludwig Maximilian University of Munich / M\xc3\xbcnchen (LMU) and Siemens AG"])
df.columns=['text']
df['text'] = df['text'].apply(lambda x: preprocess(x) if(pd.notnull(x)) else x)
df['text'][0]慕尼黑马西米兰大学/慕尼黑大学(LMU)和西门子公司
df1 = pd.read_csv('sample.csv')
df1['text'] = df1['text'].apply(lambda x: preprocess(x) if(pd.notnull(x)) else x)
df1['text'][0]‘慕尼黑马西米兰大学/M 3\xbcnchen (LMU)和西门子AG’
请注意,df1:

与df完全一样:

发布于 2018-11-20 21:36:45
这是因为熊猫将文件中的文本作为原始字符串读取。它实质上等同于:
df = pd.DataFrame({"text": [r"Ludwig Maximilian University of Munich / M\xc3\xbcnchen (LMU) and Siemens AG"]})要使规范化正常工作,您必须处理转义字符串。只需修改preprocess函数:
def preprocess(x):
decoded = x.decode('string_escape')
text = unicode(decoded, 'utf8')
return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')应该是后来者:
>>> df = pd.DataFrame({"text": [r"Ludwig Maximilian University of Munich / M\xc3\xbcnchen (LMU) and Siemens AG"]})
>>> df
text
0 Ludwig Maximilian University of Munich / M\xc3...
>>> df['text'] = df['text'].apply(lambda x: preprocess(x) if(pd.notnull(x)) else x)
>>> df
text
0 Ludwig Maximilian University of Munich / Munch...https://stackoverflow.com/questions/53401507
复制相似问题