我有一个非常大的CSV文件(million+行),我想对它执行一些操作。问题是,有些行有一些不必要的换行,如下所示:
New York City; Iron Man; no superpowers;
Metropolis; Superman; superpowers;
New York City;
Spider-Man;
superpowers;
Gotham; Batman; no superpowers;
New York City; Doctor Strange; superpowers;因此,该文件有三列(location、superhero、superpowers)。由于蜘蛛侠的条目是错误的,因为它的条目之间有行距,熊猫错误地认为这是三行分开的,NaNs在第二和第三栏。
我的想法是用regex在导入过程中修复这个问题。根据本网站,这个正则表达式正确地匹配所需的行,而不匹配错误的行(即蜘蛛侠)。
(.*[;].*[;].*)
它的逆项(然而,(?!(.*[;].*[;].*))不起作用,因为它不仅不匹配三个错误行,而且是每个正常行的第三个条目。
我的另一种方法是简单地设置列数,然后从整个文件中删除所有的换行符。不过,这也不管用。
superhero_df = pd.read_csv("superheroes.csv", sep=' *; *', skiprows=12, names=["location", "superhero", "superpower"], index_col=False, engine="python")
superhero_df = superhero_df.replace('\r\n','', regex=True)所需的输出应该如下所示:
New York City; Iron Man; no superpowers
Metropolis; Superman; superpowers;
New York City; Spider-Man; superpowers;
Gotham; Batman; no superpowers;
New York City; Doctor Strange; superpowers;发布于 2019-05-14 07:44:53
那又如何呢?
^([^;]+);[\r\n]*([^;]+);[\r\n]*([^;]+);并以下列案文取代:
\1;\2;\3;import re
regex = r"^([^;]+);[\r\n]*([^;]+);[\r\n]*([^;]+);"
test_str = ("New York City; Iron Man; no superpowers;\n"
"Metropolis; Superman; superpowers;\n"
"New York City;\n"
"Spider-Man;\n"
"superpowers;\n"
"Gotham; Batman; no superpowers;\n"
"New York City; Doctor Strange; superpowers;\n\n")
subst = "\\1;\\2;\\3;"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.DOTALL)
if result:
print (result)发布于 2019-05-14 07:45:22
下面的正则表达式在每三个字段之后消除不必要的换行和其他空白。它假定字段没有任何内部分号:
print(re.sub(r'([^;]*);\s*([^;]*);\s*([^;]*);\s+', r'\1;\2;\3\n',
line, flags=re.M))
#New York City; Iron Man;no superpowers
#Metropolis;Superman;superpowers
#New York City;Spider-Man;superpowers
#Gotham;Batman;no superpowers
#New York City;Doctor Strange;superpowers您可以在循环中运行它,在使用Pandas之前对文件进行预处理。
发布于 2019-05-14 11:21:10
如果我是您,我会在一个新的文本文件中重写整个数据,对源文件进行简单的迭代,并将结果文件加载到Pandas中,不需要re:
with open('source.txt') as fin, open('target.txt', 'w') as fout:
lc = 0
for line in fin:
lc += line.count(';')
if lc < 3:
fout.write(line[:-1])
else:
fout.write(line)
lc = 0结果:
# New York City; Iron Man; no superpowers;
# Metropolis; Superman; superpowers;
# New York City;Spider-Man;superpowers;
# Gotham; Batman; no superpowers;
# New York City; Doctor Strange; superpowers;读到潘达斯:
pd.read_csv('target.txt', header=None, sep=';', usecols=range(3))
# 0 1 2
# 0 New York City Iron Man no superpowers
# 1 Metropolis Superman superpowers
# 2 New York City Spider-Man superpowers
# 3 Gotham Batman no superpowers
# 4 New York City Doctor Strange superpowers注意:usecols只是因为尾随分号而需要的。可以通过使用
with open('source.txt') as fin, open('target.txt', 'w') as fout:
lc = 0
for line in fin:
lc += line.count(';')
if lc < 3:
fout.write(line.strip())
else:
fout.write(line.strip()[:-1] + '\n')
lc = 0读到潘达斯:
pd.read_csv('target.txt', header=None, sep=';')
# 0 1 2
# 0 New York City Iron Man no superpowers
# 1 Metropolis Superman superpowers
# 2 New York City Spider-Man superpowers
# 3 Gotham Batman no superpowers
# 4 New York City Doctor Strange superpowershttps://stackoverflow.com/questions/56125109
复制相似问题