我有一个具有多列长度的行的.csv文件。
import pandas as pd
df = pd.read_csv(infile, header=None)返回
ParserError: Error tokenizing data. C error: Expected 6 fields in line 8, saw 8错误。我知道我可以用
names=my_cols 选择在read_csv呼叫,但肯定有更多的‘琵琶’?而且,这不是一个重复的问题,因为
error_bad_lines=False 导致行被跳过(这是不想要的)。.csv看起来如下:
Anne,Beth,Caroline,Ernie,Frank,Hannah
Beth,Caroline,David,Ernie
Caroline,Hannah
David,,Anne,Beth,Caroline,Ernie
Ernie,Anne,Beth,Frank,George
Frank,Anne,Caroline,Hannah
George,
Hannah,Anne,Beth,Caroline,David,Ernie,Frank,George发布于 2017-10-09 11:18:08
好吧,受这个相关问题的启发:Pandas variable numbers of columns to binary matrix
因此,在csv中读取,但将分隔符覆盖到一个制表符中,这样它就不会试图拆分名称:
In[7]:
import pandas as pd
import io
t="""Anne,Beth,Caroline,Ernie,Frank,Hannah
Beth,Caroline,David,Ernie
Caroline,Hannah
David,,Anne,Beth,Caroline,Ernie
Ernie,Anne,Beth,Frank,George
Frank,Anne,Caroline,Hannah
George,
Hannah,Anne,Beth,Caroline,David,Ernie,Frank,George"""
df = pd.read_csv(io.StringIO(t), sep='\t', header=None)
df
Out[7]:
0
0 Anne,Beth,Caroline,Ernie,Frank,Hannah
1 Beth,Caroline,David,Ernie
2 Caroline,Hannah
3 David,,Anne,Beth,Caroline,Ernie
4 Ernie,Anne,Beth,Frank,George
5 Frank,Anne,Caroline,Hannah
6 George,
7 Hannah,Anne,Beth,Caroline,David,Ernie,Frank,Ge...现在,我们可以使用str.split和expand=True将名称展开到它们自己的列中:
In[8]:
df[0].str.split(',', expand=True)
Out[8]:
0 1 2 3 4 5 6 7
0 Anne Beth Caroline Ernie Frank Hannah None None
1 Beth Caroline David Ernie None None None None
2 Caroline Hannah None None None None None None
3 David Anne Beth Caroline Ernie None None
4 Ernie Anne Beth Frank George None None None
5 Frank Anne Caroline Hannah None None None None
6 George None None None None None None
7 Hannah Anne Beth Caroline David Ernie Frank George因此,为了清楚起见,请修改read_csv行如下:
df = pd.read_csv(infile, header=None, sep='\t')然后按照上面的方式执行str.split
发布于 2017-10-09 11:19:41
在使用熊猫之前,人们可以对csv做一些操作。
# load data into list
with open('new_data.txt', 'r') as fil:
data = fil.readlines()
# remove line breaks from string entries
data = [ x.replace('\r\n', '') for x in data]
data = [ x.replace('\n', '') for x in data]
# calculate the number of columns
total_cols = max([x.count(',') for x in data])
# add ',' to end of list depending on how many are needed
new_data = [x + ','*(total_cols-x.count(',')) for x in data]
# save data
with open('save_data.txt', 'w') as outp:
outp.write('\n'.join(new_data))
# read it in as you did.
pd.read_csv('save_data.txt', header=None)这是一些粗糙的蟒蛇,但应该能工作。我有空的时候会把这个清理干净的。
或者用另一个答案,它是干净的,尽管如此。
https://stackoverflow.com/questions/46644603
复制相似问题