我在csv文件中有以下数据:
from StringIO import StringIO
import pandas as pd
the_data = """
ABC,2016-6-9 0:00,95,{'//PurpleCar': [115L], '//YellowCar': [403L], '//BlueCar': [16L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-10 0:00,0,{'//PurpleCar': [219L], '//YellowCar': [381L], '//BlueCar': [90L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-11 0:00,0,{'//PurpleCar': [817L], '//YellowCar': [21L], '//BlueCar': [31L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-12 0:00,0,{'//PurpleCar': [80L], '//YellowCar': [2011L], '//BlueCar': [8888L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-13 0:00,0,{'//PurpleCar': [32L], '//YellowCar': [15L], '//BlueCar': [4L], '//WhiteCar-XYZ': [0L]}
DEF,2016-6-16 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-17 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-18 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-19 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-20 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
"""我将该文件读取到Pandas数据框架中,如下所示:
df = pd.read_csv(StringIO(the_data), sep=',')然后,我添加几个列标题,如下所示:
df.columns = ['Company',
'Date',
'Volume',
'Car1',
'Car2',
'Car3',
'Car4']我看到这些数据如下:
ABC,2016-6-9 0:00,95,{'//PurpleCar': [115L], '//YellowCar': [403L], '//BlueCar': [16L], '//WhiteCar-XYZ': [0L]但是,我希望看到没有的数据,如下所示:
( a)字典开头的花括号("{")和结尾的花括号("}")
( b)数值后的"L“
( c)数值周围的方括号("["和"]")
d)键周围的撇号
理想情况下,数据将按以下方式转换:
ABC,2016-6-9 0:00,95,//PurpleCar: 115, //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0我试过这个:
df['Car1'] = df['Car1'].str.strip(['{', '}', '[', 'L]'])但是,这不管用。它会导致'Car1‘列变成NaN值。
是否有可能将数据帧转换成数据帧的每一行如下所示?
ABC,2016-6-9 0:00,95,//PurpleCar: 115, //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0谢谢!
更新
使用以下正则表达式:
df['Car1'] = df['Car1'].str.replace(r'\D+', '').astype('int')这方面的结果:
ABC,2016-6-9 0:00,95, 115 , //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0我们丢失了'//PurpleCar‘,剩下的数值只有115。这是一个很好的开始,但如果我们也能看到'//PurpleCar‘键的话,那就太好了。
有什么想法吗?
更新2:
基于piRSquared和HYRY的评论,我的目标是能够绘制数值结果。因此,我希望数据框架看起来如下:
Company Date PurpleCar YellowCar BlueCar WhiteCar
0 ABC 2016-6-9 0:00 115 403 16 0
1 ABC 2016-6-10 0:00 219 381 90 0
2 ABC 2016-6-11 0:00 817 21 31 0
3 ABC 2016-6-12 0:00 80 2011 8888 0
4 ABC 2016-6-13 0:00 32 15 4 0
5 DEF 2016-6-16 0:00 32 15 4 0
6 DEF 2016-6-17 0:00 32 15 4 0
7 DEF 2016-6-18 0:00 32 15 4 0
8 DEF 2016-6-19 0:00 32 15 4 0
9 DEF 2016-6-20 0:00 32 15 4 0*更新3:*
最初公布的数据有一个小错误。以下是数据:
the_data = """
ABC,2016-6-9 0:00,95,"{'//Purple': [115L], '//Yellow': [403L], '//Blue': [16L], '//White-XYZ': [0L]}"
ABC,2016-6-10 0:00,0,"{'//Purple': [219L], '//Yellow': [381L], '//Blue': [90L], '//White-XYZ': [0L]}"
ABC,2016-6-11 0:00,0,"{'//Purple': [817L], '//Yellow': [21L], '//Blue': [31L], '//White-XYZ': [0L]}"
ABC,2016-6-12 0:00,0,"{'//Purple': [80L], '//Yellow': [2011L], '//Blue': [8888L], '//White-XYZ': [0L]}"
ABC,2016-6-13 0:00,0,"{'//Purple': [32L], '//Yellow': [15L], '//Blue': [4L], '//White-XYZ': [0L]}"
DEF,2016-6-16 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [3L]}"
DEF,2016-6-17 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [0L]}"
DEF,2016-6-18 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [7L]}"
DEF,2016-6-19 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [14L]}"
DEF,2016-6-20 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [21L]}"
"""这些数据与原始数据之间的区别是,在开始大括号( (") )之前的撇号(("))和在结束大括号("}")之后的撇号。
发布于 2016-10-08 07:46:12
这应该能起作用
s = pd.read_csv(StringIO(the_data), sep='|', header=None, squeeze=True)
left = s.str.split(',').str[:3].apply(pd.Series)
left.columns = ['Company', 'Date', 'Volume']
right = s.str.split(',').str[3:].str.join(',') \
.str.replace(r'[\[\]\{\}\']', '') \
.str.replace(r'(:\s+\d+)L', r'\1') \
.str.split(',', expand=True)
right.columns = ['Car{}'.format(i) for i in range(1, 5)]
pd.concat([left, right], axis=1)

发布于 2016-10-08 11:30:58
我认为最好把字符串分成两列:
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(the_data), sep=',', header=None)
df.columns = ['Company','Date','Volume','Car1','Car2','Car3','Car4']
cars = ["Car1", "Car2", "Car3", "Car4"]
pattern = r"//(?P<color>.+?)':.*?(?P<value>\d+)"
df2 = pd.concat([df[col].str
.extract(pattern)
.assign(value=lambda self: pd.to_numeric(self["value"]))
for col in cars],
axis=1, keys=cars)结果:
Car1 Car2 Car3 Car4
color value color value color value color value
0 PurpleCar 115 YellowCar 403 BlueCar 16 WhiteCar-XYZ 0
1 PurpleCar 219 YellowCar 381 BlueCar 90 WhiteCar-XYZ 0
2 PurpleCar 817 YellowCar 21 BlueCar 31 WhiteCar-XYZ 0
3 PurpleCar 80 YellowCar 2011 BlueCar 8888 WhiteCar-XYZ 0
4 PurpleCar 32 YellowCar 15 BlueCar 4 WhiteCar-XYZ 0
5 PurpleCar 32 BlackCar 15 PinkCar 4 NPO-GreenCar 0
6 PurpleCar 32 BlackCar 15 PinkCar 4 NPO-GreenCar 0
7 PurpleCar 32 BlackCar 15 PinkCar 4 NPO-GreenCar 0
8 PurpleCar 32 BlackCar 15 PinkCar 4 NPO-GreenCar 0
9 PurpleCar 32 BlackCar 15 PinkCar 4 NPO-GreenCar 0https://stackoverflow.com/questions/39928273
复制相似问题