我有一个dataframe,我需要根据少数列中相同的行值应用条件,并更新其他列。
输入Dataframe初始化:
df = pd.DataFrame({
'exp':['1y','1y','1y','1y','1y','1y','1y','1y','1y',],
'mat':['1y','1y','1y','2y','2y','2y','1y','1y','1y',],
'reg':['in', 'in', 'in', 'in', 'in', 'in','in','in','in'],
'con':['5w','5s','5c','5w','5s','5c','10w','10s','10c'],
'pcon':['w','s','c','w','s','c','w','s','c'],
'val':[2.5,0,-2.5,2.5,0,-2.5,5,0,-5],
'rs':[6, 10, 4, 12, 30, 6,30,25,10]
})
df输入DataFrame:
exp mat reg con pcon val rs
0 1y 1y in 5w w 2.5 6
1 1y 1y in 5s s 0.0 10
2 1y 1y in 5c c -2.5 4
3 1y 2y in 5w w 2.5 12
4 1y 2y in 5s s 0.0 30
5 1y 2y in 5c c -2.5 6
6 1y 1y in 10w w 5.0 30
7 1y 1y in 10s s 0.0 25
8 1y 1y in 10c c -5.0 10预期输出DataFrame
exp mat reg con pcon val rs
0 1y 1y in 5w w 2.5 5 # (6+4)/2 from input
1 1y 1y in 5s s 0.0 10 # Same as input
2 1y 1y in 5c c -2.5 1 # (6-4)/2 from input
3 1y 2y in 5w w 2.5 9
4 1y 2y in 5s s 0.0 30
5 1y 2y in 5c c -2.5 3
6 1y 1y in 10w w 5.0 20
7 1y 1y in 10s s 0.0 25
8 1y 1y in 10c c -5.0 10多个条件基于行和列:
列中相同值的
就像所有3行都有exp=1y、mat=1y、reg=in一样。
a.基于con更新列'rs‘。
如果'con‘== 5w,则rs = (current_rs + rs(con == 5c )/2 i.e (6+4)/2 if 'con’== 5s,则rs = (rs(con == 5w) -current_rs)/2 i.e (6-4)/2 (在上面的示例中,对于10w、10s、10c的组,数据集i的值为1000000,100000s,100000c)。
我的数据集相当大,还有许多其他区域,还有更多的'exp‘、'mat’和'reg‘列的组合。
当我尝试下面的代码时,如果我将数据解压,使'pcon‘成为列,我就会得到以下错误
"ValueError: Index contains duplicate entries, cannot reshape" error代码:
tmp = df.set_index(['exp','mat','reg', 'pcon']).unstack() ## Failing here
tmp[('rs','w')],tmp[('rs','c')] = ((tmp[('rs','w')] + tmp[('rs','c')])/2,
(tmp[('rs','w')] - tmp[('rs','c')])/2)
tmp = tmp.stack().reset_index()如果我打开数据堆栈,使'con‘成为列,我将得到以下错误
KeyError: ('rs', 'w')代码
tmp = df.set_index(['exp','mat','reg', 'con']).unstack() ## Failing here发布于 2021-03-29 10:37:39
每当我们想要打开数据堆栈时,它都应该有唯一的索引,否则解释器就会因为重复的索引而对数据进行解压不明确。在上述情况下,错误'ValueError: Index包含重复条目,无法重新构造‘是因为在set_index()期间索引被复制。
df.set_index(['exp','mat','reg', 'pcon'])为了保持索引的唯一性,我们可以创建一个新的列来保持索引的唯一性。
df['tem'] = df['con'].str.extract('(\d+)')这样就不会有重复的索引,可以对数据进行展开,并在'rs‘列上进行计算。
代码:
df = pd.DataFrame({
'exp':['1y','1y','1y','1y','1y','1y','1y','1y','1y',],
'mat':['1y','1y','1y','2y','2y','2y','1y','1y','1y',],
'reg':['in', 'in', 'in', 'in', 'in', 'in','in','in','in'],
'con':['5w','5s','5c','5w','5s','5c','10w','10s','10c'],
'pcon':['w','s','c','w','s','c','w','s','c'],
'val':[2.5,0,-2.5,2.5,0,-2.5,5,0,-5],
'rs':[6, 10, 4, 12, 30, 6,30,25,10]
})
df['tem'] = df['con'].str.extract('(\d+)')
tmp = df.set_index(['exp','mat','reg', 'tem', 'pcon']).unstack()
tmp[('rs','w')],tmp[('rs','c')] = ((tmp[('rs','w')] + tmp[('rs','c')])/2,
(tmp[('rs','w')] - tmp[('rs','c')])/2)
tmp = tmp.stack().reset_index()
tmp输出DataFrame:
exp mat reg tem pcon con val rs
0 1y 1y in 10 c 10c -5.0 10.0
1 1y 1y in 10 s 10s 0.0 25.0
2 1y 1y in 10 w 10w 5.0 20.0
3 1y 1y in 5 c 5c -2.5 1.0
4 1y 1y in 5 s 5s 0.0 10.0
5 1y 1y in 5 w 5w 2.5 5.0
6 1y 2y in 5 c 5c -2.5 3.0
7 1y 2y in 5 s 5s 0.0 30.0
8 1y 2y in 5 w 5w 2.5 9.0(可根据要求删除tem栏)
https://stackoverflow.com/questions/66783883
复制相似问题