我有一个多索引的dataframe,如下所示:
target_q_0 target_q_1 target_q_2 target_q_3 target_q_4
sample_nr event
1 0 0.086743 -1.085944 1.304110 -0.174707 -0.037001
1 0.086743 -1.085944 1.304110 -0.174707 -0.037001
2 0.086743 -1.085944 1.304110 -0.174707 -0.037001
3 0.086743 -1.085944 1.304110 -0.174707 -0.037001
4 0.086743 -1.085944 1.304110 -0.174707 -0.037001
2 0 0.092704 -1.123734 1.368322 -0.206030 -0.006364
1 0.092376 -1.121655 1.364788 -0.204306 -0.008050
2 0.092057 -1.119634 1.361355 -0.202632 -0.009688
3 0.091748 -1.117672 1.358021 -0.201005 -0.011279
3 0 0.092704 -1.123734 1.368322 -0.206030 -0.006364每个样本可以有不同数量的事件。
我想要找到最长的样本,也就是一个事件数量最多的样本,所有其他样本的长度为零。
预期的结果将是:
target_q_0 target_q_1 target_q_2 target_q_3 target_q_4
sample_nr event
1 0 0.086743 -1.085944 1.304110 -0.174707 -0.037001
1 0.086743 -1.085944 1.304110 -0.174707 -0.037001
2 0.086743 -1.085944 1.304110 -0.174707 -0.037001
3 0.086743 -1.085944 1.304110 -0.174707 -0.037001
4 0.086743 -1.085944 1.304110 -0.174707 -0.037001
2 0 0.092704 -1.123734 1.368322 -0.206030 -0.006364
1 0.092376 -1.121655 1.364788 -0.204306 -0.008050
2 0.092057 -1.119634 1.361355 -0.202632 -0.009688
3 0.091748 -1.117672 1.358021 -0.201005 -0.011279
4 0 0 0 0 0
3 0 0.092704 -1.123734 1.368322 -0.206030 -0.006364
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0我有一个工作方法来做这件事,但是很慢。
def pad_df(df):
max_rows = df.index.get_level_values(1).max()
for sample, new_df in df.groupby(level=0):
new_df = (new_df.unstack(level=0).reindex(list(range(max_rows)),
fill_value=0))
new_df = new_df.stack('sample_nr').swaplevel(0, 1).sort_index()
df.loc[experiment_data.index.get_level_values(0) == sample] = new_df这个函数是以我的完整数据格式experiment_data作为输入调用的:
experiment_data = load_some_stuff()
pad_df(experiment_data)发布于 2020-11-30 19:21:24
如果找不到神奇的方法,一个公平的策略可能是预先分配想要的数组并使用for循环填充它。这通常比对数据帧的直接操作快得多。
在您的情况下,可以使用MultiIndex为应答数组生成所需的pd.MultiIndex.from_product(),因为每个级别的长度都是固定的。
数据
import pandas as pd
from pandas import DataFrame
import io
import numpy as np
df = pd.read_csv(io.StringIO("""
sample_nr event target_q_0 target_q_1 target_q_2 target_q_3 target_q_4
1 0 0.086743 -1.085944 1.304110 -0.174707 -0.037001
1 1 0.086743 -1.085944 1.304110 -0.174707 -0.037001
1 2 0.086743 -1.085944 1.304110 -0.174707 -0.037001
1 3 0.086743 -1.085944 1.304110 -0.174707 -0.037001
1 4 0.086743 -1.085944 1.304110 -0.174707 -0.037001
2 0 0.092704 -1.123734 1.368322 -0.206030 -0.006364
2 1 0.092376 -1.121655 1.364788 -0.204306 -0.008050
2 2 0.092057 -1.119634 1.361355 -0.202632 -0.009688
2 3 0.091748 -1.117672 1.358021 -0.201005 -0.011279
3 0 0.092704 -1.123734 1.368322 -0.206030 -0.006364
"""), sep=r"\s{2,}", engine="python", index_col=["sample_nr", "event"])代码
# 1. compute the sizes of each sample_nr
sr_sizes = df.groupby(df.index.get_level_values(0)).size()
# compute max size and #sample_nr
max_size = sr_sizes.max()
n_sample_nrs = len(sr_sizes)
# 2. preallocate the output array and fill
arr = np.zeros((max_size * n_sample_nrs, 5))
idx_lv0 = df.index.get_level_values(0) # get sample_nr
for i in range(n_sample_nrs):
row = i*max_size
arr[row:row + sr_sizes.iloc[i], :] =\
df[idx_lv0 == sr_sizes.index[i]].values
# 3. convert to dataframe
df_ans = pd.DataFrame(
data=arr,
index=pd.MultiIndex.from_product([sr_sizes.index, range(max_size)]),
columns=df.columns
).rename_axis(df.index.names, axis=0)结果
print(df_ans)
target_q_0 target_q_1 target_q_2 target_q_3 target_q_4
sample_nr event
1 0 0.086743 -1.085944 1.304110 -0.174707 -0.037001
1 0.086743 -1.085944 1.304110 -0.174707 -0.037001
2 0.086743 -1.085944 1.304110 -0.174707 -0.037001
3 0.086743 -1.085944 1.304110 -0.174707 -0.037001
4 0.086743 -1.085944 1.304110 -0.174707 -0.037001
2 0 0.092704 -1.123734 1.368322 -0.206030 -0.006364
1 0.092376 -1.121655 1.364788 -0.204306 -0.008050
2 0.092057 -1.119634 1.361355 -0.202632 -0.009688
3 0.091748 -1.117672 1.358021 -0.201005 -0.011279
4 0.000000 0.000000 0.000000 0.000000 0.000000
3 0 0.092704 -1.123734 1.368322 -0.206030 -0.006364
1 0.000000 0.000000 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.000000 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000 0.000000https://stackoverflow.com/questions/65078739
复制相似问题