文章/答案/技术大牛

发布

社区首页 >问答首页 >根据以前的数据填写数据框

问根据以前的数据填写数据框
EN

Stack Overflow用户

提问于 2021-04-01 07:12:00

回答 2查看 55关注 0票数 2

我正在与零售商的一个项目中，我们希望清理一些数据的报告目的。这家零售商有多家门店，每周门店的员工都会在不同的显示器上扫描不同的商品(他们首先扫描显示器，让我们知道他们说的是哪个显示器)。此外，他们只扫描在那一周发生变化的显示，如果显示没有变化，那么我们假设它保持不变。

现在，我们正在处理2个数据帧：

层次结构数据框示例:此表基本上包含每个商店中每个端盖(显示器)的第1到52周。让我们假设公司只有2个门店，每个门店有3个end caps。另外，不同的商店可能会有不同的End Cap代码，但这对我们的目的来说并不重要(我不认为)。

    Week  Store End Cap
0      1      1       A
1      1      1       B
2      1      1       C
3      1      2       A
4      1      2       B
5      1      2       D
6      2      1       A
7      2      1       B
8      2      1       C
9      2      2       A
10     2      2       B
11     2      2       D

接下来，我们有一个历史文件，其中包含用于更新End Caps的实际更改。

    Week  Store End Cap     UPC
0      1      1       A  123456
1      1      1       B  789456
2      1      1       B  546879
3      1      1       C  423156
4      1      2       A  231567
5      1      2       B  456123
6      1      2       D  689741
7      2      1       A  321654
8      2      1       C  852634
9      2      1       C  979541
10     2      2       A  132645
11     2      2       B  787878
12     2      2       D  615432

要合并我使用的两个数据帧：

merged_df = pd.merge(hierarchy, hist,  how='left', left_on=['Week','Store', 'End Cap'], right_on = ['Week','Store', 'End Cap'])

这给了我：

    Week  Store End Cap       UPC
0      1      1       A  123456.0
1      1      1       B  789456.0
2      1      1       B  546879.0
3      1      1       C  423156.0
4      1      2       A  231567.0
5      1      2       B  456123.0
6      1      2       D  689741.0
7      2      1       A  321654.0
8      2      1       B       NaN
9      2      1       C  852634.0
10     2      1       C  979541.0
11     2      2       A  132645.0
12     2      2       B  787878.0
13     2      2       D  615432.0

除了其中显示NAN的一个实例。商店1在第二周的封口2没有改变，因此没有被扫描。因此，它没有出现在历史数据框架中。在这种情况下，我希望看到在该商店扫描的最新商品(请参阅历史数据框的第2&3行)。因此，从技术上讲，这也可以在去年的52周进行扫描，但我只想用最新的信息填充NAN，以表明它没有改变。我该怎么做呢？

所需的输出将如下所示：

    Week  Store End Cap       UPC
0      1      1       A  123456.0
1      1      1       B  789456.0
2      1      1       B  546879.0
3      1      1       C  423156.0
4      1      2       A  231567.0
5      1      2       B  456123.0
6      1      2       D  689741.0
7      2      1       A  321654.0
8      2      1       B  789456.0
9      2      1       B  546879.0
10     2      1       C  852634.0
11     2      1       C  979541.0
12     2      2       A  132645.0
13     2      2       B  787878.0
14     2      2       D  615432.0

谢谢!

编辑:在上面的基础上，我尝试对数据进行排序，然后向前填充，这只是部分解决了我遇到的问题：

sorted_df = merged_df.sort_values(['End Cap', 'Store'], ascending=[True, True])

    Week  Store End Cap       UPC
0      1      1       A  123456.0
7      2      1       A  321654.0
4      1      2       A  231567.0
11     2      2       A  132645.0
1      1      1       B  789456.0
2      1      1       B  546879.0
8      2      1       B       NaN
5      1      2       B  456123.0
12     2      2       B  787878.0
3      1      1       C  423156.0
9      2      1       C  852634.0
10     2      1       C  979541.0
6      1      2       D  689741.0
13     2      2       D  615432.0

sorted_filled = sorted_df.fillna(method='ffill')

给了我：

    Week  Store End Cap       UPC
0      1      1       A  123456.0
7      2      1       A  321654.0
4      1      2       A  231567.0
11     2      2       A  132645.0
1      1      1       B  789456.0
2      1      1       B  546879.0
8      2      1       B  546879.0
5      1      2       B  456123.0
12     2      2       B  787878.0
3      1      1       C  423156.0
9      2      1       C  852634.0
10     2      1       C  979541.0
6      1      2       D  689741.0
13     2      2       D  615432.0

此输出确实将546879添加到第二周store1 End Cap，但它没有添加我也需要的789456。我也需要它来添加另一个具有该值的行。

pandas

dataframe

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-04-02 01:58:22

您也可以这样做，创建一个helper列来处理每个商店/周/结束上限的重复UPC。

idxcols=['Week', 'Store', 'End Cap']
hist_idx = hist.set_index(idxcols + [hist.groupby(idxcols).cumcount()])

hier_idx = hierarchy.set_index(idxcols+[hierarchy.groupby(idxcols).cumcount()])

hier_idx.join(hist_idx, how='right')\
        .unstack('Week')\
        .ffill(axis=1)\
        .stack('Week')\
        .reorder_levels([3,0,1,2])\
        .sort_index()\
        .reset_index()\
        .drop('level_3', axis=1)

输出：

    Week  Store End Cap       UPC
0      1      1       A  123456.0
1      1      1       B  789456.0
2      1      1       B  546879.0
3      1      1       C  423156.0
4      1      2       A  231567.0
5      1      2       B  456123.0
6      1      2       D  689741.0
7      2      1       A  321654.0
8      2      1       B  789456.0
9      2      1       B  546879.0
10     2      1       C  852634.0
11     2      1       C  979541.0
12     2      2       A  132645.0
13     2      2       B  787878.0
14     2      2       D  615432.0

票数 1

Stack Overflow用户

发布于 2021-04-02 01:16:24

您可以尝试如下所示：

# New df without Nan values
df1 = merged_df[~merged_df["name"].isna()]

# New df with Nan values only
df2 = merged_df[merged_df["name"].isna()]

# Set previous week
df2["Week"] = df2["Week"] - 1

# For each W/S/EC in df2, grab corresponding UPC value in df1
# and append a new row (shifted back to current week) to df1
for week in df2["Week"].values:
    for store in df2["Store"].values:
        for cap in df2["Enc Cap"].values:
            mask = (
                (df1["Week"] == week)
                & (df1["Store"] == store)
                & (df1["End Cap"] == cap)
            )
            upc = df1.loc[mask, "UPC"].item()
            row = [week + 1, store, cap, upc]
            df1.loc[len(df1)] = row

sorted_df = df1.sort_values(by=["Week", "Store", "End Cap"])

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/66896444

复制

相似问题

问根据以前的数据填写数据框
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问根据以前的数据填写数据框EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问根据以前的数据填写数据框
EN