首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何筛选值或通配符的多列?

如何筛选值或通配符的多列?
EN

Stack Overflow用户
提问于 2021-04-20 09:08:05
回答 1查看 187关注 0票数 2

我有一个带有模型信息的数据(df_model,~700行,~50种类型)和带有组件信息的数据(df_components,~400.000行)。

df_model:一个模型类型的列表,每个模型类型都有几个包含特定组件的行:

代码语言:javascript
复制
    type    mainclass   subclass    param1  param2  param3  param4
0   E2      C/C         C           IT      NaN     GR      NaN
1   E2      W           W           IT      NaN     NaN     AC-8
2   E2      P           13.3        NaN     NaN     NaN     NaN
3   P1      C/C         C           IT      NaN     TU      NaN
4   P1      GR          G           NAI     XSM     NaN     NaN

df_component:可用组件列表+每个组件的交付时间:

代码语言:javascript
复制
    mainclass   subclass    param1  param2  param3  param4  component_name  delivery_time
0   C/C         C           IT      X       GR      NaN     CM1             8
1   C/C         C           IT      NaN     GR      RAD     CM2             9
2   W           W           IT      NaN     NaN     AC-8    CM3             3
3   P           13.3        NaN     NaN     NaN     NaN     CM4             13
4   P           15.3        NaN     NaN     NaN     NaN     CM5             20
5   B           BR          NaN     KRL     NaN     NaN     CM6             5

我想要编写一个函数--对于特定的模型类型--从model_components +最小交付时间返回每个匹配组件的传递时间。

例如:对于模型类型"E2“,有两个匹配组件用于"C/C C IT NaN GR NaN”。一个组件的交付时间为8,9.8<9 -> 9中的一个将被删除,因为我们需要每个模型组件的最佳交付时间。

问题:如果模型组件在列中包含"NaN“,则为通配符=对应组件列中的任何内容都应该匹配(如果这有帮助,则可以将通配符的np.nan更改为其他任何内容)。

例如,df_model的第一行:

代码语言:javascript
复制
    type    mainclass   subclass    param1  param2  param3  param4
0   E2      C/C         C           IT      NaN     GR      NaN

列"param2“= "NaN”,因此这2行与df_component匹配,因为"param2“列并不重要:

代码语言:javascript
复制
    mainclass   subclass    param1  param2  param3  param4  component_name  delivery_time
0   C/C         C           IT      X       GR      NaN     CM1             8
1   C/C         C           IT      NaN     GR      RAD     CM2             9

因此,对于模型类型"E2“,最终结果应该是一个dict/JSON,包含:

代码语言:javascript
复制
{
    "leadtime_min": 13,
    "CM1": 8,
    "CM3": 3,
    "CM4": 13,
}

到目前为止,我对函数的处理方法是:

components

  • get

  • 选择给定的模型类型

  • 迭代每个模型组件,搜索匹配的每个组件的最小交付时间(如果有多个匹配组件)。如果几个匹配的组件具有相同的交付时间,那么它们都需要添加到结果JSON)

函数中的查询代码返回所有行,而不仅仅是匹配的行,我需要解决这个问题。

到目前为止我的代码是:

代码语言:javascript
复制
df_model = pd.DataFrame(
    {
        "type": ["E2", "E2", "E2", "P1", "P1"],
        "mainclass": ["C/C", "W", "P", "C/C", "GR"],
        "subclass": ["C", "W", "13.3", "C", "G"],
        "param1": ["IT", "IT", np.nan, "IT", "NAI"],
        "param2": [np.nan, np.nan, np.nan, np.nan, "XSM"],
        "param3": ["GR", np.nan, np.nan, "TU", np.nan],
        "param4": [np.nan, "AC-8", np.nan, np.nan, np.nan],
    }
)

df_components = pd.DataFrame(
    {
        "mainclass": [
            "C/C",
            "C/C",
            "W",
            "P",
            "P",
            "B",
        ],
        "subclass": ["C", "C", "W", "13.3", "15.3", "BR"],
        "param1": ["IT", "IT", "IT", np.nan, np.nan, np.nan],
        "param2": ["X", np.nan, np.nan, np.nan, np.nan, "KRL"],
        "param3": ["GR", "GR", np.nan, np.nan, np.nan, np.nan],
        "param4": [np.nan, "RAD", "AC-8", np.nan, np.nan, np.nan],
        "component_name": ["CM1", "CM2", "CM3", "CM4", "CM5", "CM6"],
        "delivery_time": [8, 9, 3, 13, 20, 5],
    }
)



def get_delivery_time_best(model_type: str, model: pd.DataFrame, component: pd.DataFrame)

    model = model.query("type == @model_type")
    fitting_components = {"delivery_time_min": sys.maxsize}
    delivery_times_all = []

    for model_component in df_model.itertuples():
        model_mainclass = model_component.mainclass
        model_subclass = model_component.subclass
        model_param1 = model_component.param1
        model_param2 = model_component.param2
        model_param3 = model_component.param3
        model_param4 = model_component.param4

        checked_components = components.query(
            """
            mainclass == @model_mainclass and \
            subclass == @model_subclass \
            and ( param1 == @model_param1 | @model_param1 != @model_param1 ) \
            and ( param2 == @model_param2 | @model_param2 != @model_param2 ) \
            and ( param3 == @model_param3 | @model_param3 != @model_param3 ) \
            and ( param4 == @model_param4 | @model_param4 != @model_param4 ) \
            """
        )

        component_descriptions = checked_components"component_name"    ].tolist()
        delivery_times = checked_components["delivery_time"].tolist()

        delivery_time_min = min(delivery_times )
        delivery_times_all.append(delivery_time_min )

        for component_description, delivery_time in zip(component_descriptions, delivery_times ):
            if delivery_time == delivery_time_min :
                fitting_components[component_description] = delivery_time 

    fitting_components["delivery_time_min "] = max(delivery_times_all )
    print(fitting_components)

get_delivery_time_best(model_type="E2", model=df_model, components=df_components)

但这会返回df_components中的所有行,而不是匹配的2行?

EN

回答 1

Stack Overflow用户

发布于 2021-04-20 10:10:03

Mergeboolean indexing

代码语言:javascript
复制
merged = df_components.merge(df_model, on=['mainclass', 'subclass'], suffixes=['', '_r'])

l = merged.filter(regex=r'^param\d+$')
r = merged[l.columns + '_r']

m = ((l.values == r.values) | r.isna().values).all(1)
merged[m].drop(r.columns, 1)

解释

将dataframe df_components与列上的df_model合并,并在右df中添加一个后缀_r,以唯一地标识右数据格式中的列。

代码语言:javascript
复制
>>> merged

  mainclass subclass   param1 param2 param3 param4 param1_r  param2_r param3_r  param4_r
0        1a       1b  vendor1  spec1    id1    NaN  vendor1       NaN      id1       NaN
1        1a       1b  vendor1    NaN    id1  rout3  vendor1       NaN      id1       NaN
2        1a       1b  vendor1    NaN    id2    NaN  vendor1       NaN      id1       NaN
3        1a       1b      NaN    NaN    NaN    NaN  vendor1       NaN      id1       NaN
4        1a       1b  vendor2    NaN    id1    NaN  vendor1       NaN      id1       NaN
5        1a       1b  vendor2    NaN    NaN    NaN  vendor1       NaN      id1       NaN

从左边和右边的dataframe中筛选param类似的列。

代码语言:javascript
复制
>>> l

    param1 param2 param3 param4
0  vendor1  spec1    id1    NaN
1  vendor1    NaN    id1  rout3
2  vendor1    NaN    id2    NaN
3      NaN    NaN    NaN    NaN
4  vendor2    NaN    id1    NaN
5  vendor2    NaN    NaN    NaN


>>> r

  param1_r  param2_r param3_r  param4_r
0  vendor1       NaN      id1       NaN
1  vendor1       NaN      id1       NaN
2  vendor1       NaN      id1       NaN
3  vendor1       NaN      id1       NaN
4  vendor1       NaN      id1       NaN
5  vendor1       NaN      id1       NaN

通过将左侧dataframe中的param列与右dataframe中相应的param列进行比较来创建一个布尔掩码,类似地,通过检查正确的NaN值来创建另一个布尔掩码。

代码语言:javascript
复制
>>> l.values == r.values

array([[ True, False,  True, False],
       [ True, False,  True, False],
       [ True, False, False, False],
       [False, False, False, False],
       [False, False,  True, False],
       [False, False, False, False]])

>>> r.isna().values

array([[False,  True, False,  True],
       [False,  True, False,  True],
       [False,  True, False,  True],
       [False,  True, False,  True],
       [False,  True, False,  True],
       [False,  True, False,  True]])

使用逻辑或上述掩码,并沿all沿axis=1减少结果掩码

代码语言:javascript
复制
>>> m

array([ True,  True, False, False, False, False])

现在,在掩码df_components的帮助下过滤m中的行

代码语言:javascript
复制
>>> merged[m].drop(r.columns, 1)

  mainclass subclass   param1 param2 param3 param4
0        1a       1b  vendor1  spec1    id1    NaN
1        1a       1b  vendor1    NaN    id1  rout3
票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/67175837

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档