我有一个带有模型信息的数据(df_model,~700行,~50种类型)和带有组件信息的数据(df_components,~400.000行)。
df_model:一个模型类型的列表,每个模型类型都有几个包含特定组件的行:
type mainclass subclass param1 param2 param3 param4
0 E2 C/C C IT NaN GR NaN
1 E2 W W IT NaN NaN AC-8
2 E2 P 13.3 NaN NaN NaN NaN
3 P1 C/C C IT NaN TU NaN
4 P1 GR G NAI XSM NaN NaNdf_component:可用组件列表+每个组件的交付时间:
mainclass subclass param1 param2 param3 param4 component_name delivery_time
0 C/C C IT X GR NaN CM1 8
1 C/C C IT NaN GR RAD CM2 9
2 W W IT NaN NaN AC-8 CM3 3
3 P 13.3 NaN NaN NaN NaN CM4 13
4 P 15.3 NaN NaN NaN NaN CM5 20
5 B BR NaN KRL NaN NaN CM6 5我想要编写一个函数--对于特定的模型类型--从model_components +最小交付时间返回每个匹配组件的传递时间。
例如:对于模型类型"E2“,有两个匹配组件用于"C/C C IT NaN GR NaN”。一个组件的交付时间为8,9.8<9 -> 9中的一个将被删除,因为我们需要每个模型组件的最佳交付时间。
问题:如果模型组件在列中包含"NaN“,则为通配符=对应组件列中的任何内容都应该匹配(如果这有帮助,则可以将通配符的np.nan更改为其他任何内容)。
例如,df_model的第一行:
type mainclass subclass param1 param2 param3 param4
0 E2 C/C C IT NaN GR NaN列"param2“= "NaN”,因此这2行与df_component匹配,因为"param2“列并不重要:
mainclass subclass param1 param2 param3 param4 component_name delivery_time
0 C/C C IT X GR NaN CM1 8
1 C/C C IT NaN GR RAD CM2 9因此,对于模型类型"E2“,最终结果应该是一个dict/JSON,包含:
{
"leadtime_min": 13,
"CM1": 8,
"CM3": 3,
"CM4": 13,
}到目前为止,我对函数的处理方法是:
components
中
函数中的查询代码返回所有行,而不仅仅是匹配的行,我需要解决这个问题。
到目前为止我的代码是:
df_model = pd.DataFrame(
{
"type": ["E2", "E2", "E2", "P1", "P1"],
"mainclass": ["C/C", "W", "P", "C/C", "GR"],
"subclass": ["C", "W", "13.3", "C", "G"],
"param1": ["IT", "IT", np.nan, "IT", "NAI"],
"param2": [np.nan, np.nan, np.nan, np.nan, "XSM"],
"param3": ["GR", np.nan, np.nan, "TU", np.nan],
"param4": [np.nan, "AC-8", np.nan, np.nan, np.nan],
}
)
df_components = pd.DataFrame(
{
"mainclass": [
"C/C",
"C/C",
"W",
"P",
"P",
"B",
],
"subclass": ["C", "C", "W", "13.3", "15.3", "BR"],
"param1": ["IT", "IT", "IT", np.nan, np.nan, np.nan],
"param2": ["X", np.nan, np.nan, np.nan, np.nan, "KRL"],
"param3": ["GR", "GR", np.nan, np.nan, np.nan, np.nan],
"param4": [np.nan, "RAD", "AC-8", np.nan, np.nan, np.nan],
"component_name": ["CM1", "CM2", "CM3", "CM4", "CM5", "CM6"],
"delivery_time": [8, 9, 3, 13, 20, 5],
}
)
def get_delivery_time_best(model_type: str, model: pd.DataFrame, component: pd.DataFrame)
model = model.query("type == @model_type")
fitting_components = {"delivery_time_min": sys.maxsize}
delivery_times_all = []
for model_component in df_model.itertuples():
model_mainclass = model_component.mainclass
model_subclass = model_component.subclass
model_param1 = model_component.param1
model_param2 = model_component.param2
model_param3 = model_component.param3
model_param4 = model_component.param4
checked_components = components.query(
"""
mainclass == @model_mainclass and \
subclass == @model_subclass \
and ( param1 == @model_param1 | @model_param1 != @model_param1 ) \
and ( param2 == @model_param2 | @model_param2 != @model_param2 ) \
and ( param3 == @model_param3 | @model_param3 != @model_param3 ) \
and ( param4 == @model_param4 | @model_param4 != @model_param4 ) \
"""
)
component_descriptions = checked_components"component_name" ].tolist()
delivery_times = checked_components["delivery_time"].tolist()
delivery_time_min = min(delivery_times )
delivery_times_all.append(delivery_time_min )
for component_description, delivery_time in zip(component_descriptions, delivery_times ):
if delivery_time == delivery_time_min :
fitting_components[component_description] = delivery_time
fitting_components["delivery_time_min "] = max(delivery_times_all )
print(fitting_components)
get_delivery_time_best(model_type="E2", model=df_model, components=df_components)但这会返回df_components中的所有行,而不是匹配的2行?
发布于 2021-04-20 10:10:03
Merge和boolean indexing
merged = df_components.merge(df_model, on=['mainclass', 'subclass'], suffixes=['', '_r'])
l = merged.filter(regex=r'^param\d+$')
r = merged[l.columns + '_r']
m = ((l.values == r.values) | r.isna().values).all(1)
merged[m].drop(r.columns, 1)解释
将dataframe df_components与列上的df_model合并,并在右df中添加一个后缀_r,以唯一地标识右数据格式中的列。
>>> merged
mainclass subclass param1 param2 param3 param4 param1_r param2_r param3_r param4_r
0 1a 1b vendor1 spec1 id1 NaN vendor1 NaN id1 NaN
1 1a 1b vendor1 NaN id1 rout3 vendor1 NaN id1 NaN
2 1a 1b vendor1 NaN id2 NaN vendor1 NaN id1 NaN
3 1a 1b NaN NaN NaN NaN vendor1 NaN id1 NaN
4 1a 1b vendor2 NaN id1 NaN vendor1 NaN id1 NaN
5 1a 1b vendor2 NaN NaN NaN vendor1 NaN id1 NaN从左边和右边的dataframe中筛选param类似的列。
>>> l
param1 param2 param3 param4
0 vendor1 spec1 id1 NaN
1 vendor1 NaN id1 rout3
2 vendor1 NaN id2 NaN
3 NaN NaN NaN NaN
4 vendor2 NaN id1 NaN
5 vendor2 NaN NaN NaN
>>> r
param1_r param2_r param3_r param4_r
0 vendor1 NaN id1 NaN
1 vendor1 NaN id1 NaN
2 vendor1 NaN id1 NaN
3 vendor1 NaN id1 NaN
4 vendor1 NaN id1 NaN
5 vendor1 NaN id1 NaN通过将左侧dataframe中的param列与右dataframe中相应的param列进行比较来创建一个布尔掩码,类似地,通过检查正确的NaN值来创建另一个布尔掩码。
>>> l.values == r.values
array([[ True, False, True, False],
[ True, False, True, False],
[ True, False, False, False],
[False, False, False, False],
[False, False, True, False],
[False, False, False, False]])
>>> r.isna().values
array([[False, True, False, True],
[False, True, False, True],
[False, True, False, True],
[False, True, False, True],
[False, True, False, True],
[False, True, False, True]])使用逻辑或上述掩码,并沿all沿axis=1减少结果掩码
>>> m
array([ True, True, False, False, False, False])现在,在掩码df_components的帮助下过滤m中的行
>>> merged[m].drop(r.columns, 1)
mainclass subclass param1 param2 param3 param4
0 1a 1b vendor1 spec1 id1 NaN
1 1a 1b vendor1 NaN id1 rout3https://stackoverflow.com/questions/67175837
复制相似问题