我有多列采购的数据,包括以下三栏:
PURCHASE_ID (index of purchase)
WORKER_ID (index of worker)
ACCOUNT_ID (index of account)一个工作人员可以有多个与他们关联的帐户,一个帐户可以有多个工作人员。
如果我创建了员工实体和帐户实体并添加了这些关系,那么我就会得到一个错误:
KeyError: 'Variable: ACCOUNT_ID not found in entity'到目前为止,我的代码如下:
import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes
d = {'PURCHASE_ID': [1, 2],
'WORKER_ID': [0, 0],
'ACCOUNT_ID': [1, 2],
'COST': [5, 10],
'PURCHASE_TIME': ['2018-01-01 01:00:00', '2016-01-01 02:00:00']}
df = pd.DataFrame(data=d)
data_variable_types = {'PURCHASE_ID': vtypes.Id,
'WORKER_ID': vtypes.Id,
'ACCOUNT_ID': vtypes.Id,
'COST': vtypes.Numeric,
'PURCHASE_TIME': vtypes.Datetime}
es = ft.EntitySet('Purchase')
es = es.entity_from_dataframe(entity_id='purchases',
dataframe=df,
index='PURCHASE_ID',
time_index='PURCHASE_TIME',
variable_types=data_variable_types)
es.normalize_entity(base_entity_id='purchases',
new_entity_id='workers',
index='WORKER_ID',
additional_variables=['ACCOUNT_ID'],
make_time_index=False)
es.normalize_entity(base_entity_id='purchases',
new_entity_id='accounts',
index='ACCOUNT_ID',
additional_variables=['WORKER_ID'],
make_time_index=False)
fm, features = ft.dfs(entityset=es,
target_entity='purchases',
agg_primitives=['mean'],
trans_primitives=[],
verbose=True)
features我如何将实体分开以包含多到多的关系?
发布于 2018-10-04 13:52:28
您的方法是正确的,但是不需要使用additional_variables变量参数。如果省略它,您的代码将无问题地运行。
additional_variables to EntitySet.normalize_entity的目的是在您要创建的新父实体中包含您想要的其他变量。例如,假设您有关于雇用日期、薪资、地点等的变量,您可以将这些变量作为附加变量,因为对于员工来说,这些变量是静态的。在这种情况下,我不认为你有任何这样的变量。
这是我看到的代码和输出
import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes
d = {'PURCHASE_ID': [1, 2],
'WORKER_ID': [0, 0],
'ACCOUNT_ID': [1, 2],
'COST': [5, 10],
'PURCHASE_TIME': ['2018-01-01 01:00:00', '2016-01-01 02:00:00']}
df = pd.DataFrame(data=d)
data_variable_types = {'PURCHASE_ID': vtypes.Id,
'WORKER_ID': vtypes.Id,
'ACCOUNT_ID': vtypes.Id,
'COST': vtypes.Numeric,
'PURCHASE_TIME': vtypes.Datetime}
es = ft.EntitySet('Purchase')
es = es.entity_from_dataframe(entity_id='purchases',
dataframe=df,
index='PURCHASE_ID',
time_index='PURCHASE_TIME',
variable_types=data_variable_types)
es.normalize_entity(base_entity_id='purchases',
new_entity_id='workers',
index='WORKER_ID',
make_time_index=False)
es.normalize_entity(base_entity_id='purchases',
new_entity_id='accounts',
index='ACCOUNT_ID',
make_time_index=False)
fm, features = ft.dfs(entityset=es,
target_entity='purchases',
agg_primitives=['mean'],
trans_primitives=[],
verbose=True)
features这输出
[<Feature: WORKER_ID>,
<Feature: ACCOUNT_ID>,
<Feature: COST>,
<Feature: workers.MEAN(purchases.COST)>,
<Feature: accounts.MEAN(purchases.COST)>]如果我们改变目标实体并增加深度
fm, features = ft.dfs(entityset=es,
target_entity='workers',
agg_primitives=['mean', 'count'],
max_depth=3,
trans_primitives=[],
verbose=True)
features输出现在是workers实体的特性。
[<Feature: COUNT(purchases)>,
<Feature: MEAN(purchases.COST)>,
<Feature: MEAN(purchases.accounts.MEAN(purchases.COST))>,
<Feature: MEAN(purchases.accounts.COUNT(purchases))>]让我们解释一下名为MEAN(purchases.accounts.COUNT(purchases))>的特性
换言之,“与这名工人的采购有关的账户平均采购数量是多少”。
https://stackoverflow.com/questions/52629549
复制相似问题