经过处理后,我的数据是一个表,其中有几列是要素,一列是标签。我想使用featuretools.dfs来帮助我预测标签。是否可以直接执行此操作,或者我是否需要将单个表拆分为多个表?
发布于 2018-05-08 22:50:30
可以在单个表上运行DFS。举个例子,如果你有一个带有索引'index'的pandas dataframe df,你可以这样写:
import featuretools as ft
es = ft.EntitySet('Transactions')
es.entity_from_dataframe(dataframe=df,
entity_id='log',
index='index')
fm, features = ft.dfs(entityset=es,
target_entity='log',
trans_primitives=['day', 'weekday', 'month'])生成的特征矩阵将如下所示
In [1]: fm
Out[1]:
location pies sold WEEKDAY(date) MONTH(date) DAY(date)
index
1 main street 3 4 12 29
2 main street 4 5 12 30
3 main street 5 6 12 31
4 arlington ave. 18 0 1 1
5 arlington ave. 1 1 1 2这将对您的数据应用“转换”原语。您通常希望添加更多的实体来提供ft.dfs,以便使用聚合原语。您可以在我们的documentation中了解到不同之处。
一个标准的工作流程是通过一个有趣的分类来normalize你的单个实体。如果您的df是单个表
| index | location | pies sold | date |
|-------+----------------+-------+------------|
| 1 | main street | 3 | 2017-12-29 |
| 2 | main street | 4 | 2017-12-30 |
| 3 | main street | 5 | 2017-12-31 |
| 4 | arlington ave. | 18 | 2018-01-01 |
| 5 | arlington ave. | 1 | 2018-01-02 |您可能会对使用location进行规范化感兴趣
es.normalize_entity(base_entity_id='log',
new_entity_id='locations',
index='location')您的新实体locations将包含该表
| location | first_log_time |
|----------------+----------------|
| main street | 2018-12-29 |
| arlington ave. | 2000-01-01 |这使得像locations.SUM(log.pies sold)或locations.MEAN(log.pies sold)这样的功能可以根据位置对所有值进行相加或平均。您可以在以下示例中看到创建的这些要素
In [1]: import pandas as pd
...: import featuretools as ft
...: df = pd.DataFrame({'index': [1, 2, 3, 4, 5],
...: 'location': ['main street',
...: 'main street',
...: 'main street',
...: 'arlington ave.',
...: 'arlington ave.'],
...: 'pies sold': [3, 4, 5, 18, 1]})
...: df['date'] = pd.date_range('12/29/2017', periods=5, freq='D')
...: df
...:
Out[1]:
index location pies sold date
0 1 main street 3 2017-12-29
1 2 main street 4 2017-12-30
2 3 main street 5 2017-12-31
3 4 arlington ave. 18 2018-01-01
4 5 arlington ave. 1 2018-01-02
In [2]: es = ft.EntitySet('Transactions')
...: es.entity_from_dataframe(dataframe=df, entity_id='log', index='index', t
...: ime_index='date')
...: es.normalize_entity(base_entity_id='log', new_entity_id='locations', ind
...: ex='location')
...:
Out[2]:
Entityset: Transactions
Entities:
log [Rows: 5, Columns: 4]
locations [Rows: 2, Columns: 2]
Relationships:
log.location -> locations.location
In [3]: fm, features = ft.dfs(entityset=es,
...: target_entity='log',
...: agg_primitives=['sum', 'mean'],
...: trans_primitives=['day'])
...: fm
...:
Out[3]:
location pies sold DAY(date) locations.DAY(first_log_time) locations.MEAN(log.pies sold) locations.SUM(log.pies sold)
index
1 main street 3 29 29 4.0 12
2 main street 4 30 29 4.0 12
3 main street 5 31 29 4.0 12
4 arlington ave. 18 1 1 9.5 19
5 arlington ave. 1 2 1 9.5 19https://stackoverflow.com/questions/50145953
复制相似问题