文章/答案/技术大牛

发布

社区首页 >问答首页 >科学学习中机器学习模型的集成

问科学学习中机器学习模型的集成
EN

Stack Overflow用户

提问于 2022-07-13 17:17:38

回答 4查看 266关注 0票数 2

group        feature_1        feature_2       year            dependent_variable
group_a         12               19           2010               0.4
group_a         11               13           2011               0.9
group_a         10               5            2012               1.2
group_a         16               9            2013               3.2
group_b         8               29            2010               0.6
group_b         9               33            2011               0.1 
group_b         111             15            2012               2.1 
group_b         16              19            2013               12.2

在上面的数据中，我想使用feature_1，feature_2来预测dependent_variable。为了做到这一点，我想要构造两个模型:在第一个模型中，我想为每个组构建一个单独的模型。在第二个模型中，我想使用所有可用的数据。在这两种情况下，2010年至2012年的数据将用于培训，2013年将用于测试。

如何使用上述两个模型构建集成模型？数据是一个玩具数据集，但是在真实的数据集中，将会有更多的组、年份和特性。特别是，我感兴趣的方法，将工作与科学工具包-学习兼容的模型。

python

scikit-learn

ensemble-learning

回答 4

Stack Overflow用户

发布于 2022-07-17 07:06:23

将有多个步骤来创建集成模型.

首先，分别创建这两个模型。对于第一个模型，将数据按组进行分解，并对两个单独的模型进行训练，然后将两个模型合并到一个函数中。对于第二个模型，数据可以完全保留(除了删除测试数据之外)。然后，创建另一个方法将另外两个模型连接到一个集成模型中。

为了演示，我将从导入必要的模块并加载到dataframe开始

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

data_str = """group,feature_1,feature_2,year,dependent_variable
group_a,12,19,2010,0.4
group_a,11,13,2011,0.9
group_a,10,5,2012,1.2
group_a,16,9,2013,3.2
group_b,8,29,2010,0.6
group_b,9,33,2011,0.1 
group_b,111,15,2012,2.1 
group_b,16,19,2013,12.2"""

data_list = [row.split(",") for row in data_str.split("\n")]
data = pd.DataFrame(data_list[1:], columns = data_list[0])

train = data.loc[data["year"] != "2013"]
test = data.loc[data["year"] == "2013"]

这将使用RandomForestRegressor集成模型，但任何回归模型都可以使用。此外，应该注意的是，这里使用的dataframe与给定的dataframe不同，因为这个dataframe将其行从0索引，而不是由组索引，而group则是dataframe中的一个列。

构建第一个模型的：

将数据分成组a和组b的数据。
训练两个独立的模型
加入模型

前两个步骤如下：

# Splitting Data
train_a = train.loc[train["group"] == "group_a"]
train_b = train.loc[train["group"] == "group_b"]
test_a = test.loc[test["group"] == "group_a"]
test_b = test.loc[test["group"] == "group_b"]

# Training Two Models
model_a = RandomForestRegressor()
model_a.fit(train_a.drop(["dependent_variable", "year", "group"], axis = "columns"), train_a.dependent_variable)
model_b = RandomForestRegressor()
model_b.fit(train_b.drop(["dependent_variable", "year", "group"], axis = "columns"), train_b.dependent_variable)

然后，他们的预测方法可以结合在一起：

def individual_predictor(group, feature_1, feature_2):
    if group == "group_a": return model_a.predict([[feature_1, feature_2]])[0]
    elif group == "group_b": return model_b.predict([[feature_1, feature_2]])[0]

这将分别包含一个组和两个特性，并返回预测。这可以适应任何输入和输出类型是必要的。

创建第二个模型，将数据作为一个整体，只训练一个模型，这也消除了加入模型的必要性：

model = RandomForestRegressor()
model.fit(train.drop(["dependent_variable", "year", "group"], axis = "columns"), train.dependent_variable)

最后，您可以通过将的预测方法的结果平均化，将这些模型连接到一个集成模型中：

def ensemble_predict(group, feature_1, feature_2):
    return (individual_predictor(group, feature_1, feature_2) + model.predict([[feature_1, feature_2]])[0]) / 2

同样，这需要一个组，然后两个特性返回结果。这可能需要调整为另一种格式，例如输入列表和输出预测列表。

票数 0

Stack Overflow用户

发布于 2022-07-18 01:47:36

这个版本使用了两个回归符，RandomForestRegressor和GradientBoostingRegressor。

我为r2_score的计算添加了2013年的数据，它必须超过1，还添加了其他年份的数据。复制文本并保存到txt文件。

首先，我们对数据文件进行处理，然后通过数据操作将列车分开并进行测试。然后，我们为每个回归者创建一个模型，模型1.1和1.2分别用于组"a“和"b”。然后对所有数据进行模型2。在创建模型之后，我们将其保存到磁盘，以供以后处理。

在创建模型之后，我们使用所有测试数据和单个数据进行预测。度量标准r2_square和MAE也被打印出来。

最后一部分是通过加载模型文件对模型文件进行测试，并让模型文件通过测试进行预测。来自内存和磁盘中的模型的预测应该是相同的。还有一个示例输入类型，以及如何在自定义预测函数中使用它。

还请参阅代码中关于如何工作的docstring和注释。

data.txt

group        feature_1        feature_2       year            dependent_variable
group_a         12               19           2010               0.4
group_a          7               15           2010               1.5
group_a         11               13           2011               0.9
group_a          8               8            2011               2.1
group_a         10               5            2012               1.2
group_a         11               9            2012               2.6
group_a         16               9            2013               3.2
group_a         8               10            2013               2.6
group_b         8               29            2010               0.6
group_b         11              18            2010               1.5
group_b         9               33            2011               0.1 
group_b         20              15            2011               2.8 
group_b         111             15            2012               2.1 
group_b         99              10            2012               3.6
group_b         16              19            2013               12.2
group_b         4                8            2013               5.1

代码

myensemble.py

"""sklearn ensemble modeling.

Dependencies:
    * sklearn
    * pandas
    * numpy

References:
    * https://scikit-learn.org/stable/modules/classes.html?highlight=ensemble#module-sklearn.ensemble
    * https://pandas.pydata.org/docs/user_guide/indexing.html
"""


from typing import List, Union, Optional
import pickle  # for saving file to disk

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
import pandas as pd
import numpy as np


def make_model(regressor, regname: str, modelfn: str, dfX: pd.DataFrame, dfy: pd.DataFrame):
    """Creates a model.

    Args:
        regressor: Can be RandomForestRegressor or GradientBoostingRegressor.
        regname: Regressor name.
        dfX: The features in pandas dataframe.
        dfy: The target in pandas dataframe.

    Returns:
        Model
    """
    X = dfX.to_numpy()
    y = dfy.to_numpy()
    model = regressor(random_state=0)
    model.fit(X, y)

    # Save model.
    with open(f'{regname}_{modelfn}', 'wb') as f:
        pickle.dump(model, f)

    return model


def get_prediction(model, test: Union[List, pd.DataFrame, np.ndarray]) -> Optional[np.ndarray]:
    """Returns prediction based on model and test input or None.
    """
    if isinstance(test, List) or isinstance(test, np.ndarray):
        return model.predict([test])
    if isinstance(test, pd.DataFrame):
        return model.predict(np.array(test))
    return None


def model_and_prediction(df: pd.DataFrame, regressor, regname: str, modelfn: str):
    """Build model and show prediction and metrics.

    To build a model we need a training data X with features
    and data y with target or dependent values.

    Args:
        df: A dataframe.
        regressor: Can be RandomForestRegressor or GradientBoostingRegressor.
        regname: The regressor name.
        modelfn: The filename where model will be saved to disk.

    Returns:
        None
    """
    features = ['feature_1', 'feature_2']

    # 1. Get the train dataframe
    train = df.loc[df.year != 2013]  # exclude 2013 in training data
    train_feature = train[features]  # select the features column
    train_target = train.dependent_variable  # select the dependent column

    model = make_model(regressor, regname, modelfn, train_feature, train_target)

    # 2. Get the test dataframe
    test = df.loc[df.year == 2013]  # only include 2013 in test data
    test_feature = test[features]
    test_target = test.dependent_variable

    # 3. Get the prediction from all rows in test feature. See step 5
    # for single data prediction.
    prediction: np.ndarray = model.predict(np.array(test_feature))

    print(f'test feature:\n{np.array(test_feature)}')
    print(f'test prediction: {prediction}')  # prediction[0] ...
    print(f'test target: {np.array(test_target)}')

    # 4. metrics
    print(f'r2_score: {r2_score(test_target, prediction)}')
    print(f'mean_absolute_error: {mean_absolute_error(test_target, prediction)}\n')

    # 5. Get prediction from the first row of test features.
    prediction_1: np.ndarray = model.predict(np.array(test_feature.iloc[[0]]))
    print(f'1st row test:\n{test_feature.iloc[[0]]}')
    print(f'1st row test prediction array: {prediction_1}')
    print(f'1st row test prediction value: {prediction_1[0]}\n')  # get the element value


def main():
    datafn = 'data.txt'
    df = pd.read_fwf(datafn)
    print(df.to_string(index=False))

    # A. Create models for each type of regressor.
    regressors = [(RandomForestRegressor, 'RandomForrest'),
                  (GradientBoostingRegressor, 'GradientBoosting')]

    for (r, name) in regressors:
        print(f'::: Regressor: {name} :::\n')

        # Model 1 using group_a
        print(':: MODEL 1.1 ::')
        grp = 'group_a'
        modelfn = f'{grp}.pkl'  # filename of model to be save to disk
        dfa = df.loc[df.group == grp]  # select group
        model_and_prediction(dfa, r, name, modelfn)

        # Model 1 using group_b
        print(':: MODEL 1.2 ::')
        grp = 'group_b'
        modelfn = f'{grp}.pkl'
        dfb = df.loc[df.group == grp]
        model_and_prediction(dfb, r, name, modelfn)

        # Model 2 using group a and b
        print(':: MODEL 2 ::')
        grp = 'group_ab'
        modelfn = f'{grp}.pkl'
        dfab = df.loc[(df.group == 'group_a') | (df.group == 'group_b')]
        model_and_prediction(dfab, r, name, modelfn)

    # B. Test saved model file prediction.
    print('::: Prediction from loaded model :::')
    mfn = 'GradientBoosting_group_ab.pkl'
    print(f'model: gradient boosting model 2, {mfn}')

    with open(mfn, 'rb') as f:
        loaded_model = pickle.load(f)

    # test: group_b  4  8  2013  5.1    
    test = [4, 8]
    prediction = loaded_model.predict([test])
    print(f'test: {test}')
    print(f'prediction: {prediction[0]}\n')

    # C. Use get_prediction().

    # input from list
    test = [4, 8]
    prediction = get_prediction(loaded_model, test)
    print(f'test from list input:\n{test}')
    print(f'prediction from get_prediction() with list input: {prediction}\n')

    # input from dataframe
    testdata = {
        'feature_1': [8, 12],
        'feature_2': [19, 15],
    }
    testdf = pd.DataFrame(testdata)
    testrow = testdf.iloc[[0]]  # first row [8, 19]
    prediction = get_prediction(loaded_model, testrow)
    print(f'test from df input:\n{testrow}')
    print(f'prediction from get_prediction() with df input: {prediction}\n')

    testrow = testdf.iloc[[1]]  # second row [12, 15]
    prediction = get_prediction(loaded_model, testrow)
    print(f'test from df input:\n{testrow}')
    print(f'prediction from get_prediction() with df input: {prediction}\n')

    # input from numpy
    test = [8, 9]
    testnp = np.array(test)
    prediction = get_prediction(loaded_model, testnp)
    print(f'test from numpy input:\n{testnp}')
    print(f'prediction from get_prediction() with numpy input: {prediction}\n')


if __name__ == '__main__':
    main()

输出

  group  feature_1  feature_2  year  dependent_variable
group_a         12         19  2010                 0.4
group_a          7         15  2010                 1.5
group_a         11         13  2011                 0.9
group_a          8          8  2011                 2.1
group_a         10          5  2012                 1.2
group_a         11          9  2012                 2.6
group_a         16          9  2013                 3.2
group_a          8         10  2013                 2.6
group_b          8         29  2010                 0.6
group_b         11         18  2010                 1.5
group_b          9         33  2011                 0.1
group_b         20         15  2011                 2.8
group_b        111         15  2012                 2.1
group_b         99         10  2012                 3.6
group_b         16         19  2013                12.2
group_b          4          8  2013                 5.1

::: Regressor: RandomForrest :::

:: MODEL 1.1 ::
test feature:
[[16  9]
 [ 8 10]]
test prediction: [1.811 2.186]
test target: [3.2 2.6]
r2_score: -10.67065000000004
mean_absolute_error: 0.9015000000000026

1st row test:
   feature_1  feature_2
6         16          9
1st row test prediction array: [1.811]
1st row test prediction value: 1.8109999999999986

:: MODEL 1.2 ::
test feature:
[[16 19]
 [ 4  8]]
test prediction: [2.116 2.408]
test target: [12.2  5.1]
r2_score: -3.3219170799444546
mean_absolute_error: 6.388

1st row test:
    feature_1  feature_2
14         16         19
1st row test prediction array: [2.116]
1st row test prediction value: 2.116000000000001

:: MODEL 2 ::
test feature:
[[16  9]
 [ 8 10]
 [16 19]
 [ 4  8]]
test prediction: [2.425 2.145 1.01  1.958]
test target: [ 3.2  2.6 12.2  5.1]
r2_score: -1.3250936994738867
mean_absolute_error: 3.8905000000000016

1st row test:
   feature_1  feature_2
6         16          9
1st row test prediction array: [2.425]
1st row test prediction value: 2.4249999999999985

::: Regressor: GradientBoosting :::

:: MODEL 1.1 ::
test feature:
[[16  9]
 [ 8 10]]
test prediction: [2.59996945 2.21271005]
test target: [3.2 2.6]
r2_score: -1.8335008778823685
mean_absolute_error: 0.4936602458577084

1st row test:
   feature_1  feature_2
6         16          9
1st row test prediction array: [2.59996945]
1st row test prediction value: 2.59996945439128

:: MODEL 1.2 ::
test feature:
[[16 19]
 [ 4  8]]
test prediction: [1.99807124 2.63511811]
test target: [12.2  5.1]
r2_score: -3.3703627491779713
mean_absolute_error: 6.333405322236132

1st row test:
    feature_1  feature_2
14         16         19
1st row test prediction array: [1.99807124]
1st row test prediction value: 1.9980712422931164

:: MODEL 2 ::
test feature:
[[16  9]
 [ 8 10]
 [16 19]
 [ 4  8]]
test prediction: [3.60257456 2.26208935 0.402739   2.10950224]
test target: [ 3.2  2.6 12.2  5.1]
r2_score: -1.538939968014979
mean_absolute_error: 3.882060991360607

1st row test:
   feature_1  feature_2
6         16          9
1st row test prediction array: [3.60257456]
1st row test prediction value: 3.6025745572622014

::: Prediction from loaded model :::
model: gradient boosting model 2, GradientBoosting_group_ab.pkl
test: [4, 8]
prediction: 2.1095022367629728

test from list input:
[4, 8]
prediction from get_prediction() with list input: [2.10950224]

test from df input:
   feature_1  feature_2
0          8         19
prediction from get_prediction() with df input: [0.50307204]

test from df input:
   feature_1  feature_2
1         12         15
prediction from get_prediction() with df input: [1.46058714]

test from numpy input:
[8 9]
prediction from get_prediction() with numpy input: [2.30007317]

票数 0

Stack Overflow用户

发布于 2022-07-21 04:52:49

首先，使用时间序列算法(只使用日期变量和因变量)、fb先知(使用特征+日期+因变量)、基于树的回归算法如CatBoost/XGBoost/LightGBM (使用特征+日期+因变量)创建模型。

使用上述每种算法为每个组创建模型(自下而上的方法)。不同的模型能很好地适应不同的群体。根据模型的性能，取加权均值。假设，group_a预测效果最好的是Catboost，然后是fb先知，然后是指数移动平均，使用权重与这些模型的精度成比例。

您可以聚合组级模型的结果以获得聚合结果。您还可以在聚合数据上创建单独的模型(对年份进行总结)。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72970228

复制

相似问题

问科学学习中机器学习模型的集成
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问科学学习中机器学习模型的集成EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问科学学习中机器学习模型的集成
EN