首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >基于决策树算法的模型预测

基于决策树算法的模型预测
EN

Stack Overflow用户
提问于 2019-10-26 11:31:18
回答 1查看 222关注 0票数 3

我一直试图从我使用决策树算法建立的模型中做出一个由DataFrame组成的预测。

我得到了我的模型的分数,是0.96。然后,我试着用这个模型来预测那些留下来但有错误的DataFrame人。我们的目标是在DataFrame的基础上预测将来会离开公司的人。

如何实现这一目标?

所以我做的是:

  1. 阅读我的github中的DF,并将它们分割给离开和不离开的人。
代码语言:javascript
复制
df = pd.read_csv('https://raw.githubusercontent.com/bhaskoro-muthohar/DataScienceLearning/master/HR_comma_sep.csv')

leftdf = df[df['left']==1]
notleftdf =df[df['left']==0]
  1. 模型生成的数据准备
代码语言:javascript
复制
df.salary = df.salary.map({'low':0,'medium':1,'high':2})
df.salary
代码语言:javascript
复制
X = df.drop(['left','sales'],axis=1)
y = df['left']
  1. 把火车和测试装置分开
代码语言:javascript
复制
import numpy as np
from sklearn.model_selection import train_test_split


#splitting the train and test sets
X_train, X_test, y_train, y_test= train_test_split(X,y,random_state=0, stratify=y)
  1. 训练它
代码语言:javascript
复制
from sklearn import tree
clftree = tree.DecisionTreeClassifier(max_depth=3)
clftree.fit(X_train,y_train)
  1. 评价模型
代码语言:javascript
复制
y_pred = clftree.predict(X_test)
print("Test set prediction:\n {}".format(y_pred))
print("Test set score: {:.2f}".format(clftree.score(X_test, y_test)))

结果是

测试集评分: 0.96

  1. 然后我试着用DataFrame对那些还没有离开公司的人做一个预测
代码语言:javascript
复制
X_new = notleftdf.drop(['left','sales'],axis=1)

#Map salary to 0,1,2
X_new.salary = X_new.salary.map({'low':0,'medium':1,'high':2})
X_new.salary
代码语言:javascript
复制
prediction_will_left = clftree.predict(X_new)
print("Prediction: {}".format(prediction_will_left))
print("Predicted target name: {}".format(
    notleftdf['left'][prediction_will_left]
))

我得到的错误是:

代码语言:javascript
复制
KeyError: "None of [Int64Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n            ...\n            0, 0, 0, 0, 0, 0, 1, 0, 0, 0],\n           dtype='int64', length=11428)] are in the [index]"

如何解决这个问题?

PS:完整的脚本链接是这里

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-10-26 12:11:32

也许你在找这样的东西。(一旦您将数据文件下载到同一个目录,就可以使用自带的脚本了。)

代码语言:javascript
复制
from sklearn import tree
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd


def process_df_for_ml(df):
    """
    Process a dataframe for model training/prediction use.

    Returns X/y tensors.
    """

    df = df.copy()
    # Map salary to 0,1,2
    df.salary = df.salary.map({"low": 0, "medium": 1, "high": 2})
    # dropping left and sales X for the df, y for the left
    X = df.drop(["left", "sales"], axis=1)
    y = df["left"]
    return (X, y)

# Read and reindex CSV.
df = pd.read_csv("HR_comma_sep.csv")
df = df.reindex()

# Train a decision tree.
X, y = process_df_for_ml(df)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify=y)
clftree = tree.DecisionTreeClassifier(max_depth=3)
clftree.fit(X_train, y_train)

# Test the decision tree on people who haven't left yet.
notleftdf = df[df["left"] == 0].copy()
X, y = process_df_for_ml(notleftdf)
# Plug in a new column with ones and zeroes from the prediction.
notleftdf["will_leave"] = clftree.predict(X)
# Print those with the will-leave flag on.
print(notleftdf[notleftdf["will_leave"] == 1])
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/58570355

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档