首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用熊猫试图为categorical_subset检索列数据时出错

使用熊猫试图为categorical_subset检索列数据时出错
EN

Stack Overflow用户
提问于 2020-07-14 19:22:50
回答 1查看 202关注 0票数 1

我正在尝试使用Sklearn工具与Pandas和np。我正在尝试运行我的代码(如下面的错误所示)

代码语言:javascript
复制
Traceback (most recent call last):
  File "C:/PycharmProjects/AISyiff/testingAi.py", line 129, in <module>
    categorical_subset = pd.get_dummies(categorical_subset[categorical_subset.columns.drop("protocol")])
  File "C:\PycharmProjects\AISyiff\venv\lib\site-packages\pandas\core\indexes\base.py", line 5018, in drop
    raise KeyError(f"{labels[mask]} not found in axis")
KeyError: "['protocol'] not found in axis"

请让我知道我在哪里犯了这个错误,我能做些什么来弥补这个错误!

代码语言:javascript
复制
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn import preprocessing as preprocessing
from sklearn.metrics import accuracy_score
import matplotlib as mpl
mpl.use('TkAgg')
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

sns.set(style="white", context="talk")
mpl.rcParams['figure.dpi'] = 200
df = pd.read_csv("datasets_for_paper.csv", low_memory=False)

##firstPaint provides time info about page renderingso does ,rumSpeedIndex=avg page render
print(df.dtypes)
df["nodeId"] = df["nodeId"].astype(int)
df["numObj"] = df["numObj"].astype(int)
df["rumSpeedIndex"] = df['rumSpeedIndex'].astype(int)
df["pageLoadTime"] = df['pageLoadTime'].astype(int)
df["firstPaint"] = df['firstPaint'].astype(int)


# convert from name into pure string
def changeProtName(value):
    if value == 'H1s':
        return str('Hs')
    else:
        return str('Hl')


df['protocol'] = df['protocol'].map(lambda x: changeProtName(x))

# hot encode catagories as catagorical data
df['protocol'] = pd.Categorical(df["protocol"])
df['browser'] = pd.Categorical(df['browser'])
df['nodeType'] = pd.Categorical(df['nodeType'])
df['url'] = pd.Categorical(df['url'])


# list a bunch of details about categorical data
def summerize_data(df1):
    for column in df1.columns:
        print(column)
        if df.dtypes[column] == np.object:
            print(df1[column].value_counts())
        else:
            print(df1[column].describe())

        print('\n')


summerize_data(df)


def hotEncodingCats(df1):
    results = df1.copy()
    encoders = {}
    for column in results.columns:
        encoders[column] = preprocessing.LabelEncoder()
        results[column] = encoders[column].fit_transform(results[column])
    return results, encoders


print(df.dtypes)

encoded_data, _ = hotEncodingCats(df)
sns.heatmap(encoded_data.corr(), square=True)


encoded_data.tail(5)

encoded_data, encoders = hotEncodingCats(df)
new_series = encoded_data["protocol"]

X_train, X_test, y_train, y_test = train_test_split(encoded_data[encoded_data.columns.drop("protocol")], new_series,
                                                    train_size=0.70)
scaler = preprocessing.StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = scaler.transform(X_test)

cls = linear_model.LogisticRegression()

cls.fit(X_train, y_train)
y_pred = cls.predict(X_test)

print(df.dtypes)
print(accuracy_score(y_test, y_pred))
print(df.dtypes)
print("cookieprint")

def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))

print("cookie3")

def fit_and_evaluate(model):
    # Train the model
    model.fit(X_train, y_train)

    # Make predictions and evalute
    model_pred = model.predict(X_test)
    model_mae = mae(y_test, model_pred)

    # Return the performance metric
    return model_mae


print(fit_and_evaluate(cls))
print("cookie1")
random_forest = RandomForestRegressor(random_state=60)
coefs = pd.Series(cls.coef_[0], index=X_train.columns)
print(X_train.columns)
print("cookie2")
coefs = coefs.sort_values()
plt.subplot(1, 1, 1)
plt.figure(figsize=(10,10))
coefs.plot(kind="bar", alpha=0.4)
plt.show()
print(coefs.sort_values(ascending=False))

features = df.copy()
numeric_subset = df.select_dtypes('number')
categorical_subset = df.select_dtypes('object')

categorical_subset = pd.get_dummies(categorical_subset[categorical_subset.columns.drop("protocol")])
features = pd.concat([numeric_subset, categorical_subset], axis = 1)
print(features.head())
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-07-15 12:50:12

我能像这样重现你的问题:

代码语言:javascript
复制
>>> df = pd.DataFrame()
>>> df['protocol'] = pd.Categorical(['A', 'B', 'C', 'D', 'A'])
>>> df.select_dtypes('object')
Empty DataFrame
Columns: []

您可以看到最后一行,对应于

代码语言:javascript
复制
categorical_subset = df.select_dtypes('object')

可能是返回一个空的DataFrame (当有疑问时,最好检查categorical_subset是否包含您期望它包含的内容。

这是因为当您将最初包含字符串的df['protocol']重新分配到pd.Categorical时,它的dtype (以及其他分类列的类型)不再是object,而是category):

代码语言:javascript
复制
>>> df.dtypes
protocol    category
dtype: object

(这个输出看起来有点混乱;它说protocol的dtype是category,但下面是dtype: objectDataFrame.dtypes的返回值实际上是一个列,列表示列名和dtype,所以底部的欺骗性dtype: object引用了该系列的dtype )。

这可能是你真正想要的:

代码语言:javascript
复制
>>> df.select_dtypes('category')
  protocol
0        A
1        B
2        C
3        D
4        A

事实上,上面写着in the docs for select_dtypes

若要选择Pandas分类类型,请使用'category'

以上是如何创建Minimal, Reproducible Example以及一般情况下如何调试小型程序的一个很好的示例。我们首先对准了问题区域,这条线

代码语言:javascript
复制
categorical_subset.columns.drop("protocol")

显然,它认为不应该有一个名为'protocol'的专栏。然后,我们向后追溯categorical_subset是如何创建的(我们在原始数据called上调用了df.select_dtypes('object') )。除此之外,我们所需要的只是一个包含一些pd.Categorical列的示例dataframe。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/62902565

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档