文章/答案/技术大牛

发布

社区首页 >问答首页 >使用熊猫试图为categorical_subset检索列数据时出错

问使用熊猫试图为categorical_subset检索列数据时出错
EN

Stack Overflow用户

提问于 2020-07-14 19:22:50

回答 1查看 202关注 0票数 1

我正在尝试使用Sklearn工具与Pandas和np。我正在尝试运行我的代码(如下面的错误所示)

Traceback (most recent call last):
  File "C:/PycharmProjects/AISyiff/testingAi.py", line 129, in <module>
    categorical_subset = pd.get_dummies(categorical_subset[categorical_subset.columns.drop("protocol")])
  File "C:\PycharmProjects\AISyiff\venv\lib\site-packages\pandas\core\indexes\base.py", line 5018, in drop
    raise KeyError(f"{labels[mask]} not found in axis")
KeyError: "['protocol'] not found in axis"

请让我知道我在哪里犯了这个错误，我能做些什么来弥补这个错误！

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn import preprocessing as preprocessing
from sklearn.metrics import accuracy_score
import matplotlib as mpl
mpl.use('TkAgg')
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

sns.set(style="white", context="talk")
mpl.rcParams['figure.dpi'] = 200
df = pd.read_csv("datasets_for_paper.csv", low_memory=False)

##firstPaint provides time info about page renderingso does ,rumSpeedIndex=avg page render
print(df.dtypes)
df["nodeId"] = df["nodeId"].astype(int)
df["numObj"] = df["numObj"].astype(int)
df["rumSpeedIndex"] = df['rumSpeedIndex'].astype(int)
df["pageLoadTime"] = df['pageLoadTime'].astype(int)
df["firstPaint"] = df['firstPaint'].astype(int)


# convert from name into pure string
def changeProtName(value):
    if value == 'H1s':
        return str('Hs')
    else:
        return str('Hl')


df['protocol'] = df['protocol'].map(lambda x: changeProtName(x))

# hot encode catagories as catagorical data
df['protocol'] = pd.Categorical(df["protocol"])
df['browser'] = pd.Categorical(df['browser'])
df['nodeType'] = pd.Categorical(df['nodeType'])
df['url'] = pd.Categorical(df['url'])


# list a bunch of details about categorical data
def summerize_data(df1):
    for column in df1.columns:
        print(column)
        if df.dtypes[column] == np.object:
            print(df1[column].value_counts())
        else:
            print(df1[column].describe())

        print('\n')


summerize_data(df)


def hotEncodingCats(df1):
    results = df1.copy()
    encoders = {}
    for column in results.columns:
        encoders[column] = preprocessing.LabelEncoder()
        results[column] = encoders[column].fit_transform(results[column])
    return results, encoders


print(df.dtypes)

encoded_data, _ = hotEncodingCats(df)
sns.heatmap(encoded_data.corr(), square=True)


encoded_data.tail(5)

encoded_data, encoders = hotEncodingCats(df)
new_series = encoded_data["protocol"]

X_train, X_test, y_train, y_test = train_test_split(encoded_data[encoded_data.columns.drop("protocol")], new_series,
                                                    train_size=0.70)
scaler = preprocessing.StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = scaler.transform(X_test)

cls = linear_model.LogisticRegression()

cls.fit(X_train, y_train)
y_pred = cls.predict(X_test)

print(df.dtypes)
print(accuracy_score(y_test, y_pred))
print(df.dtypes)
print("cookieprint")

def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))

print("cookie3")

def fit_and_evaluate(model):
    # Train the model
    model.fit(X_train, y_train)

    # Make predictions and evalute
    model_pred = model.predict(X_test)
    model_mae = mae(y_test, model_pred)

    # Return the performance metric
    return model_mae


print(fit_and_evaluate(cls))
print("cookie1")
random_forest = RandomForestRegressor(random_state=60)
coefs = pd.Series(cls.coef_[0], index=X_train.columns)
print(X_train.columns)
print("cookie2")
coefs = coefs.sort_values()
plt.subplot(1, 1, 1)
plt.figure(figsize=(10,10))
coefs.plot(kind="bar", alpha=0.4)
plt.show()
print(coefs.sort_values(ascending=False))

features = df.copy()
numeric_subset = df.select_dtypes('number')
categorical_subset = df.select_dtypes('object')

categorical_subset = pd.get_dummies(categorical_subset[categorical_subset.columns.drop("protocol")])
features = pd.concat([numeric_subset, categorical_subset], axis = 1)
print(features.head())

python

pandas

scikit-learn

deep-learning

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-07-15 12:50:12

我能像这样重现你的问题：

>>> df = pd.DataFrame()
>>> df['protocol'] = pd.Categorical(['A', 'B', 'C', 'D', 'A'])
>>> df.select_dtypes('object')
Empty DataFrame
Columns: []

您可以看到最后一行，对应于

categorical_subset = df.select_dtypes('object')

可能是返回一个空的DataFrame (当有疑问时，最好检查categorical_subset是否包含您期望它包含的内容。

这是因为当您将最初包含字符串的df['protocol']重新分配到pd.Categorical时，它的dtype (以及其他分类列的类型)不再是object，而是category)：

>>> df.dtypes
protocol    category
dtype: object

(这个输出看起来有点混乱；它说protocol的dtype是category，但下面是dtype: object：DataFrame.dtypes的返回值实际上是一个列，列表示列名和dtype，所以底部的欺骗性dtype: object引用了该系列的dtype )。

这可能是你真正想要的：

>>> df.select_dtypes('category')
  protocol
0        A
1        B
2        C
3        D
4        A

事实上，上面写着in the docs for select_dtypes

若要选择Pandas分类类型，请使用'category'

以上是如何创建Minimal, Reproducible Example以及一般情况下如何调试小型程序的一个很好的示例。我们首先对准了问题区域，这条线

categorical_subset.columns.drop("protocol")

显然，它认为不应该有一个名为'protocol'的专栏。然后，我们向后追溯categorical_subset是如何创建的(我们在原始数据called上调用了df.select_dtypes('object') )。除此之外，我们所需要的只是一个包含一些pd.Categorical列的示例dataframe。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62902565

复制

相似问题

问使用熊猫试图为categorical_subset检索列数据时出错
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用熊猫试图为categorical_subset检索列数据时出错EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用熊猫试图为categorical_subset检索列数据时出错
EN