我正在尝试使用Sklearn工具与Pandas和np。我正在尝试运行我的代码(如下面的错误所示)
Traceback (most recent call last):
File "C:/PycharmProjects/AISyiff/testingAi.py", line 129, in <module>
categorical_subset = pd.get_dummies(categorical_subset[categorical_subset.columns.drop("protocol")])
File "C:\PycharmProjects\AISyiff\venv\lib\site-packages\pandas\core\indexes\base.py", line 5018, in drop
raise KeyError(f"{labels[mask]} not found in axis")
KeyError: "['protocol'] not found in axis"请让我知道我在哪里犯了这个错误,我能做些什么来弥补这个错误!
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn import preprocessing as preprocessing
from sklearn.metrics import accuracy_score
import matplotlib as mpl
mpl.use('TkAgg')
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
sns.set(style="white", context="talk")
mpl.rcParams['figure.dpi'] = 200
df = pd.read_csv("datasets_for_paper.csv", low_memory=False)
##firstPaint provides time info about page renderingso does ,rumSpeedIndex=avg page render
print(df.dtypes)
df["nodeId"] = df["nodeId"].astype(int)
df["numObj"] = df["numObj"].astype(int)
df["rumSpeedIndex"] = df['rumSpeedIndex'].astype(int)
df["pageLoadTime"] = df['pageLoadTime'].astype(int)
df["firstPaint"] = df['firstPaint'].astype(int)
# convert from name into pure string
def changeProtName(value):
if value == 'H1s':
return str('Hs')
else:
return str('Hl')
df['protocol'] = df['protocol'].map(lambda x: changeProtName(x))
# hot encode catagories as catagorical data
df['protocol'] = pd.Categorical(df["protocol"])
df['browser'] = pd.Categorical(df['browser'])
df['nodeType'] = pd.Categorical(df['nodeType'])
df['url'] = pd.Categorical(df['url'])
# list a bunch of details about categorical data
def summerize_data(df1):
for column in df1.columns:
print(column)
if df.dtypes[column] == np.object:
print(df1[column].value_counts())
else:
print(df1[column].describe())
print('\n')
summerize_data(df)
def hotEncodingCats(df1):
results = df1.copy()
encoders = {}
for column in results.columns:
encoders[column] = preprocessing.LabelEncoder()
results[column] = encoders[column].fit_transform(results[column])
return results, encoders
print(df.dtypes)
encoded_data, _ = hotEncodingCats(df)
sns.heatmap(encoded_data.corr(), square=True)
encoded_data.tail(5)
encoded_data, encoders = hotEncodingCats(df)
new_series = encoded_data["protocol"]
X_train, X_test, y_train, y_test = train_test_split(encoded_data[encoded_data.columns.drop("protocol")], new_series,
train_size=0.70)
scaler = preprocessing.StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = scaler.transform(X_test)
cls = linear_model.LogisticRegression()
cls.fit(X_train, y_train)
y_pred = cls.predict(X_test)
print(df.dtypes)
print(accuracy_score(y_test, y_pred))
print(df.dtypes)
print("cookieprint")
def mae(y_true, y_pred):
return np.mean(abs(y_true - y_pred))
print("cookie3")
def fit_and_evaluate(model):
# Train the model
model.fit(X_train, y_train)
# Make predictions and evalute
model_pred = model.predict(X_test)
model_mae = mae(y_test, model_pred)
# Return the performance metric
return model_mae
print(fit_and_evaluate(cls))
print("cookie1")
random_forest = RandomForestRegressor(random_state=60)
coefs = pd.Series(cls.coef_[0], index=X_train.columns)
print(X_train.columns)
print("cookie2")
coefs = coefs.sort_values()
plt.subplot(1, 1, 1)
plt.figure(figsize=(10,10))
coefs.plot(kind="bar", alpha=0.4)
plt.show()
print(coefs.sort_values(ascending=False))
features = df.copy()
numeric_subset = df.select_dtypes('number')
categorical_subset = df.select_dtypes('object')
categorical_subset = pd.get_dummies(categorical_subset[categorical_subset.columns.drop("protocol")])
features = pd.concat([numeric_subset, categorical_subset], axis = 1)
print(features.head())发布于 2020-07-15 12:50:12
我能像这样重现你的问题:
>>> df = pd.DataFrame()
>>> df['protocol'] = pd.Categorical(['A', 'B', 'C', 'D', 'A'])
>>> df.select_dtypes('object')
Empty DataFrame
Columns: []您可以看到最后一行,对应于
categorical_subset = df.select_dtypes('object')可能是返回一个空的DataFrame (当有疑问时,最好检查categorical_subset是否包含您期望它包含的内容。
这是因为当您将最初包含字符串的df['protocol']重新分配到pd.Categorical时,它的dtype (以及其他分类列的类型)不再是object,而是category):
>>> df.dtypes
protocol category
dtype: object(这个输出看起来有点混乱;它说protocol的dtype是category,但下面是dtype: object:DataFrame.dtypes的返回值实际上是一个列,列表示列名和dtype,所以底部的欺骗性dtype: object引用了该系列的dtype )。
这可能是你真正想要的:
>>> df.select_dtypes('category')
protocol
0 A
1 B
2 C
3 D
4 A事实上,上面写着in the docs for select_dtypes
若要选择Pandas分类类型,请使用
'category'
以上是如何创建Minimal, Reproducible Example以及一般情况下如何调试小型程序的一个很好的示例。我们首先对准了问题区域,这条线
categorical_subset.columns.drop("protocol")显然,它认为不应该有一个名为'protocol'的专栏。然后,我们向后追溯categorical_subset是如何创建的(我们在原始数据called上调用了df.select_dtypes('object') )。除此之外,我们所需要的只是一个包含一些pd.Categorical列的示例dataframe。
https://stackoverflow.com/questions/62902565
复制相似问题