我正在使用sklearn管道中的sklearn-pandas DataFrameMapper。为了评估特征联合管道中的特征贡献,我喜欢测量估计器的系数(Logistic回归)。对于下面的代码示例,对三个文本内容列a, b和c进行了矢量化,并为X_train选择了它们
import pandas as pd
import numpy as np
import pickle
from sklearn_pandas import DataFrameMapper
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
np.random.seed(1)
data = pd.read_csv('https://pastebin.com/raw/WZHwqLWr')
#data.columns
X = data.copy()
y = data.result
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
mapper = DataFrameMapper([
('a', CountVectorizer()),
('b', CountVectorizer()),
('c', CountVectorizer())
])
pipeline = Pipeline([
('featurize', mapper),
('clf', LogisticRegression(random_state=1))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(abs(pipeline.named_steps['clf'].coef_))
#array([[0.3567311 , 0.3567311 , 0.46215153, 0.10542043, 0.3567311 ,
# 0.46215153, 0.46215153, 0.3567311 , 0.3567311 , 0.3567311 ,
# 0.3567311 , 0.46215153, 0.46215153, 0.3567311 , 0.46215153,
# 0.3567311 , 0.3567311 , 0.3567311 , 0.3567311 , 0.46215153,
# 0.46215153, 0.46215153, 0.3567311 , 0.3567311 ]])
print(len(pipeline.named_steps['clf'].coef_[0]))
#24与通常返回与特征数量相等的长度系数的多个特征的正常分析不同,DataFrameMapper返回更大的系数矩阵。
a)如何解释大写字母中总共24个系数?b)访问每个特征("a“、"b”、"c")的coef_值的最佳方法是什么?
所需输出:
a: coef_score (float)
b: coef_score (float)
c: coef_score (float)谢谢!
发布于 2019-01-30 08:26:22
尽管您的初始数据帧确实只包含三个特征a、b和c的列,但Pandas DataFrameMapper()类将SKlearn的CountVectorizer()应用于每个列a、b和c的相应单词语料库。这导致总共创建了24个特征,然后将这些特征传递给您的LogisticRegression()分类器。这就是为什么当您尝试访问分类器的.coef_属性时得到一个包含24个值的未标记列表的原因。
但是,将这24个coeff_分数中的每一个都与它们来自的原始列(a、b或c)进行匹配,然后计算每一列的平均系数分数,这是非常简单的。下面是我们要做的事情:
原始数据帧如下所示:
a b c result
2 here we go hello here we are this is a test 0
73 here we go hello here we are this is a test 0
...如果我们运行以下行,我们可以看到由mapper对象中使用的DataFrameMapper/CountVectorizer()创建的所有24个特性的列表:
pipeline.named_steps['featurize'].transformed_names_
['a_another',
'a_example',
'a_go',
'a_here',
'a_is',
'a_we',
'b_are',
'b_column',
'b_content',
'b_every',
'b_has',
'b_hello',
'b_here',
'b_text',
'b_we',
'c_can',
'c_deal',
'c_feature',
'c_how',
'c_is',
'c_test',
'c_this',
'c_union',
'c_with']
len(pipeline.named_steps['featurize'].transformed_names_)
24现在,以下是我们如何计算来自a/b/c列的三组特征的平均coef得分:
col_names = list(data.drop(['result'], axis=1).columns.values)
vect_feats = pipeline.named_steps['featurize'].transformed_names_
clf_coef_scores = abs(pipeline.named_steps['clf'].coef_)
def get_avg_coef_scores(col_names, vect_feats, clf_coef_scores):
scores = {}
start_pos = 0
for n in col_names:
num_vect_feats = len([i for i in vect_feats if i[0] == n])
end_pos = start_pos + num_vect_feats
scores[n + '_avg_coef_score'] = np.mean(clf_coef_scores[0][start_pos:end_pos])
start_pos = end_pos
return scores如果我们调用我们刚刚编写的函数,我们会得到以下输出:
get_avg_coef_scores(col_names, vect_feats, clf_coef_scores)
{'a_avg_coef_score': 0.3499861323284858,
'b_avg_coef_score': 0.40358462487685853,
'c_avg_coef_score': 0.3918712435073411}如果我们想要验证24个Coeff值中的哪一个属于每个创建的特征,我们可以使用以下字典理解:
{key:clf_coef_scores[0][i] for i, key in enumerate(vect_feats)}
{'a_another': 0.3567310993987888,
'a_example': 0.3567310993987888,
'a_go': 0.4621515317244458,
'a_here': 0.10542043232565701,
'a_is': 0.3567310993987888,
'a_we': 0.4621515317244458,
'b_are': 0.4621515317244458,
'b_column': 0.3567310993987888,
'b_content': 0.3567310993987888,
'b_every': 0.3567310993987888,
'b_has': 0.3567310993987888,
'b_hello': 0.4621515317244458,
'b_here': 0.4621515317244458,
'b_text': 0.3567310993987888,
'b_we': 0.4621515317244458,
'c_can': 0.3567310993987888,
'c_deal': 0.3567310993987888,
'c_feature': 0.3567310993987888,
'c_how': 0.3567310993987888,
'c_is': 0.4621515317244458,
'c_test': 0.4621515317244458,
'c_this': 0.4621515317244458,
'c_union': 0.3567310993987888,
'c_with': 0.3567310993987888}发布于 2019-01-30 07:57:58
从Pipeline恢复合适的DataFrameMapper后,可以使用.features方法访问其内容。这样就可以遍历用于将字符串转换为单一热编码变量的CountVectorizer函数。每个CountVecotrizer都有一个.vocabulary_方法,它确切地告诉您字符串代表的是哪一列。
因此,您可以按顺序取出DataFrameMapper中的每个CountVectorizer,并按顺序提取代表输入矩阵中每一列的字符串。这将允许您有一个序列,准确地表示您的系数的标签。
根据您的示例,此代码片段应该可以完成您需要的工作,也是我在上面详细描述的工作(如果您遇到任何错误,请警告我,我将根据您的反馈进行更正):
# recover the fitted mapper
fitted_mapper = pipeline.named_steps['featurize']
mapped_labels = list()
# iterate through the CountVectorizers
for label, fun in fitted_mapper.features:
# Iterate through the sorted vocabulary
for level, _ in sorted(fun.vocabulary_.items()):
mapped_labels.append(label+'_'+level)
# the ordered sequence of vectorized strings
print(mapped_labels)
# pick up the coefficients
coefs = pipeline.named_steps['clf'].coef_[0]
# pair mapped labels and coefs and print them
for label, coef in zip(mapped_labels, coefs):
print("%s:%0.5f" % (label, coef))https://stackoverflow.com/questions/54388370
复制相似问题