我想在一个文本列、五个二进制变量和一个数值目标变量的数据集上运行一个回归模型。我包括了一个CountVectorizer来向量化文本列,并尝试使用make_column_transformer将其合并到一个sklearn Pipeline中。数据没有任何缺失的值-但是,当运行下面的脚本时,我会收到以下警告:
FitFailedWarning: Estimator fit failed. The score on this train-test
partition for these parameters will be set to nan.以及以下错误消息:
TypeError: All estimators should implement fit and transform, or can be
'drop' or 'passthrough' specifiers. 'Level1' (type <class 'str'>) doesn't.我假设问题可能是,我没有在make_column_transformer中指定第二个元组,而只是指定了以下内容:sample_df[categorical_cols],但我不确定如何在make_column_transformer中包含已处理的就绪数据。
完整代码:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import KFold
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score
categorical_cols = [col for col in sample_df.columns if col.startswith('Level')]
textual_col = ['Text']
pipeline = Pipeline([
('transformer', make_column_transformer((CountVectorizer(), textual_col),
sample_df[categorical_cols],
remainder='passthrough')),
('model', RandomForestRegressor())
])
X = sample_df[textual_col + categorical_cols]
y = sample_df['Value']
cv = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(pipeline, X, y, cv=cv)
scores样本数据集:
import io
data_string = """
Level1;Level2;Level3;Level4;Level5;Text;Value
0;0;1;0;0;Are you sure that the input;109.3
0;0;0;0;0;that the input text data for;87.2
0;0;1;0;0;text data for your model is;21.5
0;0;0;0;0;your model is in English? Well,;143.5
0;0;0;0;1;in English? Well, no one can;141.1
0;0;0;0;0;no one can be sure about;93.4
0;0;0;0;0;be sure about this, as no;29.5
0;0;0;0;0;this, as no one will read;17.9
0;0;1;0;0;one will read around 20k records;37.8
0;0;1;0;0;around 20k records of text data.;153.7
0;0;0;0;0;of text data. So, how non-English;99.5
0;0;0;1;0;So, how non-English text will affect;119.1
0;0;0;0;1;text will affect your English text;97.5
0;0;0;0;0;your English text trained model? Pick;49.2
0;0;0;0;0;trained model? Pick any non-English text;79.3
0;0;0;0;0;any non-English text and pass it;107.7
0;1;0;0;1;and pass it through as input;117.3
0;0;0;0;0;through as input to your English;151.1
0;0;0;0;0;to your English text trained classification;47.3
0;0;0;0;0;text trained classification model. You will;129.3
0;0;0;0;0;model. You will come to know;135.1
0;0;0;0;0;come to know that the category;145.8
0;0;0;0;1;that the category is assigned to;131.9
1;0;0;1;0;is assigned to non-English text by;43.7
1;0;0;0;0;non-English text by the model. If;67.1
1;0;0;0;0;the model. If your model is;105.3
0;0;0;1;0;your model is dependent on one;65.2
0;1;0;0;0;dependent on one language then, other;98.3
0;0;0;0;0;language then, other languages in your;130.5
0;0;0;0;0;languages in your textual data should;107.2
0;1;1;0;0;textual data should be considered as;66.5
0;0;0;1;0;be considered as noise. But why?;43.1
0;0;0;0;1;noise. But why? The job of;56.7
0;0;0;0;0;The job of the text classification;75.1
1;0;0;0;0;the text classification model is to;88.3
1;0;0;0;0;model is to classify. And, it;91.3
0;0;0;0;0;classify. And, it will do its;106.4
1;0;0;0;0;will do its job despite its;109.5
0;0;0;0;1;job despite its input text will;143.1
0;0;0;0;0;input text will be in English;54.1
1;0;0;0;0;be in English or not. What;96.4
0;0;0;1;0;or not. What can we do;133.8
0;0;0;0;0;can we do to avoid such;146.4
0;0;1;0;0;to avoid such a situation? Your;164.3
0;0;1;0;0;a situation? Your model will not;34.6
0;0;0;0;0;model will not stop classifying the;76.8
0;0;0;1;0;stop classifying the non-English text. So,;80.5
0;0;1;0;0;non-English text. So, you have to;90.3
0;0;0;0;0;you have to detect the non-English;68.3
0;0;0;0;0;detect the non-English text and remove;44.0
0;0;1;0;0;text and remove it from trained;100.4
0;0;0;0;0;it from trained data and prediction;117.4
0;0;0;0;1;data and prediction data. This process;85.4
0;1;0;0;0;data. This process comes under the;65.7
0;0;1;0;0;comes under the data cleaning part.;54.3
0;1;0;0;0;data cleaning part. Inconsistency in your;78.9
0;0;0;0;0;Inconsistency in your data will result;96.8
1;0;0;0;1;data will result in a decrease;108.1
0;0;0;0;0;in a decrease in the accuracy;145.7
1;0;0;0;0;in the accuracy of the model.;103.6
0;0;1;0;0;of the model. Sometimes, multiple languages;56.4
0;0;0;0;1;Sometimes, multiple languages present in text;90.5
0;0;0;0;0;present in text data could be;80.4
0;0;0;0;0;data could be one of the;90.7
1;0;0;0;0;one of the reasons your model;48.8
0;0;0;0;0;reasons your model behaves strangely. So,;65.4
0;0;1;0;0;behaves strangely. So, in this article,;107.5
0;0;0;0;0;in this article, we will discuss;143.2
0;0;0;0;0;we will discuss the different python;165.0
0;0;0;0;0;the different python libraries which detect;123.3
0;0;0;0;1;libraries which detect the language(s) of;85.3
0;0;0;0;0;the language(s) of the text data.;91.4
0;0;0;0;1;the text data. Let’s start with;49.5
0;0;0;0;0;Let’s start with the spaCy library.;76.3
0;0;0;0;0;the spaCy library.;49.5
"""
sample_df = pd.read_csv(io.StringIO(data_string), sep=';')发布于 2021-12-25 22:38:13
您可以使用remainder='passthrough'来避免转换已经处理的列(因此,在您的示例中,您只需将二进制列视为ColumnTransformer对象不会处理的剩余列,但它将在其上传递)。然后您应该知道,CountVectorizer需要一个1D数组作为输入,因此您应该指定要传递给make_column_transformer的列为字符串('Text'),而不是数组(['Text']) (参见 doc的引用)。
列: str、str的数组样、int的数组样、int的类数组、片的、bool的类数组或可调用的数组。
在其第二个轴上索引数据。整数被解释为位置列,而字符串可以按名称引用DataFrame列。一个标量字符串或int应该在转换器期望X为一维数组类(向量)的情况下使用,否则2d数组将传递给转换器。可调用的A将传递输入数据X,并可以返回上述任何一个。若要按名称或dtype选择多列,可以使用make_column_selector。
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import KFold
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score
categorical_cols = [col for col in sample_df.columns if col.startswith('Level')]
textual_col = ['Text']
pipeline = Pipeline([
('transformer', make_column_transformer((CountVectorizer(), 'Text'),
remainder='passthrough')),
('model', RandomForestRegressor())
])
X = sample_df[textual_col + categorical_cols]
y = sample_df['Value']
cv = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(pipeline, X, y, cv=cv)
scoreshttps://stackoverflow.com/questions/70482236
复制相似问题