我创建了一个模型,通过查看文本来预测网站的类型。
但它似乎不起作用。我已经将模型、矢量器、标签编码器存储在泡菜文件中,并在这里加载。
代码:
import pandas as pd
import sklearn.metrics as sm
import nltk
import string
from nltk.tokenize import word_tokenize
import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import pickle
import os
def clean_text(text):
#### cleaning the text
###1. Convert the text to lower case
text= text.lower()
###2. tokenize the sentences to words
text_list= word_tokenize(text)
###3. Removes the special charcters
special_char_non_text= [re.sub(f'[{string.punctuation}]+','',i) for i in text_list]
###4. remove stopwords
non_stopwords_text= [i for i in special_char_non_text if i not in stopwords.words('english')]
###5. lemmatize the words
lemmatizer= WordNetLemmatizer()
lemmatized_words= [lemmatizer.lemmatize(i) for i in non_stopwords_text]
cleaned_text= ' '.join(lemmatized_words)
return cleaned_text
text_input= input('Please enter the text: ')
cleaned_text= clean_text(text_input)
temp_df= pd.DataFrame({'input_text':[cleaned_text.strip()]})
vectorizer_filepath= 'tf_idf_vectorizer.pkl'
tf_idf_vectorizer= pickle.load(open(vectorizer_filepath,'rb'))
temp_df_1= tf_idf_vectorizer.transform(temp_df)
input_df= pd.DataFrame(temp_df_1.toarray(),columns=tf_idf_vectorizer.get_feature_names())
### load the model
model_path='multinomial_clf.pkl'
model_clf= pickle.load(open(model_path,'rb'))
y_pred= model_clf.predict(input_df)
#print(y_pred)
### load the label encoder
label_encoder_file= 'label_encoder.pkl'
label_encoder= pickle.load(open(label_encoder_file,'rb'))
label_class= label_encoder.inverse_transform(y_pred.ravel())
print(f'{label_class} is the predicted class')我收到了一个错误:
KeyError Traceback (most recent call last)
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in _encode_python(values, uniques, encode)
65 try:
---> 66 encoded = np.array([table[v] for v in values])
67 except KeyError as e:
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in <listcomp>(.0)
65 try:
---> 66 encoded = np.array([table[v] for v in values])
67 except KeyError as e:
KeyError: 'website booking flight bus ticket'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-21-b92cbf8dfe74> in <module>
5 vectorizer_filepath= 'tf_idf_vectorizer.pkl'
6 tf_idf_vectorizer= pickle.load(open(vectorizer_filepath,'rb'))
----> 7 temp_df_1= tf_idf_vectorizer.transform(temp_df)
8 input_df= pd.DataFrame(temp_df_1.toarray(),columns=tf_idf_vectorizer.get_feature_names())
9
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in transform(self, y)
275 return np.array([])
276
--> 277 _, y = _encode(y, uniques=self.classes_, encode=True)
278 return y
279
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in _encode(values, uniques, encode, check_unknown)
111 if values.dtype == object:
112 try:
--> 113 res = _encode_python(values, uniques, encode)
114 except TypeError:
115 types = sorted(t.__qualname__
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in _encode_python(values, uniques, encode)
66 encoded = np.array([table[v] for v in values])
67 except KeyError as e:
---> 68 raise ValueError("y contains previously unseen labels: %s"
69 % str(e))
70 return uniques, encoded
ValueError: y contains previously unseen labels: 'website booking flight bus ticket'我用输入的文本值作为--这是预订航班的网站,公共汽车票。
我不知道为什么会这样
有人能帮我解决这个问题吗?
发布于 2022-04-13 10:55:40
如果没有你的数据和经过训练的模型,就不能准确地说出来,但我注意到了一些事情:
###3中,空字符串似乎能够留在后面(如果令牌只包含标点符号),而且之后似乎不会以任何方式删除它们。删除整个文本,但这只会删除一个额外的第一个空格和一个额外的最后一个空间,而不会删除文本中潜在的双或更高空间。您也可以在错误消息中看到这一点。tf_idf_vectorizer.transform(),但是它需要一个可迭代的文档。像这样迭代整个DataFrame将迭代列,而不是行。试试tf_idf_vectorizer.transform(temp_df['input_text'])。transform()而不是fit_transform(),所以所有的词汇表都需要通过模型来了解,是这样吗?'website booking flight bus ticket'的向量,但失败了。您应该让TfidfVectorizer进行预处理,或者正确地使用属性preprocessor,并将您的清理方法(修改后的版本)交给它。查看这个线程:如何将预处理程序传递给TfidfVectorizer?- sklearn python。https://stackoverflow.com/questions/71855308
复制相似问题