Logistic-Regression进行分类。B是我的目标。raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"],
"B":["Yes","No","Yes","No","No"] }
df=pd.DataFrame(raw_data)
df A B
0 This is yellow Yes
1 That is green No
2 These are orange Yes
3 This is a pen No
4 This is an Orange No### 1-Import Libraries:
import numpy as np
import pandas as pd
### 2- Create data set:
raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"],
"B":["Yes","No","Yes","No","No"] }
df=pd.DataFrame(raw_data)
df
A B
0 This is yellow Yes
1 That is green No
2 These are orange Yes
3 This is a pen No
4 This is an Orange No
### 3- Count the word and charachters
df['word_count'] = df['A'].agg(lambda x: len(x.split(" ")))
df['char_count'] = df['A'].agg(lambda x:len(x))
df
A B word_count char_count
0 This is yellow Yes 3 14
1 That is green No 3 13
2 These are orange Yes 3 16
3 This is a pen No 4 13
4 This is an Orange No 4 17
### 4- Count the most repeated words in column "A"
df_word_count=pd.DataFrame(df.A.str.split('').explode().value_counts()).reset_index().rename({'index':"A,"A":"Count"},axis=1)
display(df_word_count)
list_word_count=list(df_word_count["A"])
len(list_word_count)
A Count
0 is 4
1 This 3
2 yellow 1
3 These 1
4 orange 1
5 green 1
6 That 1
7 are 1
8 a 1
9 pen 1
10 Orange 1
11 an 1
### 5- Make a ZERO-Matrix
allfeatures=np.zeros((df.shape[0],len(list_word_count)))
allfeatures.shape
### 6- Create a data frame
for i in range(len(list_word_count)):
allfeatures[:,i]=df['A'].agg(lambda x:x.split().count(list_word_count[i]))
Complete_data=pd.concat([df,pd.DataFrame(allfeatures)],axis=1)
display(Complete_data)
A B word_count char_count 0 1 2 3 4 5 6 7 8 9 10 11
0 This is yellow Yes 3 14 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 That is green No 3 13 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
2 These are orange Yes 3 16 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
3 This is a pen No 4 13 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
4 This is an Orange No 4 17 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
### 7- change columns name from list
#This creates a list of the words
l = list(df_word_count["A"])
l.insert(0,"char_count")
l.insert(0,"word_count")
l.insert(0,"B")
l.insert(0,"A")
# Finally, I rename all the columns with the names that I have in the list l
Complete_data.columns = l
### 8- Define X and Y
x=Complete_data.drop(["A","B"],axis=1) # Features
y=Complete_data["B"] # Target
### 9- Encoding of Target
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
### 10- Train|Test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
### 11- Import Sklearn needed packages
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
### 12- Prediction and Regression with Cross-Validation
LogReg=LogisticRegression()
LogReg.fit(x_train,y_train)
cv_LogReg=cross_val_score(LogReg,x_train,y_train,cv=2)
cv_LogReg_pred=cross_val_predict(LogReg,x_train,y_train,cv=2)
print("Score: ",r2_score(y_train,cv_LogReg_pred))虽然我使用了LabelEncoder,但是算法找不到任何分类(0,1)
ValueError Traceback (most recent call last)
<ipython-input-127-2d7e54ebfd6c> in <module>
4 #LogReg_pred=LogReg.predict(x_test)
5 cv_LogReg=cross_val_score(LogReg,x_train,y_train,cv=2)
----> 6 cv_LogReg_pred=cross_val_predict(LogReg,x_train,y_train,cv=2)
7
8 print("Score: ",r2_score(y_train,cv_LogReg_pred))
.
.
.
This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0我不知道我做错了什么,♂️
发布于 2021-02-09 12:45:47
因为您正在对一个样本进行交叉验证,所以在选择一个样本时,它是以这样的方式划分的,即样本只包含一个类,因此您将得到该错误。如果您有更多的数据,您不应该得到这个错误。我已经对这5条记录进行了简单的Logistic回归,我能够创建一个模型,所以你能增加你的数据和检查吗?我增加了这样的数据:
2-创建数据集: raw_data={"A":“这是黄色的”,“那是绿色的”,“这些是橙色的”,“这>是一支钢笔”,“这是一支橙色的”,“这是一支橙色的”,“这是一只绿色的”,“这是一只鸟”,“这是一只鸟”,“这是一支钢笔”,“这是一支橙色的”,“这是一只绿的”,“这是一只鸟”,“这是一支钢笔”,“这是一支橘子”,“这是一只鸟”,“这是一只猩猩”。,"B":“是”、“否”、“是”、“否”、“否”、“是”、“否”、“是”、“否”、“否” }
还有一件事,我把最后一行的r2_score改成了准确性评分。
发布于 2021-02-09 11:04:15
这似乎是因为您的y_train数据集(或其中一个折叠)只包含一个类,在本例中,所有示例都属于类0。另见这个stackexchange答案。您可以通过增加样本数量(拥有同一类的所有样本的可能性较低)或使用分层数据拆分/交叉验证策略来确保值为0和1的样本数在所有子集之间大致相同。
https://datascience.stackexchange.com/questions/89148
复制相似问题