首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在LogisticRegression中使用交叉验证来解决分类问题?

如何在LogisticRegression中使用交叉验证来解决分类问题?
EN

Data Science用户
提问于 2021-02-09 10:39:31
回答 2查看 49关注 0票数 0
  • 我想建立一个数据框架,在句子中使用最多重复的单词,并通过Logistic-Regression进行分类。
  • 我试着用密码把步骤写清楚。
  • B是我的目标。

我所拥有的:(示例)

代码语言:javascript
复制
raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"],
          "B":["Yes","No","Yes","No","No"]   }
df=pd.DataFrame(raw_data)
df
代码语言:javascript
复制
    A                   B
0   This is yellow      Yes
1   That is green       No
2   These are orange    Yes
3   This is a pen       No
4   This is an Orange   No

我所做的:

代码语言:javascript
复制
### 1-Import Libraries:
import numpy as np 
import pandas as pd

### 2- Create data set:
raw_data={"A":["This is yellow","That is green","These are orange","This is a pen","This is an Orange"],
          "B":["Yes","No","Yes","No","No"]   }
df=pd.DataFrame(raw_data)
df


          A              B
0   This is yellow      Yes
1   That is green       No
2   These are orange    Yes
3   This is a pen       No
4   This is an Orange   No

### 3- Count the word and charachters 
df['word_count'] = df['A'].agg(lambda x: len(x.split(" ")))
df['char_count'] = df['A'].agg(lambda x:len(x))
df
             A         B    word_count  char_count
0   This is yellow     Yes  3           14
1   That is green      No   3           13
2   These are orange   Yes  3           16
3   This is a pen      No   4           13
4   This is an Orange  No   4           17

### 4- Count the most repeated words in column "A"
df_word_count=pd.DataFrame(df.A.str.split('').explode().value_counts()).reset_index().rename({'index':"A,"A":"Count"},axis=1)
display(df_word_count)
list_word_count=list(df_word_count["A"])
len(list_word_count)

    A       Count
0   is      4
1   This    3
2   yellow  1
3   These   1
4   orange  1
5   green   1
6   That    1
7   are     1
8   a       1
9   pen     1
10  Orange  1
11  an      1

### 5- Make a ZERO-Matrix
allfeatures=np.zeros((df.shape[0],len(list_word_count)))
allfeatures.shape

### 6- Create a data frame
for i in range(len(list_word_count)):
  allfeatures[:,i]=df['A'].agg(lambda x:x.split().count(list_word_count[i]))
Complete_data=pd.concat([df,pd.DataFrame(allfeatures)],axis=1)
display(Complete_data)

           A            B   word_count  char_count  0   1   2   3   4   5   6   7   8   9   10  11
0   This is yellow      Yes 3           14          1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1   That is green       No  3           13          1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
2   These are orange    Yes 3           16          0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
3   This is a pen       No  4           13          1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
4   This is an Orange   No  4           17          1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0


### 7- change columns name from list
#This creates a list of the words 
l = list(df_word_count["A"])

l.insert(0,"char_count")
l.insert(0,"word_count")
l.insert(0,"B")
l.insert(0,"A")    
# Finally, I rename all the columns with the names that I have in the list l
Complete_data.columns = l

### 8- Define X and Y
x=Complete_data.drop(["A","B"],axis=1) # Features
y=Complete_data["B"] # Target


### 9- Encoding of Target
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)


### 10- Train|Test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

### 11- Import Sklearn needed packages
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import r2_score

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

### 12- Prediction and Regression with Cross-Validation
LogReg=LogisticRegression()
LogReg.fit(x_train,y_train)

cv_LogReg=cross_val_score(LogReg,x_train,y_train,cv=2)
cv_LogReg_pred=cross_val_predict(LogReg,x_train,y_train,cv=2)

print("Score: ",r2_score(y_train,cv_LogReg_pred))

错误:

虽然我使用了LabelEncoder,但是算法找不到任何分类(0,1)

代码语言:javascript
复制
ValueError                                Traceback (most recent call last)
<ipython-input-127-2d7e54ebfd6c> in <module>
      4 #LogReg_pred=LogReg.predict(x_test)
      5 cv_LogReg=cross_val_score(LogReg,x_train,y_train,cv=2)
----> 6 cv_LogReg_pred=cross_val_predict(LogReg,x_train,y_train,cv=2)
      7 
      8 print("Score: ",r2_score(y_train,cv_LogReg_pred))

.
.
.

This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0

我不知道我做错了什么,‍♂️

EN

回答 2

Data Science用户

回答已采纳

发布于 2021-02-09 12:45:47

因为您正在对一个样本进行交叉验证,所以在选择一个样本时,它是以这样的方式划分的,即样本只包含一个类,因此您将得到该错误。如果您有更多的数据,您不应该得到这个错误。我已经对这5条记录进行了简单的Logistic回归,我能够创建一个模型,所以你能增加你的数据和检查吗?我增加了这样的数据:

2-创建数据集: raw_data={"A":“这是黄色的”,“那是绿色的”,“这些是橙色的”,“这>是一支钢笔”,“这是一支橙色的”,“这是一支橙色的”,“这是一只绿色的”,“这是一只鸟”,“这是一只鸟”,“这是一支钢笔”,“这是一支橙色的”,“这是一只绿的”,“这是一只鸟”,“这是一支钢笔”,“这是一支橘子”,“这是一只鸟”,“这是一只猩猩”。,"B":“是”、“否”、“是”、“否”、“否”、“是”、“否”、“是”、“否”、“否” }

还有一件事,我把最后一行的r2_score改成了准确性评分。

票数 0
EN

Data Science用户

发布于 2021-02-09 11:04:15

这似乎是因为您的y_train数据集(或其中一个折叠)只包含一个类,在本例中,所有示例都属于类0。另见这个stackexchange答案。您可以通过增加样本数量(拥有同一类的所有样本的可能性较低)或使用分层数据拆分/交叉验证策略来确保值为0和1的样本数在所有子集之间大致相同。

票数 0
EN
页面原文内容由Data Science提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://datascience.stackexchange.com/questions/89148

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档