文章/答案/技术大牛

发布

社区首页 >问答首页 >如何标准化同时包含数值变量和因子变量的数据框架

问如何标准化同时包含数值变量和因子变量的数据框架
EN

Stack Overflow用户

提问于 2016-04-18 22:51:21

回答 3查看 7.4K关注 0票数 9

我的数据框my.data既包含数值变量又包含因子变量。我只想标准化这个数据框中的数值变量。

> mydata2=data.frame(scale(my.data, center=T, scale=T))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

标准化可以通过这样做来实现吗？我想标准化列8、9、10、11和12，但我想我有错误的代码。

mydata=data.frame(scale(flowdis3[,c(8,9,10,11,12)], center=T, scale=T,))

提前感谢

variables

standardized

回答 3

Stack Overflow用户

发布于 2016-04-18 22:56:22

以下是标准化的一种选择

 mydata[] <- lapply(mydata, function(x) if(is.numeric(x)){
                     scale(x, center=TRUE, scale=TRUE)
                      } else x)

票数 9

Stack Overflow用户

发布于 2018-04-23 11:20:01

您可以使用dplyr包来执行此操作：

mydata2%>%mutate_if(is.numeric,scale)

票数 4

Stack Overflow用户

发布于 2020-04-03 19:41:49

以下是一些需要考虑的选项，尽管回答较晚：

# Working environment and Memory management
rm(list = ls(all.names = TRUE))
gc()
memory.limit(size = 64935)

# Set working directory
setwd("path")

# Example data frame
df <- data.frame("Age" = c(21, 19, 25, 34, 45, 63, 39, 28, 50, 39), 
                 "Name" = c("Christine", "Kim", "Kevin", "Aishwarya", "Rafel", "Bettina", "Joshua", "Afreen", "Wang", "Kerubo"),
                 "Salary in $" = c(2137.52, 1515.79, 2212.81, 2500.28, 2660, 4567.45, 2733, 3314, 5757.11, 4435.99),
                 "Gender" = c("Female", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Male"),
                 "Height in cm" = c(172, 166, 191, 169, 179, 177, 181, 155, 154, 183),
                 "Weight in kg" = c(60, 70, 88, 48, 71, 51, 65, 44, 53, 91))

让我们检查一下df的结构：

str(df)
'data.frame':   10 obs. of  6 variables:
$ Age         : num  21 19 25 34 45 63 39 28 50 39
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num  2138 1516 2213 2500 2660 ...
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num  172 166 191 169 179 177 181 155 154 183
$ Weight.in.kg: num  60 70 88 48 71 51 65 44 53 91

我们看到年龄、薪水、身高和体重是数字，而姓名和性别是分类的(因子变量)。

让我们仅使用基数R来缩放数值变量：

1)选项：(对akrun在这里提出的建议稍作修改)

start_time1 <- Sys.time()
df1 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
  (x-mean(x))/sd(x)
} else x))
end_time1 <- Sys.time()
end_time1 - start_time1

Time difference of 0.02717805 secs
str(df1)
'data.frame':   10 obs. of  6 variables:
$ Age         : num  -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num  -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num  -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num  -0.254 0.365 1.478 -0.996 0.427 ...

2)选项：(akrun的方法)

start_time2 <- Sys.time()
df2 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
  scale(x, center=TRUE, scale=TRUE)
} else x))
end_time2 <- Sys.time()
end_time2 - start_time2

Time difference of 0.02599907 secs
str(df2)
'data.frame':   10 obs. of  6 variables:
$ Age         : num  -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num  -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num  -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num  -0.254 0.365 1.478 -0.996 0.427 ...

3)选项：

start_time3 <- Sys.time()
indices <- sapply(df, is.numeric)
df3 <- df
df3[indices] <- lapply(df3[indices], scale)
end_time3 <- Sys.time()
end_time2 - start_time3

Time difference of -59.6766 secs
str(df3)
'data.frame':   10 obs. of  6 variables:
  $ Age         : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2

4)选项(使用tidyverse并调用dplyr)：

library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, scale)
end_time4 <- Sys.time()
end_time4 - start_time4

Time difference of 0.012043 secs
str(df4)
'data.frame':   10 obs. of  6 variables:
  $ Age         : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2

根据您要求的输出结构和速度，您可以做出判断。如果您的数据不平衡，您想要平衡它，假设您想在缩放数值变量之后进行分类，那么数值变量的矩阵数值结构--年龄、薪资、身高和体重将会出现问题。我是说,

str(df4$Age)
 num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
 - attr(*, "scaled:center")= num 36.3
 - attr(*, "scaled:scale")= num 13.8

例如，由于ROSE包(平衡数据)不接受除int、factor和num之外的数据结构，因此它将抛出错误。

为了避免这个问题，缩放后的数值变量可以保存为向量，而不是列矩阵，方法是：

library(tidyverse)

start_time4 <- Sys.time()

df4 <-df %>% dplyr::mutate_if(is.numeric, ~scale (.) %>% as.vector)

end_time4 <- Sys.time()

end_time4 - start_time4

使用

Time difference of 0.01400399 secs

str(df4)

'data.frame':   10 obs. of  6 variables:

 $ Age         : num  -1.105 -1.249 -0.816 -0.166 0.628 ...


 $ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6

 $ Salary.in.. : num  -0.787 -1.255 -0.731 -0.514 -0.394 ...

 $ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2

 $ Height.in.cm: num  -0.0585 -0.5596 1.5285 -0.309 0.5262 ...

 $ Weight.in.kg: num  -0.254 0.365 1.478 -0.996 0.427 ...

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/36697424

复制

相似问题

问如何标准化同时包含数值变量和因子变量的数据框架
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何标准化同时包含数值变量和因子变量的数据框架EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何标准化同时包含数值变量和因子变量的数据框架
EN