文章/答案/技术大牛

发布

社区首页 >问答首页 >Sparklyr处理分类变量

问Sparklyr处理分类变量
EN

Stack Overflow用户

提问于 2017-08-14 15:45:54

回答 1查看 1.2K关注 0票数 3

Sparklyr处理分类变量

我来自R背景，我习惯于在后台处理的变量(作为因子)。对于Sparklyr，使用string_indexer或onehotencoder是相当混乱的。

例如，我有许多变量在原始数据集中被编码为数值变量，但它们实际上是绝对的。我想使用它们作为绝对变量，但不确定我是否正确地做了。

library(sparklyr)
library(dplyr)
sessionInfo()
sc <- spark_connect(master = "local", version = spark_version)
spark_version(sc)
set.seed(1)    
exampleDF <- data.frame (ID = 1:10, Resp = sample(c(100:205), 10, replace = TRUE), 
                     Numb = sample(1:10, 10))

example <- copy_to(sc, exampleDF) 
pred <- example %>% mutate(Resp = as.character(Resp)) %>%
                sdf_mutate(Resp_cat = ft_string_indexer(Resp)) %>%
                ml_decision_tree(response = "Resp_cat", features = "Numb") %>%
                sdf_predict()
pred

从模型中得到的预测不是绝对的。见下文。这是否意味着我还必须从预测转换回Resp_cat，然后再转换为Resp？

R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

spark_version(sc)
[1] ‘2.1.1.2.6.1.0’

Source:   table<sparklyr_tmp_74e340c5607c> [?? x 6]
Database: spark_connection
      ID  Numb  Resp Resp_cat id74e35c6b2dbb prediction
     <int> <int> <chr>    <dbl>          <dbl>      <dbl>
 1     1    10   150        8              0   8.000000
 2     2     3   191        4              1   4.000000
 3     3     4   146        9              2   9.000000
 4     4     9   125        5              3   5.000000
 5     5     8   107        2              4   2.000000
 6     6     2   110        1              5   1.000000
 7     7     5   133        3              6   5.333333
 8     8     7   154        6              7   5.333333
 9     9     1   170        0              8   0.000000
10    10     6   143        7              9   5.333333

apache-spark

apache-spark-ml

sparklyr

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-08-14 16:30:23

通常，Spark在处理分类数据时依赖于列元数据。在您的管道中，这由StringIndexer (ft_string_indexer)处理。ML总是预测标签，而不是原始字符串。通常您会使用IndexToString转换器，这是由ft_index_to_string提供的。

在火花IndexToString中，可以使用提供的标签列表或Column元数据。不幸的是，sparklyr实现在以下两方面受到限制：

它只能使用元数据。，它没有在预测列上设置。
ft_string_indexer抛弃了经过训练的模型，因此不能用于提取实验室。

我可能遗漏了一些东西，但看起来您必须手动映射预测，例如，由joining使用转换后的数据：

pred %>% 
  select(prediction=Resp_cat, Resp_prediction=Resp) %>% 
  distinct() %>% 
  right_join(pred)

Joining, by = "prediction"
# Source:   lazy query [?? x 9]
# Database: spark_connection
   prediction Resp_prediction    ID  Numb  Resp Resp_cat id777a79821e1e
        <dbl>           <chr> <int> <int> <chr>    <dbl>          <dbl>
 1          7             171     1     3   171        7              0
 2          0             153     2    10   153        0              1
 3          3             132     3     8   132        3              2
 4          5             122     4     7   122        5              3
 5          6             198     5     4   198        6              4
 6          2             164     6     9   164        2              5
 7          4             137     7     6   137        4              6
 8          1             184     8     5   184        1              7
 9          0             153     9     1   153        0              8
10          1             184    10     2   184        1              9
# ... with more rows, and 2 more variables: rawPrediction <list>,
#   probability <list>

解释

pred %>% 
  select(prediction=Resp_cat, Resp_prediction=Resp) %>% 
  distinct()

创建从预测(编码标签)到原始标签的映射。我们将Resp_cat重命名为prediction，以便它可以充当连接键，将Resp重命名为Resp_prediction，以避免与实际Resp发生冲突。

最后，我们应用了正确的等接：

... %>%  right_join(pred)

Note

您应该指定树的类型：

ml_decision_tree(
  response = "Resp_cat", features = "Numb",type = "classification")

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/45678282

复制

相似问题

问Sparklyr处理分类变量
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Sparklyr处理分类变量EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Sparklyr处理分类变量
EN