文章/答案/技术大牛

发布

社区首页 >问答首页 >Scala Spark模型转换返回全零

问Scala Spark模型转换返回全零
EN

Stack Overflow用户

提问于 2017-07-18 18:53:53

回答 1查看 368关注 0票数 1

各位，一天中的美好时光。首先，我使用apache-spark ml(不是mllib)和scala执行简单的机器学习任务。我的build.sbt如下：

name := "spark"
version := "1.0"
scalaVersion := "2.11.11"
libraryDependencies += "org.apache.spark" %% "spark-core"  % "2.1.1"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.1.1"
libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"
libraryDependencies += "com.databricks" %% "spark-csv" % "1.0.1"

所有阶段都做得很好。但是，应该包含预测的数据集存在问题。在我的例子中，我对三个类进行了分类，标签是1.0, 2.0, 3.0，但是预测列只包含0.0标签，即使根本没有这样的标签。以下是原始数据帧：

+--------------------+--------+
|               tfIdf|estimate|
+--------------------+--------+
|(3000,[0,1,8,14,1...|     3.0|
|(3000,[0,1707,223...|     3.0|
|(3000,[1,24,33,64...|     3.0|
|(3000,[1,40,114,5...|     2.0|
|(3000,[1,363,743,...|     2.0|
|(3000,[2,20,65,88...|     3.0|
|(3000,[3,15,21,23...|     3.0|
|(3000,[3,45,53,14...|     3.0|
|(3000,[3,387,433,...|     1.0|
|(3000,[3,523,629,...|     3.0|
+--------------------+--------+

在分类之后，我的预测是：

+--------------------+--------+----------+
|               tfIdf|estimate|prediction|
+--------------------+--------+----------+
|(3000,[0,1,8,14,1...|     3.0|       0.0|
|(3000,[0,1707,223...|     3.0|       0.0|
|(3000,[1,24,33,64...|     3.0|       0.0|
|(3000,[1,40,114,5...|     2.0|       0.0|
|(3000,[1,363,743,...|     2.0|       0.0|
|(3000,[2,20,65,88...|     3.0|       0.0|
|(3000,[3,15,21,23...|     3.0|       0.0|
|(3000,[3,45,53,14...|     3.0|       0.0|
|(3000,[3,387,433,...|     1.0|       0.0|
|(3000,[3,523,629,...|     3.0|       0.0|
+--------------------+--------+----------+

我的代码如下：

 val toDouble = udf[Double, String](_.toDouble)
  val kribrumData = krData.withColumn("estimate", toDouble(krData("estimate")))
    .select($"text",$"estimate")

  kribrumData.cache()

  val tokenizer = new Tokenizer()
    .setInputCol("text")
    .setOutputCol("tokens")
  val stopWordsRemover = new StopWordsRemover()
    .setInputCol("tokens")
    .setOutputCol("filtered")
    .setStopWords(STOP_WORDS)
  val hashingTF = new HashingTF()
    .setInputCol("filtered")
    .setNumFeatures(3000)
    .setOutputCol("tf")
  val idf = new IDF()
    .setInputCol("tf")
    .setOutputCol("tfIdf")
  val preprocessor = new Pipeline()
    .setStages(Array(tokenizer,stopWordsRemover,hashingTF,idf))
  val preprocessor_model = preprocessor.fit(kribrumData)

  val preprocessedKribrumData = preprocessor_model.transform(kribrumData)
    .select("tfIdf", "estimate")

  var Array(train, test) = preprocessedKribrumData.randomSplit(Array(0.8, 0.2), seed = 7)

  test.show(10)

  val logisticRegressor = new LogisticRegression()
    .setMaxIter(10)
    .setRegParam(0.3)
    .setElasticNetParam(0.8)
    .setLabelCol("estimate")
    .setFeaturesCol("tfIdf")
  val classifier = new OneVsRest()
    .setLabelCol("estimate")
    .setFeaturesCol("tfIdf")
    .setClassifier(logisticRegressor)


  val model = classifier.fit(train)

  val predictions = model.transform(test)

  predictions.show(10)

  val evaluator = new MulticlassClassificationEvaluator()
    .setMetricName("accuracy").setLabelCol("estimate")

  val accuracy = evaluator.evaluate(predictions)

  println("Classification accuracy" + accuracy.toString)

这段代码最终使预测精度等于零(因为在目标列“estimate”中没有标签"0.0“)。那么，我到底做错了什么呢？我们将非常感谢您的任何想法。

scala

apache-spark

machine-learning

apache-spark-ml

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-07-21 20:41:58

最后，我找到了问题所在。Spark不会抛出错误或异常，当label字段为double时，但是label不在分类器的有效范围内，为了克服这个问题，需要重新使用StringIndexer，所以我们只需要在管道中添加：

val labelIndexer = new StringIndexer()
  .setInputCol("estimate")
  .setOutputCol("indexedLabel")

这一步解决了问题，但是很不方便。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/45164721

复制

相似问题

问Scala Spark模型转换返回全零
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Scala Spark模型转换返回全零EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Scala Spark模型转换返回全零
EN