首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >用于获取精度、召回、f1score的混淆矩阵

用于获取精度、召回、f1score的混淆矩阵
EN

Stack Overflow用户
提问于 2019-10-16 10:15:30
回答 2查看 8.2K关注 0票数 4

我有一个数据帧df。我已经对数据帧执行了decisionTree分类算法。这两列是执行算法时的标签和特征。该模型被称为dtc。如何在pyspark中创建混淆矩阵?

代码语言:javascript
复制
dtc = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label')
dtcModel = dtc.fit(train)
predictions = dtcModel.transform(test)
代码语言:javascript
复制
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.evaluation import MulticlassMetrics

preds = df.select(['label', 'features']) \
                            .df.map(lambda line: (line[1], line[0]))
metrics = MulticlassMetrics(preds)

    # Confusion Matrix
print(metrics.confusionMatrix().toArray())```
EN

回答 2

Stack Overflow用户

发布于 2019-10-16 12:26:45

在调用metrics.confusionMatrix().toArray()之前,您需要强制转换为rdd并映射到元组。

official documentation

pyspark.mllib.evaluation.MulticlassMetrics(predictionAndLabels)source类

用于多类分类的赋值器。

参数: predictionAndLabels -(预测,标签)对的RDD。

这里有一个例子来指导你。

ML部分

代码语言:javascript
复制
import pyspark.sql.functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.sql.types import FloatType
#Note the differences between ml and mllib, they are two different libraries.

#create a sample data frame
data = [(1.54,3.45,2.56,0),(9.39,8.31,1.34,0),(1.25,3.31,9.87,1),(9.35,5.67,2.49,2),\
        (1.23,4.67,8.91,1),(3.56,9.08,7.45,2),(6.43,2.23,1.19,1),(7.89,5.32,9.08,2)]

cols = ('a','b','c','d')

df = spark.createDataFrame(data, cols)

assembler = VectorAssembler(inputCols=['a','b','c'], outputCol='features')

df_features = assembler.transform(df)

#df.show()

train_data, test_data = df_features.randomSplit([0.6,0.4])

dtc = DecisionTreeClassifier(featuresCol='features',labelCol='d')

dtcModel = dtc.fit(train_data)

predictions = dtcModel.transform(test_data)

评估部分

代码语言:javascript
复制
#important: need to cast to float type, and order by prediction, else it won't work
preds_and_labels = predictions.select(['predictions','d']).withColumn('label', F.col('d').cast(FloatType())).orderBy('prediction')

#select only prediction and label columns
preds_and_labels = preds_and_labels.select(['prediction','label'])

metrics = MulticlassMetrics(preds_and_labels.rdd.map(tuple))

print(metrics.confusionMatrix().toArray())
票数 3
EN

Stack Overflow用户

发布于 2020-03-25 08:17:35

使用以下命令:

代码语言:javascript
复制
import sklearn 
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'label', numTrees=500)
rfModel = rf.fit(train)
predictions_train = rfModel.transform(train)

y_true = predictions_train.select(['label']).collect()
y_pred = predictions_train.select(['prediction']).collect()

from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_true, y_pred))

其中train是您的训练数据。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/58404845

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档