首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在火花放电中将密集向量的关系式转换成DataFrame?

如何在火花放电中将密集向量的关系式转换成DataFrame?
EN

Stack Overflow用户
提问于 2016-12-26 09:05:26
回答 2查看 9.1K关注 0票数 11

我有这样的DenseVector RDD

代码语言:javascript
复制
>>> frequencyDenseVectors.collect()
[DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])]

我想把它转换成一个Dataframe。我试过像这样

代码语言:javascript
复制
>>> spark.createDataFrame(frequencyDenseVectors, ['rawfeatures']).collect()

它会产生这样的错误

代码语言:javascript
复制
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 520, in createDataFrame
    rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
  File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 360, in _createFromRDD
    struct = self._inferSchema(rdd, samplingRatio)
  File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 340, in _inferSchema
    schema = _infer_schema(first)
  File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/types.py", line 991, in _infer_schema
    fields = [StructField(k, _infer_type(v), True) for k, v in items]
  File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/types.py", line 968, in _infer_type
    raise TypeError("not supported type: %s" % type(obj))
TypeError: not supported type: <type 'numpy.ndarray'>

旧解决方案

代码语言:javascript
复制
frequencyVectors.map(lambda vector: DenseVector(vector.toArray()))

编辑1-代码可复制的

代码语言:javascript
复制
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Row
from pyspark.sql.functions import split

from pyspark.ml.feature import CountVectorizer
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.mllib.linalg import SparseVector, DenseVector

sqlContext = SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)
sc.setLogLevel('ERROR')

sentenceData = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (0, "I wish Java could use case classes"),
    (1, "Logistic regression models are neat")
], ["label", "sentence"])
sentenceData = sentenceData.withColumn("sentence", split("sentence", "\s+"))
sentenceData.show()

vectorizer = CountVectorizer(inputCol="sentence", outputCol="rawfeatures").fit(sentenceData)
countVectors = vectorizer.transform(sentenceData).select("label", "rawfeatures")

idf = IDF(inputCol="rawfeatures", outputCol="features")
idfModel = idf.fit(countVectors)
tfidf = idfModel.transform(countVectors).select("label", "features")
frequencyDenseVectors = tfidf.rdd.map(lambda vector: [vector[0],DenseVector(vector[1].toArray())])
frequencyDenseVectors.map(lambda x: (x, )).toDF(["rawfeatures"])
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2016-12-26 11:50:40

不能直接转换RDD[Vector]。它应该映射到对象的RDD,这些对象可以解释为structs,例如RDD[Tuple[Vector]]

代码语言:javascript
复制
frequencyDenseVectors.map(lambda x: (x, )).toDF(["rawfeatures"])

否则,Spark将尝试转换对象__dict__并创建使用不受支持的NumPy数组作为字段。

代码语言:javascript
复制
from pyspark.ml.linalg import DenseVector  
from pyspark.sql.types import _infer_schema

v = DenseVector([1, 2, 3])
_infer_schema(v)
代码语言:javascript
复制
TypeError                                 Traceback (most recent call last)
... 
TypeError: not supported type: <class 'numpy.ndarray'>

代码语言:javascript
复制
_infer_schema((v, ))
代码语言:javascript
复制
StructType(List(StructField(_1,VectorUDT,true)))

Notes

  • 在Spark2.0中,您必须使用正确的本地类型:
代码语言:javascript
复制
- `pyspark.ml.linalg` when working `DataFrame` based `pyspark.ml` API.
- `pyspark.mllib.linalg` when working `RDD` based `pyspark.mllib` API.

这两个名称空间不再兼容,需要显式转换(例如,How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT)。

  • 编辑中提供的代码不等同于原始问题中的代码。您应该知道,tuplelist没有相同的语义。如果将向量映射成对,使用tuple并直接转换为DataFrame: tfidf.rdd.map( lambda行:(row,DenseVector(row1.toArray( ).toDF() 使用tuple (产品类型)也适用于嵌套结构,但我怀疑这是您想要的: (tfidf.rdd .map(lambda row:(row,DenseVector(row1.toArray() .map(lambda:(x,)) .toDF()) 在顶级list以外的任何地方的row都被解释为ArrayType
  • 使用UDF进行转换(Spark Python: Standard scaler error "Do not support ... SparseVector")要干净得多。
票数 14
EN

Stack Overflow用户

发布于 2016-12-26 10:35:30

我认为这里的问题是createDataframe不以denseVactor作为参数,请尝试将denseVector转换为相应的集合,即数组或列表。在scala和java中

toArray()

方法,您可以在数组或列表中转换denseVector,然后尝试创建dataFrame。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/41328799

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档