文章/答案/技术大牛

发布

问CountVectorizer特征提取
EN

Stack Overflow用户

提问于 2017-09-05 21:37:13

回答 1查看 287关注 0票数 0

我有下面的数据

+------------------------------------------------+
|filtered                                        |
+------------------------------------------------+
|[human, interface, computer]                    |
|[survey, user, computer, system, response, time]|
|[eps, user, interface, system]                  |
|[system, human, system, eps]                    |
|[user, response, time]                          |
|[trees]                                         |
|[graph, trees]                                  |
|[graph, minors, trees]                          |
|[graph, minors, survey]                         |
+------------------------------------------------+

在上面的列上运行CountVectorizer之后，我得到以下输出

+------------------------------------------------+-------------------

--------------------------+
|filtered                                        |features                                     |
+------------------------------------------------+---------------------------------------------+
|[human, interface, computer]                    |(12,[4,7,9],[1.0,1.0,1.0])                   |
|[survey, user, computer, system, response, time]|(12,[0,2,6,7,8,11],[1.0,1.0,1.0,1.0,1.0,1.0])|
|[eps, user, interface, system]                  |(12,[0,2,4,10],[1.0,1.0,1.0,1.0])            |
|[system, human, system, eps]                    |(12,[0,9,10],[2.0,1.0,1.0])                  |
|[user, response, time]                          |(12,[2,8,11],[1.0,1.0,1.0])                  |
|[trees]                                         |(12,[1],[1.0])                               |
|[graph, trees]                                  |(12,[1,3],[1.0,1.0])                         |
|[graph, minors, trees]                          |(12,[1,3,5],[1.0,1.0,1.0])                   |
|[graph, minors, survey]                         |(12,[3,5,6],[1.0,1.0,1.0])                   |
+------------------------------------------------+---------------------------------------------+

现在，我希望在功能列上运行一个映射函数，并将其转换为如下所示

+------------------------------------------------+--------------------------------------------------------+
|features                                        |transformed                                             |
+------------------------------------------------+--------------------------------------------------------+
|(12,[4,7,9],[1.0,1.0,1.0])                      |["1 4 1", "1 7 1", "1 9 1"]                             |
|(12,[0,2,6,7,8,11],[1.0,1.0,1.0,1.0,1.0,1.0])   |["2 0 1", "2 2 1", "2 6 1", "2 7 1", "2 8 1", "2 11 1"] |
|(12,[0,2,4,10],[1.0,1.0,1.0,1.0])               |["3 0 1", "3 2 1", "3 4 1", "3 10 1"]                   |
[TRUNCATED]

转换特征的方法是从特征中提取中间数组，然后从中间数组创建子数组。例如，在features列的第1行和第1行中

(12,[4,7,9],[1.0,1.0,1.0])

现在，取其中间数组，即[4,7,9]，并将其freq与第三列[1.0,1.0,1.0]前面的"1“作为其第1行进行比较，以获得以下输出：

["1 4 1", "1 7 1", "1 9 1"]

大致上是这样的：

["RowNumber MiddleFeatEl CorrespondingFreq", ....]

通过应用地图函数，我无法从生成的特性列中分别提取中间列和Last Freq list：

下面是地图代码：

def corpus_create(feats):
    return feats[1] # Here i want to get [4,7,9] instead of 1 single feat score.

corpus_udf = udf(lambda feats: corpus_create(feats), StringType())
df3 = df.withColumn("corpus", corpus_udf("features"))

apache-spark

pyspark

spark-dataframe

pyspark-sql

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-09-05 21:50:28

在Spark中，行号基本上是没有意义的，但是如果您不介意的话：

def f(x):
    row, i = x
    jvs = (
        # SparseVector
        zip(row.features.indices, row.features.values) if hasattr(row.features, "indices")
        # DenseVector
        else enumerate(row.features.toArray()))

    s = ["{} {} {}".format(i, j, v) 
        for j, v in jvs if v]
    return row + (s, )


df.rdd.zipWithIndex().map(f).toDF(df.columns + ["transformed"])

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/46063711

复制

相似问题

问CountVectorizer特征提取
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问CountVectorizer特征提取EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问CountVectorizer特征提取
EN