首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在PySpark中使用嵌套for循环的列表理解

如何在PySpark中使用嵌套for循环的列表理解
EN

Stack Overflow用户
提问于 2021-03-05 15:09:45
回答 1查看 463关注 0票数 1

我打算在下面的difflib.SequenceMatcher数据帧上使用PySpark ()。

代码语言:javascript
复制
tech.show()

+-----------------------------+----------------------+
|        concat_tech          |Vendor_product        |
+-----------------------------+----------------------+
|AWS Cloud Administration     |AWS Cloud Map         |
|Grad Portal                  |CA Identity Portal    |
|Html/php                     |HTML                  |
|UX Xpect                     |HP-UX                 |
|Debian-based                 |Debian                |
|Microsoft Office excel       |Microsoft Office      |
|Oracle EBusiness Suite 12.2.4|Oracle Primavera Suite|
|Solaris 10.XX                |Solaris               |
|CA7 Job Scheduler            |CA Scheduler          |
|Windows NT/XP/Vista          |Windows XP            |
+-----------------------------+----------------------+

techno.show()
+------------------------------+-------------------------------+--------------------------------+---------------------------------------------------+
|vendor                        |product                        |category                        |sub_category                                       |
+------------------------------+-------------------------------+--------------------------------+---------------------------------------------------+
|Notion Labs, Inc.             |Notion                         |Project Management              |Project Management                                 |
|Apptricity Corporation        |Apptricity                     |Enterprise Applications         |Enterprise Resource Planning (ERP)                 |
|Resolution Software, Ltd.     |Xcase                          |IT Governance                   |Application Development & Management               |
|The Apache Software Foundation|Apache Mynewt                  |IT Governance                   |Application Development & Management               |
|NetApp, Inc.                  |NetApp iSCSI SAN Storage System|Data Center Solutions           |Data Management & Storage (Hardware)               |
|HP, Inc.                      |HP Z820                        |Hardware (Basic)                |Consumer Electronics, Personal Computers & Software|
|Dell Technologies, Inc.       |Dell EMC FormWare              |Customer Relationship Management|Help Desk Management                               |
|ServiceMax, Inc.              |ServiceMax                     |Customer Relationship Management|Service & Field Support Management                 |
|MaxMind, Inc.                 |MaxMind GeoIP                  |Software (Basic)                |Server Technologies (Software)                     |
|Campus Management Corporation |Campus Management              |Vertical Markets                |Academic & Education Management Software           |
+------------------------------+-------------------------------+--------------------------------+---------------------------------------------------+
import pandas as pd
from difflib import SequenceMatcher
def similar(a,b):
    if pd.isnull(a) or pd.isnull(b):
        return 0
    else:
        return SequenceMatcher(None, a, b).ratio()

函数SequenceMatcher(a,b)将数据帧技术中的concat_tech作为'a‘,数据帧技术中的乘积作为' b’,并返回a和b之间的比率,目的是在所有产品中找到concat_tech的最佳匹配,并返回最佳匹配的数据帧,即从使用SequenceMatcher()为concat_tech值生成最佳(最大)比率的产品列返回一个产品。

这应该是一对多的操作,可以使用熊猫的列表理解,但如何在PySpark实现同样的操作呢?我在两个数据帧中都有数以百万行的行,这里我给出了一个10行的示例。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-03-05 17:49:58

您正在尝试比较dataframe tech中的每个元素和dataframe techno中的每个元素。这种操作的结果是一个crossJoin。除非该连接的任何一方是相当小的,或者有一种方法减少可能的组合数量(从而避免交叉连接),这将是一个非常昂贵的操作。

实际代码是直接向前的:执行连接,在udf的帮助下计算每一对元素的比率,然后为tech中的每个元素找到最大值。

代码语言:javascript
复制
import pandas as pd
from difflib import SequenceMatcher
from pyspark.sql import functions as F

@F.udf("double")
def similar(a,b):
    if pd.isnull(a) or pd.isnull(b):
        return 0
    else:
        return SequenceMatcher(None, a, b).ratio()

df = tech.select("concat_tech").crossJoin(techno.select("product")) \
    .withColumn("ratio", similar("concat_tech", "product")) \
    .groupBy("concat_tech").agg(F.expr("max_by(product, ratio)"), F.max("ratio"))
df.show(truncate=False)

示例数据的输出:

代码语言:javascript
复制
+--------------------------+----------------------+-------------------+         
|concat_tech               |max_by(product, ratio)|max(ratio)         |
+--------------------------+----------------------+-------------------+
|UXXpect                   |Apptricity            |0.35294117647058826|
|GradPortal                |Notion                |0.25               |
|OracleEBusinessSuite12.2.4|ApacheMynewt          |0.3157894736842105 |
|MicrosoftOfficeexcel      |ServiceMax            |0.3333333333333333 |
|AWSCloudAdministration    |Notion                |0.35714285714285715|
|CA7JobScheduler           |ApacheMynewt          |0.37037037037037035|
|Html/php                  |HPZ820                |0.14285714285714285|
|WindowsNT/XP/Vista        |MaxMindGeoIP          |0.3333333333333333 |
|Debian-based              |Xcase                 |0.35294117647058826|
|Solaris10.XX              |Xcase                 |0.23529411764705882|
+--------------------------+----------------------+-------------------+

矢量化UDF的使用可能会提高一些性能,但是交叉连接仍然是大数据格式的问题。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/66494976

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档