Pyspark: Split and select part of the string column values
如何选择星星之DF中列中第4(左)反斜杠后的字符或文件路径?
火花放电列的示例行:
\\D\Dev\johnny\Desktop\TEST
\\D\Dev\matt\Desktop\TEST\NEW
\\D\Dev\matt\Desktop\TEST\OLD\TEST
\\E\dev\peter\Desktop\RUN\SUBFOLDER\New
\\K924\prod\ums\Desktop\RUN\SUBFOLDER\New
\\LE345\jskx\rfk\Desktop\RUN\SUBFOLDER\New
.
.
.
\\ls53\f7sn3\vso\hsk\mwq\sdsf\kse预期输出
johnny\Desktop\TEST
matt\Desktop\TEST\NEW
matt\Desktop\TEST\OLD\TEST
peter\Desktop\RUN\SUBFOLDER\New
ums\Desktop\RUN\SUBFOLDER\New
rfk\Desktop\RUN\SUBFOLDER\New
.
.
.
vso\hsk\mwq\sdsf\kse我以前的question导致了这个新的问题。感谢你的帮助。
发布于 2021-09-02 05:44:21
您可以在regexp_replace中使用正则表达式。
from pyspark.sql import functions as F
df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\[a-zA-Z0-9]+\\\\[a-zA-Z0-9]+\\\\",""))你也可以更灵活地使用这个解决方案。
from pyspark.sql import functions as F
no_of_slashes=4 # number of slashes to consider here
# we build the regular expression by repeating `"[a-zA-Z0-9]+\\\\"`
# NB. We subtract 2 since we start with the frst 2 slashes
df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\"+("[a-zA-Z0-9]+\\\\"*(no_of_slashes-2)),""))如果这对你有用的话请告诉我。
https://stackoverflow.com/questions/69024095
复制相似问题