我试图在一个5k数据集中对包含单词"road“的列进行反求。并创建一个新的数据格式。
我不知道如何做到这一点,以下是我所做的努力:
from pyspark.ml.feature import Bucketizer
spike_cols = [col for col in df.columns if "road" in col]
for x in spike_cols :
bucketizer = Bucketizer(splits=[-float("inf"), 10, 100, float("inf")],
inputCol=x, outputCol=x + "bucket")
bucketedData = bucketizer.transform(df)发布于 2018-07-18 12:58:14
要么在循环中修改df:
from pyspark.ml.feature import Bucketizer
for x in spike_cols :
bucketizer = Bucketizer(splits=[-float("inf"), 10, 100, float("inf")],
inputCol=x, outputCol=x + "bucket")
df = bucketizer.transform(df)或者使用Pipeline
from pyspark.ml import Pipeline
from pyspark.ml.feature import Bucketizer
model = Pipeline(stages=[
Bucketizer(
splits=[-float("inf"), 10, 100, float("inf")],
inputCol=x, outputCol=x + "bucket") for x in spike_cols
]).fit(df)
model.transform(df)发布于 2021-05-22 13:07:06
由于3.0.0,Bucketizer可以通过设置inputCols参数一次映射多个列。
因此,这变得更容易:
from pyspark.ml.feature import Bucketizer
splits = [-float("inf"), 10, 100, float("inf")]
params = [(col, col+'bucket', splits) for col in df.columns if "road" in col]
input_cols, output_cols, splits_array = zip(*params)
bucketizer = Bucketizer(inputCols=input_cols, outputCols=output_cols,
splitsArray=splits_array)
bucketedData = bucketizer.transform(df)https://stackoverflow.com/questions/51402369
复制相似问题