我的输入spark-dataframe名为df,
+---------------+----------------+-----------------------+
|Main_CustomerID|126+ Concentrate|2.5 Ethylhexyl_Acrylate|
+---------------+----------------+-----------------------+
| 725153| 3.0| 2.0|
| 873008| 4.0| 1.0|
| 625109| 1.0| 0.0|
+---------------+----------------+-----------------------+我需要从df的列名中删除特殊字符,如下所示,
+underscoredot替换为underscore所以我的df应该是
+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
| 725153| 3.0| 2.0|
| 873008| 4.0| 1.0|
| 625109| 1.0| 0.0|
+---------------+---------------+-----------------------+我用Scala实现了这一点
var tableWithColumnsRenamed = df
for (field <- tableWithColumnsRenamed.columns) {
tableWithColumnsRenamed = tableWithColumnsRenamed
.withColumnRenamed(field, field.replaceAll("\\.", "_"))
}
for (field <- tableWithColumnsRenamed.columns) {
tableWithColumnsRenamed = tableWithColumnsRenamed
.withColumnRenamed(field, field.replaceAll("\\+", ""))
}
for (field <- tableWithColumnsRenamed.columns) {
tableWithColumnsRenamed = tableWithColumnsRenamed
.withColumnRenamed(field, field.replaceAll(" ", "_"))
}
df = tableWithColumnsRenamed当我用,
for (field <- tableWithColumnsRenamed.columns) {
tableWithColumnsRenamed = tableWithColumnsRenamed
.withColumnRenamed(field, field.replaceAll("\\.", "_"))
.withColumnRenamed(field, field.replaceAll("\\+", ""))
.withColumnRenamed(field, field.replaceAll(" ", "_"))
}我得到的第一个列名为126 Concentrate,而不是126_Concentrate
但是我不喜欢这个替换的3 for循环。我能找到解决办法吗?
发布于 2018-06-29 09:32:50
df
.columns
.foldLeft(df){(newdf, colname) =>
newdf.withColumnRenamed(colname, colname.replace(" ", "_").replace(".", "_"))
}
.show发布于 2018-06-29 09:42:53
您可以使用withColumnRenamed regex replaceAllIn和foldLeft,如下所示
val columns = df.columns
val regex = """[+._, ]+"""
val replacingColumns = columns.map(regex.r.replaceAllIn(_, "_"))
val resultDF = replacingColumns.zip(columns).foldLeft(df){(tempdf, name) => tempdf.withColumnRenamed(name._2, name._1)}
resultDF.show(false)这应该会给你
+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|725153 |3.0 |2.0 |
|873008 |4.0 |1.0 |
|625109 |1.0 |0.0 |
+---------------+---------------+-----------------------+我希望答案是有帮助的。
发布于 2019-10-04 07:21:14
在java中,您可以使用df.columns()迭代列名,并用string replaceAll(regexPattern, IntendedCharreplacement)替换每个头字符串。
然后使用withColumnRenamed(headerName, correctedHeaderName)重命名df头。
伊格-
for (String headerName : dataset.columns()) {
String correctedHeaderName = headerName.replaceAll(" ","_").replaceAll("+","_");
dataset = dataset.withColumnRenamed(headerName, correctedHeaderName);
}
dataset.show();https://stackoverflow.com/questions/51097818
复制相似问题