文章/答案/技术大牛

发布

社区首页 >问答首页 >我如何读取多个csv文件并将它们合并在一起(它们可能没有相同的列)？

问我如何读取多个csv文件并将它们合并在一起(它们可能没有相同的列)？
EN

Stack Overflow用户

提问于 2022-04-12 16:21:21

回答 1查看 543关注 0票数 1

我的hdfs结构显示了一个名为"mapped_files“的文件夹，该文件夹中包含几个csv文件-- "mapped_file_1.csv”、"mapped_file_2.csv“、.，如何合并所有这些文件？有可能这些文件没有完全相同的列。例如，当我使用pyspark读取"mapped_file_1.csv“和"mapped_file_2.csv”文件时，它们如下所示：

###mapped_file_1.csv

+----------+---------+---------+--------+-----+-----+-----------+----------+------------------+---------------------+-------------------+---------------+--------------------+
|chromosome|    start|      end|assembly|  ref|  alt|risk_allele|     genes|         phenotype|clinical_significance|polyphen_prediction|sift_prediction|                hgvs|
+----------+---------+---------+--------+-----+-----+-----------+----------+------------------+---------------------+-------------------+---------------+--------------------+
|         9| 96369762| 96369762|    null|  C/T|  C/T|          T|intergenic|Migraine with aura|                 null|               null|           null|          rs59270819|
|        10| 29075768| 29075768|    null|G/A/C|G/A/C|          A|intergenic|Migraine with aura|                 null|               null|           null|          rs59495588|
+----------+---------+---------+--------+-----+-----+-----------+----------+------------------+---------------------+-------------------+---------------+--------------------+

###mapped_file_2.csv

+------------+----------+----------+----------+--------------------+-----------+-------------------+---------------+--------------------+--------+-------------+--------+--------+---+---+-----------+--------------------+----------------------+--------------+---------------------+----------------+--------------------+--------------------+----------+--------------------+--------+-----+----+-------+----+-------+-------+--------------------+-------+--------------------+-----------------+
|variant_name|variant_id|chromosome|     genes|        variant_type|description|polyphen_prediction|sift_prediction|                hgvs|assembly|assembly.date|   start|     end|ref|alt|risk_allele|           phenotype|clinical_actionability|classification|clinical_significance|          method|  assertion_criteria|     level_certainty|      date|              author|  origin|title|year|authors|pmid|is_gwas|   name|                 url|version|databanks.variant_id|clinvar_accession|
+------------+----------+----------+----------+--------------------+-----------+-------------------+---------------+--------------------+--------+-------------+--------+--------+---+---+-----------+--------------------+----------------------+--------------+---------------------+----------------+--------------------+--------------------+----------+--------------------+--------+-----+----+-------+----+-------+-------+--------------------+-------+--------------------+-----------------+
|        null|      null|         1|['TARDBP']|single nucleotide...|       null|               null|           null|['Q13148:p.Ala90V...|  GRCh38|         null|11016874|11016874|  C|  T|          T|Amyotrophic later...|                  null|          null| Uncertain signifi...| literature only|                null|no assertion crit...|2019-07-02|         GeneReviews|germline| null|null|   null|null|   null|ClinVar|https://www.ncbi....|   null|                null|     VCV000021481|
|        null|      null|         1|['TARDBP']|single nucleotide...|       null|               null|           null|['Q13148:p.Ala90V...|  GRCh38|         null|11016874|11016874|  C|  T|          T|Amyotrophic later...|                  null|          null| Uncertain signifi...|clinical testing|Invitae Variant C...|criteria provided...|2019-08-15|             Invitae|germline| null|null|   null|null|   null|ClinVar|https://www.ncbi....|   null|                null|     VCV000021481|
|        null|      null|         1|['TARDBP']|single nucleotide...|       null|               null|           null|['Q13148:p.Ala90V...|  GRCh38|         null|11016874|11016874|  C|  T|          T|Amyotrophic later...|                  null|          null| Uncertain signifi...| literature only|                null|no assertion crit...|2019-07-02|         GeneReviews|germline| null|null|   null|null|   null|ClinVar|https://www.ncbi....|   null|                null|     VCV000021481|
+------------+----------+----------+----------+--------------------+-----------+-------------------+---------------+--------------------+--------+-------------+--------+--------+---+---+-----------+--------------------+----------------------+--------------+---------------------+----------------+--------------------+--------------------+----------+--------------------+--------+-----+----+-------+----+-------+-------+--------------------+-------+--------------------+-----------------+

从前面的dataframes/file中可以看到，在两个dataframes/文件中都不存在列。我做了这个：

from pyspark.sql import SparkSession
from pyspark.sql import Row
from functools import reduce
import pyspark.sql.functions as F

warehouse_location ='hdfs://hdfs-nn:9000'

spark = SparkSession \
    .builder \
    .master("local[2]") \
    .appName("csv") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .getOrCreate()

all_data = spark.read.options(header='True', delimiter=';').csv("hdfs://hdfs-nn:9000/mapped_files/*")

all_data.show()

+------------+----------+----------+----------+--------------------+-----------+-------------------+---------------+--------------------+--------+-------------+--------+----------+----+----+-----------+--------------------+----------------------+--------------+---------------------+----------------+--------------------+--------------------+----------+--------------------+--------+-----+----+-------+----+-------+-------+--------------------+-------+--------------------+-----------------+
|variant_name|variant_id|chromosome|     genes|        variant_type|description|polyphen_prediction|sift_prediction|                hgvs|assembly|assembly.date|   start|       end| ref| alt|risk_allele|           phenotype|clinical_actionability|classification|clinical_significance|          method|  assertion_criteria|     level_certainty|      date|              author|  origin|title|year|authors|pmid|is_gwas|   name|                 url|version|databanks.variant_id|clinvar_accession|
+------------+----------+----------+----------+--------------------+-----------+-------------------+---------------+--------------------+--------+-------------+--------+----------+----+----+-----------+--------------------+----------------------+--------------+---------------------+----------------+--------------------+--------------------+----------+--------------------+--------+-----+----+-------+----+-------+-------+--------------------+-------+--------------------+-----------------+
|        null|      null|         1|['TARDBP']|single nucleotide...|       null|               null|           null|['Q13148:p.Ala90V...|  GRCh38|         null|11016874|  11016874|   C|   T|          T|Amyotrophic later...|                  null|          null| Uncertain signifi...| literature only|                null|no assertion crit...|2019-07-02|         GeneReviews|germline| null|null|   null|null|   null|ClinVar|https://www.ncbi....|   null|                null|     VCV000021481|
|        null|      null|         1|['TARDBP']|single nucleotide...|       null|               null|           null|['Q13148:p.Ala90V...|  GRCh38|         null|11016874|  11016874|   C|   T|          T|Amyotrophic later...|                  null|          null| Uncertain signifi...|clinical testing|Invitae Variant C...|criteria provided...|2019-08-15|             Invitae|germline| null|null|   null|null|   null|ClinVar|https://www.ncbi....|   null|                null|     VCV000021481|
|        null|      null|         1|['TARDBP']|single nucleotide...|       null|               null|           null|['Q13148:p.Ala90V...|  GRCh38|         null|11016874|  11016874|   C|   T|          T|Amyotrophic later...|                  null|          null| Uncertain signifi...| literature only|                null|no assertion crit...|2019-07-02|         GeneReviews|germline| null|null|   null|null|   null|ClinVar|https://www.ncbi....|   null|                null|     VCV000021481|
|           9|  96369762|  96369762|      null|                 C/T|        C/T|                  T|     intergenic|  Migraine with aura|    null|         null|    null|rs59270819|null|null|       null|                null|                  null|          null|                 null|            null|                null|                null|      null|                null|    null| null|null|   null|null|   null|   null|                null|   null|                null|             null|
|          10|  29075768|  29075768|      null|               G/A/C|      G/A/C|                  A|     intergenic|  Migraine with aura|    null|         null|    null|rs59495588|null|null|       null|                null|                  null|          null|                 null|            null|                null|                null|      null|                null|    null| null|null|   null|null|   null|   null|                null|   null|                null|             null|
+------------+----------+----------+----------+--------------------+-----------+-------------------+---------------+--------------------+--------+-------------+--------+----------+----+----+-----------+--------------------+----------------------+--------------+---------------------+----------------+--------------------+--------------------+----------+--------------------+--------+-----+----+-------+----+-------+-------+--------------------+-------+--------------------+-----------------+
only showing top 20 rows

当我使用前面的代码时，共有列的值没有出现在正确的位置(在最后两行中，值不在正确的列中)。

因此，我的问题是：如何读取多个csv文件并将它们合并在一起(它们可能没有相同的列)？。

hadoop

pyspark

hdfs

apache-spark

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-04-12 16:29:42

简单的方法是向dataframes和使用联合函数添加缺少的列。我更喜欢unionByName，所以我将在我的示例中使用它：

df1 = spark.read.options(header='True', delimiter=';').csv("mapped_file_1.csv")
df2 = spark.read.options(header='True', delimiter=';').csv("mapped_file_2.csv")

united_df = df1.unionByName(df2, allowMissingColumns=True)

allowMissingColumns将使用NULL完成数据格式中缺少的列。

如果您有两个以上的文件，那么您只需定义一个函数并使用reduce来合并所有的数据文件：

def unite_dfs(df1, df2):
  return df1.unionByName(df2, allowMissingColumns=True)


list_of_dfs = [df1, df2, df3, df4, df5, df6]
united_df = reduce(unite_dfs, list_of_dfs)

如果这件事清楚的话请告诉我。我没有包含导入，因为我只使用了您的代码片段中的库。如果不清楚我可以编辑。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71846107

复制

相似问题

问我如何读取多个csv文件并将它们合并在一起(它们可能没有相同的列)？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我如何读取多个csv文件并将它们合并在一起(它们可能没有相同的列)？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我如何读取多个csv文件并将它们合并在一起(它们可能没有相同的列)？
EN