文章/答案/技术大牛

发布

社区首页 >问答首页 >火花数据文件-进程2 CSV文件

问火花数据文件-进程2 CSV文件
EN

Stack Overflow用户

提问于 2020-04-16 18:22:21

回答 2查看 132关注 0票数 1

我是星火的新手。读取与CSV文件相关联的查询。

我试图读取2个CSV文件中的2个单独的数据文件，并‘从每个5行。然而，我看到的只是最后的数据操作正在打印。我是不是遗漏了什么？为什么第一次CSV数据不被打印？

# Read first CSV

file_location1 = "/FileStore/tables/airports.csv"
file_type1 = "csv"

# CSV options
infer_schema1 = "true"
first_row_is_header1 = "false"
delimiter1 = ","

# Load File 1 
df1 = spark.read.format(file_type1) \
  .option("inferSchema", infer_schema1) \
  .option("header", first_row_is_header1) \
  .option("sep", delimiter1) \
  .load(file_location1)

**df1.take(5)**

# Read second CSV 

file_location2 = "/FileStore/tables/Report.csv"
file_type2 = "csv"

# CSV options
infer_schema2 = "true"
first_row_is_header2 = "true"
delimiter2 = ","

# Load File 2 
df2 = spark.read.format(file_type2) \
  .option("inferSchema", infer_schema2) \
  .option("header", first_row_is_header2) \
  .option("sep", delimiter2) \
  .load(file_location2)

**df2.take(5)**

输出:只有第二个数据输出可以看到 (https://i.stack.imgur.com/bO7GO.png)

df1:pyspark.sql.dataframe.DataFrame
_c0:integer
_c1:string
_c2:string
_c3:string
_c4:string
_c5:string
_c6:double
_c7:double
_c8:integer
_c9:string
_c10:string
_c11:string
_c12:string
_c13:string
df2:pyspark.sql.dataframe.DataFrame = [Parcel(s): string, Building Name: string ... 100 more fields]
Out[1]: [Row(Parcel(s)='0022/012', Building Name='580 NORTH POINT ST', Building Address='580 NORTH POINT ST', Postal Code=94133, Full.Address='POINT (-122.416746 37.806186)', Floor Area=24022, Property Type='Commercial', Property Type - Self Selected='Hotel', PIM Link='http://propertymap.sfplanning.org/?&search=0022/012', Year Built=1900, Energy Audit Due Date=datetime.datetime(2013, 4, 1, 0, 0), Energy Audit Status='Did Not Comply', Benchmark 2018 Status='Violation - Did Not Report', 2018 Reason for Exemption=None, Benchmark 2017 Status='Violation - Did Not Report', 2017 Reason for Exemption=None, Benchmark 2016 Status='Violation - Did Not Report', 2016 Reason for Exemption=None, Benchmark 2015 Status='Violation - Did Not Report', 2015 Reason for Exemption=None, Benchmark 2014 Status='Violation - Did Not Report', 2014 Reason for Exemption=None, Benchmark 2013 Status='Violation - Did Not Report', 2013 Reason for Exemption=None, Benchmark 2012 Status='Violation - Did Not Report', 2012 Reason for Exemption=None, Benchmark 2011 Status='Exempt', 2011 Reason for Exemption='SqFt Not Subject This Year', Benchmark 2010 Status='Exempt', 2010 Reason for Exemption='SqFt Not Subject This Year', 2018 ENERGY STAR Score=None, 2018 Site EUI (kBtu/ft2)=None, 2018 Source EUI (kBtu/ft2)=None, 2018 Percent Better than National Median Site EUI=None, 2018 Percent Better than National Median Source EUI=None, 2018 Total GHG Emissions (Metric Tons CO2e)=None, 2018 Total GHG Emissions Intensity (kgCO2e/ft2)=None, 2018 Weather Normalized Site EUI (kBtu/ft2)=None, 2018 Weather Normalized

csv

apache-spark

pyspark

apache-spark-sql

回答 2

Stack Overflow用户

发布于 2020-04-16 22:29:02

为什么会这样？

take操作不会将结果打印到标准输出。
星火操作是懒惰的，所以所有的转换只在您请求最终结果(例如，打印结果以屏幕或将其保存到一个文件)时才进行评估。
在最后一行中，Spark猜测您实际上希望看到结果，因此它计算从文件中读取五行的df2。另一方面，df1永远不会被计算，并且只知道模式是被推断出来的。

尝试使用显示，就像df1.show(n=5)一样，它应该触发对数据的评估。

票数 0

Stack Overflow用户

发布于 2020-04-17 20:23:49

我认为Databricks应该是这样工作的。每个笔记本环境都是这样做的。它只显示最后的结果。因此，如果要显示这两个数据格式的结果，可能应该将它们写入两个不同的单元格中。你在一个单元格里写了完整的代码。它并不意味着要以这种方式使用。您编写单行并验证它是否成功，然后添加新行。因此，为了尝试，您可以在两个单独的单元格中添加以下内容，您将看到这两个单元格的结果。

# Read first CSV

file_location1 = "/FileStore/tables/airports.csv"
file_type1 = "csv"

# CSV options
infer_schema1 = "true"
first_row_is_header1 = "false"
delimiter1 = ","

# Load File 1 
df1 = spark.read.format(file_type1) \
  .option("inferSchema", infer_schema1) \
  .option("header", first_row_is_header1) \
  .option("sep", delimiter1) \
  .load(file_location1)

df1.take(5)

# Read second CSV 

file_location2 = "/FileStore/tables/Report.csv"
file_type2 = "csv"

# CSV options
infer_schema2 = "true"
first_row_is_header2 = "true"
delimiter2 = ","

# Load File 2 
df2 = spark.read.format(file_type2) \
  .option("inferSchema", infer_schema2) \
  .option("header", first_row_is_header2) \
  .option("sep", delimiter2) \
  .load(file_location2)

df2.take(5)

如果要打印它，就必须显式地使用print()。我希望这能帮到你。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61257291

复制

相似问题

问火花数据文件-进程2 CSV文件
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问火花数据文件-进程2 CSV文件EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问火花数据文件-进程2 CSV文件
EN