首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >火花数据文件-进程2 CSV文件

火花数据文件-进程2 CSV文件
EN

Stack Overflow用户
提问于 2020-04-16 18:22:21
回答 2查看 132关注 0票数 1

我是星火的新手。读取与CSV文件相关联的查询。

我试图读取2个CSV文件中的2个单独的数据文件,并‘从每个5行。然而,我看到的只是最后的数据操作正在打印。我是不是遗漏了什么?为什么第一次CSV数据不被打印?

代码语言:javascript
复制
# Read first CSV

file_location1 = "/FileStore/tables/airports.csv"
file_type1 = "csv"

# CSV options
infer_schema1 = "true"
first_row_is_header1 = "false"
delimiter1 = ","

# Load File 1 
df1 = spark.read.format(file_type1) \
  .option("inferSchema", infer_schema1) \
  .option("header", first_row_is_header1) \
  .option("sep", delimiter1) \
  .load(file_location1)

**df1.take(5)**

# Read second CSV 

file_location2 = "/FileStore/tables/Report.csv"
file_type2 = "csv"

# CSV options
infer_schema2 = "true"
first_row_is_header2 = "true"
delimiter2 = ","

# Load File 2 
df2 = spark.read.format(file_type2) \
  .option("inferSchema", infer_schema2) \
  .option("header", first_row_is_header2) \
  .option("sep", delimiter2) \
  .load(file_location2)

**df2.take(5)** 

输出:只有第二个数据输出可以看到 (https://i.stack.imgur.com/bO7GO.png)

代码语言:javascript
复制
df1:pyspark.sql.dataframe.DataFrame
_c0:integer
_c1:string
_c2:string
_c3:string
_c4:string
_c5:string
_c6:double
_c7:double
_c8:integer
_c9:string
_c10:string
_c11:string
_c12:string
_c13:string
df2:pyspark.sql.dataframe.DataFrame = [Parcel(s): string, Building Name: string ... 100 more fields]
Out[1]: [Row(Parcel(s)='0022/012', Building Name='580 NORTH POINT ST', Building Address='580 NORTH POINT ST', Postal Code=94133, Full.Address='POINT (-122.416746 37.806186)', Floor Area=24022, Property Type='Commercial', Property Type - Self Selected='Hotel', PIM Link='http://propertymap.sfplanning.org/?&search=0022/012', Year Built=1900, Energy Audit Due Date=datetime.datetime(2013, 4, 1, 0, 0), Energy Audit Status='Did Not Comply', Benchmark 2018 Status='Violation - Did Not Report', 2018 Reason for Exemption=None, Benchmark 2017 Status='Violation - Did Not Report', 2017 Reason for Exemption=None, Benchmark 2016 Status='Violation - Did Not Report', 2016 Reason for Exemption=None, Benchmark 2015 Status='Violation - Did Not Report', 2015 Reason for Exemption=None, Benchmark 2014 Status='Violation - Did Not Report', 2014 Reason for Exemption=None, Benchmark 2013 Status='Violation - Did Not Report', 2013 Reason for Exemption=None, Benchmark 2012 Status='Violation - Did Not Report', 2012 Reason for Exemption=None, Benchmark 2011 Status='Exempt', 2011 Reason for Exemption='SqFt Not Subject This Year', Benchmark 2010 Status='Exempt', 2010 Reason for Exemption='SqFt Not Subject This Year', 2018 ENERGY STAR Score=None, 2018 Site EUI (kBtu/ft2)=None, 2018 Source EUI (kBtu/ft2)=None, 2018 Percent Better than National Median Site EUI=None, 2018 Percent Better than National Median Source EUI=None, 2018 Total GHG Emissions (Metric Tons CO2e)=None, 2018 Total GHG Emissions Intensity (kgCO2e/ft2)=None, 2018 Weather Normalized Site EUI (kBtu/ft2)=None, 2018 Weather Normalized 
EN

回答 2

Stack Overflow用户

发布于 2020-04-16 22:29:02

为什么会这样?

  • take操作不会将结果打印到标准输出。
  • 星火操作是懒惰的,所以所有的转换只在您请求最终结果(例如,打印结果以屏幕或将其保存到一个文件)时才进行评估。
  • 在最后一行中,Spark猜测您实际上希望看到结果,因此它计算从文件中读取五行的df2。另一方面,df1永远不会被计算,并且只知道模式是被推断出来的。

尝试使用显示,就像df1.show(n=5)一样,它应该触发对数据的评估。

票数 0
EN

Stack Overflow用户

发布于 2020-04-17 20:23:49

我认为Databricks应该是这样工作的。每个笔记本环境都是这样做的。它只显示最后的结果。因此,如果要显示这两个数据格式的结果,可能应该将它们写入两个不同的单元格中。你在一个单元格里写了完整的代码。它并不意味着要以这种方式使用。您编写单行并验证它是否成功,然后添加新行。因此,为了尝试,您可以在两个单独的单元格中添加以下内容,您将看到这两个单元格的结果。

代码语言:javascript
复制
# Read first CSV

file_location1 = "/FileStore/tables/airports.csv"
file_type1 = "csv"

# CSV options
infer_schema1 = "true"
first_row_is_header1 = "false"
delimiter1 = ","

# Load File 1 
df1 = spark.read.format(file_type1) \
  .option("inferSchema", infer_schema1) \
  .option("header", first_row_is_header1) \
  .option("sep", delimiter1) \
  .load(file_location1)

df1.take(5)
代码语言:javascript
复制
# Read second CSV 

file_location2 = "/FileStore/tables/Report.csv"
file_type2 = "csv"

# CSV options
infer_schema2 = "true"
first_row_is_header2 = "true"
delimiter2 = ","

# Load File 2 
df2 = spark.read.format(file_type2) \
  .option("inferSchema", infer_schema2) \
  .option("header", first_row_is_header2) \
  .option("sep", delimiter2) \
  .load(file_location2)

df2.take(5)

如果要打印它,就必须显式地使用print()。我希望这能帮到你。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/61257291

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档