我是星火的新手。读取与CSV文件相关联的查询。
我试图读取2个CSV文件中的2个单独的数据文件,并‘从每个5行。然而,我看到的只是最后的数据操作正在打印。我是不是遗漏了什么?为什么第一次CSV数据不被打印?
# Read first CSV
file_location1 = "/FileStore/tables/airports.csv"
file_type1 = "csv"
# CSV options
infer_schema1 = "true"
first_row_is_header1 = "false"
delimiter1 = ","
# Load File 1
df1 = spark.read.format(file_type1) \
.option("inferSchema", infer_schema1) \
.option("header", first_row_is_header1) \
.option("sep", delimiter1) \
.load(file_location1)
**df1.take(5)**
# Read second CSV
file_location2 = "/FileStore/tables/Report.csv"
file_type2 = "csv"
# CSV options
infer_schema2 = "true"
first_row_is_header2 = "true"
delimiter2 = ","
# Load File 2
df2 = spark.read.format(file_type2) \
.option("inferSchema", infer_schema2) \
.option("header", first_row_is_header2) \
.option("sep", delimiter2) \
.load(file_location2)
**df2.take(5)** 输出:只有第二个数据输出可以看到 (https://i.stack.imgur.com/bO7GO.png)
df1:pyspark.sql.dataframe.DataFrame
_c0:integer
_c1:string
_c2:string
_c3:string
_c4:string
_c5:string
_c6:double
_c7:double
_c8:integer
_c9:string
_c10:string
_c11:string
_c12:string
_c13:string
df2:pyspark.sql.dataframe.DataFrame = [Parcel(s): string, Building Name: string ... 100 more fields]
Out[1]: [Row(Parcel(s)='0022/012', Building Name='580 NORTH POINT ST', Building Address='580 NORTH POINT ST', Postal Code=94133, Full.Address='POINT (-122.416746 37.806186)', Floor Area=24022, Property Type='Commercial', Property Type - Self Selected='Hotel', PIM Link='http://propertymap.sfplanning.org/?&search=0022/012', Year Built=1900, Energy Audit Due Date=datetime.datetime(2013, 4, 1, 0, 0), Energy Audit Status='Did Not Comply', Benchmark 2018 Status='Violation - Did Not Report', 2018 Reason for Exemption=None, Benchmark 2017 Status='Violation - Did Not Report', 2017 Reason for Exemption=None, Benchmark 2016 Status='Violation - Did Not Report', 2016 Reason for Exemption=None, Benchmark 2015 Status='Violation - Did Not Report', 2015 Reason for Exemption=None, Benchmark 2014 Status='Violation - Did Not Report', 2014 Reason for Exemption=None, Benchmark 2013 Status='Violation - Did Not Report', 2013 Reason for Exemption=None, Benchmark 2012 Status='Violation - Did Not Report', 2012 Reason for Exemption=None, Benchmark 2011 Status='Exempt', 2011 Reason for Exemption='SqFt Not Subject This Year', Benchmark 2010 Status='Exempt', 2010 Reason for Exemption='SqFt Not Subject This Year', 2018 ENERGY STAR Score=None, 2018 Site EUI (kBtu/ft2)=None, 2018 Source EUI (kBtu/ft2)=None, 2018 Percent Better than National Median Site EUI=None, 2018 Percent Better than National Median Source EUI=None, 2018 Total GHG Emissions (Metric Tons CO2e)=None, 2018 Total GHG Emissions Intensity (kgCO2e/ft2)=None, 2018 Weather Normalized Site EUI (kBtu/ft2)=None, 2018 Weather Normalized 发布于 2020-04-16 22:29:02
为什么会这样?
take操作不会将结果打印到标准输出。df2。另一方面,df1永远不会被计算,并且只知道模式是被推断出来的。尝试使用显示,就像df1.show(n=5)一样,它应该触发对数据的评估。
发布于 2020-04-17 20:23:49
我认为Databricks应该是这样工作的。每个笔记本环境都是这样做的。它只显示最后的结果。因此,如果要显示这两个数据格式的结果,可能应该将它们写入两个不同的单元格中。你在一个单元格里写了完整的代码。它并不意味着要以这种方式使用。您编写单行并验证它是否成功,然后添加新行。因此,为了尝试,您可以在两个单独的单元格中添加以下内容,您将看到这两个单元格的结果。
# Read first CSV
file_location1 = "/FileStore/tables/airports.csv"
file_type1 = "csv"
# CSV options
infer_schema1 = "true"
first_row_is_header1 = "false"
delimiter1 = ","
# Load File 1
df1 = spark.read.format(file_type1) \
.option("inferSchema", infer_schema1) \
.option("header", first_row_is_header1) \
.option("sep", delimiter1) \
.load(file_location1)
df1.take(5)# Read second CSV
file_location2 = "/FileStore/tables/Report.csv"
file_type2 = "csv"
# CSV options
infer_schema2 = "true"
first_row_is_header2 = "true"
delimiter2 = ","
# Load File 2
df2 = spark.read.format(file_type2) \
.option("inferSchema", infer_schema2) \
.option("header", first_row_is_header2) \
.option("sep", delimiter2) \
.load(file_location2)
df2.take(5)如果要打印它,就必须显式地使用print()。我希望这能帮到你。
https://stackoverflow.com/questions/61257291
复制相似问题