首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >读取奇怪的格式化程序CSV文件

读取奇怪的格式化程序CSV文件
EN

Stack Overflow用户
提问于 2016-03-15 10:58:32
回答 3查看 125关注 0票数 0

我正在从statistics.gov.scot网站下载一些数据。例如,我想提供一些关于住院率的数据。对我感兴趣的数据表进行源查询的格式如下:

代码语言:javascript
复制
http://statistics.gov.scot/slice/observations.csv?&dataset=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions&http%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23measureType=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fmeasure-properties%2Fratio&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fage=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fage%2Fall&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fgender=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fgender%2Fall

并通过http://statistics.gov.scot/slice/observations.csv?&dataset=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions&http%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23measureType=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fmeasure-properties%2Fratio&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fage=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fage%2Fall&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fgender=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fgender%2Fall访问,供那些想尝试的人使用。查询生成一个包含相关信息的*.CSV文件,但是该文件的格式带来了一些挑战。

文件示例

文件内容如下所示:

代码语言:javascript
复制
Generated by http://statistics.gov.scot,2016-03-15T10:41:28+00:00
http://statistics.gov.scot/data/hospital-admissions,Hospital Admissions
measure type,""
Admission Type,""
Age,""
Gender,""
Measure (cell values): ,"Ratio (Rate Per 100,000 Population)"

,,http://reference.data.gov.uk/id/year/2002,http://reference.data.gov.uk/id/year/2003,http://reference.data.gov.uk/id/year/2004,http://reference.data.gov.uk/id/year/2005,http://reference.data.gov.uk/id/year/2006,http://reference.data.gov.uk/id/year/2007,http://reference.data.gov.uk/id/year/2008,http://reference.data.gov.uk/id/year/2009,http://reference.data.gov.uk/id/year/2010,http://reference.data.gov.uk/id/year/2011,http://reference.data.gov.uk/id/year/2012
http://purl.org/linked-data/sdmx/2009/dimension#refArea,Reference Area,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
http://statistics.gov.scot/id/statistical-geography/S92000003,Scotland,"9,351","9,262","9,261","9,347","9,723","10,517","10,293","10,150","10,024","10,232","10,194"

导入Excel时:

但是,当通过R导入read.csv时,如下所示:

代码语言:javascript
复制
> head(problematicFile)
                                                   V1                        V2
1             Generated by http://statistics.gov.scot 2016-03-15T10:36:29+00:00
2 http://statistics.gov.scot/data/hospital-admissions       Hospital Admissions
3                                        measure type                          
4                                      Admission Type                          
5                                                 Age                          
6                                              Gender  

问题

read.csv导入只返回两列。我猜这个问题与一些初始列是空的有关。我想以类似于Excel实现的图解导入的方式读取这个文件。要点是,我打算在列AB中使用来自行7的vales,当然还有下面的数据表。就生成data.frame而言,我很乐意在有空单元格的情况下包含NA值,但是包含与Excel中相同的维度。我试过:

代码语言:javascript
复制
read.csv(file = link, header = FALSE, na.strings = "",
                               fill = TRUE)

但我总是遇到同样的问题。

期望的结果

所需的结果应该类似于(手工生成的提取):

代码语言:javascript
复制
Generated by http://statistics.gov.scot 2016-03-15T10:41:28+00:00   NA  NA  NA  NA  NA  NA  NA
http://statistics.gov.scot/data/hospital-admissions Hospital Admissions NA  NA  NA  NA  NA  NA  NA
measure type    NA  NA  NA  NA  NA  NA  NA  NA
Admission Type  NA  NA  NA  NA  NA  NA  NA  NA
Age NA  NA  NA  NA  NA  NA  NA  NA
Gender  NA  NA  NA  NA  NA  NA  NA  NA
Measure (cell values):  Ratio (Rate Per 100,000 Population)         NA  NA  NA  NA  NA
NA  NA  NA  NA  NA  NA  NA  NA  NA
NA  NA  http://reference.data.gov.uk/id/year/2002   http://reference.data.gov.uk/id/year/2003   http://reference.data.gov.uk/id/year/2004   http://reference.data.gov.uk/id/year/2005   http://reference.data.gov.uk/id/year/2006   http://reference.data.gov.uk/id/year/2007   http://reference.data.gov.uk/id/year/2008
http://purl.org/linked-data/sdmx/2009/dimension#refArea Reference Area  2002    2003    2004    2005    2006    2007    2008
http://statistics.gov.scot/id/statistical-geography/S92000003   Scotland    9,351   9,262   9,261   9,347   9,723   10,517  10,293
http://statistics.gov.scot/id/statistical-geography/S16000082   Angus South 8,236   8,500   8,523   8,371   8,616   8,978   9,325
http://statistics.gov.scot/id/statistical-geography/S16000106   Edinburgh Northern and Leith    9,040   8,040   7,925   9,042   10,355  11,833  8,916
http://statistics.gov.scot/id/statistical-geography/S16000140   Renfrewshire South  9,391   9,122   9,491   9,586   10,425  10,900  11,065
http://statistics.gov.scot/id/statistical-geography/S16000108   Edinburgh Southern  5,878   5,910   6,101   6,035   7,426   9,343   6,766
http://statistics.gov.scot/id/statistical-geography/S16000075   Aberdeen Donside    10,047  10,963  10,629  10,512  10,383  10,787  10,685
http://statistics.gov.scot/id/statistical-geography/S16000137   Perthshire North    9,388   9,524   7,799   9,350   9,543   9,791   9,991
http://statistics.gov.scot/id/statistical-geography/S16000077   Aberdeenshire East  7,211   7,300   7,153   7,411   7,435   7,268   7,547
http://statistics.gov.scot/id/statistical-geography/S16000114   Galloway and West Dumfries  9,861   9,165   8,143   9,258   7,508   10,213  10,399
http://statistics.gov.scot/id/statistical-geography/S16000096   Dumbarton   8,703   8,570   8,727   9,310   9,389   9,885   10,237

屏幕截图

为了进一步说明,我想维护维度,并使用NAs填充缺失的值:

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2017-02-01 17:35:28

从头解析元数据有点棘手。您可能更喜欢下载整个规范化数据集,而不是跨表切片。

代码语言:javascript
复制
> reconv <- read.csv("http://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions")

> head(reconv)

  GeographyCode DateCode Measurement                              Units Value Gender Age
1     S92000003     2003        Mean Average reconvictions per offender  0.62    All All
2     S92000003     2004        Mean Average reconvictions per offender  0.33    All All
3     S92000003     2004        Mean Average reconvictions per offender  0.61    All All
4     S92000003     2005        Mean Average reconvictions per offender  0.60    All All
5     S92000003     2006        Mean Average reconvictions per offender  0.60    All All
6     S92000003     2007        Mean Average reconvictions per offender  0.11    All All

这将使所有元数据都处于因子级别(因此不必解析它):

代码语言:javascript
复制
> str(reconv)

'data.frame':   10119 obs. of  7 variables:
 $ GeographyCode: Factor w/ 26 levels "S12000005","S12000006",..: 26 26 26 26 26 26 26 26 26 26 ...
 $ DateCode     : int  2003 2004 2004 2005 2006 2007 2007 2008 2008 2009 ...
 $ Measurement  : Factor w/ 2 levels "Mean","Ratio": 1 1 1 1 1 1 1 1 1 1 ...
 $ Units        : Factor w/ 2 levels "Average reconvictions per offender",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Value        : num  0.62 0.33 0.61 0.6 0.6 0.11 0.57 0.6 0.33 0.33 ...
 $ Gender       : Factor w/ 3 levels "All","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Age          : Factor w/ 6 levels "21-25","26-30",..: 4 4 4 4 4 4 4 4 4 4 ...

您可以选择您感兴趣的切片:

代码语言:javascript
复制
> slice <- subset(reconv, Measurement=="Ratio" & Gender=="All" & Age=="All")

如果您想要的话,返回到原来的交叉表切片:

代码语言:javascript
复制
> library(reshape2)
> dcast(slice, GeographyCode ~ DateCode, value.var="Value", fun.aggregate = first)

   GeographyCode 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
1      S12000005 41.4 34.3 41.0 40.7 37.4 37.2 33.3 34.6 35.8 33.0 32.8
2      S12000006 34.9 36.0 31.9 34.2 31.1 28.7 27.9 29.6 27.5 26.8 27.0
3      S12000008 33.7 33.2 33.7 33.2 31.7 32.8 30.4 31.5 29.1 28.1 28.7
4      S12000010 26.7 24.5 25.7 26.9 26.7 27.8 29.3 25.1 22.4 29.0 28.2
5      S12000013 31.7 26.1 30.6 35.4 31.6 25.9 24.0 18.9 30.5 22.8 18.6
...
票数 2
EN

Stack Overflow用户

发布于 2016-03-15 11:13:49

您需要手动指定col.names,以强制read.csv读取多个列。另外,将na.strings指定为空字符串将使NA值保持在空列中。

代码语言:javascript
复制
read.csv(<parameters>, col.names=c("Col1","Col2".....), na.strings="")
票数 1
EN

Stack Overflow用户

发布于 2016-03-15 11:27:50

可以通过使用read.table和提供列名来指定列数:

代码语言:javascript
复制
read.table(file = link, 
           fill = TRUE,
           sep = ",",
           na.strings = "",
           col.names = paste("c", 1:12, sep = ""))

但是,我不知道这是否是一个好的解决方案,因为您需要事先知道列的数量。

另一种方法是将整个csv读入字符串。然后,您可以通过将标头存储在另一个对象(例如列表)中进行预处理,并且您只需使用“表部分”作为数据帧。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/36009317

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档