我是一个向空间数据分析迁移的城市规划师。我并没有忘记R和编程的一般,但由于我没有适当的培训,我的技能有时是有限的。
目前,我正试图分析大约50个CSV文件,其中包含公开拍卖的财务数据,这些数据有60000到300000行,有39个字段。这些文件是从罗马尼亚国家公共拍卖系统导出的,这是一个类似表单的平台.
问题是,其中一些行被地址字段中间的CRLF行尾所打断。我怀疑当人们以这种形式输入他们的地址时,他们会从多行的其他文件中复制/粘贴地址。
这个问题不能通过查找和替换来解决,因为这也将替换行尾的正确CRLF。
例如,数据是这样格式化的,每一行后面都有一个CRLF (它们使用^作为分隔符):
Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1
Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15;
Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^为了正确处理数据,我需要像这样读取CSV,方法是只删除断线的CRLF --查找和替换不能这样做:
Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1 Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15; Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^我已经找到了一个可能的解决方案(Is there a way in R to join broken lines of csv file?),但它需要一些调整来满足我的需求。最终的结果是下面的代码挂起,甚至在小的示例文件上也不会到达进程的结束。
我从上面提到的职位中修改了接受的解决方案代码:
dat <- readLines("filename.csv") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], "^", fixed = TRUE)) # extract variable names
nvar <- length(varnames)
k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))
while(k <= length(dat)){
k <- k + 1
if(dat[k] == "") {k <- k + 1
print(paste("data line", k, "is an empty string"))
if(k > length(dat)) {break}
}
temp <- dat[k]
# checks if there are enough commas or if the line was broken
while(length(gregexpr("^", temp)[[1]]) < nvar-1){
k <- k + 1
temp <- paste0(temp, dat[k])
}
temp <- unlist(strsplit(temp, "^"))
message(k)
dat1 <- rbind(dat1, temp)
}
dat1 = dat1[-1,] # delete the empty initial row 在分隔符之间计算字段似乎是一个很好的解决方案,但我无法找到一种很好的方法来做到这一点,而且我的R编程技能显然还不够。
那么,有没有办法修复R中这种损坏的CSV文件呢?
工作文件示例可以在这里访问:http://data.gv.ro/dataset/4a4903c4-b1e3-46d1-82a5-238287f9496c/resource/c6abc0ef-3efb-4aef-bc0a-411f8cab2a28/download/contracte-2007.csv
感谢您的任何帮助!
发布于 2016-07-31 04:05:08
问题似乎是^是一个特殊的角色。如果您遍历代码,您将看到您有627个变量,而不是39个变量。它使每个字符都成为一个变量。试试这个:
dat <- readLines("filename.csv") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], "\\^")) # extract variable names
nvar <- length(varnames)
k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))
while(k <= length(dat)){
k <- k + 1
#if(dat[k] == "") {k <- k + 1
#print(paste("data line", k, "is an empty string"))
if(k > length(dat)) {break}
#}
temp <- dat[k]
# checks if there are enough commas or if the line was broken
while(length(gregexpr("\\^", temp)[[1]]) < nvar-1){
k <- k + 1
temp <- paste0(temp, dat[k])
}
temp <- unlist(strsplit(temp, "\\^"))
message(k)
dat1 <- rbind(dat1, temp)
}
dat1 = dat1[-1,] # delete the empty initial row 抱歉错过了你和我的代码上的差异。你不想要fixed=true。将其更改为上面的内容如下:
> varnames
[1] "Castigator" "CastigatorCUI" "CastigatorTara"
[4] "CastigatorLocalitate" "CastigatorAdresa" "Tip"
[7] "TipContract" "TipProcedura" "AutoritateContractanta"
[10] "AutoritateContractantaCUI" "TipAC" "TipActivitateAC"
[13] "NumarAnuntAtribuire" "DataAnuntAtribuire" "TipIncheiereContract"
[16] "TipCriteriiAtribuire" "CuLicitatieElectronica" "NumarOfertePrimite"
[19] "Subcontractat" "NumarContract" "DataContract"
[22] "TitluContract" "Valoare" "Moneda"
[25] "ValoareRON" "ValoareEUR" "CPVCodeID"
[28] "CPVCode" "NumarAnuntParticipare" "DataAnuntParticipare"
[31] "ValoareEstimataParticipare" "MonedaValoareEstimataParticipare" "FonduriComunitare"
[34] "TipFinantare" "TipLegislatieID" "FondEuropean"
[37] "ContractPeriodic" "DepoziteGarantii" "ModalitatiFinantare" 发布于 2016-07-29 13:17:46
我们可以通过检查记录是否以数字字段结尾来确定每条记录的最后一行。然后使用累积和,我们可以使用1,2,3,.标记同一记录中的行。最后把它们粘在一起。
# test data
Lines <- "Name^FiscCode^Country^Adress^SomeData^
SomeCompany^235356^Romania^Adress1
Adress2^ 565863
SomeCompany^235356^Romania^Adress1^ 565863"
# for real problem use readLines("myfile")[-1]
L <- readLines(textConnection(Lines))[-1]
g <- rev(cumsum(rev(grepl("\\^ *\\d+$", L)))) ##
g <- max(g) - g + 1
L2 <- tapply(L, g, paste, collapse = " ")
read.table(text = L2, sep = "^")上述方法适用于问题中显示的数据,但如果实际数据与所显示的数据有差异,则可能需要根据这些差异进行一些修改。
备注:如果每个记录中总是有四个^字符,请尝试将标记为##的行替换为:
cnt <- count.fields(textConnection(L), sep = "^") - 1
g <- rev(cumsum(rev(cumsum(cnt) %% 4 == 0)))更新问题已更改为提供新的示例数据。请注意,发布的答案适用于它,但是您当然需要将4替换为38,因为新数据每个记录有38个分隔符,而旧数据有4个。此外,旧数据有一个标头,而新数据没有,因此我们删除了那些用于删除标头的-1。下面是一个可以复制并粘贴到R中的自包含示例。
Lines <- "Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1
Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15;
Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^"
L <- readLines(textConnection(Lines))
cnt <- count.fields(textConnection(L), sep = "^") - 1 # 38 4 34 4 34
g <- rev(cumsum(rev(cumsum(cnt) %% 38 == 0)))
g <- max(g) - g + 1 # 1 2 2 3 3
L2 <- tapply(L, g, paste, collapse = " ")
DF <- read.table(text = L2, sep = "^")
dim(DF)
## [1] 3 39示例数据不包含注释字符(#)或单引号或双引号,但如果其中包含了它们的部分数据,则需要将comment.char = "", quote = ""添加到count.fields和read.table调用中。
https://stackoverflow.com/questions/38658151
复制相似问题