首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >将文本拆分为“x”并填充表

将文本拆分为“x”并填充表
EN

Stack Overflow用户
提问于 2022-09-09 11:51:00
回答 3查看 76关注 0票数 1

我有一些数据看起来:

代码语言:javascript
复制
                          cardCharacteristics
1        2 habs.|2 baños|72 m²|Bajos|Ascensor
2 3 habs.|2 baños|110 m²|Ascensor|Calefacción
3                     3 habs.|70 m²|2ª Planta
4       2 habs.|2 baños|160 m²|Terraza|Balcón
5   5 habs.|2 baños|176 m²|7ª Planta|Ascensor
6            3 habs.|2 baños|187 m²|4ª Planta

我正试图通过|将该列拆分为一个未指定的列数。使用下面的cSplit_e(., split.col = "cardCharacteristics", sep = "|", type = "character")无法得到结果,因为它对所有唯一值进行拆分,并返回二进制输出。

预期产出将是:

代码语言:javascript
复制
tibble(
  "habs" = c(2, 3, 3, 2, 5, 3),
  "baños" = c(2, 2, NA, 2, 2, 2),
  "m^2" = c(72, 110, 70, 160, 176, 187),
  "Floor" = c("Bajos", NA, "2ª Planta", NA, "7ª Planta", "4ª Planta"),
  "Lift" = c("Ascensor", "Ascensor", NA, NA, "Ascensor", NA),
  "Heating" = c(NA, "Calefacción", NA, NA, NA, NA),
  "Terraza" = c(NA, NA, NA, "Terraza", NA, NA),
  "Balcón" = c(NA, NA, NA, "Balcón", NA, NA)
)

或者:

代码语言:javascript
复制
   habs baños `m^2` Floor     Lift     Heating     Terraza Balcón
  <dbl> <dbl> <dbl> <chr>     <chr>    <chr>       <chr>   <chr> 
1     2     2    72 Bajos     Ascensor NA          NA      NA    
2     3     2   110 NA        Ascensor Calefacción NA      NA    
3     3    NA    70 2ª Planta NA       NA          NA      NA    
4     2     2   160 NA        NA       NA          Terraza Balcón
5     5     2   176 7ª Planta Ascensor NA          NA      NA    
6     3     2   187 4ª Planta NA       NA          NA      NA

数据:

代码语言:javascript
复制
data = structure(list(cardCharacteristics = c("2 habs.|2 baños|72 m²|Bajos|Ascensor", 
"3 habs.|2 baños|110 m²|Ascensor|Calefacción", "3 habs.|70 m²|2ª Planta", 
"2 habs.|2 baños|160 m²|Terraza|Balcón", "5 habs.|2 baños|176 m²|7ª Planta|Ascensor", 
"3 habs.|2 baños|187 m²|4ª Planta")), row.names = c(NA, 6L
), class = "data.frame")

编辑:

我的进步,是做以下几点:

代码语言:javascript
复制
data %>%
mutate(
    habs = str_extract(cardCharacteristics, "(\\d)+(?= habs.)"),
    baños = str_extract(cardCharacteristics, "(\\d)+(?= baños)"),
    mts2 = str_extract(cardCharacteristics, "(\\d)+(?= m²)"),
    floor = str_extract(cardCharacteristics, "(\\d)+(?= 4ª Planta)")
  )

编辑2:

以下内容如下:

代码语言:javascript
复制
  mutate(
    habs = str_extract(cardCharacteristics, "(\\d)+(?= habs.)"),
    baños = str_extract(cardCharacteristics, "(\\d)+(?= baños)"),
    mts2 = str_extract(cardCharacteristics, "(\\d)+(?= m²)"),
    Terraza = str_extract(cardCharacteristics, "Terraza"),
    Calefacción = str_extract(cardCharacteristics, "Calefacción"),
    Floor = str_extract(cardCharacteristics, "(\\d)+(?=ª Planta)|Bajos"),
  )

让我:

代码语言:javascript
复制
                          cardCharacteristics habs baños mts2 Terraza Calefacción Floor
1        2 habs.|2 baños|72 m²|Bajos|Ascensor    2     2   72    <NA>        <NA> Bajos
2 3 habs.|2 baños|110 m²|Ascensor|Calefacción    3     2  110    <NA> Calefacción  <NA>
3                     3 habs.|70 m²|2ª Planta    3  <NA>   70    <NA>        <NA>     2
4       2 habs.|2 baños|160 m²|Terraza|Balcón    2     2  160 Terraza        <NA>  <NA>
5   5 habs.|2 baños|176 m²|7ª Planta|Ascensor    5     2  176    <NA>        <NA>     7
6            3 habs.|2 baños|187 m²|4ª Planta    3     2  187    <NA>        <NA>     4

这几乎就是我所需要的。

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2022-09-10 09:55:36

您非常接近--实际上,您提取Floor的方式与您的目标背道而驰(展望阻止了提取ª Planta子字符串!):

代码语言:javascript
复制
data %>% 
  mutate(
    habs = str_extract(cardCharacteristics, "(\\d)+(?= habs.)"),
    baños = str_extract(cardCharacteristics, "(\\d)+(?= baños)"),
    mts2 = str_extract(cardCharacteristics, "(\\d)+(?= m²)"),
    Terraza = str_extract(cardCharacteristics, "Terraza"),
    Calefacción = str_extract(cardCharacteristics, "Calefacción"),
    Floor = str_extract(cardCharacteristics, "Bajos|\\d+ª Planta"),  # <--- Corrected here
  )
                          cardCharacteristics habs baños mts2 Terraza Calefacción     Floor
1        2 habs.|2 baños|72 m²|Bajos|Ascensor    2     2   72    <NA>        <NA>     Bajos
2 3 habs.|2 baños|110 m²|Ascensor|Calefacción    3     2  110    <NA> Calefacción      <NA>
3                     3 habs.|70 m²|2ª Planta    3  <NA>   70    <NA>        <NA> 2ª Planta
4       2 habs.|2 baños|160 m²|Terraza|Balcón    2     2  160 Terraza        <NA>      <NA>
5   5 habs.|2 baños|176 m²|7ª Planta|Ascensor    5     2  176    <NA>        <NA> 7ª Planta
6            3 habs.|2 baños|187 m²|4ª Planta    3     2  187    <NA>        <NA> 4ª Planta
票数 2
EN

Stack Overflow用户

发布于 2022-09-10 08:39:36

回到您的dictionary想法,您可能需要一种与write.dcf可能有的tag:value方法。我的系统是一种不同的编码,你的系统使某些东西很难测试,或者用你的编码导致不想要的结果,尽管可能会在你的系统上工作。我假设'habs.','banos‘'m2’是最基本的,更多的条目是额外的便利设施,每个“唱片”都以'habs.‘开头使用您的data

代码语言:javascript
复制
# split1 <- strsplit(...
strsplit(unname(unlist(data)), '|', fixed = TRUE)
[[1]]
[1] "2 habs."  "2 baños"  "72 m²"    "Bajos"    "Ascensor"

[[2]]
[1] "3 habs."     "2 baños"     "110 m²"      "Ascensor"    "Calefacción"

这些是我们的记录,我们希望将\n\n作为.dcf的记录分隔符附加到每个记录中。

代码语言:javascript
复制
# split1 <- lapply(...
lapply(split1, function(x) c(x, '\n\n'))
[[1]]
[1] "2 habs."  "2 baños"  "72 m²"    "Bajos"    "Ascensor" "\n\n"    

[[2]]
[1] "3 habs."     "2 baños"     "110 m²"      "Ascensor"    "Calefacción"
[6] "\n\n"       

目前,“标记”是在“值”之后,因此我们必须对它们进行swap

代码语言:javascript
复制
# split1 <- sub(...
sub('(.*) (.*)', '\\2 \\1', unlist(split1))
 [1] "habs. 2"     "baños 2"     "m² 72"       "Bajos"       "Ascensor"   
 [6] "\n\n"        "habs. 3"     "baños 2"     "m² 110"      "Ascensor"   
[11] "Calefacción" "\n\n"        "habs. 3"     "m² 70"       "Planta 2ª"  
[16] "\n\n"

.dcf标签:值,所以用':‘代替’‘

代码语言:javascript
复制
# split1 <- gsub(
gsub(' ', ':', split1)
 [1] "habs.:2"     "baños:2"     "m²:72"       "Bajos"       "Ascensor"   
 [6] "\n\n"        "habs.:3"     "baños:2"     "m²:110"      "Ascensor"   
[11] "Calefacción" "\n\n"

我们现在已经接近了,但我很肯定我们的未终止的便利设施(Bajos、Ascensor等)将被视为“畸形”,尽管它似乎与cat一起工作。

代码语言:javascript
复制
cat(split1)
habs.:2
baños:2
m²:72
Bajos
Ascensor
# but
c(read.dcf(textConnection(split1)))
Error in read.dcf(textConnection(split1)) : 
  Line starting 'Bajos ...' is malformed! # first non ':' terminated

Append ':',对我的编码操作很差,但是应该适用于您,下面是我的坏结果。

代码语言:javascript
复制
cat(gsub("(\\p{L}+)\\b(?![\\p{P}\\p{S}])", '\\1:', split1, perl = TRUE))
habs.:2
 bañ:os:2
 m:²:72
 Bajos:
 Ascensor:

坏的事情发生在'banos‘和'm2’。也许这只是个坏主意..。但是,改变banos中的enyey和m2中的指数,情况开始变得更好,但便利设施要么需要一个数字,要么需要双列( Ascensor )

代码语言:javascript
复制
data$cardCharacteristics[6] <- "3 habs.|2 banos|187 m2|4ª Planta"
split3 <- strsplit(unname(unlist(data)), '|', fixed = TRUE)
split3 <- lapply(split3, function(x) c(x, '\n\n'))
split3 <- sub('(.*) (.*)', '\\2 \\1', unlist(split3))
split3 <- gsub(' ', ':', split3)
split3 <- gsub("(\\p{L}+)\\b(?![\\p{P}\\p{S}])", '\\1:', split3, perl = TRUE)> split3
 [1] "habs.:2"      "banos:2"      "m2:72"        "Bajos:"       "Ascensor:"   
 [6] "\n\n"         "habs.:3"      "banos:2"      "m2:110"       "Ascensor:"   
[11] "Calefacción:" "\n\n"       

另一个gsub/regex (也建议使用Append ':' )忽略了字符串编码不匹配的内部匹配:

代码语言:javascript
复制
split5 <- strsplit(unname(unlist(data3)), '|', fixed = TRUE)
split5 <- lapply(split5, function(x) c(x, '\n\n'))
split5 <- sub('(.*) (.*)', '\\2 \\1', unlist(split5))
split5 <- gsub(' ', ':', split5)
nocolon <- !grepl(':', split5)
> split5[nocolon] <- paste0(split5[nocolon], ':')
> split5
 [1] "habs.:2"      "baños:2"      "m²:72"        "Bajos:"       "Ascensor:"   
 [6] "\n\n:"

我们只需要从\n\n\n\n

代码语言:javascript
复制
split5 <- gsub('\n\n:', '\n\n', split5)
split5_df <- data.frame(read.dcf(textConnection(split5)))
split5_df
  habs. baños  m. Bajos Ascensor Calefacción Planta Terraza Balcón
1     2     2  72                       <NA>   <NA>    <NA>   <NA>
2     3     2 110  <NA>                        <NA>    <NA>   <NA>
3     3  <NA>  70  <NA>     <NA>        <NA>     2ª    <NA>   <NA>
4     2     2 160  <NA>     <NA>        <NA>   <NA>               
5     5     2 176  <NA>                 <NA>     7ª    <NA>   <NA>
6     3     2 187  <NA>     <NA>        <NA>     4ª    <NA>   <NA>

split5_df$Ascensor[which(split5_df$Ascensor == '')] <- c('old','new','scary')

split5_df$Calefacción[which(split5_df$Calefacción == '')] <- 'elec'

我们还没见过piscina,但.dcf会知道的。

票数 2
EN

Stack Overflow用户

发布于 2022-09-09 12:10:59

使用tidyr包可能可以解决这个问题。这可能是你的起点。sep接受regex,所以您可以利用上面str_extract中的regex。

代码语言:javascript
复制
  library(dplyr)
  library(tidyr)
  data %>% separate(cardCharacteristics,
   sep = "\\|",
   into = c(
     "habs", "baños", "m^2", "Floor",
     "Lift"
   )
 )  

结果在

代码语言:javascript
复制
habs   baños       m^2     Floor        Lift
1 2 habs. 2 baños     72 m²     Bajos    Ascensor
2 3 habs. 2 baños    110 m²  Ascensor Calefacción
3 3 habs.   70 m² 2ª Planta      <NA>        <NA>
4 2 habs. 2 baños    160 m²   Terraza      Balcón
5 5 habs. 2 baños    176 m² 7ª Planta    Ascensor
6 3 habs. 2 baños    187 m² 4ª Planta        <NA>
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73661728

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档