我有一个文本位置:
locality <- "NEAR HAENA BEACH PARK, N 22 13 W 159 34 500 meters from coast"我只想从这个字符串中提取坐标,不包括"500“,因为它不是坐标的一部分;它指的是离海岸的距离。
我已经编写了这个更通用的regex命令来提取具有类似模式的坐标,在最后我有一个负的前视,这样实际上是距离的数字就不包括在内。
这是可行的:
> capture <- gregexpr("([0-9]*\\.?[0-9]+)?(\\$O)?(\\s)?[0-9]*\\.?[0-9]+(\\')?(\\$O)?\\s([0-9]*\\.?[0-9]+(\\')?)(?!\\sMI\\b|KM\\b|M\\b|MILES|KILOMETERS|METERS)", locality, ignore.case = TRUE, perl = TRUE)
> regmatches(locality, capture)
[[1]]
[1] " 22 13" " 159 34"但是我遗漏了北边和西边的参考文献。如果我希望拾取这些N和W字母,这将不再正常工作:
> capture <- gregexpr("(N(\\s|\\b)|S(\\s|\\b)|E(\\s|\\b)|W(\\s|\\b))([0-9]*\\.?[0-9]+)?(\\$O)?(\\s)?[0-9]*\\.?[0-9]+(\\')?(\\$O)?\\s([0-9]*\\.?[0-9]+(\\')?)(?!\\sMI\\b|KM\\b|M\\b|MILES|KILOMETERS|METERS)", locality, ignore.case = TRUE, perl = TRUE)
> regmatches(locality, capture)
[[1]]
[1] "N 22 13" "W 159 34 500"换句话说,通过在正则表达式的开头只添加(N(\\s|\\b)|S(\\s|\\b)|E(\\s|\\b)|W(\\s|\\b)),前视就不再起作用。我认为先行只适用于紧接在它之前的圆括号中的片段。
为了扩展这一点,我从评论中吸收了一个建议,并包括了这个位置的许多变体,我希望这个正则表达式能够处理这些建议。
locality <- c(
"NEAR HAENA BEACH PARK, N 22 13 W 159 34 500 meters from coast",
"NEAR HAENA BEACH PARK, N 22 13 45 W 159 34 23 500 meters from coast",
"NEAR HAENA BEACH PARK, N 22 13 12.32 W 159 34 500.4 meters from coast",
"NEAR HAENA BEACH PARK, E 22 13 S 159 34 500 meters from coast",
"NEAR HAENA BEACH PARK, N 22 13' W 159 34' 500 meters from coast",
"NEAR HAENA BEACH PARK, N 22 13' W 159 34' 500 km from coast"
"NEAR HAENA BEACH PARK, N 22 13' W 159 34' 500 distance from coast"
)
regex <- "[NSEW]\\b([0-9]*\\.?[0-9]+)?(\\$O)?(\\s)?[0-9]*\\.?[0-9]+(\\')?(\\$O)?\\s([0-9]*\\.?[0-9]+(\\')?)(?!\\sMI\\b|KM\\b|M\\b|MILES|KILOMETERS|METERS)"
> capture <- gregexpr(regex, locality[1], ignore.case = TRUE, perl = TRUE)
> regmatches(locality[1], capture)
[[1]]
[1] "N 22 13" "W 159 34"
>
> capture <- gregexpr(regex, locality[2], ignore.case = TRUE, perl = TRUE)
> regmatches(locality[2], capture)
[[1]]
[1] "N 22 13" "W 159 34"
>
> capture <- gregexpr(regex, locality[3], ignore.case = TRUE, perl = TRUE)
> regmatches(locality[3], capture)
[[1]]
[1] "N 22 13" "W 159 34"
>
> capture <- gregexpr(regex, locality[4], ignore.case = TRUE, perl = TRUE)
> regmatches(locality[4], capture)
[[1]]
[1] "E 22 13" "S 159 34"
>
> capture <- gregexpr(regex, locality[5], ignore.case = TRUE, perl = TRUE)
> regmatches(locality[5], capture)
[[1]]
[1] "N 22 13'" "W 159 34'"
>
> capture <- gregexpr(regex, locality[6], ignore.case = TRUE, perl = TRUE)
> regmatches(locality[6], capture)
[[1]]
[1] "N 22 13'" "W 159 34'"
>
> capture <- gregexpr(regex, locality[7], ignore.case = TRUE, perl = TRUE)
> regmatches(locality[7], capture)
[[1]]
[1] "N 22 13'" "W 159 34'"看起来有几个方面不起作用。在第二个位置,秒数没有被接收到。此外,先行不应该影响最后一个位置,但它确实影响了(但这可能与秒数的问题相同)。
发布于 2016-06-06 21:21:32
也许你可以试试这个
location_N <- which(strsplit(locality, "\\s")[[1]]=="N")
stringr::word(locality,location_N,location_N+5)https://stackoverflow.com/questions/37657171
复制相似问题