文章/答案/技术大牛

发布

社区首页 >问答首页 >从URL提取各种格式的日期的Regex

问从URL提取各种格式的日期的Regex
EN

Stack Overflow用户

提问于 2018-10-17 19:17:18

回答 2查看 306关注 0票数 1

我需要从以下urls的数据中提取日期：

id | url
1 | https://www.infobae.com/tecno/2018/08/22/una-plataforma-argentina-entre-las-10-soluciones-de-big-data-mas-destacadas-del-ano/
2 | https://www.infobae.com/2014/08/03/1584584-que-es-data-lake-y-como-transforma-el-almacenamiento-datos/
3 | http://www.ellitoral.com/index.php/diarios/2018/01/09/economia1/ECON-02.html
4 | http://www.ellitoral.com/index.php/diarios/2017/12/01/economia1/ECON-01.html
5 | https://www.cronista.com/contenidos/2017/08/16/noticia_0089.html
6 | https://www.cronista.com/contenidos/2017/04/20/noticia_0090.html
7 | https://www.perfil.com/noticias/economia/supercomputadoras-para-sacarles-provecho-a-los-datos-20160409-0023.phtml
8 | https://www.mdzol.com/sociedad/100-cursos-online-gratuitos-sobre-profesiones-del-futuro-20170816-0035.html
9 | https://www.eldia.com/nota/2018-8-26-7-33-54--pueden-nuestros-datos-ponernos-en-serio-riesgo--revista-domingo
10 | https://www.letrap.com.ar/nota/2018-8-6-13-34-0-lula-eligio-a-su-vice-o-a-su-reemplazante
11 | https://www.telam.com.ar/notas/201804/270831-los-pacientes-deben-conocer-que-tipo-de-datos-usan-sus-medicos-coinciden-especialistas.html
12 | http://www.telam.com.ar/notas/201804/271299-invierten-100-millones-en-plataforma-de-internet-de-las-cosas.html
13 | http://www.telam.com.ar/notas/201308/30404-realizan-jornadas-sobre-tecnologia-para-gestion-de-datos.php
14 | http://www.telam.com.ar/notas/201701/176163-inteligencia-artificial-lectura-de-diarios.html

这些urls具有不同格式的日期：

链接1-6使用/yyyy/mm/dd/

链接7-8使用-yyyymmdd

链接9-10 use /yyyy

链接11-14使用/yyyymm/

幸运的是，这些都是数字(没有"Jar“而不是1)。

是否有一种正则表达式可以将它们全部提取，或者大多数？

regex

date

date-formatting

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-10-17 19:32:34

我相信下面的正则表达式可以实现您想要的结果。

regex <- "\\d{8}|\\d{6}|\\d{4}[^\\d]{1}\\d{2}|\\d{4}[^\\d]{1}\\d{1,2}[^\\d]{1}\\d{1,2}"
regmatches(URLData$url, regexpr(regex, URLData$url))
# [1] "2018/08/22" "2014/08/03" "2018/01/09" "2017/12/01" "2017/08/16"
# [6] "2017/04/20" "20160409"   "20170816"   "2018-8-26"  "2018-8-6"  
#[11] "201804"     "201804"     "201308"     "201701"

编辑

在阅读了@hrbrmstr的答案之后，我意识到最好将结果强制到Date类。我将使用外部包lubridate来完成它。

d <- regmatches(URLData$url, regexpr(regex, URLData$url))
d[nchar(d) < 7] <- paste0(d[nchar(d) < 7], "01")
d <- lubridate::ymd(d)
d
# [1] "2018-08-22" "2014-08-03" "2018-01-09" "2017-12-01" "2017-08-16"
# [6] "2017-04-20" "2016-04-09" "2017-08-16" "2018-08-26" "2018-08-06"
#[11] "2018-04-01" "2018-04-01" "2013-08-01" "2017-01-01"

dput格式的数据.

URLData <-
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14), url = structure(c(10L, 9L, 2L, 1L, 7L, 6L, 13L, 12L, 
8L, 11L, 14L, 5L, 3L, 4L), .Label = c(" http://www.ellitoral.com/index.php/diarios/2017/12/01/economia1/ECON-01.html", 
" http://www.ellitoral.com/index.php/diarios/2018/01/09/economia1/ECON-02.html", 
" http://www.telam.com.ar/notas/201308/30404-realizan-jornadas-sobre-tecnologia-para-gestion-de-datos.php", 
" http://www.telam.com.ar/notas/201701/176163-inteligencia-artificial-lectura-de-diarios.html                      ", 
" http://www.telam.com.ar/notas/201804/271299-invierten-100-millones-en-plataforma-de-internet-de-las-cosas.html", 
" https://www.cronista.com/contenidos/2017/04/20/noticia_0090.html", 
" https://www.cronista.com/contenidos/2017/08/16/noticia_0089.html", 
" https://www.eldia.com/nota/2018-8-26-7-33-54--pueden-nuestros-datos-ponernos-en-serio-riesgo--revista-domingo", 
" https://www.infobae.com/2014/08/03/1584584-que-es-data-lake-y-como-transforma-el-almacenamiento-datos/", 
" https://www.infobae.com/tecno/2018/08/22/una-plataforma-argentina-entre-las-10-soluciones-de-big-data-mas-destacadas-del-ano/", 
" https://www.letrap.com.ar/nota/2018-8-6-13-34-0-lula-eligio-a-su-vice-o-a-su-reemplazante", 
" https://www.mdzol.com/sociedad/100-cursos-online-gratuitos-sobre-profesiones-del-futuro-20170816-0035.html", 
" https://www.perfil.com/noticias/economia/supercomputadoras-para-sacarles-provecho-a-los-datos-20160409-0023.phtml", 
" https://www.telam.com.ar/notas/201804/270831-los-pacientes-deben-conocer-que-tipo-de-datos-usan-sus-medicos-coinciden-especialistas.html"
), class = "factor")), class = "data.frame", row.names = c(NA, 
-14L))

票数 2

Stack Overflow用户

发布于 2018-10-17 19:48:13

如果您知道lucas_7_94前缀的日期将是“统一的”(但这是您最初要求的)，那么最好还是听取的建议：

library(stringi)
library(tidyverse)

你的数据：

readLines(textConnection("https://www.infobae.com/tecno/2018/08/22/una-plataforma-argentina-entre-las-10-soluciones-de-big-data-mas-destacadas-del-ano/
https://www.infobae.com/2014/08/03/1584584-que-es-data-lake-y-como-transforma-el-almacenamiento-datos/
http://www.ellitoral.com/index.php/diarios/2018/01/09/economia1/ECON-02.html
http://www.ellitoral.com/index.php/diarios/2017/12/01/economia1/ECON-01.html
https://www.cronista.com/contenidos/2017/08/16/noticia_0089.html
https://www.cronista.com/contenidos/2017/04/20/noticia_0090.html
https://www.perfil.com/noticias/economia/supercomputadoras-para-sacarles-provecho-a-los-datos-20160409-0023.phtml
https://www.mdzol.com/sociedad/100-cursos-online-gratuitos-sobre-profesiones-del-futuro-20170816-0035.html
https://www.eldia.com/nota/2018-8-26-7-33-54--pueden-nuestros-datos-ponernos-en-serio-riesgo--revista-domingo
https://www.letrap.com.ar/nota/2018-8-6-13-34-0-lula-eligio-a-su-vice-o-a-su-reemplazante
https://www.telam.com.ar/notas/201804/270831-los-pacientes-deben-conocer-que-tipo-de-datos-usan-sus-medicos-coinciden-especialistas.html
http://www.telam.com.ar/notas/201804/271299-invierten-100-millones-en-plataforma-de-internet-de-las-cosas.html
http://www.telam.com.ar/notas/201308/30404-realizan-jornadas-sobre-tecnologia-para-gestion-de-datos.php
http://www.telam.com.ar/notas/201701/176163-inteligencia-artificial-lectura-de-diarios.html")) -> urls

一种可读的、有文档记录的用于您的情况的正则表达式：

regex <- "
([[:digit:]]{4}/[[:digit:]]{2}/[[:digit:]]{2})|                 # 1st case - yyyy/mm/dd
([[:digit:]]{8})-[[:digit:]]|                                   # 2nd case - yyyymmdd-#
([[:digit:]]{4}-[[:digit:]]{1,2}-[[:digit:]]{1,2})-[[:digit:]]| # 3rd case - yyyy-m-d-#
([[:digit:]]{6})/[[:digit:]]                                    # 4th case - yyyymm/#"

然后：

stri_match_first_regex(urls, regex, cg_missing = "", opts_regex = stri_opts_regex(comments = TRUE)) %>% 
  as_data_frame() %>% 
  select(-V1) %>% 
  unite(date, starts_with("V"), sep="") %>% 
  select(date)
  mutate(date = case_when(
    (nchar(date) == 10) & grepl("/", date) ~ as.Date(date, format = "%Y/%m/%d"),
    (nchar(date) == 8) & grepl("^[[:digit:]]+$", date) ~ as.Date(date, format = "%Y%m%d"),
    (nchar(date) == 6) & grepl("^[[:digit:]]+$", date) ~ as.Date(sprintf("%s01", date), format = "%Y%m%d"),
    (grepl("-", date)) ~ stri_split_fixed(date, "-")[[1]] %>% sprintf("%02s", .) %>% paste0(collapse="-") %>% as.Date()
  ))
## # A tibble: 14 x 1
##    date      
##    <date>    
##  1 2018-08-22
##  2 2014-08-03
##  3 2018-01-09
##  4 2017-12-01
##  5 2017-08-16
##  6 2017-04-20
##  7 2016-04-09
##  8 2017-08-16
##  9 2018-08-22
## 10 2018-08-22
## 11 2018-04-01
## 12 2018-04-01
## 13 2013-08-01
## 14 2017-01-01

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/52862119

复制

相似问题

问从URL提取各种格式的日期的Regex
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从URL提取各种格式的日期的RegexEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从URL提取各种格式的日期的Regex
EN