首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >正则表达式筛选标记项目上的所有内容

正则表达式筛选标记项目上的所有内容
EN

Stack Overflow用户
提问于 2020-03-29 00:24:48
回答 2查看 189关注 0票数 1

我试图从一篇文章中获得所有内容:https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01975-8,我发现这些信息都在标签中

代码语言:javascript
复制
<article><div...><..> information.... <></article>

我正在尝试这样的方法:

代码语言:javascript
复制
art_sections<-regexpr("<article (.*)?>(.[0-9]*)</article>",thepage)

但我找不到情报。

拜托,我,你知道些什么,我该怎么解决。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-03-29 00:43:09

这不是一个正则化的问题,而是关于使用一个库(例如rvest )使用R进行web抓取的问题。

下面是一些示例代码和下面的一些链接,可以帮助您入门:

代码语言:javascript
复制
#Loading the rvest package
library('rvest')
#Specifying the url for desired website to be scraped
url <- 'https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01975-8'
#Reading the HTML code from the website
webpage <- read_html(url)
article_html <- html_nodes(webpage,'article')
#Converting the ranking data to text
html_text(article_html)

最后,要清理您的文本,请看一看stringr

代码语言:javascript
复制
library(stringr)
str_replace_all(x, "[\r\n]" , "")
票数 0
EN

Stack Overflow用户

发布于 2020-03-29 00:43:28

尝试使用rvest包提取文章中的所有文本(仅为文本)。但是,所有HTML标记(包括链接、图像等)都会被删除。

代码语言:javascript
复制
# install.packages("rvest")

library(rvest)
url <- "https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01975-8"
article <- url %>% 
  read_html %>%
  html_node(css = 'article') %>%
  html_text
代码语言:javascript
复制
article

# Method\n    \n        \n            \n                Open Access\n            \n        \n    \n    \n\n                            Published: 19 March 2020\n                        VALOR2: characterization of large-scale structural variants using linked-reads\n                        Fatih Karaoğlanoğlu1 na1, Camir Ricketts2,4, Ezgi Ebren1, Marzieh Eslami Rasekh3, Iman Hajirasouliha4,5 & Can Alkan1,6 \n                            \n    Genome Biology\n\n                            volume 21, Article number: 72 (2020)\n            Cite this article\n                        \n                        \n    \n        \n            \n                        576 Accesses\n                    \n                \n                \n                    \n                        1 Citations\n                    \n                \n                \n                    \n                        \n                            12 Altmetric\n                        \n                    \n                \n                \n                    Metrics details\n                \n            \n    \n\n                        \n                        \n                        \n                    \n\n                    AbstractMost existing methods for structural variant detection focus on discovery and genotyping of deletions, insertions, and mobile elements. Detection of balanced structural variants with no gain or loss of genomic segments, for example, inversions and translocations, is a particularly challenging task. Furthermore, there are very few algorithms to predict the insertion locus of large interspersed segmental duplications and characterize translocations. Here, we propose novel algorithms to characterize large interspersed segmental duplications, inversions, deletions, and translocations using linked-read sequencing data. We redesign our earlier algorithm, VALOR, and implement our new algorithms in a new software package, called VALOR2.BackgroundAlterations of DNA content and organization larger than 50 bp, commonly referred to as genomic structural variations (SVs) [1], are among the major drivers of evolution [2, 3] and diseases of genomic origin [4]. Despite decades of research, they remain difficult to accurately characterize contributing to our lack of full understanding of the etiology of complex diseases, termed missing heritability [5].High-throughput sequencing

Regex解决方案,用于提取<article>标记之间的所有内容(包括文本和其他<article>标记)

代码语言:javascript
复制
html <- paste(readLines(url), collapse = " ")
article <- sub(".*(<article.*?>.*</article>).*", "\\1", html)
代码语言:javascript
复制
article

/*
* 提示:该行代码过长,系统自动注释不进行高亮。一键复制会移除系统注释 
* # <article itemscope itemtype=\"http://schema.org/ScholarlyArticle\" lang=\"en\">                     <div class=\"c-article-header\">                                                   <ul class=\"c-article-identifiers\" data-test=\"article-identifier\">                                  <li class=\"c-article-identifiers__item\" data-test=\"article-category\">Method</li>                           <li class=\"c-article-identifiers__item\">                 <span class=\"c-article-identifiers__open\" data-test=\"open-access\">Open Access</span>             </li>                                                 <li class=\"c-article-identifiers__item\"><a href=\"#article-info\" data-track=\"click\" data-track-action=\"publication date\" data-track-category=\"article body\" data-track-label=\"link\">Published: <time datetime=\"2020-03-19\" itemprop=\"datePublished\">19 March 2020</time></a></li>                         </ul>                          <h1 class=\"c-article-title u-h1\" data-test=\"article-title\" data-article-title=\"\" itemprop=\"name headline\">VALOR2: characterization of large-scale structural variants using linked-reads</h1>                         <ul class=\"c-author-list js-list-authors js-etal-collapsed\" data-etal=\"25\" data-etal-small=\"3\" data-test=\"authors-list\"><li class=\"c-author-list__item\" itemprop=\"author\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Person\"><span itemprop=\"name\"><a data-test=\"author-name\" data-track=\"click\" data-track-action=\"open author\" data-track-category=\"article body\" data-track-label=\"link\" href=\"#auth-1\">Fatih KaraoÄŸlanoÄŸlu</a></span><sup class=\"u-js-hide\"><a href=\"#Aff1\">1</a><span itemprop=\"affiliation\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Organization\" class=\"u-visually-hidden\"><meta itemprop=\"name\" content=\"Bilkent University\" /><meta itemprop=\"address\" content=\"grid.18376.3b, 0000 0001 0723 2427, Department of Computer Engineering, Bilkent University, Ankara, 06800, Turkey\" /></span></sup><sup class=\"u-js-hide\">Â <a href=\"#na1\">na1</a></sup>, </li><li class=\"c-author-list__item\" itemprop=\"author\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Person\"><span itemprop=\"name\"><a data-test=\"author-name\" data-track=\"click\" data-track-action=\"open author\" data-track-category=\"article body\" data-track-label=\"link\" href=\"#auth-2\">Camir Ricketts</a></span><sup class=\"u-js-hide\"><a href=\"#Aff2\">2</a>,<a href=\"#Aff4\">4</a><span itemprop=\"affiliation\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Organization\" class=\"u-visually-hidden\"><meta itemprop=\"name\" content=\"Cornell University\" /><meta itemprop=\"address\" content=\"grid.5386.8, 000000041936877X, Tri-Institutional Computational Biology &amp; Medicine Program, Cornell University, 1300 York Ave, New York, 10065, NY, USA\" /></span><span itemprop=\"affiliation\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Organization\" class=\"u-visually-hidden\"><meta itemprop=\"name\" content=\"Weill Cornell Medicine\" /><meta itemprop=\"address\" content=\"grid.5386.8, 000000041936877X, Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medicine, 1300 York Ave, New York, 10065, NY, USA\" /></span></sup>, </li><li class=\"c-author-list__item\" itemprop=\"author\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Person\"><span itemprop=\"name\"><a data-test=\"author-name\" data-track=\"click\" data-track-action=\"open author\" data-track-category=\"article body\" data-track-label=\"link\" href=\"#auth-3\">Ezgi Ebren</a></span><sup class=\"u-js-hide\"><a href=\"#Aff1\">1</a><span itemprop=\"affiliation\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Organization\" class=\"u-visually-hidden\"><meta itemprop=\"name\" content=\"Bilkent University\" /><meta itemprop=\"address\" content=\"grid.18376.3b, 0000 0001 0723 2427, Department of Computer Engineering, Bilkent University, Ankara, 06800, Turkey\" /></span></sup>, </li><li class=\"c-author-list__item\" itemprop=\"author\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Person\"><span itemprop=\"name\"><a data-test=\"author-name\" data-track=\"click\" data-track-action=\"open author\" data-track-category=\"article body\" data-track-label=\"link\" href=\"#auth-4\">Marzieh Eslami Rasekh</a></span><sup class=\"u-js-hide\"><a href=\"#Aff3\">3</a><span itemprop=\"affiliation\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Organization\" class=\"u-visually-hidden\"><meta itemprop=\"name\" content=\"Boston University\" /><meta itemprop=\"address\" content=\"grid.189504.1, 0000 0004 1936 7558, Graduate Program in Bioinformatics, Boston University, 24 Cummington Mall, Boston, 02215, MA, USA\" /></span></sup>, </li><li class=\"c-author-list__item\" itemprop=\"author\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Person\"><span itemprop=\"name\"><a data-test=\"author-name\" data-track=\"click\" data-track-action=\"open author\" data-track-category=\"article body\" data-track-label=\"link\" href=\"#auth-5\" data-corresp-id=\"c1\">Iman Hajirasouliha<svg width=\"16\" height=\"16\" class=\"u-icon\"><use xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"#global-icon-email\"></use></svg></a></span><sup class=\"u-js-hide\"><a href=\"#Aff4\">4</a>,<a href=\"#Aff5\">5</a><span itemprop=\"affiliation\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Organization\" class=\"u-visually-hidden\"><meta itemprop=\"name\" content=\"Weill Cornell Medicine\" /><meta itemprop=\"address\" content=\"grid.5386.8, 000000041936877X, Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medicine, 1300 York Ave, New York, 10065, NY, USA\" /></span><span itemprop=\"affiliation\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Organization\" class=\"u-visually-hidden\"><meta itemprop=\"name\" content=\"Weill Cornell Medicine\" /><meta itemprop=\"address\" content=\"grid.5386.8, 000000041936877X, Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine, 1300 York Ave, New York, 10065, NY, USA\" /></span></sup> &amp; </li><li class=\"c-author-list__item\" itemprop=\"author\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Person\"><span itemprop=\"name\"><a data-test=\"author-name\" data-track=\"click\" data-track-action=\"open author\" data-track-category=\"article body\" data-track-label=\"link\" href=\"#auth-6\" data-corresp-id=\"c2\">Can Alkan<svg width=\"16\" height=\"16\" class=\"u-icon\"><use xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"#global-icon-email\"></use></svg></a>
*/
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/60908722

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档