首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从文本文件正文中删除HTML

从文本文件正文中删除HTML
EN

Stack Overflow用户
提问于 2020-03-04 03:34:08
回答 1查看 256关注 0票数 1

我目前正在编写一个函数,通过从Pitchfork获得专辑的评论和评级,并删除HTML。结果应该是一个包含两个元素的列表:该专辑的评论和分数。到目前为止,我还在考虑返回什么、HTML和paste0函数的正则表达式。谢谢您抽时间见我!

代码语言:javascript
复制
pitchfork = function(url){
  save = getURL(url)
  cat(save,file = "review.txt")
  a1 = '<div class="contents dropcap"><p>'
  b1 = str_replace(save, paste0("^.*",a1),"")
  a2 = '</div><a class="end-mark-container" href="/">'
  b2 = str_replace(b1, paste0(a2,".*$"),"")
}
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-03-04 04:48:41

像这样的怎么样?

代码语言:javascript
复制
library(xml2)
library(rvest)
library(tidyverse)

url <- "http://pitchfork.com/reviews/albums/grimes-miss-anthropocene"
html <- read_html(url)

review <- html %>%
    xml_nodes("p") %>%
    html_text() %>%
    enframe("paragraph_no", "text")
review
## A tibble: 14 x 2
#   paragraph_no text
#          <int> <chr>
# 1            1 Best new music
# 2            2 Grimes’ first project as a bona fide pop star is more morose th…
# 3            3 In 2011, Grimes was eager to say in an interview that she had “…
# 4            4 Miss Anthropocene is Grimes’ fifth album and her first as that …
# 5            5 The result is a record that’s more morose than her previous wor…
# 6            6 In November 2018, Grimes released “We Appreciate Power,” a coll…
# 7            7 When Grimes veers away from high concept toward examining intim…
# 8            8 Miss Anthropocene thrills when it reveals a refined, linear evo…
# 9            9 So much about the actual music of Miss Anthropocene succeeds th…
#10           10 And that’s the obstacle, the slimy mouthfeel, standing in the w…
#11           11 Correction: An earlier version of this review erroneously state…
#12           12 Listen to our Best New Music playlist on Spotify and Apple Musi…
#13           13 Buy: Rough Trade
#14           14 (Pitchfork may earn a commission from purchases made through af…

review是一个tibble,包含按段落分割的评审;它可能需要一些额外的清理(比如删除第一行和最后一行)。

对于分数,我们可以使用类属性选择器。

代码语言:javascript
复制
score <- html %>% xml_nodes("[class='score']") %>% html_text() %>% as.numeric()
score
#[1] 8.2

(在一个函数中)把东西包装起来

让我们将所有内容封装在一个function中,它返回一个list,其中包含复习tibble和数字分数。

代码语言:javascript
复制
get_pitchfork_data <- function(url) {
    html <- read_html(url)
    list(
        review = html %>%
            xml_nodes("p") %>%
            html_text() %>%
            trimws() %>%
            enframe("paragraph_no", "text"),
        score = html %>%
            xml_nodes("[class='score']") %>%
            html_text() %>%
            as.numeric())
}

试验1:

格莱姆斯-人类学小姐

代码语言:javascript
复制
get_pitchfork_data("http://pitchfork.com/reviews/albums/grimes-miss-anthropocene")
#$review
## A tibble: 14 x 2
#   paragraph_no text
#          <int> <chr>
# 1            1 Best new music
# 2            2 Grimes’ first project as a bona fide pop star is more morose th…
# 3            3 In 2011, Grimes was eager to say in an interview that she had “…
# 4            4 Miss Anthropocene is Grimes’ fifth album and her first as that …
# 5            5 The result is a record that’s more morose than her previous wor…
# 6            6 In November 2018, Grimes released “We Appreciate Power,” a coll…
# 7            7 When Grimes veers away from high concept toward examining intim…
# 8            8 Miss Anthropocene thrills when it reveals a refined, linear evo…
# 9            9 So much about the actual music of Miss Anthropocene succeeds th…
#10           10 And that’s the obstacle, the slimy mouthfeel, standing in the w…
#11           11 Correction: An earlier version of this review erroneously state…
#12           12 Listen to our Best New Music playlist on Spotify and Apple Musi…
#13           13 Buy: Rough Trade
#14           14 (Pitchfork may earn a commission from purchases made through af…
#
#$score
#[1] 8.2

试验2:

无线电头- OK计算机(重新发行)

代码语言:javascript
复制
get_pitchfork_data("https://pitchfork.com/reviews/albums/radiohead-ok-computer-oknotok-1997-2017/")
#$review
## A tibble: 12 x 2
#   paragraph_no text
#          <int> <chr>
# 1            1 Best new reissue
# 2            2 Twenty years on, Radiohead revisit their 1997 masterpiece with …
# 3            3 As they regrouped to figure out what their third album might be…
# 4            4 It’s still funny to think, two decades later, that Thom Yorke’s…
# 5            5 It’s unclear what happened to that album. OK Computer obviously…
# 6            6 OKNOTOK is something a little more interesting than a remaster …
# 7            7 But “Lift’s” reputation for positivity might be a little confus…
# 8            8 The most fun to be had with OKNOTOK is in these line-blurring m…
# 9            9 This fondness for camp and schlock has always been latent in Ra…
#10           10 The ghost of Bond followed them once they decamped from their s…
#11           11 Radiohead have been at least as brilliant at packaging and posi…
#12           12 Now that they have arrived at an autumnal, valedictory stage in…
#
#$score
#[1] 10
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/60518679

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档