我有一个数据集,其中每一行包含一个这种类型的文本字符串。
1)list(text = \"incredible hosts\", relevance = 0.87518, count = 1), list(text = \"Japan\", relevance = 0.675236, count = 1), list(text = \"support\", relevance = 0.625663, count = 1), list(text = \"result\", relevance = 0.359757, count = 1)
2)list(text = \"British fleet\", relevance = 0.912888, count = 1), list(text = \"worst maritime disasters\", relevance = 0.904047, count = 1), list(text = \"British history\", relevance = 0.755491, count = 1), list(text = \"Scilly Isles\", relevance = 0.716508, count = 1), list(text = \"sailors\", relevance = 0.691141, count = 1), list(text = \"evening\", relevance = 0.597375, count = 1), list(text = \"Tragedy\", relevance = 0.577141, count = 1), list(text = \"prize\", relevance = 0.565035, count = 1), list(text = \"rocks\", relevance = 0.543257, count = 1), list(text = \"innovation\", relevance = 0.529463, count = 1), list(text = \"longitude\", relevance = 0.335207, count = 1)基本上,我只想提取\“和\”之间的文本字符串。
得到这样的东西
1) "incredible hosts, Japan, support , result"
2) "British fleet, worst maritime disasters, British history, scilly Isles, sailors, evening, etc..."此外,我希望创建一个数据框架,帮助le跟踪文本中包含的每一段文本的相关性评分(考虑到不同的raws可能有不同数量的文本),这样可以得到如下内容:
col1 col2. col3. col4. col5. col6..... colA1 colA2. .....
incredible hosts Japon support result NA. NA 0.87518. 0.675236....
british fleet. worst marit.......基本上,一个列的数量等于一行中的最大文本块数,对于对应于分数的列是相同的(每个相关性评分都是指一段文本,因此它们是相同的数目)。
如果我能找到一种方法,首先提取文本片段,然后用逗号将它们分开,然后对相关分数进行同样的处理,我想我可以很容易地将两者合并在一个数据格式中。所以问题主要是从文本中提取这两样东西。
谢谢你的帮助,
卡洛
发布于 2019-11-21 18:27:42
上面显示的字符串几乎是正确的R代码。因此,只要进行最小的修改,我们就可以直接将数据读入R:
txt1 <- 'list(text = \"incredible hosts\", relevance = 0.87518, count = 1), list(text = \"Japan\", relevance = 0.675236, count = 1), list(text = \"support\", relevance = 0.625663, count = 1), list(text = \"result\", relevance = 0.359757, count = 1)'
txt2 <- 'list(text = \"British fleet\", relevance = 0.912888, count = 1), list(text = \"worst maritime disasters\", relevance = 0.904047, count = 1), list(text = \"British history\", relevance = 0.755491, count = 1), list(text = \"Scilly Isles\", relevance = 0.716508, count = 1), list(text = \"sailors\", relevance = 0.691141, count = 1), list(text = \"evening\", relevance = 0.597375, count = 1), list(text = \"Tragedy\", relevance = 0.577141, count = 1), list(text = \"prize\", relevance = 0.565035, count = 1), list(text = \"rocks\", relevance = 0.543257, count = 1), list(text = \"innovation\", relevance = 0.529463, count = 1), list(text = \"longitude\", relevance = 0.335207, count = 1)'
txt1 <- gsub("text = ", "id = 1, text = ", txt1) # this is just if you want to have an ID later on
txt2 <- gsub("text = ", "id = 2, text = ", txt2)
list1 <- eval(parse(text = paste0("list(", txt1, ")")))
list2 <- eval(parse(text = paste0("list(", txt2, ")")))
df <- dplyr::bind_rows(list1, list2)
df
#> # A tibble: 15 x 4
#> id text relevance count
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 incredible hosts 0.875 1
#> 2 1 Japan 0.675 1
#> 3 1 support 0.626 1
#> 4 1 result 0.360 1
#> 5 2 British fleet 0.913 1
#> 6 2 worst maritime disasters 0.904 1
#> 7 2 British history 0.755 1
#> 8 2 Scilly Isles 0.717 1
#> 9 2 sailors 0.691 1
#> 10 2 evening 0.597 1
#> 11 2 Tragedy 0.577 1
#> 12 2 prize 0.565 1
#> 13 2 rocks 0.543 1
#> 14 2 innovation 0.529 1
#> 15 2 longitude 0.335 1从这里开始,你可以更容易地把它们变成你想要的任何形式。
更新
在您的评论之后,我更改了我的回答,向您展示了如何在更大的数据集中使用这种方法,并将其放入quanteda:
假设您读取了数据,现在每个文本都是一个向量中的值:
txt <- c('list(text = \"incredible hosts\", relevance = 0.87518, count = 1), list(text = \"Japan\", relevance = 0.675236, count = 1), list(text = \"support\", relevance = 0.625663, count = 1), list(text = \"result\", relevance = 0.359757, count = 1)',
'list(text = \"British fleet\", relevance = 0.912888, count = 1), list(text = \"worst maritime disasters\", relevance = 0.904047, count = 1), list(text = \"British history\", relevance = 0.755491, count = 1), list(text = \"Scilly Isles\", relevance = 0.716508, count = 1), list(text = \"sailors\", relevance = 0.691141, count = 1), list(text = \"evening\", relevance = 0.597375, count = 1), list(text = \"Tragedy\", relevance = 0.577141, count = 1), list(text = \"prize\", relevance = 0.565035, count = 1), list(text = \"rocks\", relevance = 0.543257, count = 1), list(text = \"innovation\", relevance = 0.529463, count = 1), list(text = \"longitude\", relevance = 0.335207, count = 1)')与其更改每个对象,不如循环遍历每个元素:
txt <- lapply(seq_along(txt), function(i) { # this is just if you want to have an ID later on
gsub("text = ", paste0("id = ", i, ", text = "), txt[i])
})
list <- lapply(txt, function(x) {
dplyr::bind_rows(eval(parse(text = paste0("list(", x, ")"))))
})
df <- dplyr::bind_rows(list)一旦您有了一个data.frame,在您可以使用quanteda工作之前,只需要做一点数据争论:
library(dplyr)
df_wide <- df %>%
group_by(id) %>%
summarise(text = paste(text, collapse = " "), relevance = list(relevance))
library(quanteda)
corp <- corpus(df_wide, docid_field = "id", text_field = "text")
corp
#> Corpus consisting of 2 documents and 1 docvar.
corp$documents$relevance
#> [[1]]
#> [1] 0.875180 0.675236 0.625663 0.359757
#>
#> [[2]]
#> [1] 0.912888 0.904047 0.755491 0.716508 0.691141 0.597375 0.577141
#> [8] 0.565035 0.543257 0.529463 0.335207发布于 2019-11-21 18:22:23
下面是一个基本R方法,它至少输出所有匹配的引号:
x <- "list(text = \"incredible hosts\", relevance = 0.87518, count = 1), list(text = \"Japan\", relevance = 0.675236, count = 1), list(text = \"support\", relevance = 0.625663, count = 1), list(text = \"result\", relevance = 0.359757, count = 1)"
m <- gregexpr("\"(.*?)\"", x)
regmatches(x, m)[[1]]
[1] "\"incredible hosts\"" "\"Japan\"" "\"support\""
[4] "\"result\""https://stackoverflow.com/questions/58981254
复制相似问题