我有一个tweet数据集,其中一些tweet是原创的,另一些是转发的。由于某些原因,转发会被...截断,因此不会显示整个文本。在我的数据集中,原始推文(希望)总是存在的,所以我想找到原始推文并用它替换被截断的推文。
例如:
my_data <- tribble(
~user, ~text,
"Peter", "Hello, this is Peter, I like ice cream!",
"John", "RT @Peter: Hello, this is Peter, I like ...",
"Martha", "RT @Peter: Hello, this is Peter, I like ...",
"Julia", "Hi, I really like apples!",
"Bjorn", "RT @Julia: I really like ..."
)# A tibble: 5 x 2
user text
<chr> <chr>
1 Peter Hello, this is Peter, I like ice cream!
2 John RT @Peter: Hello, this is Peter, I like ...
3 Martha RT @Peter: Hello, this is Peter, I like ...
4 Julia Hi, I really like apples!
5 Bjorn RT @Julia: I really like ... 我想找到RT@ username: some text...的每个实例,并将其替换为完整的推文。基本上:
# A tibble: 5 x 2
user text
<chr> <chr>
1 Peter Hello, this is Peter, I like ice cream!
2 John RT @Peter: Hello, this is Peter, I like ice cream!
3 Martha RT @Peter: Hello, this is Peter, I like ice cream!
4 Julia Hi, I really like apples!
5 Bjorn RT @Julia: Hi, I really like apples! 我已经提取了正在被转发的句柄,并将其按组分解:
retweet_pattern <- "^RT @([a-zA-Z0-9_]*): (.*)"
str_match(my_data$text, retweet_pattern)然而,我并不完全确定如何继续。由于用户/文本对不一定是唯一的(即,一个用户可能有多个转发的tweet),简单地查找转发句柄并更改整个文本将不起作用。也许我需要使用字符串指标,比如Levenshtein?
谢谢。
发布于 2021-07-28 20:19:49
由于转发文本与非转发数据完全一致,您可以尝试此操作。
library(dplyr)
library(tidyr)
#Create a separate dataframe for retweet data
#separate the username and tweet in different columns
rt_data <- my_data %>%
filter(grepl('RT', text)) %>%
separate(text, c('name', 'text'), sep = ':\\s*')
#Create a separate dataframe for tweets which are not retweets.
no_rt_data <- my_data %>% filter(!grepl('RT', text))
#Clean the retweet string and find the corresponding match
#in non-retweet data
rt_data$text <- sapply(gsub('RT @\\w+:\\s*|\\s*\\.+$', '', rt_data$text),
function(x) no_rt_data$text[grepl(x, no_rt_data$text)])
#Combine the username and tweet
rt_data <- rt_data %>% unite(text, name, text, sep = ' :')
#combine the two dataframes
bind_rows(no_rt_data, rt_data)
# user text
# <chr> <chr>
#1 Peter Hello, this is Peter, I like ice cream!
#2 Julia Hi, I really like apples!
#3 John RT @Peter :Hello, this is Peter, I like ice cream!
#4 Martha RT @Peter :Hello, this is Peter, I like ice cream!
#5 Bjorn RT @Julia :Hi, I really like apples! https://stackoverflow.com/questions/68559087
复制相似问题