我正在学习有关文本挖掘和rTweet的知识,目前我正在头脑风暴中讨论如何最简单地清除从tweet中获取的文本。我一直在使用这个链接上推荐的方法来删除URL,删除除英文字母或空格以外的任何内容,删除停止词,删除额外的空格,删除数字,删除标点符号。
这个方法同时使用gsub和tm_map(),我想知道是否可以使用stringr将它们添加到清洗管道中来流清理过程。建议使用以下函数的我在网站上看到了答案,但由于某种原因,我无法运行它。
clean_tweets <- function(x) {
x %>%
str_remove_all(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%
str_replace_all("&", "and") %>%
str_remove_all("[[:punct:]]") %>%
str_remove_all("^RT:? ") %>%
str_remove_all("@[[:alnum:]]+") %>%
str_remove_all("#[[:alnum:]]+") %>%
str_replace_all("\\\n", " ") %>%
str_to_lower() %>%
str_trim("both")
}清洁解决方案:
tweetsClean <- df %>%
mutate(clean = clean_tweets(text))最后,是否有可能保留表情符号,以便统计使用表情符号的频率,并有可能为每个表情符号创造定制的情感?
Emoji解决方案:
library(emo)
TopEmoji <- tweetsClean %>%
mutate(emoji = ji_extract_all(text)) %>%
unnest(cols = c(emoji)) %>%
count(emoji, sort = TRUE) %>%
top_n(5)一旦文本值被清除,我的过程就是选择相关的列,添加一个行号来保留每个单词所属的tweet,并取消标记。
tweetsClean <- tweets %>%
select(created_at,text) %>%
mutate(linenumber = row_number()) %>%
select(linenumber,everything()) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)在此之后,我分配期望的情绪,并根据AFINN获得的情绪之和,为每一行指定一个值:
sentiment_bing <- get_sentiments("bing")
sentiment_AFINN <- get_sentiments("afinn")
tweetsValue <- tweetsClean %>%
inner_join(sentiment_bing) %>%
inner_join(sentiment_AFINN) %>%
group_by(linenumber,created_at) %>%
mutate(TweetValue = sum(value))谢谢你的指点!
TestData:
df <- structure(list(created_at = structure(c(1622854597, 1622853904,
1622853716, 1622778852, 1622448379, 1622450951, 1622777623, 1622853561,
1622466544, 1622853192), tzone = "UTC", class = c("POSIXct",
"POSIXt")), text = c("@elonmusk can the dogefather ride @CumRocketCrypto into the night. #SpaceX @dogecoin https://twitter.com/",
"@CryptoCrunchApp @CumRocketCrypto @vergecurrency @InuSanshu @Mettalex @UniLend_Finance @NuCypher @Chiliz @JulSwap @CurveFinance @PolyDoge Wrong this twitt shansu",
"9am AEST Sunday morning!!!\nI will be hosting on the @CumRocketCrypto twitch channel!\n\nSo cum say Hi! https://twitter.com/",
"@SamInCrypt1 @IamMars34147875 @DylanMcKitten @elonmusk @CumRocketCrypto Cumrocket <U+0001F4A6> https://twitter.com/",
"@DK19663019 @CumRocketCrypto Oh hey, that's me! Did you grab one?",
"@DK19663019 @CumRocketCrypto Thank you! <U+2764><U+FE0F>", "@CumRocketInfo @elonmusk @CumRocketCrypto Maybe he'd like to meet the CUMrocket models? https://twitter.com/",
"@AerotyneToken @CumRocketCrypto Is there a way to make sure ones wallet ID is on the list?",
"@AerotyneToken @CumRocketCrypto Does one have to attend the giveaway stream, or just hold 0.2 BNB of #CUMMIES and #ATYNE?\nAnd what happens if I bought about 0.2BNB each and the BNB price rises? Do I have to check every day if they're still worth at least 0.2?",
"@Don_Santino1 @brandank_cr @PAWGcoinbsc @Tyga @CumRocketCrypto Massive bull flag. 10x is imminent!"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))发布于 2021-06-05 02:52:06
为了回答您的主要问题,clean_tweets()函数没有在"Clean <- tweets %>% clean_tweets“行中工作,这可能是因为您正在向它提供数据格式。但是,函数的内部(即str_函数)需要字符向量(字符串)。
清洗问题
我在这里说“大概”是因为我不确定您的tweets对象是什么样子,所以我不能确定。但是,至少在您的测试数据上,下面的内容解决了这个问题。
df %>%
mutate(clean = clean_tweets(text))如果你只想要回字符向量,你也可以
clean_tweets(df$text)表情符号问题
关于保留表情符号并赋予它们情感的可能性,是的,我认为你会按照你对文本其余部分的方式进行:标记它们,为每个数字赋值,然后聚合。
https://stackoverflow.com/questions/67845605
复制相似问题