我在r中有以下数据
Id titles
1 emami paper mills slips 10% on dismal q4 numbers
2 jsw steel q4fy17 standalone net profit rises 173.33%
3 fmcg major hul q4fy17 standalone net profit rises 6.2
4 chennai petroleum, allsec tech slip 6-7% on poor q4而且,我在向量中有名字
names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs","chennai petroleum corp ltd")我希望将dataframe列标题与向量字符串相匹配,并在新列中打印相应的字符串。我想要的数据格式是
Id titles names
1 emami paper mills slips 10% on dismal q4 numbers emami ltd
2 jsw steel q4fy17 standalone net profit rises 173.33% jsw steel ltd
3 fmcg major hul q4fy17 standalone net profit rises 6.2 hul india ltd
4 chennai petroleum, allsec tech slip 6-7% on poor q4 chennai petroleum corp ltd我是用下面的代码做的,但它并没有给我想要的东西。
df[grepl(paste(names, collapse="|"), df$titles),]在R里怎么做?
发布于 2017-05-17 13:01:06
如果我当时正确地理解了您,您可以使用BaseR的gregexpr以及regematches和gsub来完成您的任务。
Data:OP更改问题后进行编辑
options(stringsAsFactors = F)
df <- data.frame(titles = c("emami paper mills slips 10% on dismal q4 numbers",
"jsw steel q4fy17 standalone net profit rises 173.33%",
"fmcg major hul q4fy17 standalone net profit rises 6.2",
"chennai petroleum, allsec tech slip 6-7% on poor q4"),stringsAsFactors = F)
names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs","chennai petroleum corp ltd")Regex
library(dplyr)
library(stringr)
newnames <- gsub("^(\\w+).*","\\1",names)
regmat <- regmatches(df$titles,gregexpr(paste0(newnames,collapse="|"),df$titles))
regmat[lapply(regmat,length) == 0] <- NA
df <- data.frame(cbind(df,newnames =do.call("rbind",regmat)),stringsAsFactors = F)
df1 <- data.frame(names=names,newnames=newnames,stringsAsFactors = F)
left_join(df,df1,by="newnames")您还可以使用stringr库,如下所示:
library(stringr)
newnames <- str_replace(names,"^(\\w+).*","\\1")
df$newnames <- str_extract(df$titles,paste0(newnames,collapse="|"))
df1 <- data.frame(names=names,newnames=newnames,stringsAsFactors = F)
left_join(df,df1,by="newnames")输出
> left_join(df,df1,by="newnames")
titles newnames names
1 emami paper mills slips 10% on dismal q4 numbers emami emami ltd
2 jsw steel q4fy17 standalone net profit rises 173.33% jsw jsw steel ltd
3 fmcg major hul q4fy17 standalone net profit rises 6.2 hul hul india ltd
4 chennai petroleum, allsec tech slip 6-7% on poor q4 chennai chennai petroleum corp ltd发布于 2017-05-17 13:01:03
将有限公司从您的名字中删除:
names <- gsub(" ltd","",names)发布于 2017-05-17 13:09:27
在这种类型的“模糊”合并中也可以使用sqldf。
构造查找:
names <- data.frame(name = c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs"))
names$lookup <- gsub("(\\w+).*", "\\1", names$name)执行合并:
library(sqldf)
res <- sqldf("SELECT l.*, r.name
FROM df as l
LEFT JOIN names as r
ON l.titles LIKE '%'||r.lookup||'%'")注意:我从查找中提取第一个单词,因为您说只需要"hul",而不是"hul india"。同样在sql中,||表示连接,%表示通配符(这将匹配任何内容),因此这将匹配文本中的任何查找,无论在文本之前或之后出现什么。
使用Reduce然后合并的另一个选项是:
df$lookup <- Reduce( function(x, y) {x[grepl(y,x)] <- y; x}, c(list(df$titles), names$lookup))
merge(df, names)https://stackoverflow.com/questions/44025181
复制相似问题