我有一个具有列的数据集,其中包含一行中的一组国家。有时国家不止一次被重复,我想在我的数据集的每一行中计算出独特的国家数如下:
> class(address_countries2$address_countries)
[1] "character"
> head(address_countries2)
address_countries
1 China China
2 China China China
3 China China
4 China China
5 China China China China China China
6 China China Uk China所需的输出将是如下所示的新列:
address_countries n_countries
1 China China 1
2 China China China 1
3 China China 1
4 China China 1
5 China China China China China China 1
6 China China Uk China 2这段代码给出了每一行的字数:
address_countries2 <- address_countries2 %>%
select(address_countries) %>%
mutate(n_countries = str_count(address_countries, boundary("word")))
> head(address_countries2)
address_countries n_countries
1 China China 2
2 China China China 3
3 China China 2
4 China China 2
5 China China China China China China 6
6 China China Uk China 4我尝试过用n_distinct()以及str_count()添加unique()和distinct(),但是我得到了以下错误:
Error in mutate_impl(.data, dots) :
Column `n_countries` must be length 34760 (the number of rows) or one, not 39有什么建议吗?
发布于 2018-02-05 11:34:51
试试这个:
你的data.frame
address_countries2<-data.frame(address_countries=c("Chian","China China","China UK"))国家数目:
list_country<-strsplit(as.character(address_countries2$address_countries)," ")
list_country
[[1]]
[1] "Chian"
[[2]]
[1] "China" "China"
[[3]]
[1] "China" "UK" 加入"n_countries“栏
address_countries2$n_countries<-unlist(lapply(lapply(list_country, unique),length))输出量
address_countries2
address_countries n_countries
1 Chian 1
2 China China 1
3 China UK 2发布于 2018-02-05 11:41:12
您可以将address_countries拆分为一个列表,然后使用n_distinct。
library(purrr)
library(dplyr)
library(stringr)
df %>%
mutate(n_countries = map_int(address_countries, ~
.x %>%
str_trim %>%
str_split(" ") %>%
unlist() %>%
n_distinct))map_int在comme之后将函数应用于address_countries的每个元素,并输出一个整数
str_trim移除向量开头和结尾处的空白
str_split拆分向量,使用" "作为分裂模式。
unlist将str_split的结果转化为向量
n_distinct计算结果向量的唯一元素。
数据
df <- tibble(address_countries = c("China China", "China China China", "China China",
"China China", "China China China China China China",
"China China Uk China"))发布于 2018-02-05 11:45:35
这应该给你你想要的:
ac$n_countries <- lengths(lapply(strsplit(ac$countries, split = ' '), unique))结果:
> ac
countries n_countries
1 Chian 1
2 China China 1
3 China UK 2数据:
ac <- data.frame(countries = c("Chian","China China","China UK"), stringsAsFactors = FALSE)https://stackoverflow.com/questions/48621428
复制相似问题