我有一个长的数据帧
mydf <- data.frame(
+ date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", "2016-06-01"),
+ value=c(1,2,3,4,5,1,2,3,4,5),
+ country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US"),
+ indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population", "population", "population"))
date value country indicator
1 2016-01-01 1 US gdp
2 2016-02-01 2 US gdp
3 2016-03-01 3 US gdp
4 2016-04-01 4 US gdp
5 2016-05-01 5 US gdp
6 2016-02-01 1 US population
7 2016-03-01 2 US population
8 2016-04-01 3 US population
9 2016-05-01 4 US population
10 2016-06-01 5 US population我想要创造来自比率的具体新指标,例如国内生产总值/人口*1000
它看起来像这样,它必须为每个指标匹配正确的日期。
mydf <- data.frame(
+ date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", "2016-06-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01"),
+ value=c(1,2,3,4,5,1,2,3,4,5,2,1.5,1.33,1.2),
+ country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US"),
+ indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population", "population", "population", "gdp per capita", "gdp per capita", "gdp per capita", "gdp per capita"))
date value country indicator
1 2016-01-01 1.00 US gdp
2 2016-02-01 2.00 US gdp
3 2016-03-01 3.00 US gdp
4 2016-04-01 4.00 US gdp
5 2016-05-01 5.00 US gdp
6 2016-02-01 1.00 US population
7 2016-03-01 2.00 US population
8 2016-04-01 3.00 US population
9 2016-05-01 4.00 US population
10 2016-06-01 5.00 US population
11 2016-02-01 2.00 US gdp per capita
12 2016-03-01 1.50 US gdp per capita
13 2016-04-01 1.33 US gdp per capita
14 2016-05-01 1.20 US gdp per capita在R中有一种简单的方法吗?
发布于 2017-12-06 05:55:26
是的,我认为使用tidyr和dplyr的整洁方法可以更容易地进行所需的更改。
library(dplyr)
library(tidyr)
df <- tribble(
~date, ~value, ~country, ~indicator,
"2016-01-01", 1, "US", "gdp",
"2016-02-01", 2, "US", "gdp",
"2016-03-01", 3, "AU", "gdp",
"2016-04-01", 4, "US", "gdp",
"2016-05-01", 5, "US", "gdp",
"2016-02-01", 1, "US", "population",
"2016-03-01", 2, "AU", "population",
"2016-04-01", 3, "US", "population",
"2016-05-01", 4, "US", "population",
"2016-06-01", 5, "US", "population"
)
df %>%
group_by(country) %>%
spread(indicator, value) %>%
mutate(`gdp per capita` = gdp / population) %>%
gather(indicator, value, -c(date, country)) %>%
drop_na(value)
# # A tibble: 14 x 4
# # Groups: country [2]
# date country indicator value
# <chr> <chr> <chr> <dbl>
# 1 2016-01-01 US gdp 1.000000
# 2 2016-02-01 US gdp 2.000000
# 3 2016-03-01 AU gdp 3.000000
# 4 2016-04-01 US gdp 4.000000
# 5 2016-05-01 US gdp 5.000000
# 6 2016-02-01 US population 1.000000
# 7 2016-03-01 AU population 2.000000
# 8 2016-04-01 US population 3.000000
# 9 2016-05-01 US population 4.000000
# 10 2016-06-01 US population 5.000000
# 11 2016-02-01 US gdp per capita 2.000000
# 12 2016-03-01 AU gdp per capita 1.500000
# 13 2016-04-01 US gdp per capita 1.333333
# 14 2016-05-01 US gdp per capita 1.250000N.B.我修改了数据并添加了一个
group_by语句来演示country的多个值的解决方案。
发布于 2017-12-06 09:36:56
就我个人而言,我发现重组包更容易使用,它会自动处理多个国家/尽管有许多类型的标签/数据类型。
library(reshape)
mydf <- data.frame(
date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01",
"2016-06-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01","2016-05-01"),
value=c(1,2,3,4,5,1,2,3,4,5,2,1.5,1.33,1.2, 2),
country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", 'AU'),
indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population",
"population", "population", "gdp per capita", "gdp per capita", "gdp per capita", "gdp per capita", 'gdp'))要获得新的指示符,首先将数据设置为宽格式,以便相关列相邻。这样您就可以进行简单的按列操作。
df_wide = cast(mydf, date+country~indicator, sum)您希望将国家和日期作为唯一定义行的列(公式的左边),将不同的指示符作为列(公式的右侧)。
date country gdp gdp per capita population
1 2016-01-01 US 1 0.00 0
2 2016-02-01 US 2 2.00 1
3 2016-03-01 US 3 1.50 2
4 2016-04-01 US 4 1.33 3
5 2016-05-01 AU 2 0.00 0
6 2016-05-01 US 5 1.20 4
7 2016-06-01 US 0 0.00 5现在创建一个新的列,并将其设置为任何您想要的
df_wide['g_p_ratio'] = df_wide['gdp'] / df_wide['population'] 然后使用Melt将其返回到您的长格式。
df_new = melt(df_wide, id=c('date'))瞧!
date country value indicator
gdp 2016-01-01 US 1.00 gdp
gdp.1 2016-02-01 US 2.00 gdp
gdp.2 2016-03-01 US 3.00 gdp
gdp.3 2016-04-01 US 4.00 gdp
gdp.4 2016-05-01 AU 2.00 gdp
gdp.5 2016-05-01 US 5.00 gdp
gdp.6 2016-06-01 US 0.00 gdp
gdp.per.capita 2016-01-01 US 0.00 gdp per capita
gdp.per.capita.1 2016-02-01 US 2.00 gdp per capita
gdp.per.capita.2 2016-03-01 US 1.50 gdp per capita
gdp.per.capita.3 2016-04-01 US 1.33 gdp per capita
gdp.per.capita.4 2016-05-01 AU 0.00 gdp per capita
gdp.per.capita.5 2016-05-01 US 1.20 gdp per capita
gdp.per.capita.6 2016-06-01 US 0.00 gdp per capita
population 2016-01-01 US 0.00 population
population.1 2016-02-01 US 1.00 population
population.2 2016-03-01 US 2.00 population
population.3 2016-04-01 US 3.00 population
population.4 2016-05-01 AU 0.00 population
population.5 2016-05-01 US 4.00 population
population.6 2016-06-01 US 5.00 population您可能想要新的行标签,也可能不需要,但是您可以修复它。
rownames(df_new) <- 1:nrow(df_new)https://stackoverflow.com/questions/47667022
复制相似问题