首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在长数据中创建新的比率指标

在长数据中创建新的比率指标
EN

Stack Overflow用户
提问于 2017-12-06 04:54:29
回答 2查看 55关注 0票数 3

我有一个长的数据帧

代码语言:javascript
复制
mydf <- data.frame(
+     date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", "2016-06-01"),
+     value=c(1,2,3,4,5,1,2,3,4,5),
+     country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US"),
+     indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population", "population", "population"))

         date value country  indicator
1  2016-01-01     1      US        gdp
2  2016-02-01     2      US        gdp
3  2016-03-01     3      US        gdp
4  2016-04-01     4      US        gdp
5  2016-05-01     5      US        gdp
6  2016-02-01     1      US population
7  2016-03-01     2      US population
8  2016-04-01     3      US population
9  2016-05-01     4      US population
10 2016-06-01     5      US population

我想要创造来自比率的具体新指标,例如国内生产总值/人口*1000

它看起来像这样,它必须为每个指标匹配正确的日期。

代码语言:javascript
复制
mydf <- data.frame(
+     date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", "2016-06-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01"),
+     value=c(1,2,3,4,5,1,2,3,4,5,2,1.5,1.33,1.2),
+     country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US"),
+     indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population", "population", "population", "gdp per capita", "gdp per capita", "gdp per capita", "gdp per capita"))

         date value country      indicator
1  2016-01-01  1.00      US            gdp
2  2016-02-01  2.00      US            gdp
3  2016-03-01  3.00      US            gdp
4  2016-04-01  4.00      US            gdp
5  2016-05-01  5.00      US            gdp
6  2016-02-01  1.00      US     population
7  2016-03-01  2.00      US     population
8  2016-04-01  3.00      US     population
9  2016-05-01  4.00      US     population
10 2016-06-01  5.00      US     population
11 2016-02-01  2.00      US gdp per capita
12 2016-03-01  1.50      US gdp per capita
13 2016-04-01  1.33      US gdp per capita
14 2016-05-01  1.20      US gdp per capita

在R中有一种简单的方法吗?

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-12-06 05:55:26

是的,我认为使用tidyrdplyr的整洁方法可以更容易地进行所需的更改。

代码语言:javascript
复制
library(dplyr)
library(tidyr)

df <- tribble(
         ~date, ~value, ~country,   ~indicator,
  "2016-01-01",      1,     "US",        "gdp",
  "2016-02-01",      2,     "US",        "gdp",
  "2016-03-01",      3,     "AU",        "gdp",
  "2016-04-01",      4,     "US",        "gdp",
  "2016-05-01",      5,     "US",        "gdp",
  "2016-02-01",      1,     "US", "population",
  "2016-03-01",      2,     "AU", "population",
  "2016-04-01",      3,     "US", "population",
  "2016-05-01",      4,     "US", "population",
  "2016-06-01",      5,     "US", "population"
)

df %>%
  group_by(country) %>%
  spread(indicator, value) %>%
  mutate(`gdp per capita` = gdp / population) %>%
  gather(indicator, value, -c(date, country)) %>%
  drop_na(value)

# # A tibble: 14 x 4
# # Groups:   country [2]
#          date country      indicator    value
#         <chr>   <chr>          <chr>    <dbl>
#  1 2016-01-01      US            gdp 1.000000
#  2 2016-02-01      US            gdp 2.000000
#  3 2016-03-01      AU            gdp 3.000000
#  4 2016-04-01      US            gdp 4.000000
#  5 2016-05-01      US            gdp 5.000000
#  6 2016-02-01      US     population 1.000000
#  7 2016-03-01      AU     population 2.000000
#  8 2016-04-01      US     population 3.000000
#  9 2016-05-01      US     population 4.000000
# 10 2016-06-01      US     population 5.000000
# 11 2016-02-01      US gdp per capita 2.000000
# 12 2016-03-01      AU gdp per capita 1.500000
# 13 2016-04-01      US gdp per capita 1.333333
# 14 2016-05-01      US gdp per capita 1.250000

N.B.我修改了数据并添加了一个group_by语句来演示country的多个值的解决方案。

票数 1
EN

Stack Overflow用户

发布于 2017-12-06 09:36:56

就我个人而言,我发现重组包更容易使用,它会自动处理多个国家/尽管有许多类型的标签/数据类型。

代码语言:javascript
复制
library(reshape)
mydf <- data.frame(
date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", 
       "2016-06-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01","2016-05-01"),
value=c(1,2,3,4,5,1,2,3,4,5,2,1.5,1.33,1.2, 2),
country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", 'AU'),
indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population",
            "population", "population", "gdp per capita", "gdp per capita", "gdp per capita", "gdp per capita", 'gdp'))

要获得新的指示符,首先将数据设置为宽格式,以便相关列相邻。这样您就可以进行简单的按列操作。

代码语言:javascript
复制
df_wide = cast(mydf, date+country~indicator, sum)

您希望将国家和日期作为唯一定义行的列(公式的左边),将不同的指示符作为列(公式的右侧)。

代码语言:javascript
复制
        date country gdp gdp per capita population
1 2016-01-01      US   1           0.00          0
2 2016-02-01      US   2           2.00          1
3 2016-03-01      US   3           1.50          2
4 2016-04-01      US   4           1.33          3
5 2016-05-01      AU   2           0.00          0
6 2016-05-01      US   5           1.20          4
7 2016-06-01      US   0           0.00          5

现在创建一个新的列,并将其设置为任何您想要的

代码语言:javascript
复制
df_wide['g_p_ratio'] = df_wide['gdp'] / df_wide['population'] 

然后使用Melt将其返回到您的长格式。

代码语言:javascript
复制
df_new = melt(df_wide, id=c('date'))

瞧!

代码语言:javascript
复制
                       date country value      indicator
gdp              2016-01-01      US  1.00            gdp
gdp.1            2016-02-01      US  2.00            gdp
gdp.2            2016-03-01      US  3.00            gdp
gdp.3            2016-04-01      US  4.00            gdp
gdp.4            2016-05-01      AU  2.00            gdp
gdp.5            2016-05-01      US  5.00            gdp
gdp.6            2016-06-01      US  0.00            gdp
gdp.per.capita   2016-01-01      US  0.00 gdp per capita
gdp.per.capita.1 2016-02-01      US  2.00 gdp per capita
gdp.per.capita.2 2016-03-01      US  1.50 gdp per capita
gdp.per.capita.3 2016-04-01      US  1.33 gdp per capita
gdp.per.capita.4 2016-05-01      AU  0.00 gdp per capita
gdp.per.capita.5 2016-05-01      US  1.20 gdp per capita
gdp.per.capita.6 2016-06-01      US  0.00 gdp per capita
population       2016-01-01      US  0.00     population
population.1     2016-02-01      US  1.00     population
population.2     2016-03-01      US  2.00     population
population.3     2016-04-01      US  3.00     population
population.4     2016-05-01      AU  0.00     population
population.5     2016-05-01      US  4.00     population
population.6     2016-06-01      US  5.00     population

您可能想要新的行标签,也可能不需要,但是您可以修复它。

代码语言:javascript
复制
rownames(df_new) <- 1:nrow(df_new)
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/47667022

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档