文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在pandas组上应用window case-when函数

问如何在pandas组上应用window case-when函数
EN

Stack Overflow用户

提问于 2020-04-10 18:51:25

回答 3查看 153关注 0票数 1

我有一个包含两个标识符列w和x的表。列y给列z带来了顺序，该列是目标变量。由w和x标识的每个序列在z中至少有两行和所有不同的值。我想要计算每个z在序列中出现的频率，在哪个位置。就位置而言，我感兴趣的是它是第一个、最后一个还是其他任何一个。我在R的dplyr中的方法如下(对于那些不熟悉%>%的人，它是管道运算符，它接受左侧的输出并将其作为第一个参数放入右侧函数中，您可以将其理解为"and then")：

library(tidyverse)
library(reticulate)

data <- tribble(
  ~w,   ~x,  ~y,   ~z,
   1,    1,    1,  "a",
   1,    1,    2,  "b",
   1,    2,    1,  "b",
   1,    2,    2,  "a",
   1,    2,    3,  "c",
   1,    2,    4,  "d",
   2,    1,    1,  "b",
   2,    1,    2,  "a",
   2,    1,    3,  "c",
   2,    1,    4,  "d"
)

首先，我为每个有序的w和x组添加了一个序列索引，并添加了一个根据它确定z在序列中的分类位置的标记( position_in_sequence)。

data_with_markers <- 
  data %>%
  group_by(w, x) %>%
  arrange(y, .by_group = TRUE) %>%
  mutate(
    sequence_id = row_number(),
    position_in_sequence = case_when(
      sequence_id == first(sequence_id) ~ "first",
      sequence_id == last(sequence_id) ~ "last",
      TRUE ~ "other" # this is the "else"
    )
  ) %>%
  ungroup()


>data_with_markers
# A tibble: 10 x 6
       w     x     y z     sequence_id position_in_sequence 
   <dbl> <dbl> <dbl> <chr>       <int> <chr>
 1     1     1     1 a               1 first
 2     1     1     2 b               2 last 
 3     1     2     1 b               1 first
 4     1     2     2 a               2 other
 5     1     2     3 c               3 other
 6     1     2     4 d               4 last 
 7     2     1     1 b               1 first
 8     2     1     2 a               2 other
 9     2     1     3 c               3 other
10     2     1     4 d               4 last

然后我用position_in_sequence和z做一个简单的计数。

data_summary <- data_with_markers %>% 
  group_by(position_in_sequence, z) %>% 
  count() %>% 
  ungroup()

> data_summary
# A tibble: 6 x 3
  position_in_sequence  z         n
  <chr> <chr> <int>
1 first a         1
2 first b         2
3 last  b         1
4 last  d         2
5 other a         2
6 other c         2

对于pandas，我只能获取position_in_sequence变量(这里我使用的是reticulate包)：

import pandas as pd
data = r.data
data

     w    x    y  z
0  1.0  1.0  1.0  a
1  1.0  1.0  2.0  b
2  1.0  2.0  1.0  b
3  1.0  2.0  2.0  a
4  1.0  2.0  3.0  c
5  1.0  2.0  4.0  d
6  2.0  1.0  1.0  b
7  2.0  1.0  2.0  a
8  2.0  1.0  3.0  c
9  2.0  1.0  4.0  d

data_sorted = data.sort_values(['w', 'x', 'y'])

data_sorted['sequence_id'] = data_sorted.groupby(['w', 'x']).cumcount() + 1
data_sorted

     w    x    y  z  sequence_id
0  1.0  1.0  1.0  a            1
1  1.0  1.0  2.0  b            2
2  1.0  2.0  1.0  b            1
3  1.0  2.0  2.0  a            2
4  1.0  2.0  3.0  c            3
5  1.0  2.0  4.0  d            4
6  2.0  1.0  1.0  b            1
7  2.0  1.0  2.0  a            2
8  2.0  1.0  3.0  c            3
9  2.0  1.0  4.0  d            4

我摆弄了一下.apply，但是我需要访问某一行的列sequence_id，同时访问列的最小和最大值来进行比较，但是我没有让它工作。

python

pandas

dplyr

回答 3

Stack Overflow用户

发布于 2021-06-09 04:34:55

您可以完全像在R中使用python中的datar一样完成此操作：

>>> from datar.all import (
...     f, tribble, group_by, arrange, row_number, 
...     case_when, first, last, ungroup, mutate, count
... )
>>> 
>>> data = tribble(
...    f.w,  f.x,  f.y, f.z,
...    1,    1,    1,   "a",
...    1,    1,    2,   "b",
...    1,    2,    1,   "b",
...    1,    2,    2,   "a",
...    1,    2,    3,   "c",
...    1,    2,    4,   "d",
...    2,    1,    1,   "b",
...    2,    1,    2,   "a",
...    2,    1,    3,   "c",
...    2,    1,    4,   "d"
... )
>>> 
>>> data_with_markers = (
...     data >>
...     group_by(f.w, f.x) >>
...     arrange(f.y, _by_group=True) >>
...     mutate(
...         sequence_id=row_number(),
...         position_in_sequence=case_when(
...             f.sequence_id == first(f.sequence_id), "first",
...             f.sequence_id == last(f.sequence_id),  "last",
...             True, "other"
...         )
...     ) >>
...     ungroup()
... )
>>> 
>>> data_with_markers
        w       x       y        z  sequence_id position_in_sequence
  <int64> <int64> <int64> <object>      <int64>             <object>
0       1       1       1        a            1                first
1       1       1       2        b            2                 last
2       1       2       1        b            1                first
3       1       2       2        a            2                other
4       1       2       3        c            3                other
5       1       2       4        d            4                 last
6       2       1       1        b            1                first
7       2       1       2        a            2                other
8       2       1       3        c            3                other
9       2       1       4        d            4                 last
>>> 
>>> data_summary = (
...   data_with_markers >>
...   group_by(f.position_in_sequence, f.z) >>
...   count() >>
...   ungroup()
... )
>>>   
>>> data_summary
  position_in_sequence        z       n
              <object> <object> <int64>
0                first        a       1
1                first        b       2
2                 last        b       1
3                 last        d       2
4                other        a       2
5                other        c       2

我是这个包的作者。如果您有任何问题，请随时提交问题。

票数 1

Stack Overflow用户

发布于 2020-04-10 19:48:11

我不确定这是否会有帮助，但我看到您使用了一个名为"type“的变量，此代码可能会给您错误，因为"type”是一个关键字，可能不是变量/列名的最佳选择。试试"type1“或"typ"？关键字赋值作为变量通常一开始不会给出任何问题，但后来会给你留下莫名其妙的错误(我在这里只是头脑风暴，我还没有用过网格化)

票数 0

Stack Overflow用户

发布于 2021-10-01 00:07:30

让我们以数组序列的形式获取grouper，这样就可以重用它：

grouper = [df[col] for col in ['w', 'x']]

从这里，我们可以运行链接和来自pyjanitor case_when的一些帮助(在开发中)：

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn

(df.assign(pos = df.groupby(grouper, sort = True).cumcount(), 
           pos_in_seq = lambda df: df.pos.groupby(grouper).transform('max'))
  .case_when(lambda df: df.pos.eq(0), 'first', # 1st condition, result
             lambda df: df.pos.eq(df.pos_in_seq), 'last', # 2nd cond, result
             'other', # default, similar to else
             column_name = 'pos_in_seq')
   #.groupby(['pos_in_seq', 'z'], as_index = False).size()  to get a DataFrame
  .groupby(['pos_in_seq', 'z']).size()

)

pos_in_seq  z
first       a    1
            b    2
last        b    1
            d    2
other       a    2
            c    2
dtype: int64

您可以避免使用pyjanitor，而完全使用Pandas的工具(pyjanitor只是Pandas函数的包装器)：

grouper = [df[col] for col in ['w', 'x']]
temp = df.assign(pos = df.groupby(grouper, sort = True).cumcount(), 
                 pos_in_seq = lambda df: df.pos.groupby(grouper).transform('max'))

temp = temp.assign(pos_seq = temp.pos.map({0:'first'}))

temp['pos_seq'] = temp.pos_seq.mask(temp.pos.eq(temp.pos_in_seq), 'last')

temp['pos_seq'] = temp['pos_seq'].fillna('other')

temp.groupby(['pos_seq', 'z'], as_index = False).size()

  pos_seq  z  size
0   first  a     1
1   first  b     2
2    last  b     1
3    last  d     2
4   other  a     2
5   other  c     2

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61138760

复制

相似问题

问如何在pandas组上应用window case-when函数
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在pandas组上应用window case-when函数EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在pandas组上应用window case-when函数
EN