我有一个包含两个标识符列w和x的表。列y给列z带来了顺序,该列是目标变量。由w和x标识的每个序列在z中至少有两行和所有不同的值。我想要计算每个z在序列中出现的频率,在哪个位置。就位置而言,我感兴趣的是它是第一个、最后一个还是其他任何一个。我在R的dplyr中的方法如下(对于那些不熟悉%>%的人,它是管道运算符,它接受左侧的输出并将其作为第一个参数放入右侧函数中,您可以将其理解为"and then"):
library(tidyverse)
library(reticulate)
data <- tribble(
~w, ~x, ~y, ~z,
1, 1, 1, "a",
1, 1, 2, "b",
1, 2, 1, "b",
1, 2, 2, "a",
1, 2, 3, "c",
1, 2, 4, "d",
2, 1, 1, "b",
2, 1, 2, "a",
2, 1, 3, "c",
2, 1, 4, "d"
)首先,我为每个有序的w和x组添加了一个序列索引,并添加了一个根据它确定z在序列中的分类位置的标记( position_in_sequence)。
data_with_markers <-
data %>%
group_by(w, x) %>%
arrange(y, .by_group = TRUE) %>%
mutate(
sequence_id = row_number(),
position_in_sequence = case_when(
sequence_id == first(sequence_id) ~ "first",
sequence_id == last(sequence_id) ~ "last",
TRUE ~ "other" # this is the "else"
)
) %>%
ungroup()
>data_with_markers
# A tibble: 10 x 6
w x y z sequence_id position_in_sequence
<dbl> <dbl> <dbl> <chr> <int> <chr>
1 1 1 1 a 1 first
2 1 1 2 b 2 last
3 1 2 1 b 1 first
4 1 2 2 a 2 other
5 1 2 3 c 3 other
6 1 2 4 d 4 last
7 2 1 1 b 1 first
8 2 1 2 a 2 other
9 2 1 3 c 3 other
10 2 1 4 d 4 last 然后我用position_in_sequence和z做一个简单的计数。
data_summary <- data_with_markers %>%
group_by(position_in_sequence, z) %>%
count() %>%
ungroup()
> data_summary
# A tibble: 6 x 3
position_in_sequence z n
<chr> <chr> <int>
1 first a 1
2 first b 2
3 last b 1
4 last d 2
5 other a 2
6 other c 2对于pandas,我只能获取position_in_sequence变量(这里我使用的是reticulate包):
import pandas as pd
data = r.data
data
w x y z
0 1.0 1.0 1.0 a
1 1.0 1.0 2.0 b
2 1.0 2.0 1.0 b
3 1.0 2.0 2.0 a
4 1.0 2.0 3.0 c
5 1.0 2.0 4.0 d
6 2.0 1.0 1.0 b
7 2.0 1.0 2.0 a
8 2.0 1.0 3.0 c
9 2.0 1.0 4.0 d
data_sorted = data.sort_values(['w', 'x', 'y'])
data_sorted['sequence_id'] = data_sorted.groupby(['w', 'x']).cumcount() + 1
data_sorted
w x y z sequence_id
0 1.0 1.0 1.0 a 1
1 1.0 1.0 2.0 b 2
2 1.0 2.0 1.0 b 1
3 1.0 2.0 2.0 a 2
4 1.0 2.0 3.0 c 3
5 1.0 2.0 4.0 d 4
6 2.0 1.0 1.0 b 1
7 2.0 1.0 2.0 a 2
8 2.0 1.0 3.0 c 3
9 2.0 1.0 4.0 d 4我摆弄了一下.apply,但是我需要访问某一行的列sequence_id,同时访问列的最小和最大值来进行比较,但是我没有让它工作。
发布于 2021-06-09 04:34:55
您可以完全像在R中使用python中的datar一样完成此操作:
>>> from datar.all import (
... f, tribble, group_by, arrange, row_number,
... case_when, first, last, ungroup, mutate, count
... )
>>>
>>> data = tribble(
... f.w, f.x, f.y, f.z,
... 1, 1, 1, "a",
... 1, 1, 2, "b",
... 1, 2, 1, "b",
... 1, 2, 2, "a",
... 1, 2, 3, "c",
... 1, 2, 4, "d",
... 2, 1, 1, "b",
... 2, 1, 2, "a",
... 2, 1, 3, "c",
... 2, 1, 4, "d"
... )
>>>
>>> data_with_markers = (
... data >>
... group_by(f.w, f.x) >>
... arrange(f.y, _by_group=True) >>
... mutate(
... sequence_id=row_number(),
... position_in_sequence=case_when(
... f.sequence_id == first(f.sequence_id), "first",
... f.sequence_id == last(f.sequence_id), "last",
... True, "other"
... )
... ) >>
... ungroup()
... )
>>>
>>> data_with_markers
w x y z sequence_id position_in_sequence
<int64> <int64> <int64> <object> <int64> <object>
0 1 1 1 a 1 first
1 1 1 2 b 2 last
2 1 2 1 b 1 first
3 1 2 2 a 2 other
4 1 2 3 c 3 other
5 1 2 4 d 4 last
6 2 1 1 b 1 first
7 2 1 2 a 2 other
8 2 1 3 c 3 other
9 2 1 4 d 4 last
>>>
>>> data_summary = (
... data_with_markers >>
... group_by(f.position_in_sequence, f.z) >>
... count() >>
... ungroup()
... )
>>>
>>> data_summary
position_in_sequence z n
<object> <object> <int64>
0 first a 1
1 first b 2
2 last b 1
3 last d 2
4 other a 2
5 other c 2我是这个包的作者。如果您有任何问题,请随时提交问题。
发布于 2020-04-10 19:48:11
我不确定这是否会有帮助,但我看到您使用了一个名为"type“的变量,此代码可能会给您错误,因为"type”是一个关键字,可能不是变量/列名的最佳选择。试试"type1“或"typ"?关键字赋值作为变量通常一开始不会给出任何问题,但后来会给你留下莫名其妙的错误(我在这里只是头脑风暴,我还没有用过网格化)
发布于 2021-10-01 00:07:30
让我们以数组序列的形式获取grouper,这样就可以重用它:
grouper = [df[col] for col in ['w', 'x']]从这里,我们可以运行链接和来自pyjanitor case_when的一些帮助(在开发中):
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn
(df.assign(pos = df.groupby(grouper, sort = True).cumcount(),
pos_in_seq = lambda df: df.pos.groupby(grouper).transform('max'))
.case_when(lambda df: df.pos.eq(0), 'first', # 1st condition, result
lambda df: df.pos.eq(df.pos_in_seq), 'last', # 2nd cond, result
'other', # default, similar to else
column_name = 'pos_in_seq')
#.groupby(['pos_in_seq', 'z'], as_index = False).size() to get a DataFrame
.groupby(['pos_in_seq', 'z']).size()
)
pos_in_seq z
first a 1
b 2
last b 1
d 2
other a 2
c 2
dtype: int64您可以避免使用pyjanitor,而完全使用Pandas的工具(pyjanitor只是Pandas函数的包装器):
grouper = [df[col] for col in ['w', 'x']]
temp = df.assign(pos = df.groupby(grouper, sort = True).cumcount(),
pos_in_seq = lambda df: df.pos.groupby(grouper).transform('max'))
temp = temp.assign(pos_seq = temp.pos.map({0:'first'}))
temp['pos_seq'] = temp.pos_seq.mask(temp.pos.eq(temp.pos_in_seq), 'last')
temp['pos_seq'] = temp['pos_seq'].fillna('other')
temp.groupby(['pos_seq', 'z'], as_index = False).size()
pos_seq z size
0 first a 1
1 first b 2
2 last b 1
3 last d 2
4 other a 2
5 other c 2https://stackoverflow.com/questions/61138760
复制相似问题