我有一个数据集,有所有的点击在网站上完成的1列。我想要找到在整个数据中重复的模式,数据包含超过100万行,并且有17000个不同的模式。我还想知道每个模式每次点击的平均时间。我用SAS编写了一个代码,它对每个模式进行分组,并找到每个点击之间的时间差,但是我没有得到我想要的输出。另外,如果一个模式包含连续的关键字"one“,那么我希望将其合并为单个关键字"one”。
例如,根据我的代码,我得到了以下输出:
Clicks Group Time(Seconds)
A 1 6
B 1 2
C 1 4
one 1 0
D 2 12
E 2 5
F 2 0
A 3 9
B 3 6
C 3 7
one 3 6
one 3 0
H 4 8
I 4 9
J 4 0 预期产出:
Clicks Average Time Count
ABCone A-7.5,B-4,C-0,one-2 2
DEF D-12,E-5,F-0 1
HIJ H-8,I-9,J-0 1 发布于 2018-07-02 22:33:33
以下是您的预期输出。
df %>%
group_by(Clicks) %>%
mutate(`Average Time` = paste(sprintf("%s-%2.1f", Clicks, mean(Time.Seconds.)))) %>%
group_by(Group) %>%
mutate(
Clicks = paste(Clicks, collapse = ""),
`Average Time` = paste(`Average Time`, collapse = ",")) %>%
slice(1) %>%
ungroup() %>%
select(-Group, -Time.Seconds.) %>%
count(Clicks, `Average Time`)
## A tibble: 3 x 3
# Clicks `Average Time` n
# <chr> <chr> <int>
#1 ABC A-7.5,B-4.0,C-0.0 2
#2 DEF D-12.0,E-5.0,F-0.0 1
#3 HIJ H-8.0,I-9.0,J-0.0 1这是一个相当简单的问题,不同的(重新)分组和pasteing条目。
样本数据
df <- read.table(text =
"Clicks Group Time(Seconds)
A 1 6
B 1 2
C 1 0
D 2 12
E 2 5
F 2 0
A 3 9
B 3 6
C 3 0
H 4 8
I 4 9
J 4 0 ", header = T)更新
对于您更新的数据(请注意,您的预期输出对于C的平均值是错误的)
df %>%
group_by(Clicks) %>% # Do the averaging
mutate(`Average Time` = paste(sprintf("%s-%2.1f", Clicks, mean(Time.Seconds.)))) %>%
group_by(Clicks, Group) %>% # Deal with duplicates per Clicks+Group
slice(1) %>%
group_by(Group) %>% # Paste entries
mutate(
Clicks = paste(Clicks, collapse = ""),
`Average Time` = paste(`Average Time`, collapse = ",")) %>%
slice(1) %>%
ungroup() %>% # Ungroup to prepare for counting
select(-Group, -Time.Seconds.) %>%
count(Clicks, `Average Time`)
## A tibble: 3 x 3
# Clicks `Average Time` n
# <chr> <chr> <int>
#1 ABCone A-7.5,B-4.0,C-5.5,one-2.0 2
#2 DEF D-12.0,E-5.0,F-0.0 1
#3 HIJ H-8.0,I-9.0,J-0.0 1和更新的数据
df <- read.table(text =
"Clicks Group Time(Seconds)
A 1 6
B 1 2
C 1 4
one 1 0
D 2 12
E 2 5
F 2 0
A 3 9
B 3 6
C 3 7
one 3 6
one 3 0
H 4 8
I 4 9
J 4 0 ", header = T)发布于 2018-07-02 21:37:05
如果您以我们可以复制和粘贴的方式发布数据,您将得到更多的帮助。我认为dplyr在这里会有所帮助。
编辑:有人编辑了OP,使之更可理解。我能让你接近,但平均时间列不是你想要的。
test %>%
group_by(Group) %>%
mutate(Click_Order = paste0(Clicks, collapse = "")) %>%
group_by(Click_Order) %>%
summarise(Average_Time = mean(Time), Count = n()/3) %>%
arrange(desc(Count))
# A tibble: 3 x 3
Click_Order Average_Time Count
<chr> <dbl> <dbl>
1 ABC 3.83 2.
2 DEF 5.67 1.
3 HIJ 5.67 1.发布于 2018-07-03 00:44:49
在SAS中
Proc MEANS步骤和CLASS语句来完成。样本代码
data have; input
Clicks $ Group Time; datalines;
A 1 6
B 1 2
C 1 0
D 2 12
E 2 5
F 2 0
A 3 9
B 3 6
C 3 0
H 4 8
I 4 9
J 4 0
run;
* presume no clicks value contains pipe (|) character;
data have2 / view=have2;
length pattern $30;
pattern = '|'; * prepare for bounded token search via INDEX();
do _n_ = 1 by 1 until (last.group);
set have;
by group;
* use this line if all items in group are known to be distinct ;
* pattern = cats(pattern,clicks);
* track observed clicks by searching the growing pattern of the group;
bounded_token = cats( '|', clicks, '|' );
if index (pattern, trim(bounded_token) ) = 0 then
pattern = cats (pattern, clicks, '|');
end;
if length (pattern) = lengthc(pattern) then do;
put 'WARNING: pattern needs more length';
stop;
end;
* remove token bounders;
pattern = compress(pattern,'|');
do _n_ = 1 to _n_;
set have;
output;
end;
run;
proc means noprint data=have2;
class pattern clicks;
var time;
ways 2;
output out=have_means mean=mean ;
run;
data want (keep=pattern time_summary _freq_);
do until (last.pattern);
set have_means;
by pattern;
length time_summary $100;
time_summary = catx(',',time_summary,catx('-',clicks,mean));
end;
run;https://stackoverflow.com/questions/51143908
复制相似问题