假设我有一个关系Students,与字段grade和teacher。我想同时按年级和老师进行分组,但保留每一组中每一年级所有学生的数量。类似于:
classes = GROUP Students BY (grade,teacher);
classes = FOREACH classes {
GENERATE
(### COUNT OF ALL STUDENTS IN GRADE ###) as grade_size,
Students as students,
teacher as teacher;
}但我不知道如何从组语句中进行筛选。某种过滤器,但我不知道学生在小组内外的成绩如何。
发布于 2014-09-07 14:43:28
有两种方法可以做到:
1)分组按年级、教师、比计数、比平缓、按年级分组使用。
classes = GROUP Students BY (grade,teacher);
teachers = FOREACH classes GENEARATE FLATTEN(group) as (grade,teacher), COUNT(Students) as perTeacehr;
grade = GROUP teachers BY grade;
result = FOREACH grade GENERATE FLATTEN(teachers), SUM(teachers.perTeacher) as perGrade;
describe result;
dump result;2)按级别分组,比使用来自BagGroup的DataFu库的UDF在内存组中做得更好,但这很容易受到堆内存异常的影响,但速度更快。
https://stackoverflow.com/questions/25701716
复制相似问题