首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Sparksql -在带有Sparksql的时间窗口中搜索事件

Sparksql -在带有Sparksql的时间窗口中搜索事件
EN

Stack Overflow用户
提问于 2019-01-03 17:05:22
回答 1查看 104关注 0票数 0

我有CSV文件,事件如下表所示。

代码语言:javascript
复制
+-------------------+-------+
|Created            |Name   |
++------------------+-------+
|2018-09-30 21:00:08|EVENT A|
|2018-09-30 21:03:11|Event C|
|2018-09-30 21:04:17|Event 3|
|2018-09-30 21:05:27|Event Y| <<<
|2018-09-30 21:06:11|Event 5|
|2018-09-30 21:07:17|Event P|
|2018-09-30 21:08:25|Event X| <<<
|2018-09-30 21:09:26|Event B|
|2018-09-30 21:10:39|Event O|
-----------------------------

我需要按时间戳对事件进行分区,在Windows中持续5分钟,并在此窗口内搜索事件x的发生,如果发生此事件,我需要在同一窗口中搜索Y事件,在此之前,事件x被找到,直到窗口开始。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-01-03 21:19:30

下面是一种方法,它首先创建5分钟的时间窗,收集每个时间窗口分区的事件列表,然后应用一个udf来标记想要的事件:

代码语言:javascript
复制
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import java.sql.Timestamp

val df = Seq(
  (Timestamp.valueOf("2018-09-30 21:00:08"), "Event A"),
  (Timestamp.valueOf("2018-09-30 21:03:11"), "Event C"),
  (Timestamp.valueOf("2018-09-30 21:04:17"), "Event 3"),
  (Timestamp.valueOf("2018-09-30 21:05:27"), "Event Y"),
  (Timestamp.valueOf("2018-09-30 21:06:11"), "Event 5"),
  (Timestamp.valueOf("2018-09-30 21:07:17"), "Event P"),
  (Timestamp.valueOf("2018-09-30 21:08:25"), "Event X"),
  (Timestamp.valueOf("2018-09-30 21:09:26"), "Event B"),
  (Timestamp.valueOf("2018-09-30 21:10:39"), "Event O")
).toDF("Created", "Name")

val winSpec = Window.partitionBy($"Win5m")

def checkEvents(e1: String, e2: String) = udf(
  (currEvent: String, events: Seq[String]) =>
    events.contains(e1) && events.contains(e2) &&
      events.indexOf(e1) < events.indexOf(e2) &&
      (currEvent == e1 || currEvent == e2)
)

df.
  withColumn("Win5m", window($"Created", "5 minutes")).
  withColumn("Events", collect_list($"Name").over(winSpec)).
  withColumn("marked", checkEvents("Event Y", "Event X")($"Name", $"Events")).
  select($"Created", $"Name").
  where($"marked").
  show(false)
// +-------------------+-------+
// |Created            |Name   |
// +-------------------+-------+
// |2018-09-30 21:05:27|Event Y|
// |2018-09-30 21:08:25|Event X|
// +-------------------+-------+

下面是包含从上述最终结果中排除的中间列的数据集:

代码语言:javascript
复制
// +-------------------+-------+---------------------------------------------+---------------------------------------------+------+
// |Created            |Name   |Win5m                                        |Events                                       |marked|
// +-------------------+-------+---------------------------------------------+---------------------------------------------+------+
// |2018-09-30 21:00:08|Event A|[2018-09-30 21:00:00.0,2018-09-30 21:05:00.0]|[Event A, Event C, Event 3]                  |false |
// |2018-09-30 21:03:11|Event C|[2018-09-30 21:00:00.0,2018-09-30 21:05:00.0]|[Event A, Event C, Event 3]                  |false |
// |2018-09-30 21:04:17|Event 3|[2018-09-30 21:00:00.0,2018-09-30 21:05:00.0]|[Event A, Event C, Event 3]                  |false |
// |2018-09-30 21:05:27|Event Y|[2018-09-30 21:05:00.0,2018-09-30 21:10:00.0]|[Event Y, Event 5, Event P, Event X, Event B]|true  |
// |2018-09-30 21:06:11|Event 5|[2018-09-30 21:05:00.0,2018-09-30 21:10:00.0]|[Event Y, Event 5, Event P, Event X, Event B]|false |
// |2018-09-30 21:07:17|Event P|[2018-09-30 21:05:00.0,2018-09-30 21:10:00.0]|[Event Y, Event 5, Event P, Event X, Event B]|false |
// |2018-09-30 21:08:25|Event X|[2018-09-30 21:05:00.0,2018-09-30 21:10:00.0]|[Event Y, Event 5, Event P, Event X, Event B]|true  |
// |2018-09-30 21:09:26|Event B|[2018-09-30 21:05:00.0,2018-09-30 21:10:00.0]|[Event Y, Event 5, Event P, Event X, Event B]|false |
// |2018-09-30 21:10:39|Event O|[2018-09-30 21:10:00.0,2018-09-30 21:15:00.0]|[Event O]                                    |false |
// +-------------------+-------+---------------------------------------------+---------------------------------------------+------+
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/54026716

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档