我正试图想出一个正则表达式,以便可以将weblog导入到Hive中。我不是regex专家,我遇到了一些问题,试图用不同的元数据处理日志中的两种不同的事务类型。
日志包含一个请求和一个Reply事务,regex我已经为请求事务提供了支持,但是在回复上失败了,因为答复中包含了数据中的“心跳”。日志每天可以包含多达两万个或更多条目。
我的Regex到目前为止
(^[0-9]*) ([0-9]*) ([0-9]*) ([0-9]*) ([0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([0-9]*) ([0-9]*) ([0-9]*) ([^ ]*)(?:[:]) ([0-9]+.[0-9]+(?:\ms)) ([^ ]*)(?:[:]) (.*)\n预期结果:
博客示例:
请求事务:
样本数据
0 20140625095038864 20140625095038864 0000 006 00/0000/000 LSEMPPRD Nestra 000 000000 00 Request: fafe805e-614e-41dd-9b0f-8ec5fbc8c882结构:
Version = 0
EventTime = 20140625095038864
LogTime = 20140625095038864
SequenceNumber = 0000
Level = 006
InternalArea = 00/0000/000
MachineName = LSEMPPRD
ApplicationName = Nestra
DiscardedMessageCount = 000 000000 00
TranType = Request:
SessionKey = fafe805e-614e-41dd-9b0f-8ec5fbc8c882
Message = <GetAccountInfo xmlns="http://tempuri.org/"><accountSession xmlns:b="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common" xmlns:i="http://www.w3.org/2001/XMLSchema-instance"><b:SessionKey>8c737876-fccb-4352-8d06-9f19ccb34fd6</b:SessionKey><b:SessionToken i:nil="true"></b:SessionToken></accountSession></GetAccountInfo>回复事务:
样本数据
0 20140625095038615 20140625095038646 0000 006 00/0000/000 LSEMPPRD Nestra 000 000000 00 Reply: 296.4057ms d594891a-6d9c-4b5c-913a-9dc302277534:结构:
Version = 0
EventTime = 20140625095038615
LogTime = 20140625095038646
SequenceNumber = 0000
Level = 006
InternalArea = 00/0000/000
MachineName = LSEMPPRD
ApplicationName = Nestra
DiscardedMessageCount = 000 000000 00
TranType = Reply:
Heartbeat = 296.4057ms
SessionKey = d594891a-6d9c-4b5c-913a-9dc302277534:
Message = <GetAccountInfo xmlns="http://tempuri.org/"><accountSession xmlns:b="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common" xmlns:i="http://www.w3.org/2001/XMLSchema-instance"><b:SessionKey>8c737876-fccb-4352-8d06-9f19ccb34fd6</b:SessionKey><b:SessionToken i:nil="true"></b:SessionToken></accountSession></GetAccountInfo>发布于 2014-06-27 07:01:53
在这里找到答案:
注意:确保使用的空格。使用前在球状中进行测试
Regex:
(^[0-9]*) ([0-9]*) ([0-9]*) ([0-9]*) ([0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([0-9]*) ([0-9]*) ([0-9]*) ([^ ]*)(?:[:]) ([0-9]+.[0-9]+|)(?:ms |)([^ ]*)(?:[:])数据:2行
0 20140625095038864 20140625095038864 0000 006 00/0000/000 LSEMPPRD Nestra 000 000000 00 Request: fafe805e-614e-41dd-9b0f-8ec5fbc8c882:
0 20140625095038615 20140625095038646 0000 006 00/0000/000 LSEMPPRD Nestra 000 000000 00 Reply: 296.4057ms d594891a-6d9c-4b5c-913a-9dc302277534:Regex:(完成-处理消息部分)-只需在末尾添加:(.*$)
(^[0-9]*) ([0-9]*) ([0-9]*) ([0-9]*) ([0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([0-9]*) ([0-9]*) ([0-9]*) ([^ ]*)(?:[:]) ([0-9]+.[0-9]+|)(?:ms |)([^ ]*)(?:[:]) (.*$)

验证:球状


https://stackoverflow.com/questions/24443488
复制相似问题