我有要解析的服务器日志数据格式。
在这里前两行
test <- c("5638052581 \"Norway|Oslo County|Oslo|3163036322|503858711|160449504|y|\" n - - [31/Oct/2019:13:00:01 +0000] \"GET /P04_AL?args=app_01&distributor=p4&player=app&playeros=ios&referrer=1&station=1&codec=aac&quality=low&deviceid=1D6A84DA-92A6-4AD1-A2A3-1AB20D2263B2&listenerid=61D1F2EB-7B35-4434-9D8B-A6D074BE28F0&userid=fczUdjf5yEU8j4JlZHG4JXABgiZ2&aw_0_1st.audience=%5B%22P7ActiveListeners%22,%20%22p5hitsactive%22,%20%22P6ActiveListeners%22,%20%22P4ActiveListeners%22,%20%22AppInstalledP4%22%5D HTTP/1.1\" 200 4305805 \"-\" \"AppleCoreMedia//1.0.0.17B84 (iPhone; U; CPU OS 13_2 like Mac OS X; nb_no)\" 702", "616118387 \"Netherlands|North Holland|Haarlem|631068861|616118387|862817723||\" n - - [31/Oct/2019:13:00:01 +0000] \"GET /P04_MH HTTP/1.1\" 200 519546 \"-\" \"MultiRoomAudioPlayer//5.1\" 6")我试图像下面这样使用雷克斯包,但是经常会遇到意外输入的错误。我做错什么了?有人能帮我做这个吗。以下是我对一个记录的尝试(向量的第一个元素)
library(rex)
re_logic <- rex(
capture(name = "process_id", digits),
"`\´",
capture(name = "country", non_spaces),
"|",
capture(name = "county", non_spaces),
"|",
capture(name = "city", non_spaces),
"|",
capture(name = "x1", digits),
"|",
capture(name = "x2", digits),
"|",
capture(name = "x3", digits),
"|",
capture(name = "process_name", alpha),
"`n - -´",
spaces,
"[",
capture(name = "accept_date", except_some_of("]")),
"]",
spaces,
"`\´",
capture(name = "http_request", non_quotes),
"`\´",
spaces,
capture(name = "status_code", digits),
spaces,
capture(name = "bytes_read", some_of("+", digit)),
"`" \"´",
capture(name = "actconn", digits),
"`//´",
spaces,
"(",
capture(name = "Tr", non_quotes),
";" )
# sample view
re_matches(test, re_logic) %>% as_tibble()发布于 2021-04-28 10:20:06
您可以使用
re_logic <- rex(
capture(name = "process_id", digits),
spaces, quote,
capture(name = "country", except_some_of("|")),
"|",
capture(name = "county", except_some_of("|")),
"|",
capture(name = "city", except_some_of("|")),
"|",
capture(name = "x1", digits),
"|",
capture(name = "x2", digits),
"|",
capture(name = "x3", digits),
"|",
capture(name = "process_name", zero_or_more(alpha)),
"|", quote, spaces, "n", spaces, "-", spaces, "-",spaces,
"[",
capture(name = "accept_date", except_some_of("]","[")),
"]",
spaces, quote,
capture(name = "http_request", non_quotes),
quote, spaces,
capture(name = "status_code", digits),
spaces,
capture(name = "bytes_read", some_of("+", digit)),
spaces, quote, non_quotes, quote, spaces, quote,
capture(name = "actconn", except_some_of(quote, "/")),
"/", non_spaces,
maybe(
spaces, "(",
capture(name = "Tr", except_some_of(";"))
)
)
re_matches(test, re_logic)见regex演示。
注意到
quote来匹配任何'或"字符non_spaces来匹配地理名称,而是使用了任何字符(除了|模式,except_some_of("|") )Tr部件是可选的,因此需要用maybe子句包装与该组相关的模式链。https://stackoverflow.com/questions/67296836
复制相似问题