首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从字符串中提取目录

从字符串中提取目录
EN

Stack Overflow用户
提问于 2016-03-10 17:37:44
回答 3查看 49关注 0票数 2

我需要提取字符串的目录,示例如下:

代码语言:javascript
复制
222.77.201.211 - - [20/Sep/2013:00:10:23 +0800] "GET /mapreduce-nextgen/hadoop-internals-mapreduce-reference/ HTTP/1.1" 200 28664 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
220.181.89.164 - - [20/Sep/2013:00:10:25 +0800] "GET /mapreduce/hadoop-capacity-scheduler HTTP/1.1" 301 390 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
175.44.54.185 - - [20/Sep/2013:00:10:25 +0800] "GET /mapreduce-nextgen/apache-hadoop-2-0-3-published HTTP/1.1" 301 439 "http://dongxicheng.org/mapreduce-nextgen/apache-hadoop-2-0-3-published/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
175.44.54.185 - - [20/Sep/2013:00:10:25 +0800] "GET /search-engine/scribe-intro/ HTTP/1.1" 200 21578 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
112.111.174.38 - - [20/Sep/2013:00:10:30 +0800] "GET /structure/segment-tree HTTP/1.1" 301 414 "http://dongxicheng.org/structure/segment-tree/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
112.111.174.38 - - [20/Sep/2013:00:10:30 +0800] "GET /structure/segment-tree HTTP/1.1" 301 414 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"
222.77.201.211 - - [20/Sep/2013:00:10:31 +0800] "GET /mapreduce-nextgen/apache-hadoop-2-0-3-published/ HTTP/1.1" 200 23438 "http://dongxicheng.org/mapreduce-nextgen/apache-hadoop-2-0-3-published/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"

预期产出将是:

  • /mapreduce-nextgen/hadoop-internals-mapreduce-reference/
  • /mapreduce/hadoop-capacity-scheduler
  • /mapreduce-nextgen/apache-hadoop-2-0-3-published
  • 等等。

我认为可能需要一个正则表达式。提前感谢!

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2016-03-11 15:14:59

好的,上面的答案是有效的,而且可能更好,但是我用.indexOf()做了。文本中的第一行阅读并不是我在Hadoop处理它时是如何做到的,而是为了简洁起见。

代码语言:javascript
复制
Text value = "112.111.186.210 - - [20/Sep/2013:00:10:22 +0800] \"GET /structure/segment-tree HTTP/1.1\" 301 414 \"-\" \"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)\"","GET ", " HTTP"


     int idx = value.toString().indexOf("GET");
     int idy = value.toString().indexOf("HTTP/1");
     ip.set(value.toString().substring(idx, idy).trim());
票数 1
EN

Stack Overflow用户

发布于 2016-03-10 17:41:47

如果总是在GETHTTP之间,最简单的Regex应该是这样的:

代码语言:javascript
复制
GET (.*?) HTTP

证明:Regex101

在Java中,代码应该如下所示:

代码语言:javascript
复制
Pattern p = Pattern.compile("GET (.*?) HTTP");
Matcher m = p.matcher(string);

编辑:不要忘记将\放在字符串中的每个"之前,否则它将被解释为字符串的结尾。

代码语言:javascript
复制
String str = "222.77.201.211 - - [20/Sep/2013:00:10:23 +0800] \"GET /mapreduce-nextgen/hadoop-internals-mapreduce-reference/ HTTP/1.1\" 200 28664 \"-\" \"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)\"";

上面的字符串的输出将是/mapreduce-nextgen/hadoop-internals-mapreduce-reference/

票数 2
EN

Stack Overflow用户

发布于 2016-03-10 17:42:52

代码语言:javascript
复制
String toInspect = "112.111.186.210 - - [20/Sep/2013:00:10:22 +0800] \"GET /structure/segment-tree HTTP/1.1\" 301 414 \"-\" \"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)\"";
String directory = StringUtils.substringBetween(toInspect ,"GET ", " HTTP");
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/35923477

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档