我正在尝试创建正则表达式,这样我就可以用LucidWorks在我的网站上爬行和索引某些URL。
示例URL:http://www.example.com/reviews/assassins-creed-revelations/24475/reviews/示例URL:http://www.example.com/reviews/super-mario-3d-land/64303/reviews/
基本上,我希望LucidWorks搜索我的整个网站,并且只索引在URL末尾有/reviews/的URL。
有人能帮我构造一个表达式吗?:)
更新:
网址:http://www.example.com/
包括路径://*/审查/*
这种方式有效,但它只抓取第一页,它不会转到下一页与更多的评论(1,2,3等)。
如果我也添加:///reviews/.*
我得到了很多我不想要的页面,比如http://www.example.com/?page=2
发布于 2013-02-14 10:25:09
Check with this function
public boolean canAcceptURL(String url,String endsWith){
boolean canAccept = false;
String regex = "";
try{
if(endsWith.equals("")){
endsWith = "/reviews/";
}
regex = "[\\x20-\\x7E]*"+endsWith+"$";//Check the url string u passed ends with the endString you hav passed.If end string is null it will take the default value.
canAccept = url.matches(regex);
}catch (PatternSyntaxException pe) {
pe.printStackTrace();
}catch (Exception e) {
e.printStackTrace();
}
System.out.println("String matches : "+canAccept);
return canAccept;
}
Sample out put :
calling function : canAcceptURL("http://www.example.com/reviews/super-mario-3d-land/64303/reviews/","/reviews/");
String matches : true
if you want to get the url contains *'/reviews/'* just change the regex string to
String regex = "[\\x20-\\x7E]*/reviews/[\\x20-\\x7E]*"; // this will accept a string with white space and special character.https://stackoverflow.com/questions/8174619
复制相似问题