我有一个代表垃圾邮件链接的网站列表:
List<String> bannedSites = ["spam1.com", "spam2.com", "spam3.com"];是否有regex方法从文本中删除与这些禁用站点相匹配的链接:
Dear Arezzo,
Please check out my website at spam1.com or http://www.spam1.com
or http://spam1.com or spam1 dot com to win millions of dollars in prizes.
Thank you.
Big Spammer请注意,该链接可能有多种URL格式,'s solution在识别方面做得很好:
String input = "Dear Arezzo,\n"
+ "Please check out my website at spam1.com or http://www.spam1.com"
+ "or http://spam1.com or spam1 dot com to win millions of dollars in prizes."
+ "Thank you.";
List<String> bannedSites = Arrays.asList("spam1.com", "spam2.com", "spam3.com");
StringBuilder re = new StringBuilder();
for (String bannedSite : bannedSites) {
if (re.length() > 0)
re.append("|");
re.append(String.format("http://(www\\.)?%s\\S*|%1$s",
Pattern.quote(bannedSite)));
}
System.out.println(input.replaceAll(re.toString(), "LINK REMOVED"));虽然上面的代码对于URL格式非常有用,但spam1.com 或 http://www.spam1.com 或 http://spam1.com**,却忽略了多种文本格式:**
如何将regex修改为目标文本格式(如这些格式)?
spam1 dot com
spam1[.com]
spam1 .com
spam1 . com这样做的目的是产生这样的结果:
Dear Arezzo,
Please check out my website at [LINK REMOVED] or [LINK REMOVED]
or [LINK REMOVED] or [LINK REMOVED] to win millions of dollars in prizes.
Thank you.
Big Spammer正如我在下面的注释中所指出的,我可能不需要禁止整个字符串spam1 dot com。如果我能把spam1部分擦掉,使它变成:[LINK REMOVED] dot com --就可以完成任务了。
发布于 2011-10-10 13:24:25
这是你的一个开始。
import java.util.*;
import java.util.regex.Pattern;
class Test {
public static void main(String[] args) {
String input = "Dear Arezzo,\n"
+ "Please check out my website at spam1.com "
+ "or http://www.spam1.com or http://spam1.com or "
+ "spam1 dot com to win millions of dollars in prizes.\n"
+ "Thank you.";
List<String> bannedSites = Arrays.asList("spam1", "spam2", "spam3");
StringBuilder re = new StringBuilder();
for (String bannedSite : bannedSites) {
if (re.length() > 0)
re.append("|");
String quotedSite = Pattern.quote(bannedSite);
re.append("https?://(www\\.)?" + quotedSite + "\\S*");
re.append("|" + quotedSite + "\\s*(dot|\\.)?\\s*(com|net|org)");
//re.append("|" ... your variation here);
}
System.out.println(input.replaceAll(re.toString(), "LINK REMOVED"));
}
}输出:
亲爱的阿雷佐
请查看我的网站在链接删除或链接删除,以赢得数百万美元的奖品。谢谢。
根据需要扩展正则表达式。
发布于 2011-10-10 13:20:24
我建议使用TRIE (http://en.wikipedia.org/wiki/Trie) DS存储网站黑名单。现在,在阅读网站时,您可以做比较,并删除被禁止的网站。它将比regex更有效,因为使用regex时,您将在输入文本中搜索每个垃圾网站字符串。
发布于 2011-10-10 18:41:36
为此目的使用正则表达式可能是一个性能瓶颈,因为垃圾邮件站点列表、处理的消息总数和消息大小都在增加。
下面的测试代码中的正则表达式可以工作,但我只会在彻底测试并使所有可能的性能改进之后才使用它。
final String[] spam = new String[] {"spam1.com", "spam2.net"};
System.out.println("***** SPAM SITES *****\n" + Arrays.toString(spam)
+ "\n");
final StringBuilder patternBuilder = new StringBuilder();
patternBuilder.append("(?i)(?:(?:f|ht)tps?://)?(?:\\S*?)(");
for (final String s : spam) {
patternBuilder
.append("(?:\\[|\\])?"
+ s.replaceAll("\\.",
"\\\\s*(?:\\\\[|\\\\])?\\\\s*(?:\\\\.|dot)\\\\s*(?:\\\\[|\\\\])?\\\\s*")
+ "\\s*(?:\\[|\\])?").append("|");
}
patternBuilder.setLength(patternBuilder.length() - 1);
patternBuilder.append(")(?:/\\S*)?(?=\\s|$)");
final String ps = patternBuilder.toString();
final String psLong = ps;
System.out.println("***** PATTERN *****\n" + psLong + "\n");
final Pattern p = Pattern.compile(ps);
for (String s : new String[] {"http://www.spam1.com",
"http://spam2.net", "www.spam1.com", "spam1 dot com",
"spam1[.com]", "spam1 .com", "spam2 . net", "no links here"})
{
final Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("Success: " + s);
} else {
System.out.println("Fail: " + s);
}
}
final String message =
"Dear Arezzo,\nPlease check out my website at spam1.com or http://www.spam1.com \nor http://spam1.com or spam1 dot com to win millions of dollars in prizes.\nThank you.\nBig Spammer\n";
final Matcher m = p.matcher(message);
System.out.println("\n\n***** ORIGINAL MESSAGE *****\n" + message
+ "\n\n***** REPLACED LINKS *****\n"
+ m.replaceAll("[LINK REMOVED]"));其中产出:
***** SPAM SITES *****
[spam1.com, spam2.net]
***** PATTERN *****
(?i)(?:(?:f|ht)tps?://)?(?:\S*?)((?:\[|\])?spam1\s*(?:\[|\])?\s*(?:\.|dot)\s*(?:\[|\])?\s*com\s*(?:\[|\])?|(?:\[|\])?spam2\s*(?:\[|\])?\s*(?:\.|dot)\s*(?:\[|\])?\s*net\s*(?:\[|\])?)(?:/\S*)?(?=\s|$)
Success: http://www.spam1.com
Success: http://spam2.net
Success: www.spam1.com
Success: spam1 dot com
Success: spam1[.com]
Success: spam1 .com
Success: spam2 . net
Fail: no links here
***** ORIGINAL MESSAGE *****
Dear Arezzo,
Please check out my website at spam1.com or http://www.spam1.com
or http://spam1.com or spam1 dot com to win millions of dollars in prizes.
Thank you.
Big Spammer
***** REPLACED LINKS *****
Dear Arezzo,
Please check out my website at [LINK REMOVED] or [LINK REMOVED]
or [LINK REMOVED] or [LINK REMOVED] to win millions of dollars in prizes.
Thank you.
Big Spammerhttps://stackoverflow.com/questions/7713281
复制相似问题