首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何定义regex以从Java字符串中删除文本屏蔽的垃圾链接("spam1点com")?

如何定义regex以从Java字符串中删除文本屏蔽的垃圾链接("spam1点com")?
EN

Stack Overflow用户
提问于 2011-10-10 13:13:07
回答 3查看 487关注 0票数 5

我有一个代表垃圾邮件链接的网站列表:

代码语言:javascript
复制
List<String> bannedSites = ["spam1.com", "spam2.com", "spam3.com"];

是否有regex方法从文本中删除与这些禁用站点相匹配的链接:

代码语言:javascript
复制
Dear Arezzo,
Please check out my website at spam1.com or http://www.spam1.com 
or http://spam1.com or spam1 dot com to win millions of dollars in prizes.
Thank you.
Big Spammer

请注意,该链接可能有多种URL格式,'s solution在识别方面做得很好:

代码语言:javascript
复制
    String input = "Dear Arezzo,\n"
        + "Please check out my website at spam1.com or http://www.spam1.com" 
        + "or http://spam1.com or spam1 dot com to win millions of dollars in prizes."
        + "Thank you.";

    List<String> bannedSites = Arrays.asList("spam1.com", "spam2.com", "spam3.com");

    StringBuilder re = new StringBuilder();
    for (String bannedSite : bannedSites) {
        if (re.length() > 0)
            re.append("|");
        re.append(String.format("http://(www\\.)?%s\\S*|%1$s",
                                Pattern.quote(bannedSite)));
    }

    System.out.println(input.replaceAll(re.toString(), "LINK REMOVED"));

虽然上面的代码对于URL格式非常有用,但spam1.com http://www.spam1.com http://spam1.com**,却忽略了多种文本格式:**

如何将regex修改为目标文本格式(如这些格式)?

代码语言:javascript
复制
spam1 dot com
spam1[.com]
spam1 .com
spam1 . com

这样做的目的是产生这样的结果:

代码语言:javascript
复制
Dear Arezzo,
Please check out my website at [LINK REMOVED] or [LINK REMOVED] 
or [LINK REMOVED] or [LINK REMOVED] to win millions of dollars in prizes.
Thank you.
Big Spammer

正如我在下面的注释中所指出的,我可能不需要禁止整个字符串spam1 dot com。如果我能把spam1部分擦掉,使它变成:[LINK REMOVED] dot com --就可以完成任务了。

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2011-10-10 13:24:25

这是你的一个开始。

代码语言:javascript
复制
import java.util.*;
import java.util.regex.Pattern;

class Test {
    public static void main(String[] args) {

        String input = "Dear Arezzo,\n"
            + "Please check out my website at spam1.com "
            + "or http://www.spam1.com or http://spam1.com or " 
            + "spam1 dot com to win millions of dollars in prizes.\n"
            + "Thank you.";

        List<String> bannedSites = Arrays.asList("spam1", "spam2", "spam3");

        StringBuilder re = new StringBuilder();
        for (String bannedSite : bannedSites) {
            if (re.length() > 0)
                re.append("|");
            String quotedSite = Pattern.quote(bannedSite);
            re.append("https?://(www\\.)?" + quotedSite + "\\S*");
            re.append("|" + quotedSite + "\\s*(dot|\\.)?\\s*(com|net|org)");
            //re.append("|" ... your variation here);
        }

        System.out.println(input.replaceAll(re.toString(), "LINK REMOVED"));
    }
}

输出:

亲爱的阿雷佐

请查看我的网站在链接删除或链接删除,以赢得数百万美元的奖品。谢谢。

根据需要扩展正则表达式。

票数 2
EN

Stack Overflow用户

发布于 2011-10-10 13:20:24

我建议使用TRIE (http://en.wikipedia.org/wiki/Trie) DS存储网站黑名单。现在,在阅读网站时,您可以做比较,并删除被禁止的网站。它将比regex更有效,因为使用regex时,您将在输入文本中搜索每个垃圾网站字符串。

票数 0
EN

Stack Overflow用户

发布于 2011-10-10 18:41:36

为此目的使用正则表达式可能是一个性能瓶颈,因为垃圾邮件站点列表、处理的消息总数和消息大小都在增加。

下面的测试代码中的正则表达式可以工作,但我只会在彻底测试并使所有可能的性能改进之后才使用它。

代码语言:javascript
复制
final String[] spam = new String[] {"spam1.com", "spam2.net"};
System.out.println("***** SPAM SITES *****\n" + Arrays.toString(spam)
        + "\n");
final StringBuilder patternBuilder = new StringBuilder();
patternBuilder.append("(?i)(?:(?:f|ht)tps?://)?(?:\\S*?)(");
for (final String s : spam) {
    patternBuilder
            .append("(?:\\[|\\])?"
                    + s.replaceAll("\\.",
                            "\\\\s*(?:\\\\[|\\\\])?\\\\s*(?:\\\\.|dot)\\\\s*(?:\\\\[|\\\\])?\\\\s*")
                    + "\\s*(?:\\[|\\])?").append("|");
}
patternBuilder.setLength(patternBuilder.length() - 1);
patternBuilder.append(")(?:/\\S*)?(?=\\s|$)");
final String ps = patternBuilder.toString();
final String psLong = ps;
System.out.println("***** PATTERN *****\n" + psLong + "\n");
final Pattern p = Pattern.compile(ps);
for (String s : new String[] {"http://www.spam1.com",
        "http://spam2.net", "www.spam1.com", "spam1 dot com",
        "spam1[.com]", "spam1 .com", "spam2 . net", "no links here"})
{
    final Matcher m = p.matcher(s);
    if (m.matches()) {
        System.out.println("Success: " + s);
    } else {
        System.out.println("Fail: " + s);
    }
}
final String message =
        "Dear Arezzo,\nPlease check out my website at spam1.com or http://www.spam1.com \nor http://spam1.com or spam1 dot com to win millions of dollars in prizes.\nThank you.\nBig Spammer\n";
final Matcher m = p.matcher(message);
System.out.println("\n\n***** ORIGINAL MESSAGE *****\n" + message
        + "\n\n***** REPLACED LINKS *****\n"
        + m.replaceAll("[LINK REMOVED]"));

其中产出:

代码语言:javascript
复制
***** SPAM SITES *****
[spam1.com, spam2.net]

***** PATTERN *****
(?i)(?:(?:f|ht)tps?://)?(?:\S*?)((?:\[|\])?spam1\s*(?:\[|\])?\s*(?:\.|dot)\s*(?:\[|\])?\s*com\s*(?:\[|\])?|(?:\[|\])?spam2\s*(?:\[|\])?\s*(?:\.|dot)\s*(?:\[|\])?\s*net\s*(?:\[|\])?)(?:/\S*)?(?=\s|$)

Success: http://www.spam1.com
Success: http://spam2.net
Success: www.spam1.com
Success: spam1 dot com
Success: spam1[.com]
Success: spam1 .com
Success: spam2 . net
Fail: no links here


***** ORIGINAL MESSAGE *****
Dear Arezzo,
Please check out my website at spam1.com or http://www.spam1.com 
or http://spam1.com or spam1 dot com to win millions of dollars in prizes.
Thank you.
Big Spammer


***** REPLACED LINKS *****
Dear Arezzo,
Please check out my website at [LINK REMOVED] or [LINK REMOVED]
or [LINK REMOVED] or [LINK REMOVED] to win millions of dollars in prizes.
Thank you.
Big Spammer
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/7713281

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档