问使用斯坦福CoreNLP提取电子邮件地址、电话号码
EN

Stack Overflow用户

提问于 2015-11-03 18:41:04

回答 1查看 1.9K关注 0票数 0

我一直在找办法提取电子邮件地址，电话号码，.来自使用斯坦福大学CoreNLP (RegexNERAnnotator)的文本。有谁能举个例子吗？

更新:04/11/2015:实际上，我应该问斯坦福RegexNERAnnotator是否有一种方式可以支持Java正则表达式。

示例用法：

       final String EMAIL_PATTERN = 
            "^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*@"
            + "[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$";

       List<CoreLabel> tokens = ...;
       TokenSequencePattern pattern = TokenSequencePattern.compile(EMAIL_PATTERN);
       TokenSequenceMatcher matcher = pattern.getMatcher(tokens);

       while (matcher.find()) {
         String matchedString = matcher.group();
         List<CoreMap> matchedTokens = matcher.groupNodes();
         ...
       }

似乎不支持Java正则表达式：

Exception in thread "main" edu.stanford.nlp.ling.tokensregex.parser.TokenMgrError: Lexical error at line 1, column 1.  Encountered: "^" (94), after : ""
    at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParserTokenManager.getNextToken(TokenSequenceParserTokenManager.java:1029)
    at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.jj_ntk(TokenSequenceParser.java:3228)
    at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.SeqRegexBasic(TokenSequenceParser.java:784)
    at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.SeqRegexDisjConj(TokenSequenceParser.java:973)
    at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.SeqRegex(TokenSequenceParser.java:743)
    at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.SeqRegexWithAction(TokenSequenceParser.java:1596)
    at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.parseSequenceWithAction(TokenSequenceParser.java:37)
    at edu.stanford.nlp.ling.tokensregex.TokenSequencePattern.compile(TokenSequencePattern.java:186)
    at edu.stanford.nlp.ling.tokensregex.TokenSequencePattern.compile(TokenSequencePattern.java:169)

nlp

stanford-nlp

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-11-03 19:50:33

StackOverflow不是教程的地方，甚至不是例子的地方。但是，即使不需要RegexNER，常规正则表达式也应该能工作。从谷歌的一点，看看Using a regular expression to validate an email address的电子邮件。电话号码应该与以下长而直截了当的正则表达式一样简单：

(\+[0-9]{1,2}(\s*|-)?)?(\(?[0-9]{3}\)?)?(\s*|-)[0-9]{3}(\s*|-)[0-9]{4}

我的猜测是，斯坦福托卡器的标记化会使这件事变得更难，而不是更容易。

票数 5

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/33506672

复制

相似问题

问使用斯坦福CoreNLP提取电子邮件地址、电话号码
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用斯坦福CoreNLP提取电子邮件地址、电话号码EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用斯坦福CoreNLP提取电子邮件地址、电话号码
EN