文章/答案/技术大牛

发布

社区首页 >问答首页 >对于Tokensregex，规则是否需要标记类型才能使用注释？

问对于Tokensregex，规则是否需要标记类型才能使用注释？
EN

Stack Overflow用户

提问于 2017-09-18 22:18:54

回答 1查看 308关注 0票数 0

我正在研究一些旧代码Tokensregex代码，我面临着一些字符没有被PTBTokenizer标记的情况。特别是，我看的是货币符号。因此，例如，₱将不是一个令牌，而其他一些则是$ would。

好的，我想尝试编写文本类型规则，而不是标记类型，以尝试在捕获组中捕获这个符号，然后执行类似Annotate($0, ner, "MONEY")的操作来捕获字符串，如₱240。

当我尝试这样做时，我得到：

. 49多个原因: edu.stanford.nlp.ling.tokensregex.TokenSequencePattern :不能在edu.stanford.nlp.ling.tokensregex.SequenceMatchRules$TextPatternExtractRuleCreator.create(SequenceMatchRules.java:666)的edu.stanford.nlp.ling.tokensregex.SequenceMatchRules.createExtractionRule(SequenceMatchRules.java:331) at edu.stanford.nlp.ling.tokensregex.SequenceMatchRules.createRule(SequenceMatchRules.java:321) at edu.stanford.nlp.ling.tokensregex上转换为java.lang.String.parser.TokenSequenceParser.Rule(TokenSequenceParser.java:141) at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.RuleList(TokenSequenceParser.java:125) at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.updateExpressionExtractor(TokenSequenceParser.java:32) at edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor.createExtractorFromFiles(CoreMapExpressionExtractor.java:292) . 52

所以我可以做上面的事情，创建一个MONEY ner注释。如果货币符号被令牌程序遗漏了呢？

示例

文本规则尝试做我想做的事情(为包含比索货币值的字符串创建称为货币的ner注释)

ENV.defaults["ruleType"] = "text"
{ text: /(₱\d+)/ => Annotate($0, ner, "CURRENCY")}

Token规则成功地实现了我想做的事情(因为yen是一个被重新编码的令牌)。这创建了一个日元货币字符串，并对货币进行了新的注释。

ENV.defaults["ruleType"] = "tokens"
ENV.defaults["matchWithResults"] = TRUE

# Set default string pattern flags (to case-insensitive)
ENV.defaultStringPatternFlags = 2

ENV.defaults["stage"] = 0

# Ex: ¥3000
{   
pattern:  ([{ word: "¥" }] $NUMBER_COMMA_SEP $LARGE_NUMBERS?),
action: (Annotate($0, ner, "CURRENCY"))
}

ner的定义是：

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

然后：

$NUMBER_COMMA_SEP = "$NUMBER_NON_CD | ([{ tag: /CD/ } & $VALID_NUMERIC_CHARS] [{ tag: /CD/; word: /,\d+(\.\d+)?/ }]*)"
$LARGE_NUMBERS = "/thousand|million|mil|mn|billion|bil|bn|trillion/"

stanford-nlp

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-09-21 05:33:14

您需要确保令牌程序没有删除不可标记的令牌。

命令：

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,tokensregex -tokensregex.rules example-rules.txt -props StanfordCoreNLP-spanish.properties -tokenize.options "untokenizable=allKeep" -file example.txt -outputFormat text

example-rules.txt

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

{ pattern: ( /₱/ /[0-9]+/ ) , action: (Annotate($0, ner, "CURRENCY") ) }

如果您在具有该符号的文本上运行，并且标记器配置正确，它将为该符号创建一个不同的令牌。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/46288962

复制

相似问题

问对于Tokensregex，规则是否需要标记类型才能使用注释？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问对于Tokensregex，规则是否需要标记类型才能使用注释？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问对于Tokensregex，规则是否需要标记类型才能使用注释？
EN