首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >基于Ruby & Regex的文本挖掘清理

基于Ruby & Regex的文本挖掘清理
EN

Stack Overflow用户
提问于 2015-05-14 15:58:19
回答 4查看 186关注 0票数 0

我有一个单词计数散列,如下所示:

代码语言:javascript
复制
words = {
  "love"   => 10,
  "hate"   => 12,
  "lovely" => 3,
  "loving" => 2,
  "loved"  => 1, 
  "peace"  => 14,
  "thanks" => 3,
  "wonderful" => 10,
  "grateful" => 10
  # there are more but you get the idea
}

我想确保“爱”、“爱”和“爱”都算作“爱”。因此,我把它们加在一起,作为“爱”的计数,去掉“爱”的其余部分。然而,同时,我不希望“可爱”被视为“爱”,所以我保留它的原样。

所以我最终会得到这样的东西。

代码语言:javascript
复制
words = [
  "love"   => 13,
  "hate"   => 12,
  "lovely" => 3,
  "peace"  => 14,
  "thanks" => 3,
  "wonderful" => 10,
  "grateful" => 10
  # there are more but you get the idea
]

我有一些代码可以工作,但我认为最后一行的逻辑是错误的。我想知道你是否能帮我解决这个问题,或者建议一个更好的方法。

代码语言:javascript
复制
words.select { |k| /\Alov[a-z]*/.match(k) }
words["love"] = purgedWordCount.select { |k| /\Alov[a-z]*/.match(k) }.map(&:last).reduce(:+) - 1 # that 1 is for the 1 for "lovely"; I tried not to hard code it by using words["lovely"], but it messed things up completely, so I had to do this. 
words.delete_if { |k| /\Alov[a-z]*/.match(k) && k != "love" && k != "lovely" }

谢谢!

EN

回答 4

Stack Overflow用户

回答已采纳

发布于 2015-05-14 16:27:59

代码语言:javascript
复制
words = {
  "love"   => 10,
  "hate"   => 12,
  "lovely" => 3,
  "loving" => 2,
  "loved"  => 1,
  "peace"  => 14,
  "thanks" => 3,
  "wonderful" => 10,
  "grateful" => 10
  # there are more but you get the idea
}

aggregated_words = words.inject({}) do |memo, (word, count)|
  key = word =~ /\Alov.+/ && word != "lovely" ? "love" : word
  memo[key] = memo[key].to_i + count
  memo
end

> {"love"=>13, "hate"=>12, "lovely"=>3, "peace"=>14, "thanks"=>3, "wonderful"=>10, "grateful"=>10}
票数 0
EN

Stack Overflow用户

发布于 2015-05-14 16:59:33

我建议如下:

代码语言:javascript
复制
r = /
    lov     # match 'lov'
    (?!ely) # negative lookahead to not match 'ely'
    [a-z]+  # match one or more letters
            # /x is for 'extended', /i makes it case-independent
    /xi

words.each_with_object(Hash.new(0)) { |(k,v),h| (k=~r) ? h["love"]+=v : h[k]=v }
  #=> {"love"=>13, "hate"=>12, "lovely"=>3, "peace"=>14, "thanks"=>3,
  #    "wonderful"=>10, "grateful"=>10} 
票数 1
EN

Stack Overflow用户

发布于 2015-05-14 16:17:51

下面是功能无损版本

代码语言:javascript
复制
words = {
  "love"   => 10,
  "hate"   => 12,
  "lovely" => 3,
  "loving" => 2,
  "loved"  => 1, 
  "peace"  => 14,
  "thanks" => 3,
  "wonderful" => 10,
  "grateful" => 10
}

to_love_or_not_to_love = words.partition {|w| w.first =~ /^lov/ && w.first != "lovely"}

{"love" => to_love_or_not_to_love.first.map(&:last).sum}.merge(to_love_or_not_to_love.last.reduce({}) {|m, e| m[e.first] = e.last; m})

=> {“爱”=>13,“恨”=>12,“可爱的=>3”,“和平”=>14,“谢谢”=>3,“美妙的”=>10,“感激的”=>10

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/30241836

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档