首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >潘达斯中的拼写检查器

潘达斯中的拼写检查器
EN

Stack Overflow用户
提问于 2017-09-25 16:00:00
回答 2查看 3.1K关注 0票数 0

我正在尝试用从SQL数据库中提取的单词在熊猫类中实现彼得·诺维格的拼写检查器。数据包含用户查询,这些查询通常包含许多拼写错误,我希望这个类将返回最有可能的查询(拼写正确)。

类由返回熊猫数据的数据库查询初始化。例如:

代码语言:javascript
复制
  query     count
0 foo bar       1864
1 super foo      73
2 bar of foos    1629
3 crazy foos     940

下面的大部分内容都是直接从Peter的工作中提取出来的,但是我对这个类所做的修改似乎并不正确。我的猜测是,这与删除计数器功能(WORDS = Counter(words(open('big.txt').read())))有关,但我不确定从dataframe获取相同功能的最佳方法。

现班以下:

代码语言:javascript
复制
class _SpellCheckClient(object):
  """Wraps functionality to check the spelling of a query."""

  def __init__(self, team, table, dremel_connection):
    self.df = database_connection.ExecuteQuery(
        'SELECT query, COUNT(query) AS count FROM table GROUP BY 1;' 

  def expected_word(self, word):
    """Most probable spelling correction for word."""
    return max(self._candidates(word), key=self._probability)

  def _probability(self, query):
    """Probability of a given word within a query."""
    query_count = self.df.loc[self.df['query'] == query]['count'].values
    return query_count / self.df['count'].sum()

  def _candidates(self, word):
    """Generate possible spelling corrections for word."""
    return (self._known([word])
            or self._known(self._one_edits_from_word(word))
            or self._known(self._two_edits_from_word(word))
            or [word])

  def _known(self, query):
    """The subset of `words` that appear in the dictionary of WORDS."""
    # return set(w for w in query if w in WORDS)
    return set(w for w in query if w in self.df['query'].value_counts)

  def _one_edits_from_word(self, word):
    """All edits that are one edit away from `word`."""
    splits = [(word[:i], word[i:]) for i in xrange(len(word) + 1)]
    deletes = [left + right[1:] for left, right in splits if right]
    transposes = [left + right[1] + right[0] + right[2:]
                  for left, right in splits
                  if len(right) > 1]
    replaces = [left + center + right[1:]
                for left, right in splits
                if right for center in LETTERS]
    inserts = [left + center + right
               for left, right in splits
               for center in LETTERS]
    return set(deletes + transposes + replaces + inserts)

  def _two_edits_from_word(self, word):
    """All edits that are two edits away from `word`."""
    return (e2 for e1 in self._one_edits_from_word(word)
            for e2 in self._one_edits_from_word(e1))

提前感谢!

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-10-10 23:40:56

对于任何想要找到答案的人来说,下面是对我有用的东西:

代码语言:javascript
复制
def _words(df):
  """Returns the total count of each word within a dataframe."""
  return df['query'].str.get_dummies(sep=' ').T.dot(df['count'])


class _SpellCheckClient(object):
  """Wraps functionality to check the spelling of a query."""

  def __init__(self, team, table, database_connection):
    self.df = database_connection
    self.words = _words(self.df)

  def expected_word(self, query):
    """Most probable spelling correction for word."""
    return max(self._candidates(query), key=self._probability)

  def _probability(self, query):
    """Probability of a given word within a query."""
    return self.words.pipe(lambda x: x / x.sum()).get(query, 0.0)

  def _candidates(self, query):
    """Generate possible spelling corrections for word."""
    return (self._known(self._one_edits_from_query(query))
            or self._known(self._two_edits_from_query(query))
            or [query])

  def _known(self, query):
    """The subset of `query` that appear in the search console database."""
    return set(w for w in query if self.words.get(w))

  def _one_edits_from_query(self, query):
    """All edits that are one edit away from `query`."""
    splits = [(query[:i], query[i:]) for i in xrange(len(query) + 1)]
    deletes = [left + right[1:] for left, right in splits if right]
    transposes = [left + right[1] + right[0] + right[2:]
                  for left, right in splits
                  if len(right) > 1]
    replaces = [left + center + right[1:]
                for left, right in splits
                if right for center in LETTERS]
    inserts = [left + center + right
               for left, right in splits
               for center in LETTERS]
    return set(deletes + transposes + replaces + inserts)

  def _two_edits_from_query(self, query):
    """All edits that are two edits away from `query`."""
    return (e2 for e1 in self._one_edits_from_query(query)
            for e2 in self._one_edits_from_query(e1))
票数 0
EN

Stack Overflow用户

发布于 2020-02-05 18:10:02

代码语言:javascript
复制
import pandas as pd
from spellchecker import SpellChecker
df = pd.Series(['Customir','Tast','Hlp'])
spell = SpellChecker(distance=1)
def Correct(x):
    return spell.correction(x)
df = df.apply(Correct)
df

0    customer
1        last
2        help
dtype: object
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/46409475

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档