首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >替换文本中单词列表下划线空间的最快方法

替换文本中单词列表下划线空间的最快方法
EN

Stack Overflow用户
提问于 2016-01-16 14:35:33
回答 2查看 2.2K关注 0票数 2

每一行10,000,000,000行,约20-50字,例如:

代码语言:javascript
复制
Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful .
However , others argue that while anti-statism is central , it is inadequate to define anarchism .
Therefore , they argue instead that anarchism entails opposing authority or hierarchical organization in the conduct of human relations , including , but not limited to , the state system .
Proponents of anarchism , known as " anarchists " , advocate stateless societies based on non - hierarchical free association s. As a subtle and anti-dogmatic philosophy , anarchism draws on many currents of thought and strategy .
Anarchism does not offer a fixed body of doctrine from a single particular world view , instead fluxing and flowing as a philosophy .
There are many types and traditions of anarchism , not all of which are mutually exclusive .
Anarchist schools of thought can differ fundamentally , supporting anything from extreme individualism to complete collectivism .
Strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications .
Anarchism is often considered a radical left-wing ideology , and much of anarchist economics and anarchist legal philosophy reflect anti-authoritarian interpretations of communism , collectivism , syndicalism , mutualism , or participatory economics .
Anarchism as a mass social movement has regularly endured fluctuations in popularity .
The central tendency of anarchism as a social movement has been represented by anarcho-communism and anarcho-syndicalism , with individualist anarchism being primarily a literary phenomenon which nevertheless did have an impact on the bigger currents and individualists have also participated in large anarchist organizations .
Many anarchists oppose all forms of aggression , supporting self-defense or non-violence ( anarcho-pacifism ) , while others have supported the use of some coercive measures , including violent revolution and propaganda of the deed , on the path to an anarchist society .
Etymology and terminology The term derives from the ancient Greek ἄναρχος , anarchos , meaning " without rulers " , from the prefix ἀν - ( an - , " without " ) + ἀρχός ( arkhos , " leader " , from ἀρχή arkhē , " authority , sovereignty , realm , magistracy " ) + - ισμός ( - ismos , from the suffix - ιζειν , - izein " - izing " ) . "
Anarchists " was the term adopted by Maximilien de Robespierre to attack those on the left whom he had used for his own ends during the French Revolution but was determined to get rid of , though among these " anarchists " there were few who exhibited the social revolt characteristics of later anarchists .
There would be many revolutionaries of the early nineteenth century who contributed to the anarchist doctrines of the next generation , such as William Godwin and Wilhelm Weitling , but they did not use the word " anarchist " or " anarchism " in describing themselves or their beliefs .
Pierre-Joseph Proudhon was the first political philosopher to call himself an anarchist , making the formal birth of anarchism the mid-nineteenth century .
Since the 1890s from France , the term " libertarianism " has often been used as a synonym for anarchism and was used almost exclusively in this sense until the 1950s in the United States ; its use as a synonym is still common outside the United States .
On the other hand , some use " libertarianism " to refer to individualistic free-market philosophy only , referring to free-market anarchism as " libertarian anarchism " .

假设我有一个由一个或多个单词组成的字典术语列表,例如:

代码语言:javascript
复制
clinical anatomy
clinical psychology
cognitive neuroscience
cognitive psychology
cognitive science
comparative anatomy
comparative psychology
compound morphology
computational linguistics
correlation
cosmetic dentistry
cosmography
cosmology
craniology
craniometry
criminology
cryobiology
cryogenics
cryonics
cryptanalysis
crystallography
curvilinear correlation
cybernetics
cytogenetics
cytology
deixis
demography
dental anatomy
dental surgery
dentistry
philosophy
political philosophy

我需要找到所有包含这些词的句子,然后把单词之间的空格替换为下划线。

例如,案文中有一句:

代码语言:javascript
复制
Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful .

在课文中有字典术语political philosophy。所以这句话的输出应该是:

代码语言:javascript
复制
Anarchism is often defined as a political_philosophy which holds the state to be undesirable , unnecessary , or harmful .

我可以这么做:

代码语言:javascript
复制
dictionary = sort(dictionary, key=len) # replace the longest terms first.
for line in text:
   for term in dictionary: 
       if term in line:
           line = line.replace(term, term.replace(' ', '_'))

假设我有10,000个字典术语(D)和10,000,000,000句(S),那么使用简单方法的复杂性将是O(D*S),对吗?是否有一种更快、更少蛮横的方法来实现相同的结果?

是否有一种方法可以将所有的术语替换为每一行的下划线?,这将有助于避免内部循环。

如果首先使用类似于whoosh 之类的内容对文本进行索引,然后查询索引并替换术语,这会更有效吗?我还需要一些像O(1*S)这样的东西来代替我,对吧?

这个解决方案不需要使用Python,即使它是一些像grep/sed/awk这样的Unix命令技巧,只要subprocess.Popen-able就行了。

如果我错了,请纠正我的复杂性假设,原谅我的虚情假意。

给出一句话:

这是一个包含多个短语的句子,我需要用带有下划线的短语来代替,例如哲学分支下的社会政治哲学和一些计算语言学,其中认知语言学和心理认知语言学都是伴随语言学出现的。

假设我有字典:

代码语言:javascript
复制
cognitive linguistics
psycho cognitive linguistics
socio political philosophy
political philosophy
computational linguistics
linguistics
philosophy
social political philosophy 

输出应该如下所示:

这是一个包含多个短语的句子,我需要用带有下划线的短语来代替,例如在哲学分支下的social_political_philosophy和political_philosophy,以及cognitive_linguistics和psycho_cognitive_linguistics与语言学一起出现的computational_linguistics。

这样做的目的是通过一个包含100亿行的文本文件和一个包含10-100k词组的字典来实现这一目标。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2016-01-16 17:07:40

如果你需要最大的单词,最好把单词分开,把单词从短语的开头映射到完整的短语,而不是检查词典中的每一项,你只需要按长度对出现的短语进行排序:

代码语言:javascript
复制
from collections import defaultdict

def get_phrases(fle):
    phrase_dict = defaultdict(list)
    with open(fle) as ph:
        for line in map(str.rstrip, ph):
            k, _, phr = line.partition(" ")
            phrase_dict[k].append(line)
        return phrase_dict

from itertools import chain


def replace(fle, dct):
    with open(fle) as f:
        for line in f:
            phrases = sorted(chain.from_iterable(dct[word] for word in line.split() 
                             if word in dct) ,reverse=1, key=len)
            for phr in phrases:
                  line = line.replace(phr, phr.replace(" ", "_"))
            yield line

输出:

代码语言:javascript
复制
In [10]: cat out.txt
This is a sentence that contains multiple phrases that I need to replace with phrases with underscores, e.g. social political philosophy with political philosophy under the branch of philosophy and some computational linguistics where the cognitive linguistics and psycho cognitive linguistics appears with linguistics
In [11]: cat phrases.txt
cognitive linguistics
psycho cognitive linguistics
socio political philosophy
political philosophy
computational linguistics
linguistics
philosophy
social political philosophy
In [12]: list(replace("out.txt",get_phrases("phrases.txt")))
Out[12]: ['This is a sentence that contains multiple phrases that I need to replace with phrases with underscores, e.g. social_political_philosophy with political_philosophy under the branch of philosophy and some computational_linguistics where the cognitive_linguistics and psycho_cognitive_linguistics appears with linguistics']

其他几个版本:

代码语言:javascript
复制
def repl(x):
    if x:
        return x.group().replace(" ", "_")
    return x


def replace_re(fle, dct):
    with open(fle) as f:
        for line in f:
            spl = set(line.split())
            phrases = chain.from_iterable(dct[word] for word in spl if word in dct)
            line = re.sub("|".join(phrases), repl, line)
            yield line


def replace_re2(fle, dct):
    cached = {}
    with open(fle) as f:
        for line in f:
            phrases = tuple(chain.from_iterable(dct[word] for word in set(line.split()) if word in dct))
            if phrases not in cached:
                r = re.compile("|".join(phrases))
                cached[phrases] = r
                line = r.sub(repl, line)
            else:
                line = cached[phrases].sub(repl, line)
            yield line
票数 1
EN

Stack Overflow用户

发布于 2016-01-16 18:56:00

我会对你的字典做个判读来匹配这些数据。

然后在替换端,使用回调将空格替换为_

我估计不到3个小时就能做完这件事。

幸运的是,有一个三元工具(字典) regex生成器。

要生成regex和下面所示的内容,您将需要试用

RegexFormat 7版本

一些链接:

工具截图

TernaryTool(字典)-文本版本字典示例

175,000字字典Regex

你基本上生成了你自己的字典

通过插入要查找的字符串,然后按Generate按钮。

然后,您所要做的就是以5 MB块读取并使用

regex,然后将其附加到新文件中。冲洗重复。

很简单真的。

根据您的示例(上面),这是对所需时间的估计。

完成100亿行。

此分析基于使用使用生成的regex (下面)在示例输入上运行的基准测试。

代码语言:javascript
复制
19 lines  (@ 3600 chars)

Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   5
Elapsed Time:    4.03 s,   4034.28 ms,   4034278 µs

////////////////////////////
3606 chars
x 50,000
------------
180,300,000  (chars)

or 

20 lines
x 50,000
------------
1,000,000  (lines)
=========================
10,000,000,000 lines
/
1,000,000  (lines) per 4 seconds
-----------------------------------------
40,000 seconds
/
3600 secs per hour
-------------------------
11 hours
////////////////////////////

但是,如果您读取并处理5兆字节的块

(作为一个字符串)它将减少引擎的开销。

把时间缩短到1-3个小时。

这是为示例字典(压缩)生成的正则表达式:

代码语言:javascript
复制
\b(?:c(?:linical[ ](?:anatomy|psychology)|o(?:gnitive[ ](?:neuroscience|psychology|science)|mp(?:arative[ ](?:anatomy|psychology)|ound[ ]morphology|utational[ ]linguistics)|rrelation|sm(?:etic[ ]dentistry|o(?:graphy|logy)))|r(?:anio(?:logy|metry)|iminology|y(?:o(?:biology|genics|nics)|ptanalysis|stallography))|urvilinear[ ]correlation|y(?:bernetics|to(?:genetics|logy)))|de(?:ixis|mography|nt(?:al[ ](?:anatomy|surgery)|istry))|p(?:hilosophy|olitical[ ]philosophy))\b

(请注意,空间分隔是以每个空间的[ ]形式生成的。

如果要将其更改为量化类,只需运行

找到(?:\[ \])+,然后用任何你想要的替换。

例如\s+[ ]+)

在这里,它被格式化:

代码语言:javascript
复制
 \b 
 (?:
      c
      (?:
           linical [ ] 
           (?: anatomy | psychology )
        |  o
           (?:
                gnitive [ ] 
                (?: neuroscience | psychology | science )
             |  mp
                (?:
                     arative [ ] 
                     (?: anatomy | psychology )
                  |  ound [ ] morphology
                  |  utational [ ] linguistics
                )
             |  rrelation
             |  sm
                (?:
                     etic [ ] dentistry
                  |  o
                     (?: graphy | logy )
                )
           )
        |  r
           (?:
                anio
                (?: logy | metry )
             |  iminology
             |  y
                (?:
                     o
                     (?: biology | genics | nics )
                  |  ptanalysis
                  |  stallography
                )
           )
        |  urvilinear [ ] correlation
        |  y
           (?:
                bernetics
             |  to
                (?: genetics | logy )
           )
      )
   |  de
      (?:
           ixis
        |  mography
        |  nt
           (?:
                al [ ] 
                (?: anatomy | surgery )
             |  istry
           )
      )
   |  p
      (?: hilosophy | olitical [ ] philosophy )
 )
 \b 

添加10,000个短语非常简单,正则表达式不大于

短语中的字节数加上一些用于交错的开销。

审判官。

最后一个音符。您可以通过生成

短语的判读..。这只是用水平空格分隔的单词。

并且,一定要预编译正则表达式。只需要做一次。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/34828174

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档