首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从字符串位置提取python中的周围单词

从字符串位置提取python中的周围单词
EN

Stack Overflow用户
提问于 2015-05-07 23:55:07
回答 2查看 2.7K关注 0票数 3

假设,我有一个字符串:

代码语言:javascript
复制
string="""<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from  other scripts. Both of these typically flow  left-to-right within the overall right-to-left  context. </p> <p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>"""

我在这个字符串中有一个单词的位置,例如:

代码语言:javascript
复制
>>> pos = [m.start() for m in re.finditer("tells you", string)]
>>> pos
[263, 588]

我需要从每个位置提取几个单词后面和几个单词。如何使用Python和正则表达式实现它?

例如:

代码语言:javascript
复制
def look_through(d, s):
    r = []
    content = readFile(d["path"])
    content = BeautifulSoup(content)
    content = content.getText()
    pos = [m.start() for m in re.finditer(s, content)]
    if pos:
        if "phrase" not in d:
            d["phrase"] = [s]
        else:
            d["phrase"].append(s)
        for p in pos:
            r.append({"content": content, "phrase": d["phrase"], "name": d["name"]})
    for b in d["decendent"] or []:
            r += look_through(b, s)
    return r

>>> dict = {
    "content": """<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from  other scripts. Both of these typically flow  left-to-right within the overall right-to-left  context. </p>""", 
    "name": "directory", 
    "decendent": [
         {
            "content": """<p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>""", 
            "name": "subdirectory", 
            "decendent": None
        }, 
        {
            "content": """It tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)""", 
            "name": "subdirectory_two", 
            "decendent": [
                {
                    "content": "Name 4", 
                    "name": "subsubdirectory", 
                    "decendent": None
                }
            ]
        }
    ]
}

所以:

代码语言:javascript
复制
>>> look_through(dict, "tells you")
[
    { "content": "This article tells you how to", "phrase": "tells you", "name": "subdirectory" },
    { "content": "It tells you how to use", "phrase": "tells you", "name": "subdirectory_two" }
]

谢谢!

EN

回答 2

Stack Overflow用户

发布于 2015-05-08 01:11:06

如果您想要regexp命中的“一致性”,让我们在regexp匹配的位置之前和之后说两个单词。最简单的方法是在那里打断您的字符串,并将您的搜索锚定到的端点。例如,要获得索引263 (您的第一个m.start())前后的两个单词,您可以这样做:

代码语言:javascript
复制
m_left = re.search(r"(?:\s+\S+){,2}\s+\S*$", text[:263])
m_right = re.search(r"^\S*\s+(?:\S+\s+){2,}", text[263:])
print(text[m_left.start():m_right.end()])

应该从字符串的末尾向后读取第一个表达式:它锚定在$的末尾,如果匹配结束于单词中部(\S*),则可能跳过部分单词,跳过一些空格(\s+),然后最多匹配两个{2,}单词空间序列\s+\S+。它不是恰好是2,因为如果我们到达字符串的开头,我们希望返回一个短匹配。

第二个regexp执行相同的操作,但方向相反。

为了获得一致,您可能希望在regexp匹配结束后立即开始阅读,而不是从开头开始。在这种情况下,使用m.end()作为第二个字符串的开头。

我认为,如何将其与regexp匹配列表一起使用是非常明显的。

票数 1
EN

Stack Overflow用户

发布于 2015-05-08 00:22:35

我最初建议使用单词边界元字符,但这并不完全正确,因为它们不会消耗任何字符串,而且\B也不会真正与我想要的匹配。

相反,我建议使用单词边界的底层定义--即\w和\W之间的边界。在搜索子字符串的两端,按照正确的顺序查找一个或多个单词字符(\w)以及一个或多个非单词字符(\W),重复次数不限。

例如:(?:\w+\W+){,3}some string(?:\W+\w+){,3}

这将在“某个字符串”之前查找最多三个单词,并在“某些字符串”之后查找最多三个单词。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/30106082

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档