文章/答案/技术大牛

发布

社区首页 >问答首页 >从字符串位置提取python中的周围单词

问从字符串位置提取python中的周围单词
EN

Stack Overflow用户

提问于 2015-05-07 23:55:07

回答 2查看 2.7K关注 0票数 3

假设，我有一个字符串：

string="""<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from  other scripts. Both of these typically flow  left-to-right within the overall right-to-left  context. </p> <p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>"""

我在这个字符串中有一个单词的位置，例如：

>>> pos = [m.start() for m in re.finditer("tells you", string)]
>>> pos
[263, 588]

我需要从每个位置提取几个单词后面和几个单词。如何使用Python和正则表达式实现它？

例如：

def look_through(d, s):
    r = []
    content = readFile(d["path"])
    content = BeautifulSoup(content)
    content = content.getText()
    pos = [m.start() for m in re.finditer(s, content)]
    if pos:
        if "phrase" not in d:
            d["phrase"] = [s]
        else:
            d["phrase"].append(s)
        for p in pos:
            r.append({"content": content, "phrase": d["phrase"], "name": d["name"]})
    for b in d["decendent"] or []:
            r += look_through(b, s)
    return r

>>> dict = {
    "content": """<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from  other scripts. Both of these typically flow  left-to-right within the overall right-to-left  context. </p>""", 
    "name": "directory", 
    "decendent": [
         {
            "content": """<p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>""", 
            "name": "subdirectory", 
            "decendent": None
        }, 
        {
            "content": """It tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)""", 
            "name": "subdirectory_two", 
            "decendent": [
                {
                    "content": "Name 4", 
                    "name": "subsubdirectory", 
                    "decendent": None
                }
            ]
        }
    ]
}

所以：

>>> look_through(dict, "tells you")
[
    { "content": "This article tells you how to", "phrase": "tells you", "name": "subdirectory" },
    { "content": "It tells you how to use", "phrase": "tells you", "name": "subdirectory_two" }
]

谢谢!

python

regex

string

回答 2

Stack Overflow用户

发布于 2015-05-08 01:11:06

如果您想要regexp命中的“一致性”，让我们在regexp匹配的位置之前和之后说两个单词。最简单的方法是在那里打断您的字符串，并将您的搜索锚定到的端点。例如，要获得索引263 (您的第一个m.start())前后的两个单词，您可以这样做：

m_left = re.search(r"(?:\s+\S+){,2}\s+\S*$", text[:263])
m_right = re.search(r"^\S*\s+(?:\S+\s+){2,}", text[263:])
print(text[m_left.start():m_right.end()])

应该从字符串的末尾向后读取第一个表达式:它锚定在$的末尾，如果匹配结束于单词中部(\S*)，则可能跳过部分单词，跳过一些空格(\s+)，然后最多匹配两个{2,}单词空间序列\s+\S+。它不是恰好是2，因为如果我们到达字符串的开头，我们希望返回一个短匹配。

第二个regexp执行相同的操作，但方向相反。

为了获得一致，您可能希望在regexp匹配结束后立即开始阅读，而不是从开头开始。在这种情况下，使用m.end()作为第二个字符串的开头。

我认为，如何将其与regexp匹配列表一起使用是非常明显的。

票数 1

Stack Overflow用户

发布于 2015-05-08 00:22:35

我最初建议使用单词边界元字符，但这并不完全正确，因为它们不会消耗任何字符串，而且\B也不会真正与我想要的匹配。

相反，我建议使用单词边界的底层定义--即\w和\W之间的边界。在搜索子字符串的两端，按照正确的顺序查找一个或多个单词字符(\w)以及一个或多个非单词字符(\W)，重复次数不限。

例如：(?:\w+\W+){,3}some string(?:\W+\w+){,3}

这将在“某个字符串”之前查找最多三个单词，并在“某些字符串”之后查找最多三个单词。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/30106082

复制

相似问题

问从字符串位置提取python中的周围单词
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从字符串位置提取python中的周围单词EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从字符串位置提取python中的周围单词
EN