文章/答案/技术大牛

发布

社区首页 >问答首页 >使用Regex搜索关键字附近的HTML链接

问使用Regex搜索关键字附近的HTML链接
EN

Stack Overflow用户

提问于 2012-01-23 09:05:22

回答 4查看 360关注 0票数 4

如果我正在寻找关键字"sales“，并且我想获得最近的"http://www.somewebsite.com”，即使文件中有多个链接。我想要最近的链接，而不是第一个链接。这意味着我需要搜索在关键字匹配之前出现的链接。

这不管用..。

regex = (http|https)://[-A-Za-z0-9./]+.*(?!((http|https)://[-A-Za-z0-9./]+))sales sales

找到最接近关键字的链接的最好方法是什么？

python

regex

negative-lookahead

回答 4

Stack Overflow用户

回答已采纳

发布于 2012-01-30 05:23:29

我测试了这段代码，它似乎在工作……

def closesturl(keyword, website):
    keylist = []
    urllist = []
    closest = []
    urls = []
    urlregex = "(http|https)://[-A-Za-z0-9\\./]+"
    urlmatches = re.finditer(urlregex, website, re.IGNORECASE)
    keymatches = re.finditer(keyword, website, re.IGNORECASE)
    for n in keymatches:
        keylist.append([n.start(), n.end()])
    if(len(keylist) > 0):
        for m in urlmatches:
            urllist.append([m.start(), m.end()])
    if((len(keylist) > 0) and (len(urllist) > 0)):
        for i in range (0, len(keylist)):
            closest.append([abs(urllist[0][0]-keylist[i][0])])
            urls.append(website[urllist[0][0]:urllist[0][1]])
            if(len(urllist) >= 1):
                for j in range (1, len(urllist)):
                    if((abs(urllist[j][0]-keylist[i][0]) < closest[i])):
                        closest[i] = abs(keylist[i][0]-urllist[j][0])
                        urls[i] = website[urllist[j][0]:urllist[j][1]]
                        if((abs(urllist[j][0]-keylist[i][0]) > closest[i])):
                            break # local minimum / inflection point break from url list                                                      
    if((len(keylist) > 0) and (len(urllist) > 0)):
        return urls #return website[urllist[index[0]][0]:urllist[index[0]][1]]                                                                
    else:
        return ""

    somestring = "hey whats up... http://www.firstlink.com some other test http://www.secondlink.com then mykeyword"
    keyword = "mykeyword"
    print closesturl(keyword, somestring)

当run显示时，上面显示...http://www.secondlink.com。

如果有人对如何加速这段代码有想法，那就太棒了！

谢谢V$H。

票数 -1

Stack Overflow用户

发布于 2012-01-23 10:43:34

通常，使用HTML解析器要比使用正则表达式简单得多，也更健壮。

使用第三方模块lxml

import lxml.html as LH

content = '''<html><a href="http://www.not-this-one.com"></a>
<a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p>
</html>
'''

doc = LH.fromstring(content)    
for url in doc.xpath('''
    //*[contains(text(),"sales")]
    /preceding::*[starts-with(@href,"http")][1]/@href'''):
    print(url)

收益率

http://www.somewebsite.com

我发现lxml (和XPath)是一种表达我正在寻找的元素的便捷方法。但是，如果不能选择安装第三方模块，您也可以使用标准库中的HTMLParser来完成此特定工作：

import HTMLParser
import contextlib

class MyParser(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.last_link = None

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        if 'href' in attrs:
            self.last_link = attrs['href']

content = '''<html><a href="http://www.not-this-one.com"></a>
<a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p>
</html>
'''

idx = content.find('sales')

with contextlib.closing(MyParser()) as parser:
    parser.feed(content[:idx])
    print(parser.last_link)

关于lxml解决方案中使用的XPath : XPath有以下含义：

 //*                              # Find all elements
   [contains(text(),"sales")]     # whose text content contains "sales"
   /preceding::*                  # search the preceding elements 
     [starts-with(@href,"http")]  # such that it has an href attribute that starts with "http"
       [1]                        # select the first such <a> tag only
         /@href                   # return the value of the href attribute

票数 3

Stack Overflow用户

发布于 2012-01-23 10:21:32

我认为你不能单独使用正则表达式(特别是在关键字匹配之前查找)来完成这个任务，因为它没有比较距离的意义。

我认为你最好这样做：

找到所有出现的子串& get子串索引，称为salesIndex

find
of sales，并通过salesIndex获得称为sales的子串索引。对于salesIndex中的每个位置i，查找最近的urlIndex。

根据您希望如何判断“最近”，您可能需要获取sales和http...实例的开始和结束索引进行比较。即找到与sales当前出现的起始索引最接近的URL的结束索引，找到与当前sales出现的结束索引最接近的URL的起始索引，取最接近的。

您可以使用matches = re.finditer(pattern,string,re.IGNORECASE)获取匹配列表，然后使用match.span()获取matches中每个match的开始/结束子字符串索引。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/8966244

复制

相似问题

问使用Regex搜索关键字附近的HTML链接
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Regex搜索关键字附近的HTML链接EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Regex搜索关键字附近的HTML链接
EN