我的目标是从拍卖网站页面上刮掉一些拍卖is。页面是这里
对于我感兴趣的页面,大约有60个拍卖ids。auctionID前面有一个破折号,由10位数字组成,并在.htm之前终止。例如,在ID下面的链接中是0133346952
<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">通过识别"a“标签,我已经提取了所有的链接。此代码位于页面底部。
根据我有限的知识,我认为REGEX应该是解决这一问题的正确方法。我在想REGEX是这样的:
-...........htm但是,我未能成功地将regex集成到代码中。我想我会觉得
for links in soup.find_all('-...........htm'):本来可以做到的,但显然不行。
我怎样才能修正这段代码?
import bs4
import requests
import re
res = requests.get('http://www.trademe.co.nz/browse/categorylistings.aspx?mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page=2&sort_order=default&rptpath=5-380-50-7145-')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for links in soup.find_all('-...........htm'):
print (links.get('href'))发布于 2016-02-20 08:37:03
下面是工作的代码:
for links in soup.find_all(href=re.compile("auction-[0-9]{10}.htm")):
h = links.get('href')
m = re.search("auction-([0-9]{10}).htm", h)
if m:
print(m.group(1))首先,您需要一个正则表达式来提取href。然后,您需要一个捕获正则表达式来提取id。
发布于 2016-02-20 08:29:48
您必须将一个正则表达式对象传递给find_all(),您只需要将一个字符串作为正则表达式的模式。
要学习和调试这类东西,从站点缓存数据是很有用的,直到事情成功为止:
import bs4
import requests
import re
import os
# don't want to download while experimenting
tmp_file = 'data.html'
if True and os.path.exists('data.html'): # switch True to false in production
with open(tmp_file) as fp:
data = fp.read()
else:
res = requests.get('http://www.trademe.co.nz/browse/categorylistings.aspx?mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page=2&sort_order=default&rptpath=5-380-50-7145-')
res.raise_for_status()
data = res.text
with open(tmp_file, 'w') as fp:
fp.write(data)
soup = bs4.BeautifulSoup(data, 'html.parser')
# and start experimenting with your regular expressions
regex = re.compile('...........htm')
for links in soup.find_all(regex):
print (links.get('href'))
# the above doesn't find anything, you need to search the hrefs
print('try again')
for links in soup.find_all(href=regex):
print (links.get('href'))一旦得到一些匹配,您就可以使用更复杂的技术改进regex模式,但这一点并不重要,不如从正确的“框架”开始快速尝试(而无需等待测试的每一个代码更改的下载)。
发布于 2016-02-20 08:35:20
在python中,您可以:
import re
text = """<a href="/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm" class="tile-2">"""
p = re.compile(r'(?<=<a\shref=").*?(?=")')
re.findall(p,text) ## ['/sports/cycling/mountain-bikes/full-suspension/auction-1033346952.htm']https://stackoverflow.com/questions/35520738
复制相似问题