我已经从BeautifulSoup中分离出一行我想运行regex的HTML,但是我一直在获取AttributeError: 'NoneType' object has no attribute 'groups'
我读到了另一个堆栈溢出问题(using regex on beautiful soup tags),但我看不出我需要做什么来修复这个问题的版本。
这是代码的相关部分(提供了url):与罗伯的正确regex更新仍然抛出dat属性错误:
soup = BeautifulSoup(urlopen(url).read()).find("div",{"id":"page"})
addy = soup.find("p","addy").em.encode_contents()
extracted_entities = re.match(r'\$([\d.]+)\. ([^,]+), ([\d-]+)', addy)
extracted_entities.groups()
price = extracted_entities[0]
location = extracted_entities[1]
phone = extracted_entities[2]阿迪似乎是我想要的,还说:
$10. 2109 W. Chicago Ave., 773-772-0406, <a href="http://www.theoldoaktap.com/">theoldoaktap.com</a>
$9. 800 W. Randolph St., 312-929-4580, <a href="http://aucheval.tumblr.com/">aucheval.tumblr.com</a>
$9.50. 445 N. Clark St., 312-334-3688, <a href="http://www.rickbayless.com/">rickbayless.com</a>等等,当我把它打印出来的时候。
这里发生了什么事?提前谢谢各位。
发布于 2015-05-20 00:18:07
问题似乎是在您的RegEx模式中出现了一个我在示例输出中没有看到的不稳定的RegEx。
match = re.match(r'\$([\d.]+)\. ([^,]+), ([\d-]+)', addy)
if match:
extracted_entities = match.groups()
else:
raise Exception("RegEx didn't match '%s'" % addy)应起作用:
>>> f = """$10. 2109 W. Chicago Ave., 773-772-0406, <a href="http://www.theoldoaktap.com/">theoldoaktap.com</a>
... $9. 800 W. Randolph St., 312-929-4580, <a href="http://aucheval.tumblr.com/">aucheval.tumblr.com</a>
... $9.50. 445 N. Clark St., 312-334-3688, <a href="http://www.rickbayless.com/">rickbayless.com</a>"""
>>> l = f.splitlines()
>>> for i in l:
... r = re.match(r'\$([\d.]+)\. ([^,]+), ([\d-]+)', i)
... if r:
... print "GOT IT", r.groups()
... else:
... print "NO GOT IT", i
...
GOT IT ('10', '2109 W. Chicago Ave.', '773-772-0406')
GOT IT ('9', '800 W. Randolph St.', '312-929-4580')
GOT IT ('9.50', '445 N. Clark St.', '312-334-3688')https://stackoverflow.com/questions/30338123
复制相似问题