我正在尝试存储从某些链接获得的电子邮件。我遇到了两个问题。第一个原因是,由于某种原因,元素email存储了两个相同类型的项目。第二个问题是if语句检测到电子邮件有值,但没有将其存储在电子邮件列表中。谢谢你的帮助!
emails = []
comment = []
with open('comment.txt', 'r') as filehandle:
for line in filehandle:
currentPlace = line[:-1]
comment.append(currentPlace)
print(emails)
i = 0
while i < len(comment) :
url = str(comment[i]) + '/about'
print("Crawling URL %s" % url)
response = requests.get(url)
email = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I)
print(email)
if email:
emails.append(email)
email.clear()
i += 1
time.sleep(0.2)
print(emails)输出:
[]
Crawling URL ...
['email@gmail.com', 'email@gmail.com']
Crawling URL ...
[]
Crawling URL ...
['email@yahoo.com', 'email@yahoo.com']
Crawling URL ...
[]
Crawling URL ...
[]
[[], []]旧代码输出正确:
emails = set()
print("Crawling URL %s" % starting_url)
response = requests.get(starting_url)
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
emails.update(new_emails)
print(emails)
# create a beutiful soup for the html document
soup = BeautifulSoup(response.text, 'lxml')发布于 2020-11-27 23:43:19
https://docs.python.org/3/library/re.html#re.findall返回正则表达式的所有匹配项的列表。因此,正则表达式为您的电子邮件regexp找到2个匹配项。
然后执行emails.append(email)。但email本身就是一个电子邮件列表。因此,您的emails列表最终看起来像[["1@1.com","1@1.com"], ["2@2.com","2@2.com"], ... ]。
https://stackoverflow.com/questions/65039973
复制相似问题