文章/答案/技术大牛

发布

社区首页 >问答首页 >卷筒式刮板机

问卷筒式刮板机
EN

Stack Overflow用户

提问于 2019-05-29 12:35:23

回答 1查看 358关注 0票数 0

脚本应该找到带有文章的子页面的地址，并从它们中收集必要的数据。数据应该转到数据库。数据应该通过处理HTML文档来收集。

准确地说，它应该: 1.找出10个最常见的单词和它们的数字。2.找出每个作者最常见的10个单词及其编号。3.贴出作者的姓名。

我还没有完成单词计数器，但目前我有两个循环(2.1，2.2)，这应该进入到每一篇文章中，并从它们和作者的名字中获取内容。

我犯了这样一个错误：

UserWarning: "link/" looks like a URL. Beautiful Soup is not an HTTP client. 
You should probably use an HTTP client like requests to get the 
document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup

这是我的剧本：

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver

url = 'https://teonite.com/blog/page/{}/index.html'
all_links = []

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'User-Agent': 'Mozilla/5.0'
}
with requests.Session() as s:
    r = s.get('https://teonite.com/blog/')
    soup = bs(r.content, 'lxml')
    article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
    all_links.append(article_links)
    num_pages = int(soup.select_one('.page-number').text.split('/')[1])

    for page in range(2, num_pages + 1):
        r = s.get(url.format(page))
        soup = bs(r.content, 'lxml')
        article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
        all_links.append(article_links)

    all_links = [item for i in all_links for item in i]

    d = webdriver.Chrome()

    for article in all_links:
        d.get(article)
        soup = bs(d.page_source, 'lxml')
        [t.extract() for t in soup(['style', 'script', '[document]', 'head', 'title'])]
        visible_text = soup.getText()

        try:
            print(soup.select_one('.post-title').text)
        except:
            print(article)
            print(soup.select_one('h1').text)
            break
    d = webdriver.Chrome()

    for article_links in all_links:
        d.get(article)
        soup = bs(d.page_source, 'lxml')

    # nie moj !!!!!!

    # 2.2. Post contents
    contents = []
    for all_links in article_links:
        soup = bs((article), 'html.parser')
        content = soup.find('section', attrs={'class': 'post-content'})
        contents.append(content)


    # 2.1. Authors

    authors = []
    for all_links in article:
        soup = bs(article, 'html.parser')
        author = soup.find('span', attrs={'class': 'author-content'})
        authors.append(author)



    # POSTGRESQL CONNECTION
    # 1. Connect to local database using psycopg2

    import psycopg2

    hostname = 'balarama.db.elephantsql.com'
    username = 'yagoiucf'
    password = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
    database = 'yagoiucf'

    conn = psycopg2.connect(host='balarama.db.elephantsql.com', user='yagoiucf',
                            password='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', dbname='yagoiucf')
    conn.close()

web-scraping

python

postgresql

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-05-29 15:16:31

在许多地方，您使用的是在前一个循环中生成的字符串，而不是当前的循环变量。

for all_links in article_links:
        soup = bs((article), 'html.parser')

article是在早期循环中生成的url (很可能是错误的来源)。

另外，all_links是一个链接列表(从列表列表中扁平)，现在已更改为article_links的循环变量。

在这一点

for all_links in article:
    soup = bs(article, 'html.parser')

我相信，您是将一个url字符串传递给bs而不是html。

当您继续使用现有实例时，您还会重新创建一个webdriver实例。

我认为您只需要在原来的循环中使用selenium。

for article in all_links:

这会访问所有的博客。在循环中的页面上，您可以提取所需的任何内容。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56360800

复制

相似问题

问卷筒式刮板机
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问卷筒式刮板机EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问卷筒式刮板机
EN