文章/答案/技术大牛

发布

社区首页 >问答首页 >滚动雅虎财经新闻

问滚动雅虎财经新闻
EN

Stack Overflow用户

提问于 2020-10-12 16:59:23

回答 2查看 224关注 0票数 1

因此，我正在做一个小项目，我收集了一家特定公司的雅虎财经新闻，并对其进行了一些数据分析，以了解新闻情绪如何影响股票表现。我正在尝试无限地抓取和滚动，直到它停止为止，但是，我在尝试抓取第一个滚动时遇到了麻烦。

我正在使用selenium来帮助我做到这一点。我一直在到处寻找帮助，但似乎是因为每次向下滚动时都会增量加载新闻结果，这会使事情变得更加复杂。

import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup


# Web scrapper for infinite scrolling page 
url = "https://finance.yahoo.com/quote/company/press-releases?p=company"

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
time.sleep(2)  # Allow 2 seconds for the web page to open
scroll_pause_time = 2 
screen_height = driver.execute_script("return window.screen.height;")   # get the screen height of the web

i = 1
   
SCROLL_PAUSE_TIME = 0.5
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(SCROLL_PAUSE_TIME)
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

##### Extract Article Titles #####
titles = []
soup = BeautifulSoup(driver.page_source, "html.parser")
for t in soup.find_all(class_="Cf"):
    a_tag = t.find("a", class_="Fw(b)")
    if a_tag:
        text = a_tag.text
        titles.append(text)

python

selenium

web-scraping

sentiment-analysis

yahoo-finance

回答 2

Stack Overflow用户

发布于 2020-10-12 17:02:06

这不是selenium中的最佳自动化实践。

出于多种原因，不推荐使用WebDriver登录Gmail和Facebook等网站。除了违反这些网站的使用条款(您可能会面临帐户被关闭的风险)之外，它还很慢且不可靠。

理想的做法是使用电子邮件提供商提供的API，或者在Facebook的情况下使用开发人员工具服务，该服务公开用于创建测试帐户、朋友等的API。尽管使用API看起来像是一项额外的工作，但您将在速度、可靠性和稳定性方面得到回报。API也不太可能改变，而网页和HTML定位器经常改变，需要你更新你的测试框架。

在测试的任何时候使用WebDriver登录到第三方站点都会增加测试失败的风险，因为这会使测试变得更长。一个普遍的经验法则是，更长的测试更脆弱和不可靠。

符合W3C的WebDriver实现还会用WebDriver属性注释navigator对象，以便减轻拒绝服务攻击。

票数 0

Stack Overflow用户

发布于 2021-02-22 12:49:32

此示例代码来自我不久前参与的一个项目。希望它能帮助你朝着正确的方向前进。

from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
from pandas import DataFrame

resp = urllib.request.urlopen("https://www.cnbc.com/finance/")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
substring = 'https://www.cnbc.com/'

df = ['review']
for link in soup.find_all('a', href=True):
    #print(link['href'])
    if (link['href'].find(substring) == 0): 
        # append
        df.append(link['href'])
        
        #print(link['href'])
        
        
        #list(df)
        # convert list to data frame
        df = DataFrame(df)
        #type(df)
        #list(df)
        
        # add column name
        df.columns = ['review']
        df.columns
        
        from nltk.sentiment.vader import SentimentIntensityAnalyzer
        sid = SentimentIntensityAnalyzer()
        df['sentiment'] = df['review'].apply(lambda x: sid.polarity_scores(x))
        def convert(x):
            if x < 0:
                return "negative"
            elif x > .2:
                return "positive"
            else:
                return "neutral"
                df['result'] = df['sentiment'].apply(lambda x:convert(x['compound']))
                df['result']
            
                df_final = pd.merge(df['review'], df['result'], left_index=True, right_index=True)
                df_final

结果：

                                               review   result
0                                              review  neutral
1                      https://www.cnbc.com/business/  neutral
2   https://www.cnbc.com/2021/02/22/chinas-foreign...  neutral
3   https://www.cnbc.com/2021/02/22/chinas-foreign...  neutral
4                  https://www.cnbc.com/evelyn-cheng/  neutral
..                                                ...      ...
89                        https://www.cnbc.com/banks/  neutral
90  https://www.cnbc.com/2021/02/17/wells-fargo-sh...  neutral
91                   https://www.cnbc.com/technology/  neutral
92  https://www.cnbc.com/2021/02/17/lakestar-found...  neutral
93               https://www.cnbc.com/finance/?page=2  neutral

[94 rows x 2 columns]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64314680

复制

相似问题

问滚动雅虎财经新闻
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问滚动雅虎财经新闻EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问滚动雅虎财经新闻
EN