文章/答案/技术大牛

发布

社区首页 >问答首页 >Instagram抓取

问Instagram抓取
EN

Stack Overflow用户

提问于 2020-06-16 12:32:30

回答 2查看 4.4K关注 0票数 3

下面的代码正在一台计算机上运行，用于从Instagram帐户中刮取数据。当我试图在VPS服务器上使用它时，我被重定向到Instagram登录页面，这样脚本就无法工作。

为什么Instagram在电脑上或服务器上的反应不一样？

对wget来说也是一样。在计算机上，我有配置文件页，在服务器上，我被重定向到登录页面。

import requests
import re


class InstagramScraper:
    """
    Scraper of Instagram profiles infos.
    """

    def __init__(self, session: requests.Session, instagram_account_name: str):
        self.session = session
        self._account_name = self.clean_account_name(instagram_account_name)
        self.load_data()

    def load_data(self):
        #print(self._account_name)
        response = self.session.get("https://www.instagram.com/{account_name}/".format(account_name=self._account_name))
        #print(response)
        #print(response.text)
        publications_regex = r'"edge_owner_to_timeline_media":{"count":(\d*),'
        self._publications = re.search(publications_regex, response.text).group(1)

        followers_regex = r'"edge_followed_by":{"count":(\d*)'
        self._followers = re.search(followers_regex, response.text).group(1)

        # title_regex = r'"@type":".*","name":"(.*)",'
        title_regex = r'"full_name":"(.*)",'
        self._title = re.search(title_regex, response.text).group(1)
        self._title = self._title.split('\"')[0]

        following_regex = r'"edge_follow":{"count":(\d*)}'
        self._following = re.search(following_regex, response.text).group(1)

    def clean_account_name(self, value) -> str:
        """
        Return the account name without the url address.
        """
        found: str = re.search("https://www.instagram.com/(.*)/", value)
        if found:
            return found.group(1)
        return value

    @property
    def publications(self) -> int:
        """
        Number of publications by this account.
        """
        return self._publications

    @property
    def followers(self) -> int:
        """
        Number of followers of this account.
        """
        return self._followers

    @property
    def title(self) -> str:
        """
        Name of the Instagram profile.
        """
        return self._title

    @property
    def account(self) -> str:
        """
        Account name used on Instagram.
        """
        return self._account_name

    @property
    def following(self) -> int:
        """
        Number of accounts this profile is following.
        """
        return self._following

    def __str__(self) -> str:
        return str({
            'Account': self.account,
            'Followers': self.followers,
            'Publications': self.publications,
            'Following': self.following,
            'Title': self.title,
        })


if __name__ == "__main__":
    with requests.session() as session:
        scraper = InstagramScraper(session, "https://www.instagram.com/ksc_lokeren/")
        print(scraper)

wget

python

web-scraping

instagram

回答 2

Stack Overflow用户

发布于 2020-06-25 20:55:59

这可能是因为您在计算机上使用自己的凭据登录？furas提到了一份黑名单，但如果你以前从未在这台服务器上运行过，我对此表示怀疑。

我所能避免的是使用无头浏览器，它模拟正常的浏览器并允许您在网站上导航。您可以使用凭据模拟登录，然后从cookie中检索csrftoken和sessionid并关闭浏览器。

我是在javascript中做的，所以我不能真正地向您展示它，但是逻辑是这样的：

创建无头浏览器

将请求的“接受-语言”标题设置为“en-US”

。

导航到https://www.instagram.com/accounts/login/。等待到空闲的

用你的证件来模仿登录。寻找：

'input[name="password"]' //for the password.

'input[name="username"]' //for username.

'button[type="submit"]' //for the login button

等待到空闲的

获取所有cookie并检索csrftoken和sessionid

关闭无头浏览器

然后，在对https://www.instagram.com/{account_name}/执行任何请求时，不要忘记在请求头中设置csrftoken和sessionid。过一段时间，它将过期，您将需要重新启动

票数 1

Stack Overflow用户

发布于 2022-03-31 17:12:28

您可以从Instagram看到登录提示，因为您被阻塞了。Instagram检测到你没有手动浏览他们的网站。

如果您想提取Instagram配置文件的信息，您必须依赖API进行抓取，因为Instagram会很快阻止您。

下面是一个很好的教程，介绍如何使用用于刮取的API处理分页的用户配置文件数据和帖子：https://scrapingfish.com/blog/scraping-instagram

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62408588

复制

相似问题

问Instagram抓取
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Instagram抓取EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Instagram抓取
EN