文章/答案/技术大牛

发布

社区首页 >问答首页 >Abot省略CrawledPage HttpWebRequest/Response

问Abot省略CrawledPage HttpWebRequest/Response
EN

Stack Overflow用户

提问于 2015-04-03 14:46:54

回答 1查看 935关注 0票数 1

我使用Abot的方式是，我有一个显示浏览器控件(CefSharp)的WPF应用程序。用户登录并使用任何可能的自定义身份验证都将在爬行时工作，就像用户实际在浏览站点一样。

因此，当我爬行时，我希望使用这个浏览器控件来发出请求，并简单地返回页面数据。因此，我已经实现了我的自定义PageRequester，完整的清单如下。

问题是，与其他浏览器控件一样，使用CefSharp无法获得与CrawlPage关联的HttpWebRequest/响应。如果不设置这两个属性，Abot就不会继续爬行。

我能做些什么来回避这个问题吗？

代码清单：

using Abot.Core;
using Abot.Poco;
using CefSharp.Wpf;
using System;
using System.Net;
using System.Text;
using System.Threading;

public class CefPageRequester : IPageRequester
{
    private MainWindowDataContext DataContext;
    private ChromiumWebBrowser ChromiumWebBrowser;
    private CrawlConfiguration CrawlConfig;

    private volatile bool _navigationCompleted;
    private string _pageSource;

    public CefPageRequester(MainWindowDataContext dataContext, ChromiumWebBrowser chromiumWebBrowser, CrawlConfiguration crawlConfig)
    {
        this.DataContext = dataContext;
        this.ChromiumWebBrowser = chromiumWebBrowser;
        this.CrawlConfig = crawlConfig;

        this.ChromiumWebBrowser.FrameLoadEnd += ChromiumWebBrowser_FrameLoadEnd;
    }

    public CrawledPage MakeRequest(Uri uri)
    {
        return this.MakeRequest(uri, cp => new CrawlDecision() { Allow = true });
    }

    public CrawledPage MakeRequest(Uri uri, Func<CrawledPage, CrawlDecision> shouldDownloadContent)
    {
        if (uri == null)
            throw new ArgumentNullException("uri");

        CrawledPage crawledPage = new CrawledPage(uri);

        try
        {
            //the browser control is bound to the address of the data context, 
            //if we set the address directly it breaks for some reason, although it's a two way binding.
            this.DataContext.Address = uri.AbsolutePath;

            crawledPage.RequestStarted = DateTime.Now;
            crawledPage.DownloadContentStarted = crawledPage.RequestStarted;

            while (!_navigationCompleted)
                Thread.CurrentThread.Join(10);
        }
        catch (WebException e)
        {
            crawledPage.WebException = e;
        }
        catch
        {
            //bad luck, we should log this.
        }
        finally
        {
            //TODO must add these properties!!
            //crawledPage.HttpWebRequest = request;
            //crawledPage.HttpWebResponse = response;
            crawledPage.RequestCompleted = DateTime.Now;
            crawledPage.DownloadContentCompleted = crawledPage.RequestCompleted;
            if (!String.IsNullOrWhiteSpace(_pageSource))
                crawledPage.Content = this.GetContent("UTF-8", _pageSource);

            _navigationCompleted = false;
            _pageSource = null;
        }

        return crawledPage;
    }

    private void ChromiumWebBrowser_FrameLoadEnd(object sender, CefSharp.FrameLoadEndEventArgs e)
    {
        if (!e.IsMainFrame)
            return;

        this.ChromiumWebBrowser.Dispatcher.BeginInvoke(
            (Action)(() =>
            {
                _pageSource = this.ChromiumWebBrowser.GetSourceAsync().Result;
                _navigationCompleted = true;
            }));
    }

    private PageContent GetContent(string charset, string html)
    {
        PageContent pageContent = new PageContent();
        pageContent.Charset = charset;
        pageContent.Encoding = this.GetEncoding(charset);
        pageContent.Text = html;
        pageContent.Bytes = pageContent.Encoding.GetBytes(html);

        return pageContent;
    }

    private Encoding GetEncoding(string charset)
    {
        Encoding e = Encoding.UTF8;
        if (charset != null)
        {
            try
            {
                e = Encoding.GetEncoding(charset);
            }
            catch { }
        }

        return e;
    }
}

这个问题也可以表述为:如何避免从流中创建HttpWebResponse？考虑到MSDN 他说，这似乎是不可能的：

永远不要直接创建HttpWebResponse类的实例。相反，使用调用HttpWebRequest.GetResponse返回的实例。

实际上，我必须发布请求才能得到响应，这正是我希望通过一个web浏览器控件来避免的。

web-crawler

httprequest

httpresponse

cefsharp

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-04-03 15:49:49

如您所知，许多功能取决于正在设置的HttpWebRequest和HttpWebResponse。我已经给你点了几个选择.

1)重构Abot以使用一些POCO抽象，而不是那些类。然后，只需要一个转换器来将真正的HttpWebRequest和HttpWebResponse转换成那些POCO类型，还有一个转换器可以将浏览器对象响应转换成这些POCO。

2)创建一个继承自CustomHttpWebRequest类的CustomHttpWebResponse和.net，这样您就可以访问/覆盖公共/受保护的属性，从而可以手动创建一个实例，对浏览器组件返回给您的请求/响应建模。我知道这可能很棘手，但可能会奏效(我从来没有做过，所以我不能肯定地说)。

我讨厌这个主意。应该是您最后的选择，它创建了这些类的一个真实实例，并使用反射来设置任何需要设置的属性/值，以满足Abot的所有使用。

4) --我讨厌这个想法--更糟糕的是，使用MS假货来创建HttpWebRequest和HttpWebResponse的属性和方法的shims/stub/伪。然后您可以将其配置为返回值。这个工具通常只用于测试，但我相信它可以用于生产代码，如果你绝望，不关心性能和/或疯了。

我也包括可怕的想法，以防万一它们会让你产生一些想法。希望能帮上忙..。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/29434396

复制

相似问题

问Abot省略CrawledPage HttpWebRequest/Response
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Abot省略CrawledPage HttpWebRequest/ResponseEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Abot省略CrawledPage HttpWebRequest/Response
EN