文章/答案/技术大牛

发布

社区首页 >问答首页 >Goquery从明显非空的响应中加载空文档

问Goquery从明显非空的响应中加载空文档
EN

Stack Overflow用户

提问于 2019-10-25 19:13:15

回答 2查看 189关注 0票数 0

我一直在尝试将响应加载到goquery文档中，但似乎失败了(尽管它没有抛出错误)。

我试图加载的响应来自于：

https://www.bbcgoodfood.com/search_api_ajax/search/recipes?sort=created&order=desc&page=4

虽然它没有抛出任何错误，但当我调用fmt.Println(goquery.OuterHtml(doc.Contents()))时，我得到了输出：

<html><head></head><body></body></html>

同时，如果我不尝试将其加载到goquery文档中，而是调用

s, _ := ioutil.ReadAll(resp.Body)
fmt.Println(string(s))

我得到了：

<!doctype html>
<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8 no-touch" lang="en"> <![endif]-->
<!--[if IE 8]>    <html class="no-js lt-ie9 no-touch" lang="en"> <![endif]-->
<!--[if gt IE 8]> <html class="no-js gt-ie-8 no-touch" lang="en"> <![endif]-->
<!--[if !IE]><!-->
<html class="no-js no-touch" lang="en">
<!--<![endif]-->

<head>
    <meta charset="utf-8">
    <title>Search | BBC Good Food</title>
    <!--[if IE]><![endif]-->
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <link rel="prev" href="https://www.bbcgoodfood.com/search/recipes?page=3&amp;sort=created&amp;order=desc" />
    <link rel="next" href="https://www.bbcgoodfood.com/search/recipes?page=5&amp;sort=created&amp;order=desc" />
    <meta name="robots" content="noindex" />
    <style>
        .async-hide {
            opacity: 0 !important
        }
    ... etc

我正在做的事情的基本逻辑如下：

package main

import (
    "fmt"
    "net/http"
    "github.com/PuerkitoBio/goquery"
    "io/ioutil"
)

func main() {
    baseUrl := "https://www.bbcgoodfood.com/search_api_ajax/search/recipes?sort=created&order=desc&page="
    i := 4

    // Make a request
    req, _ := http.NewRequest(http.MethodGet, fmt.Sprintf("%s%d", baseUrl, i), nil)

    // Create a new HTTP client and execute the request
    client := &http.Client{}
    resp, _ := client.Do(req)

    // Print out response
    s, _ := ioutil.ReadAll(resp.Body)
    fmt.Println(string(s))

    // Load into goquery doc
    doc, _ := goquery.NewDocumentFromReader(resp.Body)
    fmt.Println(goquery.OuterHtml(doc.Contents()))
}

完整的响应可以在here上找到。有什么特殊的原因导致这个不能加载吗？

goquery

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-10-26 01:17:31

Go的html解析器似乎不喜欢你得到的html - <html>标签都在注释中，所以我认为它永远不会进行解析。

如果在文档前面加上<html>，那么一切都会正常工作。要做到这一点，一种方法是使用读取器包装器，如下所示，它在第一次调用Read时写入html标记，并在随后的调用中委托给resp.Body。

import "io"

var htmlTag string = "<html>\n"

type htmlAddingReader struct {
    sentHtml bool
    source io.Reader
}

func (r *htmlAddingReader) Read(b []byte) (n int, err error) {
    if !r.sentHtml {
        copy(b, htmlTag)
        r.sentHtml = true
        return len(htmlTag), nil
    } else {
        return r.source.Read(b)
    }
}

要在示例代码中使用此代码，请更改最后一节，如下所示：

    // Load into goquery doc
    wrapped := &htmlAddingReader{}
    wrapped.source = resp.Body
    doc, _ := goquery.NewDocumentFromReader(wrapped)
    fmt.Println(goquery.OuterHtml(doc.Contents()))

票数 1

Stack Overflow用户

发布于 2021-10-28 01:49:38

代码有两个问题：

(1) resp.Body为io.ReadCloser流。

ioutil.ReadAll(resp.Body)读取整个流，因此没有任何东西可供goquery.NewDocumentFromReader(resp.Body)读取，因此它返回一个空文档。

相反，您可以使用NewReader(s)从保存的正文字符串创建新的流。

(2) doc.Contents()返回顶部元素的子元素，该元素仅为<!DOCTYPE html>。如果您想要整个文档，那么您可能希望使用doc.Selection。

像这样的东西应该是有效的：

    // Read entire resp.Body into raw
    raw, _ := io.ReadAll(resp.Body)
    s := string(raw)

    // Print out response
    fmt.Println(s)

    // Create a new readable stream with NewReader(s)
    doc, _ := goquery.NewDocumentFromReader(strings.NewReader(s))
    
    // Use doc.Selection to get the whole doc
    fmt.Println(doc.Selection.Html())

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/58557464

复制

相似问题

问Goquery从明显非空的响应中加载空文档
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Goquery从明显非空的响应中加载空文档EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Goquery从明显非空的响应中加载空文档
EN