文章/答案/技术大牛

发布

社区首页 >问答首页 >Golang解析HTML，使用<body> </body>标记提取所有内容

问Golang解析HTML，使用<body> </body>标记提取所有内容
EN

Stack Overflow用户

提问于 2015-05-07 18:34:04

回答 4查看 76.8K关注 0票数 37

如标题所述。我需要返回html文档主体标记中的所有内容，包括后续的html标记等等。我很想知道这样做的最佳方法是什么。我在Gokogiri包中有了一个可行的解决方案，但是我试图远离任何依赖于C库的包。有什么方法可以用go标准库来完成这个任务吗？还是用一个100%的包裹？

自从发布了我最初的问题以来，我一直试图使用以下的软件包，这些包都没有解决问题。(它们似乎都没有从身体内部返回后续的子标记或嵌套标记。例如：

<!DOCTYPE html>
<html>
    <head>
        <title>
            Title of the document
        </title>
    </head>
    <body>
        body content 
        <p>more content</p>
    </body>
</html>

将返回正文内容，忽略后续的<p>标记及其包装的文本)：

pkg// xml /(标准库xml包)
golang.org/x/net/html

总体目标是获得一个字符串或内容，如下所示：

<body>
    body content 
    <p>more content</p>
</body>

html

回答 4

Stack Overflow用户

发布于 2016-08-09 15:48:52

这可以通过递归查找body节点(使用html包)来解决，然后从该节点开始呈现html。

package main

import (
    "bytes"
    "errors"
    "fmt"
    "golang.org/x/net/html"
    "io"
    "strings"
)

func Body(doc *html.Node) (*html.Node, error) {
    var body *html.Node
    var crawler func(*html.Node)
    crawler = func(node *html.Node) {
        if node.Type == html.ElementNode && node.Data == "body" {
            body = node
            return
        }
        for child := node.FirstChild; child != nil; child = child.NextSibling {
            crawler(child)
        }
    }
    crawler(doc)
    if body != nil {
        return body, nil
    }
    return nil, errors.New("Missing <body> in the node tree")
}

func renderNode(n *html.Node) string {
    var buf bytes.Buffer
    w := io.Writer(&buf)
    html.Render(w, n)
    return buf.String()
}

func main() {
    doc, _ := html.Parse(strings.NewReader(htm))
    bn, err := Body(doc)
    if err != nil {
        return
    }
    body := renderNode(bn)
    fmt.Println(body)
}

const htm = `<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>
    body content
    <p>more content</p>
</body>
</html>`

票数 64

Stack Overflow用户

发布于 2015-05-07 22:11:58

它可以使用标准的encoding/xml包来完成。但有点麻烦。这个例子中的一个警告是，它将不包括封闭的body标记，但是它将包含它的所有子元素。

package main

import (
    "bytes"
    "encoding/xml"
    "fmt"
)

type html struct {
    Body body `xml:"body"`
}
type body struct {
    Content string `xml:",innerxml"`
}

func main() {
    b := []byte(`<!DOCTYPE html>
<html>
    <head>
        <title>
            Title of the document
        </title>
    </head>
    <body>
        body content 
        <p>more content</p>
    </body>
</html>`)

    h := html{}
    err := xml.NewDecoder(bytes.NewBuffer(b)).Decode(&h)
    if err != nil {
        fmt.Println("error", err)
        return
    }

    fmt.Println(h.Body.Content)
}

可运行的示例：

http://play.golang.org/p/ZH5iKyjRQp

票数 11

Stack Overflow用户

发布于 2015-05-07 22:35:36

由于您没有用html包显示尝试的源代码，所以我不得不猜测您在做什么，但我怀疑您使用的是令牌程序而不是解析器。下面是一个使用解析器并执行您所需要的操作的程序：

package main

import (
    "log"
    "os"
    "strings"

    "github.com/andybalholm/cascadia"
    "golang.org/x/net/html"
)

func main() {
    r := strings.NewReader(`<!DOCTYPE html>
<html>
    <head>
        <title>
            Title of the document
        </title>
    </head>
    <body>
        body content 
        <p>more content</p>
    </body>
</html>`)
    doc, err := html.Parse(r)
    if err != nil {
        log.Fatal(err)
    }

    body := cascadia.MustCompile("body").MatchFirst(doc)
    html.Render(os.Stdout, body)
}

票数 7

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/30109061

复制

相似问题

问Golang解析HTML，使用<body> </body>标记提取所有内容
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Golang解析HTML，使用<body> </body>标记提取所有内容EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Golang解析HTML，使用<body> </body>标记提取所有内容
EN