我的任务是在html中查找图像urls。
问题
Html解析器golang.org/x/net/html以及github.com/PuerkitoBio/goquery igonores -页面http://www.ozon.ru/context/detail/id/34498204/上最大的图像
问题
img标记的src=""会被忽略?备注:
//static2.ozone.ru/multimedia/spare_covers/1013531536.jpg页面上找到了Go get是成功的,但Go build永远坚持。解析的html页面源
这是html是resp, _ := http.Get(url)的结果
代码:
package main
import (
"golang.org/x/net/html"
"log"
"net/http"
)
func main() {
url := "http://www.ozon.ru/context/detail/id/34498204/"
if resp, err := http.Get(url); err == nil {
defer resp.Body.Close()
log.Println("Load page complete")
if resp != nil {
log.Println("Page response is NOT nil")
if document, err := html.Parse(resp.Body); err == nil {
var parser func(*html.Node)
parser = func(n *html.Node) {
if n.Type == html.ElementNode && n.Data == "img" {
var imgSrcUrl, imgDataOriginal string
for _, element := range n.Attr {
if element.Key == "src" {
imgSrcUrl = element.Val
}
if element.Key == "data-original" {
imgDataOriginal = element.Val
}
}
log.Println(imgSrcUrl, imgDataOriginal)
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
parser(c)
}
}
parser(document)
} else {
log.Panicln("Parse html error", err)
}
} else {
log.Println("Page response IS nil")
}
}
}发布于 2016-07-11 19:10:23
这不是一个bug,而是x/net/html的预期行为,它影响到所有基于x/net/html的解析器。
有四种可能的解决办法:
<noscript>和</noscript>,这样x/net/html就可以像预期的那样解析其内容。类似于:
包主导入( "golang.org/x/net/html“"log”"net/http“"io/ioutil”“字符串”) func main() { url := "http://www.ozon.ru/context/detail/id/34498204/“(如果是resp,err := http.Get(Url));如果resp != nil {“页面响应不是零”//- := ioutil.ReadAll(resp.Body) resp.Body.Close() hdata := strings.Replace(string(data)、"“、"”、-1) hdata = strings.Replace(hdata、"“、-1) /--/err == nil { var解析器( *html.Node)解析器= func(n *html.Node){ if n.Type == html.ElementNode & n.Data == "img“{ var imgSrcUrl,imgDataOriginal string for _,元素:= range n.Attr { if element.Key == "src“{ imgSrcUrl = element.Val } if element.Key ==”data-原始“{ imgDataOriginal = element.Val} log.Println(imgSrcUrl,imgDataOriginal) }c := n.FirstChild;C !=零;C= c.NextSibling {解析器(C)}解析器(文档){log.Panicln(“parser错误”,err) }{log.Println(“页面响应为零”)}x/net/html的https://github.com/bearburger/net/commit/42ac75393ced8c48137b574278522df1f3fa2cec补丁https://stackoverflow.com/questions/38293657
复制相似问题