编辑:.。嗯,在@Ismael VC的慷慨帮助下变得很快了。解决方案是首先擦除我的JuliaV0.4,从最近的夜间重新安装它,然后进行一定数量的软件包杂耍:Pkg.init(),Pkg.add("Gumbo")。Gumbo的添加首先会产生一个生成错误:
INFO: Installing Gumbo v0.1.0
INFO: Building Gumbo
WARNING: deprecated syntax "[a=>b, ...]" at /Users/szalmaf/.julia/v0.4/Gumbo/deps/build.jl:19.
Use "Dict(a=>b, ...)" instead.
INFO: Attempting to Create directory /Users/szalmaf/.julia/v0.4/Gumbo/deps/downloads
INFO: Downloading file http://jamesporter.me/static/julia/gumbo-1.0.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (22) The requested URL returned error: 404 Not Found
================================[ ERROR: Gumbo ]================================
LoadError: failed process: Process(`curl -f -o /Users/szalmaf/.julia/v0.4/Gumbo/deps/downloads/gumbo-1.0.tar.gz -L http://jamesporter.me/static/julia/gumbo-1.0.tar.gz`, ProcessExited(22)) [22]
while loading /Users/szalmaf/.julia/v0.4/Gumbo/deps/build.jl, in expression starting on line 19
================================================================================
================================[ BUILD ERRORS ]================================
WARNING: Gumbo had build errors.
- packages with build errors remain installed in /Users/szalmaf/.julia/v0.4
- build the package(s) and all dependencies with `Pkg.build("Gumbo")`
- build a single package by running its `deps/build.jl` script
================================================================================
INFO: Package database updated因此,我们需要查看主分支Pkg.update(),Pkg.build("Gumbo")的最新Gumbo,它反过来产生了一个Gumbo,它的parsehtml运行得很快。
注意:问题不在于评论者(他没有仔细阅读之前的评论)提到了什么,即JIT编译器使'it‘变慢的说法。如果你读了我和@Ismael VC之间的前后讨论,你会发现我和他一样运行了他的测试代码,我在我的前两条评论中得到了结果,我最初的安装确实太慢了。无论如何,重要的是,在Ismael的帮助下,parsehtml在我们的私人聊天中获得了最快的速度。再次感谢!
原始员额:
Julia0.4.0-dev+7053 html解析非常慢?
虽然朱莉娅语言在很多事情上卖得同样快,但它在基本的生活事物上看上去很慢,比如解析网页。
分析http://julialang.org网页,这显示了朱莉娅对C、Fortran、R、Matlab等的反对有多快。
# using HTTPClient, Gumbo
julia_url = "http://julialang.org"
println(" scrape start: ", Dates.unix2datetime(time()))
julia_pageg = julia_url |> get
println(" scrape end: ", Dates.unix2datetime(time()))
julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
println(" parsed: ", Dates.unix2datetime(time()))给出
scrape start: 2015-09-05T16:47:03.843
scrape end: 2015-09-05T16:47:04.044
parsed: 2015-09-05T16:47:04.41这表明,获得这个网页需要100‘s,这在我的wifi连接上是合理的,然而,解析这个简单的页面需要400’s,以今天的标准来看,这听起来是令人望而却步的。
对一个稍微复杂一些的网页做同样的测试
julia_url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
println(" scrape start: ", Dates.unix2datetime(time()))
julia_pageg = julia_url |> get
println(" scrape end: ", Dates.unix2datetime(time()))
julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
println(" parsed: ", Dates.unix2datetime(time()))给出
scrape start: 2015-09-05T16:57:52.054
scrape end: 2015-09-05T16:57:52.736
parsed: 2015-09-05T16:57:53.699解析几乎需要整整一秒钟的时间。
我可能遗漏了什么,但是在Julia中是否有更好/更快的方法来解析网页或从中获取html元素?如果是这样的话,是怎么做的?
发布于 2015-09-06 15:14:27
首先,您是否在te手册中对性能提示进行了修改?你用的是哪个朱莉娅版本?(versioninfo())
您可以从阅读它开始,并按照文档中的建议将代码放入函数中,其中有一个@time宏,它也提示您内存分配,如下所示:
Julia诉0.3.11
测试at:https://juliabox.org
using HTTPClient, Gumbo
function test(url::String)
@show url
print("Scraping: ")
@time page = get(url)
print("Parsing: ")
@time page = parsehtml(bytestring(page.body))
end
let
gc_disable()
url = "http://julialang.org"
println("First run:")
test(url) # first run JITed
println("\nSecond run:")
test(url)
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
println("\nThird run:")
test(url)
println("\nFourth run:")
test(url)
gc_enable()
endFirst run:
url => "http://julialang.org"
Scraping: elapsed time: 0.248092469 seconds (3971912 bytes allocated)
Parsing: elapsed time: 0.850927483 seconds (27207516 bytes allocated)
Second run:
url => "http://julialang.org"
Scraping: elapsed time: 0.055722638 seconds (73952 bytes allocated)
Parsing: elapsed time: 0.005446998 seconds (821800 bytes allocated)
Third run:
url => "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: elapsed time: 0.282382774 seconds (619324 bytes allocated)
Parsing: elapsed time: 0.227427243 seconds (9728620 bytes allocated)
Fourth run:
url => "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: elapsed time: 0.288903961 seconds (400272 bytes allocated)
Parsing: elapsed time: 0.017787089 seconds (1516560 bytes allocated)这是您使用@time编写代码的时间安排
julia_url = "http://julialang.org"
@time julia_pageg = julia_url |> get
@time julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml第一轮:
elapsed time: 0.361194892 seconds (11108960 bytes allocated)
elapsed time: 0.996812988 seconds (34546156 bytes allocated, 4.04% gc time)第二轮:
elapsed time: 0.018920084 seconds (77952 bytes allocated)
elapsed time: 0.006632215 seconds (823256 bytes allocated)julia_url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
@time julia_pageg = julia_url |> get
@time julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml第一轮:
elapsed time: 0.33795947 seconds (535916 bytes allocated)
elapsed time: 0.224386491 seconds (9729852 bytes allocated)第二轮:
elapsed time: 0.276848452 seconds (584944 bytes allocated)
elapsed time: 0.018806686 seconds (1517856 bytes allocated)编辑v0.4.0-dev+7053
在版本0.4+中,请确保首先执行Pkg.checkout("Gumbo")以获取最新的提交,然后执行JuliaBox中的Pkg.build("Gumbo")操作:
http://nbviewer.ipython.org/gist/Ismael-VC/4c241228f04ed54c70e2
First run:
url = "http://julialang.org"
Scraping: 0.227681 seconds (85.11 k allocations: 3.585 MB)
Parsing: 0.696063 seconds (799.12 k allocations: 29.450 MB)
Second run:
url = "http://julialang.org"
Scraping: 0.018953 seconds (571 allocations: 69.344 KB)
Parsing: 0.007132 seconds (15.91 k allocations: 916.313 KB)
Third run:
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: 0.313128 seconds (4.86 k allocations: 608.850 KB)
Parsing: 0.196110 seconds (270.17 k allocations: 10.356 MB)
Fourth run:
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: 0.307949 seconds (1.41 k allocations: 470.953 KB)
Parsing: 0.019801 seconds (23.82 k allocations: 1.627 MB)https://stackoverflow.com/questions/32415689
复制相似问题