首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Julia 0.4.0-dev+7053 html解析非常快

Julia 0.4.0-dev+7053 html解析非常快
EN

Stack Overflow用户
提问于 2015-09-05 17:11:19
回答 1查看 645关注 0票数 3

编辑:.。嗯,在@Ismael VC的慷慨帮助下变得很快了。解决方案是首先擦除我的JuliaV0.4,从最近的夜间重新安装它,然后进行一定数量的软件包杂耍:Pkg.init()Pkg.add("Gumbo")。Gumbo的添加首先会产生一个生成错误:

代码语言:javascript
复制
INFO: Installing Gumbo v0.1.0
INFO: Building Gumbo

WARNING: deprecated syntax "[a=>b, ...]" at /Users/szalmaf/.julia/v0.4/Gumbo/deps/build.jl:19.
Use "Dict(a=>b, ...)" instead.
INFO: Attempting to Create directory /Users/szalmaf/.julia/v0.4/Gumbo/deps/downloads
INFO: Downloading file http://jamesporter.me/static/julia/gumbo-1.0.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (22) The requested URL returned error: 404 Not Found
================================[ ERROR: Gumbo ]================================

LoadError: failed process: Process(`curl -f -o /Users/szalmaf/.julia/v0.4/Gumbo/deps/downloads/gumbo-1.0.tar.gz -L http://jamesporter.me/static/julia/gumbo-1.0.tar.gz`, ProcessExited(22)) [22]
while loading /Users/szalmaf/.julia/v0.4/Gumbo/deps/build.jl, in expression starting on line 19

================================================================================

================================[ BUILD ERRORS ]================================

WARNING: Gumbo had build errors.

 - packages with build errors remain installed in /Users/szalmaf/.julia/v0.4
 - build the package(s) and all dependencies with `Pkg.build("Gumbo")`
 - build a single package by running its `deps/build.jl` script

================================================================================
INFO: Package database updated

因此,我们需要查看主分支Pkg.update()Pkg.build("Gumbo")的最新Gumbo,它反过来产生了一个Gumbo,它的parsehtml运行得很快。

注意:问题不在于评论者(他没有仔细阅读之前的评论)提到了什么,即JIT编译器使'it‘变慢的说法。如果你读了我和@Ismael VC之间的前后讨论,你会发现我和他一样运行了他的测试代码,我在我的前两条评论中得到了结果,我最初的安装确实太慢了。无论如何,重要的是,在Ismael的帮助下,parsehtml在我们的私人聊天中获得了最快的速度。再次感谢!

原始员额:

Julia0.4.0-dev+7053 html解析非常慢?

虽然朱莉娅语言在很多事情上卖得同样快,但它在基本的生活事物上看上去很慢,比如解析网页。

分析http://julialang.org网页,这显示了朱莉娅对C、Fortran、R、Matlab等的反对有多快。

代码语言:javascript
复制
# using HTTPClient, Gumbo
julia_url = "http://julialang.org"
println("  scrape start: ", Dates.unix2datetime(time()))
julia_pageg = julia_url |> get
println("  scrape end: ", Dates.unix2datetime(time()))
julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
println("  parsed: ", Dates.unix2datetime(time()))

给出

代码语言:javascript
复制
scrape start: 2015-09-05T16:47:03.843
scrape end: 2015-09-05T16:47:04.044
parsed: 2015-09-05T16:47:04.41

这表明,获得这个网页需要100‘s,这在我的wifi连接上是合理的,然而,解析这个简单的页面需要400’s,以今天的标准来看,这听起来是令人望而却步的。

对一个稍微复杂一些的网页做同样的测试

代码语言:javascript
复制
julia_url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
println("  scrape start: ", Dates.unix2datetime(time()))
julia_pageg = julia_url |> get
println("  scrape end: ", Dates.unix2datetime(time()))
julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
println("  parsed: ", Dates.unix2datetime(time()))

给出

代码语言:javascript
复制
scrape start: 2015-09-05T16:57:52.054
scrape end: 2015-09-05T16:57:52.736
parsed: 2015-09-05T16:57:53.699

解析几乎需要整整一秒钟的时间。

我可能遗漏了什么,但是在Julia中是否有更好/更快的方法来解析网页或从中获取html元素?如果是这样的话,是怎么做的?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-09-06 15:14:27

首先,您是否在te手册中对性能提示进行了修改?你用的是哪个朱莉娅版本?(versioninfo())

  • http://julia.readthedocs.org/en/latest/manual/performance-tips/

您可以从阅读它开始,并按照文档中的建议将代码放入函数中,其中有一个@time宏,它也提示您内存分配,如下所示:

Julia诉0.3.11

测试at:https://juliabox.org

代码语言:javascript
复制
using HTTPClient, Gumbo

function test(url::String)
    @show url

    print("Scraping: ")
    @time page = get(url)

    print("Parsing: ")
    @time page = parsehtml(bytestring(page.body))
end

let
    gc_disable()

    url =  "http://julialang.org"

    println("First run:")
    test(url)    # first run JITed

    println("\nSecond run:")
    test(url)

    url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"    

    println("\nThird run:")
    test(url)

    println("\nFourth run:")
    test(url)

    gc_enable()
end
代码语言:javascript
复制
First run:
url => "http://julialang.org"
Scraping: elapsed time: 0.248092469 seconds (3971912 bytes allocated)
Parsing: elapsed time: 0.850927483 seconds (27207516 bytes allocated)

Second run:
url => "http://julialang.org"
Scraping: elapsed time: 0.055722638 seconds (73952 bytes allocated)
Parsing: elapsed time: 0.005446998 seconds (821800 bytes allocated)

Third run:
url => "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: elapsed time: 0.282382774 seconds (619324 bytes allocated)
Parsing: elapsed time: 0.227427243 seconds (9728620 bytes allocated)

Fourth run:
url => "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: elapsed time: 0.288903961 seconds (400272 bytes allocated)
Parsing: elapsed time: 0.017787089 seconds (1516560 bytes allocated)

这是您使用@time编写代码的时间安排

代码语言:javascript
复制
julia_url = "http://julialang.org"
@time julia_pageg = julia_url |> get
@time julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml

第一轮:

代码语言:javascript
复制
elapsed time: 0.361194892 seconds (11108960 bytes allocated)
elapsed time: 0.996812988 seconds (34546156 bytes allocated, 4.04% gc time)

第二轮:

代码语言:javascript
复制
elapsed time: 0.018920084 seconds (77952 bytes allocated)
elapsed time: 0.006632215 seconds (823256 bytes allocated)
代码语言:javascript
复制
julia_url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
@time julia_pageg = julia_url |> get
@time julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml

第一轮:

代码语言:javascript
复制
elapsed time: 0.33795947 seconds (535916 bytes allocated)
elapsed time: 0.224386491 seconds (9729852 bytes allocated)

第二轮:

代码语言:javascript
复制
elapsed time: 0.276848452 seconds (584944 bytes allocated)
elapsed time: 0.018806686 seconds (1517856 bytes allocated)

编辑v0.4.0-dev+7053

在版本0.4+中,请确保首先执行Pkg.checkout("Gumbo")以获取最新的提交,然后执行JuliaBox中的Pkg.build("Gumbo")操作:

http://nbviewer.ipython.org/gist/Ismael-VC/4c241228f04ed54c70e2

代码语言:javascript
复制
First run:
url = "http://julialang.org"
Scraping:   0.227681 seconds (85.11 k allocations: 3.585 MB)
Parsing:   0.696063 seconds (799.12 k allocations: 29.450 MB)

Second run:
url = "http://julialang.org"
Scraping:   0.018953 seconds (571 allocations: 69.344 KB)
Parsing:   0.007132 seconds (15.91 k allocations: 916.313 KB)

Third run:
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping:   0.313128 seconds (4.86 k allocations: 608.850 KB)
Parsing:   0.196110 seconds (270.17 k allocations: 10.356 MB)

Fourth run:
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping:  0.307949 seconds (1.41 k allocations: 470.953 KB)
Parsing:   0.019801 seconds (23.82 k allocations: 1.627 MB)
票数 5
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/32415689

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档