首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Clojure core.async网络爬虫

Clojure core.async网络爬虫
EN

Code Review用户
提问于 2014-09-06 18:38:34
回答 1查看 1.1K关注 0票数 6

我现在是clojure的初学者,我想我应该尝试用core.async构建一个网络爬虫。

我有工作,但我正在寻求以下几点的反馈:

  • 当我不想失去价值时,如何避免使用大量的缓冲区?
  • 我是否有效地使用了go块?是否有更适合使用thread的地方?
  • 我怎样才能更好地确定什么时候能完成爬行?目前,我从urls-chan获得的超时时间为3秒,如果超时获胜,我想我们就结束了。这看起来不太有效。

下面是代码的主要部分:

代码语言:javascript
复制
(def visited-urls (atom #{}))
(def site-map (atom {}))

;; I've given massive buffers my two channels here because I don't want to drop
;; values. I'm not quite sure why they need to be so big, but anything smaller gives me:
;; Exception in thread "async-dispatch-1626" java.lang.AssertionError:
;;   Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer.
;;   (< (.size puts) impl/MAX-QUEUE-SIZE)
(def urls-chan (chan 102400))
(def log-chan (chan 102400))

(def exit-chan (chan 1))

(defn get-doc
  "Fetches a parsed html page from the given url and places onto a channel"
  [url]
  (go (let [{:keys [error body opts headers]} (<! (async-get url))
            content-type (:content-type headers)]
        (if (or error (not (.startsWith content-type "text/html")))
          (do (log "error fetching" url)
              false)
          (Jsoup/parse body (base-url (:url opts)))))))

;; Main event loop
(defn start-consumers
  "Spins up n go blocks to take a url from urls-chan, store its assets and then
  puts its links onto urls-chan, repeating until there are no more urls to take"
  [n domain]
  (dotimes [_ n]
    (go-loop [url (<! urls-chan)]
             (when-not (@visited-urls url)
               (log "crawling" url)
               (swap! visited-urls conj url)
               (when-let [doc (<! (get-doc url))]
                 (swap! site-map assoc url (get-assets doc))
                 (doseq [url (get-links doc domain)]
                   (go (>! urls-chan url)))))
             ;; Take the next url off the q, if 3 secs go by assume no more are coming
             (let [[value channel] (alts! [urls-chan (timeout 3000)])]
               (if (= channel urls-chan)
                 (recur value)
                 (>! exit-chan true))))))

(defn -main
  "Crawls [domain] for links to assets"
  [domain]
  (let [start-time (System/currentTimeMillis)]
    (start-logger)
    (log "Begining crawl of" domain)
    (start-consumers 40 domain)
    ;; Kick off with the first url
    (>!! urls-chan domain)
    (<!! exit-chan)
    (println (json/write-str @site-map))
    (<!! (log "Completed after" (seconds-since start-time) "seconds"))))
EN

回答 1

Code Review用户

发布于 2014-12-31 10:41:08

目前,(when-not (@visited-urls url)可能有超过1000名用户正在查看同一个未访问的url。他们将爬行相同的url,这是意外的,但它似乎没有破坏任何东西。

我看不出有什么更好的方法了。事实上,原子在这里不会买到任何东西,因为它所做的就是改变一个全球状态。我认为java.util.concurrent.ConcurrentHashMap的效果更好。visited-urls可以是URL到布尔值的映射,如果访问了该URL,则表示该映射。条件应该是.putIfAbsent(url, true) == null

票数 1
EN
页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://codereview.stackexchange.com/questions/62147

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档