问Clojure core.async网络爬虫
EN

Code Review用户

提问于 2014-09-06 18:38:34

回答 1查看 1.1K关注 0票数 6

我现在是clojure的初学者，我想我应该尝试用core.async构建一个网络爬虫。

我有工作，但我正在寻求以下几点的反馈：

当我不想失去价值时，如何避免使用大量的缓冲区？
我是否有效地使用了go块？是否有更适合使用thread的地方？
我怎样才能更好地确定什么时候能完成爬行？目前，我从urls-chan获得的超时时间为3秒，如果超时获胜，我想我们就结束了。这看起来不太有效。

下面是代码的主要部分：

(def visited-urls (atom #{}))
(def site-map (atom {}))

;; I've given massive buffers my two channels here because I don't want to drop
;; values. I'm not quite sure why they need to be so big, but anything smaller gives me:
;; Exception in thread "async-dispatch-1626" java.lang.AssertionError:
;;   Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer.
;;   (< (.size puts) impl/MAX-QUEUE-SIZE)
(def urls-chan (chan 102400))
(def log-chan (chan 102400))

(def exit-chan (chan 1))

(defn get-doc
  "Fetches a parsed html page from the given url and places onto a channel"
  [url]
  (go (let [{:keys [error body opts headers]} (<! (async-get url))
            content-type (:content-type headers)]
        (if (or error (not (.startsWith content-type "text/html")))
          (do (log "error fetching" url)
              false)
          (Jsoup/parse body (base-url (:url opts)))))))

;; Main event loop
(defn start-consumers
  "Spins up n go blocks to take a url from urls-chan, store its assets and then
  puts its links onto urls-chan, repeating until there are no more urls to take"
  [n domain]
  (dotimes [_ n]
    (go-loop [url (<! urls-chan)]
             (when-not (@visited-urls url)
               (log "crawling" url)
               (swap! visited-urls conj url)
               (when-let [doc (<! (get-doc url))]
                 (swap! site-map assoc url (get-assets doc))
                 (doseq [url (get-links doc domain)]
                   (go (>! urls-chan url)))))
             ;; Take the next url off the q, if 3 secs go by assume no more are coming
             (let [[value channel] (alts! [urls-chan (timeout 3000)])]
               (if (= channel urls-chan)
                 (recur value)
                 (>! exit-chan true))))))

(defn -main
  "Crawls [domain] for links to assets"
  [domain]
  (let [start-time (System/currentTimeMillis)]
    (start-logger)
    (log "Begining crawl of" domain)
    (start-consumers 40 domain)
    ;; Kick off with the first url
    (>!! urls-chan domain)
    (<!! exit-chan)
    (println (json/write-str @site-map))
    (<!! (log "Completed after" (seconds-since start-time) "seconds"))))

asynchronous

clojure

web-scraping

回答 1

Code Review用户

发布于 2014-12-31 10:41:08

目前，(when-not (@visited-urls url)可能有超过1000名用户正在查看同一个未访问的url。他们将爬行相同的url，这是意外的，但它似乎没有破坏任何东西。

我看不出有什么更好的方法了。事实上，原子在这里不会买到任何东西，因为它所做的就是改变一个全球状态。我认为java.util.concurrent.ConcurrentHashMap的效果更好。visited-urls可以是URL到布尔值的映射，如果访问了该URL，则表示该映射。条件应该是.putIfAbsent(url, true) == null

票数 1

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/62147

复制

相似问题

问Clojure core.async网络爬虫
EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Clojure core.async网络爬虫EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Clojure core.async网络爬虫
EN