我现在是clojure的初学者,我想我应该尝试用core.async构建一个网络爬虫。
我有工作,但我正在寻求以下几点的反馈:
go块?是否有更适合使用thread的地方?urls-chan获得的超时时间为3秒,如果超时获胜,我想我们就结束了。这看起来不太有效。下面是代码的主要部分:
(def visited-urls (atom #{}))
(def site-map (atom {}))
;; I've given massive buffers my two channels here because I don't want to drop
;; values. I'm not quite sure why they need to be so big, but anything smaller gives me:
;; Exception in thread "async-dispatch-1626" java.lang.AssertionError:
;; Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer.
;; (< (.size puts) impl/MAX-QUEUE-SIZE)
(def urls-chan (chan 102400))
(def log-chan (chan 102400))
(def exit-chan (chan 1))
(defn get-doc
"Fetches a parsed html page from the given url and places onto a channel"
[url]
(go (let [{:keys [error body opts headers]} (<! (async-get url))
content-type (:content-type headers)]
(if (or error (not (.startsWith content-type "text/html")))
(do (log "error fetching" url)
false)
(Jsoup/parse body (base-url (:url opts)))))))
;; Main event loop
(defn start-consumers
"Spins up n go blocks to take a url from urls-chan, store its assets and then
puts its links onto urls-chan, repeating until there are no more urls to take"
[n domain]
(dotimes [_ n]
(go-loop [url (<! urls-chan)]
(when-not (@visited-urls url)
(log "crawling" url)
(swap! visited-urls conj url)
(when-let [doc (<! (get-doc url))]
(swap! site-map assoc url (get-assets doc))
(doseq [url (get-links doc domain)]
(go (>! urls-chan url)))))
;; Take the next url off the q, if 3 secs go by assume no more are coming
(let [[value channel] (alts! [urls-chan (timeout 3000)])]
(if (= channel urls-chan)
(recur value)
(>! exit-chan true))))))
(defn -main
"Crawls [domain] for links to assets"
[domain]
(let [start-time (System/currentTimeMillis)]
(start-logger)
(log "Begining crawl of" domain)
(start-consumers 40 domain)
;; Kick off with the first url
(>!! urls-chan domain)
(<!! exit-chan)
(println (json/write-str @site-map))
(<!! (log "Completed after" (seconds-since start-time) "seconds"))))发布于 2014-12-31 10:41:08
目前,(when-not (@visited-urls url)可能有超过1000名用户正在查看同一个未访问的url。他们将爬行相同的url,这是意外的,但它似乎没有破坏任何东西。
我看不出有什么更好的方法了。事实上,原子在这里不会买到任何东西,因为它所做的就是改变一个全球状态。我认为java.util.concurrent.ConcurrentHashMap的效果更好。visited-urls可以是URL到布尔值的映射,如果访问了该URL,则表示该映射。条件应该是.putIfAbsent(url, true) == null
https://codereview.stackexchange.com/questions/62147
复制相似问题