首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Craigslist搜索-跨区域脚本

Craigslist搜索-跨区域脚本
EN

Code Review用户
提问于 2013-06-27 06:35:18
回答 1查看 513关注 0票数 5

我是一个JavaScript开发人员。我很肯定这一点在下面的代码中会立即显现出来,如果没有其他原因,那就是我喜欢的链接的级别/深度。但是,我正在学习Ruby,所以我也很想编写漂亮的Ruby代码。我的第一个简单项目是一个Craigslist搜索跨区域脚本。

完整的代码是论GitHub,但是分解为下面的问题片段。

代码语言:javascript
复制
def get_us_regions()
  # Accumulator for building up the returned object.
  results = {}

  # Important URLs
  sitelist = URI('http://www.craigslist.org/about/sites')
  geospatial = URI('http://www.craigslist.org/about/areas.json')

  # Get a collection of nodes for the US regions out of craigslist's site list.
  usregions =  Nokogiri::HTML(Net::HTTP.get(sitelist)).search("a[name=US]").first().parent().next_element().search('a')

  # Parse out the information to build a usable representation.
  usregions.each { |usregion|
    hostname = usregion.attr('href').gsub('http://','').gsub('.craigslist.org','')
    results[hostname] = { name: usregion.content, state: usregion.parent().parent().previous_element().content }
  }

  # Merge that information with craigslist's geographic information.
  areas = JSON.parse(Net::HTTP.get(geospatial))
  areas.each { |area|
    if results[area["hostname"]]
      results[area["hostname"]][:stateabbrev] = area["region"]
      results[area["hostname"]][:latitude] = area["lat"]
      results[area["hostname"]][:longitude] = area["lon"]
    end
  }

  # This is a complete list of the US regions, keyed off of their hostname.
  return results
end

引导

  • 我应该如何获得我需要开始的程序信息?
  • 如果我是在一个长期运行的应用程序的引导程序上这样做,并且想要刷新,比如说,每月刷新,那么这种情况会改变吗?
  • 我是不是应该把它放到一个真正抽象的类中呢?

这不是JS

  • 如何将调用链接到类方法?
  • 为什么要使用字符串键而不是奇怪的命名键呢?

对象-离子

  • 我是否应该为每个区域创建一个对象,并将我从文档中解析出来的部分输入构造函数?
  • 如果我这样做了,该构造函数应该只接受一个DOM节点并聪明地计算出我传递给它的内容吗?
  • 对于重新打开一个对象,因为我必须跨两个来源进行整理,什么是“正确”的方法?
代码语言:javascript
复制
# Perform a search in a particular region.
def search_region(regionhostname, query)
  # In case there are multiple pages of results from a search
  pages = []
  pagecount = false

  # An accumulator for storing what we need to return.
  result = []

  # Make requests for every page.
  while (pages.length != pagecount)
    # End up with a start of "0" on the first time, 100 is craigslist's page length.
    page = pages.length * 100    

    # Here is the URL we'll be making the request of.
    url = URI("http://#{regionhostname}.craigslist.org/search/cto?query=#{query}&srchType=T&s=#{page}")

    # Get the response and parse it.
    pages << Nokogiri::HTML(Net::HTTP.get(url))

    # If this is the first time through
    if (pagecount == false)

      #check to make sure there are results.
      if pages.last().search('.resulttotal').length() != 0
        # There are results, and we need to see if additional requests are necessary.
        pagecount = (pages.last().search('.resulttotal').first().content().gsub(/[^0-9]/,'').to_i / 100.0).ceil
      else
        # There are no results, we're done here.
        return []
      end
    end
  end

  # Go through each of the pages of results and process the listings
  pages.each { |page|
    # Go through all of the listings on each page
    page.search('.row').each { |listing|
      # Skip listings from other regions in case there are any ("FEW LOCAL RESULTS FOUND").
      if listing.search('a[href^=http]').length() != 0
        next
      end

      # Parse information out of the listing.
      car = {}
      car["id"] = listing["data-pid"]
      car["date"] = listing.search(".date").length() == 1 ? Date.parse(listing.search(".date").first().content) : nil
      # When Craigslist wraps at the end of the year it doesn't add a year field.
      # Fortunately Craigslist has an approximately one month time limit that makes it easy to know which year is being referred to.
      # Overshooting by incrementing the month to make sure that timezone differences between this and CL servers don't result in weirdness
      if car["date"].month > Date.today.month + 1
        car["date"] = car["date"].prev_year
      end
      car["link"] = "http://#{regionhostname}.craigslist.org/cto/#{car['id']}.html"
      car["description"] = listing.search(".pl > a").length() == 1 ? listing.search(".pl > a").first().content : nil
      car["price"] = listing.search("span.price").length() == 1 ? listing.search("span.price").first().content : nil
      car["location"] = listing.search(".l2 small").length() == 1 ? listing.search(".l2 small").first().content.gsub(/[\(\)]/,'').strip : nil
      car["longitude"] = listing["data-longitude"]
      car["latitude"] = listing["data-latitude"]

      # Pull car model year from description
      # Can be wrong, but likely to be accurate.
      if /(?:\b19[0-9]{2}\b|\b20[0-9]{2}\b|\b[0-9]{2}\b)/.match(car["description"]) { |result|

        # Two digit year
        if result[0].length == 2
          # Not an arbitrary wrapping point like it is in MySQL, etc.
          # Cars have known manufacture dates and can't be too far in the future.
          if result[0].to_i <= Date.today.strftime("%y").to_i + 1
            car["year"] = "20#{result[0]}"
          else
            car["year"] = "19#{result[0]}"
          end
        # Four digit year is easy.
        elsif result[0].length == 4
          car["year"] = result[0]
        end
      }
      else
        car["year"] = nil
      end

      # Store the region lookup key.
      car["regionhostname"] = regionhostname

      result << car
    }
  }

  return result
end

Car与Listing

  • 现在我有两个可能的“竞争”对象,如果我要把它扔到一个类中。清单描述的是一辆汽车,但我关心的是从两者中获取信息。我应该把它们都储存起来并连接起来吗?一辆“有一辆”的车?

结果页

  • 每个页面应该是一个对象吗?我首先要了解的是,我需要请求多少页。
  • 我该如何防止它连续运行呢?我是否应该通过返回函数将这些函数泡沫化?这真的有可能在Ruby中做到干净吗?

如果代码看起来像这样.

  • 我正在检查的if语句是否存在(并两次调用该方法)是很糟糕的。但是,如果我试图访问不存在的东西,它会抛出丑陋的错误。
  • 三元是我发现的最好的,还有其他的窍门吗?

Misc.

  • “下一个”很受欢迎吗?
  • 有从匹配对象中提取信息的成语吗?
  • 那建立关系呢?我这样做对吗?
代码语言:javascript
复制
def search(query)
  results = []

  # Get a copy of the regions we're going to search.
  regions = get_us_regions()

  # Divide the requests to each region across the "right" number of threads.
  iterations = 5
  count = (regions.length/iterations.to_f).ceil

  # Spin up the threads!
  (0..(iterations-1)).each { |iteration|
    threads = []

    # Protect against source exhaustion
    if iteration * count > regions.length()
      next
    end

    # Split the requests by region.
    regions.keys.slice(iteration*count,count).each { |regionhostname|
      threads << Thread.new(regionhostname) { |activeregionhostname|
        # New block for proper scoping of regionhostname
        results << search_region(activeregionhostname, query)
      }
    }

    # Wait until all threads are complete before kicking off the next set.
    threads.each { |thread| thread.join }
  }

  # From search_region we return an array, which means we need to flatten(1) to pull everything up to the top level.
  results = results.flatten(1)

  # Sort the search results by date, descending.
  results.sort! { |a,b|
    if a["date"] == b["date"]
      b["id"].to_i <=> a["id"].to_i
    else
      b["date"] <=> a["date"]
    end
  }

  return results
end

puts search("TDI").to_json

public static void main

  • 穿线!异步代码对我来说是有意义的,但是如果我一次创建太多的代码,我的(Ruby)线程就会崩溃。是否有为一组工作线程排队活动的成语?
  • 为了陈述?还是(0..5).each {x=0 }?
  • 对象集合的全局?只在“主”里?
  • 我是不是做错了命名约定?
  • 还有什么我应该问的吗?

结论

代码可以工作,您可以使用它从每个区域获得结果,以便在Craigslist上搜索汽车。这对于稀有/难找的车辆来说是件好事。我希望线程更好,并在不同的线程上包含来自分页的多个请求,但是我需要一些池来处理这个问题。最后,我考虑将Rack集成到这个简单的车辆搜索API中。或者,它会变得更聪明,并将结果存储在数据库中,以跟踪一段时间的价格,创造出更多受过良好教育的卖家和消费者,或者标记出好的交易。

EN

回答 1

Code Review用户

回答已采纳

发布于 2013-06-27 20:55:48

很长很长的问题我会拿走你的第一个片段,让其他人来处理剩下的。首先,关于您的代码的一些评论:

  • def get_us_regions():把这些()放在没有参数的方法上并不是惯用的。
  • first():在没有参数的调用上编写它们也不是惯用的。
  • results = {}:我已经用CR写了很多关于这个主题的文章,所以我只给出链接:用Ruby进行函数编程
  • # Important URLs:不确定是否重要到值得评论:-)
  • Nokogiri::HTML(...) ... long expression。表达式可以无尾链接,您必须决定何时中断并给出有意义的名称。我至少把它分解成两个子表达式。
  • gsub('http://','').gsub('.craigslist.org',''):使用模块URI代替手动操作URL。
  • results[area["hostname"]][:stateabbrev]:同样,这种表达式的函数式方法使它们更加简洁和清晰。
  • return results:明确的returns不是惯用的。
  • def get_us_regions。当一个方法如此琐碎而使其可配置时,请在这里给出这个国家作为参数-> def get_regions(country_code)

现在我是怎么写这个方法的。首先,我会使用,这是一个优秀的库,有许多核心中没有提供的很酷的抽象:

代码语言:javascript
复制
require 'uri'
require 'nokogiri'
require 'net/http'
require 'json'
require 'facets'

module CraigsList
  SitelistUrl = 'http://www.craigslist.org/about/sites'
  GeospatialUrl = 'http://www.craigslist.org/about/areas.json'

  # Return hash of pairs (hostname, {:name, :state, :stateabbr, :latitude, :longitude})
  # for US regions in craigslist.
  def self.get_regions(country_code)
    doc = Nokogiri::HTML(Net::HTTP.get(URI(SitelistUrl)))
    usregions = doc.search("a[name=#{country_code}]").first.parent.next_element.search('a')
    state_info = usregions.mash do |usregion|
      hostname = URI.parse(usregion.attr('href')).host.split(".").first
      state = usregion.parent.parent.previous_element.content
      info = {name: usregion.content, state: state}
      [hostname, info] 
    end

    areas = JSON.parse(Net::HTTP.get(URI(GeospatialUrl)))
    geo_info = areas.slice(*state_info.keys).mash do |area|
      info = {stateabbrev: area["region"], latitude: area["lat"], longitude: area["lon"]}
      [area["hostname"], info] 
    end

    state_info.deep_merge(geo_info)
  end
end

你提到你写了Javascript代码。函数式方法的好处是,代码在任何语言中都是相同的(不包括语法差异)(如果它具有最小的功能能力)。在JS中(尽管FP样式对Coffeescript更友好)和下划线 (+自定义抽象为mixins),您可以编写相同的代码。

票数 2
EN
页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://codereview.stackexchange.com/questions/27832

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档