首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >google urls的机械抓取

google urls的机械抓取
EN

Stack Overflow用户
提问于 2016-03-08 18:36:23
回答 2查看 310关注 0票数 0

我有一个程序可以使用关键字或关键字搜索google,这些关键字在运行该程序时被作为参数:

示例:pull_sites.rb "testing"返回这些站点>>>

代码语言:javascript
复制
https://en.wikipedia.org/wiki/Software_testing
http://en.wikipedia.org/wiki/Test_automation
http://www.istqb.org/about-istqb.html
http://softwaretestingfundamentals.com/test-plan/
https://en.wikipedia.org/wiki/Software_testing
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:9qU2GDLzZzEJ:https://en.wikipedia.org/wiki/Software_testing%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://en.wikipedia.org/wiki/Test_strategy
https://en.wikipedia.org/wiki/Category:Software_testing
https://en.wikipedia.org/wiki/Test_automation
https://en.wikipedia.org/wiki/Portal:Software_testing
https://en.wikipedia.org/wiki/Test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:R94CAo00wOYJ:https://en.wikipedia.org/wiki/Test%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://en.wikipedia.org/wiki/Unit_testing
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:G9V8uRLkPjIJ:https://en.wikipedia.org/wiki/Unit_testing%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://testing.byu.edu/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:d9bGrCHr9fsJ:https://testing.byu.edu/%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://www.test.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:S92tylTr1V8J:https://www.test.com/%252Btesting%26gbv%3D1%26%26ct%3Dclnk
http://ddce.utexas.edu/disability/using-testing-accommodations/
http://blogs.vmware.com/virtualblocks/2015/07/06/vsan-vs-nutanix-head-to-head-performance-testing-part-4-exchange/
http://www.networkforgood.com/nonprofitblog/testing-101-4-steps-optimizing-your-fundraising-approach/
http://www.auslea.com/software-testing-training.html
http://academy.littletonpublicschools.net/Default.aspx%3Ftabid%3D12807%26articleType%3DArticleView%26articleId%3D2400
https://golang.org/pkg/testing/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:EALG7Jlm9eoJ:https://golang.org/pkg/testing/%252Btesting%26gbv%3D1%26%26ct%3Dclnk
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J:http://www.speedtest.net/%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:1sMSoJBXydoJ:https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html%252Btesting%26gbv%3D1%26%26ct%3Dclnk
http://www.act.org/content/act/en/products-and-services/the-act/test-preparation.html
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:pAzlNJl3YY4J:http://www.act.org/content/act/en/products-and-services/the-act/test-preparation.html%252Btesting%26gbv%3D1%26%26ct%3Dclnk

它能像预期的那样工作,但只会刮到谷歌的第一页,是否可以搜索,比如第1页-5页?

这是刮伤的来源:

代码语言:javascript
复制
  def get_urls
    puts "Searching...".green
    agent = Mechanize.new
    page = agent.get('http://www.google.com/')
    google_form = page.form('f')
    google_form.q = "#{SEARCH}" #SEARCH is the parameter given when program is run
    page = agent.submit(google_form, google_form.buttons.first)
    page.links.each do |link|
      if link.href.to_s =~/url.q/
        str=link.href.to_s
        strList=str.split(%r{=|&}) 
        url=strList[1] 
        File.open("links.txt", "a+"){ |s| s.puts(url) }
      end
    end 
  end
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2016-03-08 23:50:19

好的,如果你正在使用谷歌铬或火狐,打开开发工具。这将帮助您识别要自动单击的链接。当你做谷歌搜索,然后滚动到底部,你会看到页面链接点击。使用浏览器中的开发工具,您需要识别google分配这些页码链接的类或id。然后使用机械化单击方法跟踪这些链接。例如,如果链接标记为"next“,则可以使用以下简单的方法:

page2 = page1.link_with(:text => "next").click

我是从我的电话应答,所以它可以节省你的时间谷歌“点击链接”与机械化,以了解更多的细节。

票数 1
EN

Stack Overflow用户

发布于 2016-03-09 01:08:53

这是一种简单得多的获取表单,只需自己提出请求:

代码语言:javascript
复制
https://www.google.com/search?q=foo
https://www.google.com/search?q=foo&start=10
https://www.google.com/search?q=foo&start=20
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/35875304

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档