这是一个相当长的问题,我可能会遗漏一些东西,所以如果需要更多的信息,可以问。
我一直在从谷歌学者那里搜集数据,直到最近,我还只是把所有的网址都放在这样的位置。
elec_urls = """http://1.hidemyass.com/ip-5/encoded/Oi8vc2Nob2xhci5nb29nbGUuY29tL2NpdGF0aW9ucz91c2VyPWo0YnRpeXNBQUFBSiZobD1lbg%3D%3D&f=norefer
http://4.hidemyass.com/ip-1/encoded/Oi8vc2Nob2xhci5nb29nbGUuY29tL2NpdGF0aW9ucz91c2VyPVZXaFJiZEFBQUFBSiZobD1lbg%3D%3D&f=norefer
http://4.hidemyass.com/ip-2/encoded/Oi8vc2Nob2xhci5nb29nbGUuY29tL2NpdGF0aW9ucz91c2VyPV84X09JSWNBQUFBSiZobD1lbg%3D%3D&f=norefer
http://1.hidemyass.com/ip-4/encoded/Oi8vc2Nob2xhci5nb29nbGUuY29tL2NpdGF0aW9ucz91c2VyPUh3WHdmTGtBQUFBSiZobD1lbg%3D%3D&f=norefer
http://4.hidemyass.com/ip-1/encoded/Oi8vc2Nob2xhci5nb29nbGUuY29tL2NpdGF0aW9ucz91c2VyPXU1NWFWZEFBQUFBSiZobD1lbg%3D%3D&f=norefer
""".strip()
elec_urls = elec_urls.splitlines()然后我抓取每一页,将我想要的信息放在一个字典列表中,排序一次,删除重复项,然后使用不同的键再次排序,然后将我想要的信息导出到google docs电子表格中。这是100%有效的。
我试图改变它,以便我可以有另一个谷歌文档电子表格,从这里我可以把所有的网址,它将做同样的事情。下面是我到目前为止所做的工作。
def InputUrls(Entered_doc, EnteredURL):
username = 'myemail'
password = 'mypassword'
doc_name = Entered_doc
spreadsheet_id = Entered_doc
worksheet_id = 'od6'
# Connect to Google
gd_client = gdata.spreadsheet.service.SpreadsheetsService()
gd_client.email = username
gd_client.password = password
gd_client.source = EnteredURL
gd_client.ProgrammaticLogin()
#Now that we're connected, we query the spreadsheet by name, and extract the unique spreadsheet and worksheet IDs.
rows = gd_client.GetListFeed(spreadsheet_id, worksheet_id).entry
#At this point, you have a row iterator which will yield rows for the spreadsheet. This example will print everything out, keyed by column names:
urlslist = []
for row in rows:
for key in row.custom:
urlslist.append(row.custom[key].text)
return urlslist
def URLStoScrape(ToScrape):
Dep = []
for i in range(0,len(ToScrape)):
Department_urls = ToScrape[i].strip()
Department_urls = Department_urls.splitlines()
Done = MainScraper(Department_urls)
Dep.append(Done)
return Dep
ElectricalDoc = '0AkGb10ekJtfQdG9EOHN0VzRDdVhWaG1kNVEtdVpyRlE'
ElectricalUrl = 'https://docs.google.com/spreadsheet/ccc? '
ToScrape_Elec = InputUrls(ElectricalDoc, ElectricalUrl)这看起来很好,但是当程序进行排序时,我得到了下面的错误。
回溯(最近一次调用):文件"./code/scraper",第230行,在Total_and_Hindex_Electrical = GetTotalCitations (电气)文件"./code/scraper",第89行,在GetTotalCitations Wrt_CitationURL =排序(部门,"CitationURL")文件"./code/scraper",第15行,在排序SortedData =已排序(未排序,反向=真,关键字=λk: kpivot)文件"./code/scraper",第15行,在SortedData =已排序(未排序,反向=真,关键字=λk: kpivot)中TypeError:列表索引必须是整数,而不是字符串
我认为,几乎可以肯定,它与URLStoScrape函数有关,但我不知道如何修复它,任何帮助都会很好。
谢谢,如果需要更多信息,请告诉我
发布于 2013-08-21 12:27:59
我想问题出在第89行,
GetTotalCitations Wrt_CitationURL = Sorting(Department, "CitationURL")"CitationUrl“应该是整数索引,或者传递给sorted()中的key函数的value的值应该是字典。
https://stackoverflow.com/questions/18338708
复制相似问题