文章/答案/技术大牛

发布

社区首页 >问答首页 >应用程序引擎的python脚本优化

问应用程序引擎的python脚本优化
EN

Stack Overflow用户

提问于 2009-11-27 23:20:39

回答 3查看 346关注 0票数 2

我使用以下脚本从我的uni网站中删除数据并将其插入到GAE Db中

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import datetime

__author__ = "Nash Rafeeq" 

url  = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
viewurl  = "http://localhost:8000/timekeeper/intake/checkintake/"
inserturl = "http://localhost:8000/timekeeper/intake/addintake/"
print url
mech =  Browser()
try:
    page = mech.open(url)
    html = page.read()
except Exception, err:
    print str(err)
#print html 
soup = BeautifulSoup(html)
soup.prettify() 
tables  = soup.find('select')
for options in tables:
    intake = options.string
    #print intake
    try:
        #print viewurl+intake
        page = mech.open(viewurl+intake)
        html = page.read()
        print html
        if html=="Exist in database":
            print intake, " Exist in the database skiping"
        else:
            page = mech.open(inserturl+intake)
            html = page.read()
            print html
            if html=="Ok":
                print intake, "added to the database"
            else:
                print "Error adding ",  intake, " to database"
    except Exception, err:
        print str(err)

我想知道优化这个脚本的最佳方法是什么，这样我就可以在应用程序引擎服务器上运行它。实际上，它现在抓取了300多个条目，并花费了10分钟以上的时间将所有数据插入到我的本地计算机上

用来存储数据的模型是

class Intake(db.Model):
    intake=db.StringProperty(multiline=False, required=True)
    #@permerlink    
    def get_absolute_url(self):
        return "/timekeeper/%s/" % self.intake
    class Meta:
        db_table = "Intake"
        verbose_name_plural = "Intakes"
        ordering = ['intake']

python

google-app-engine

回答 3

Stack Overflow用户

回答已采纳

发布于 2009-11-27 23:27:53

Divide and conquer。

创建任务列表(例如，用于将任务放入队列的urls (appengine taskqueue api，amazon sqs，…) )
Process queue

票数 4

Stack Overflow用户

发布于 2009-11-27 23:47:06

您应该做的第一件事是重写脚本，以便直接使用App Engine数据存储。您花费的大部分时间无疑是因为您正在使用HTTP请求(每个条目两个！)将数据插入您的数据存储区。直接与batch puts一起使用数据存储应该会减少运行时的几个数量级。

如果您的解析代码仍然太慢，您可以将工作分成块，并使用task queue API在多个请求中完成工作。

票数 2

Stack Overflow用户

发布于 2009-11-28 00:57:26

嗨，根据tosh和nick的说法，我已经修改了脚本如下

from google.appengine.api import urlfetch
from BeautifulSoup import BeautifulSoup
from timkeeper.models import Intake
from google.appengine.ext import db

__author__ = "Nash Rafeeq" 

url  = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
try:
    page = urlfetch.fetch(url)
    #print html 
    soup = BeautifulSoup(page.content)
    soup.prettify() 
    tables  = soup.find('select')
    models=[]
    for options in tables:
        intake_code = options.string
        if Intake.all().filter('intake',intake_code).count()<1:
            data = Intake(intake=intake_code)
            models.append(data)
    try:
        if len(models)>0:
            db.put(models)
        else:
            pass 
    except Exception,err:
        pass
except Exception, err:
    print str(err)

我在正确的轨道上吗？此外，我也不确定如何让它按计划调用(一周一次)，最好的方法是什么？

并感谢您的迅速回答

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/1809335

复制

相似问题

问应用程序引擎的python脚本优化
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问应用程序引擎的python脚本优化EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问应用程序引擎的python脚本优化
EN