文章/答案/技术大牛

发布

社区首页 >问答首页 >如何优化我的PageRank计算？

问如何优化我的PageRank计算？
EN

Stack Overflow用户

提问于 2010-03-20 19:41:13

回答 3查看 1.5K关注 0票数 3

在“编程集体智能”一书中，我找到了以下函数来计算PageRank：

def calculatepagerank(self,iterations=20):
    # clear out the current PageRank tables
    self.con.execute("drop table if exists pagerank")
    self.con.execute("create table pagerank(urlid primary key,score)")
    self.con.execute("create index prankidx on pagerank(urlid)")

    # initialize every url with a PageRank of 1.0
    self.con.execute("insert into pagerank select rowid,1.0 from urllist")
    self.dbcommit()

    for i in range(iterations):
        print "Iteration %d" % i
        for (urlid,) in self.con.execute("select rowid from urllist"):
            pr=0.15

            # Loop through all the pages that link to this one
            for (linker,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid):
                # Get the PageRank of the linker
                linkingpr=self.con.execute("select score from pagerank where urlid=%d" % linker).fetchone()[0]

                # Get the total number of links from the linker
                linkingcount=self.con.execute("select count(*) from link where fromid=%d" % linker).fetchone()[0]

                pr+=0.85*(linkingpr/linkingcount)

            self.con.execute("update pagerank set score=%f where urlid=%d" % (pr,urlid))
        self.dbcommit()

但是，这个函数非常慢，因为每次迭代中都有所有的SQL查询。

>>> import cProfile
>>> cProfile.run("crawler.calculatepagerank()")
         2262510 function calls in 136.006 CPU seconds

   Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.000    0.000  136.006  136.006 <string>:1(<module>)
     1   20.826   20.826  136.006  136.006 searchengine.py:179(calculatepagerank)
    21    0.000    0.000    0.528    0.025 searchengine.py:27(dbcommit)
    21    0.528    0.025    0.528    0.025 {method 'commit' of 'sqlite3.Connecti
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler
1339864  112.602    0.000  112.602    0.000 {method 'execute' of 'sqlite3.Connec 
922600    2.050    0.000    2.050    0.000 {method 'fetchone' of 'sqlite3.Cursor' 
     1    0.000    0.000    0.000    0.000 {range}

因此，我优化了这个函数，并得到了如下结果：

def calculatepagerank2(self,iterations=20):
    # clear out the current PageRank tables
    self.con.execute("drop table if exists pagerank")
    self.con.execute("create table pagerank(urlid primary key,score)")
    self.con.execute("create index prankidx on pagerank(urlid)")

    # initialize every url with a PageRank of 1.0
    self.con.execute("insert into pagerank select rowid,1.0 from urllist")
    self.dbcommit()

    inlinks={}
    numoutlinks={}
    pagerank={}

    for (urlid,) in self.con.execute("select rowid from urllist"):
        inlinks[urlid]=[]
        numoutlinks[urlid]=0
        # Initialize pagerank vector with 1.0
        pagerank[urlid]=1.0
        # Loop through all the pages that link to this one
        for (inlink,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid):
            inlinks[urlid].append(inlink)
            # get number of outgoing links from a page        
            numoutlinks[urlid]=self.con.execute("select count(*) from link where fromid=%d" % urlid).fetchone()[0]            

    for i in range(iterations):
        print "Iteration %d" % i

        for urlid in pagerank:
            pr=0.15
            for link in inlinks[urlid]:
                linkpr=pagerank[link]
                linkcount=numoutlinks[link]
                pr+=0.85*(linkpr/linkcount)
            pagerank[urlid]=pr
    for urlid in pagerank:
        self.con.execute("update pagerank set score=%f where urlid=%d" % (pagerank[urlid],urlid))
    self.dbcommit()

这个函数要快很多倍(但是对所有临时字典使用更多的内存)，因为它避免了每次迭代中不必要的SQL查询：

>>> cProfile.run("crawler.calculatepagerank2()")
     90070 function calls in 3.527 CPU seconds
Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.004    0.004    3.527    3.527 <string>:1(<module>)
     1    1.154    1.154    3.523    3.523 searchengine.py:207(calculatepagerank2
     2    0.000    0.000    0.058    0.029 searchengine.py:27(dbcommit)
 23065    0.013    0.000    0.013    0.000 {method 'append' of 'list' objects}
     2    0.058    0.029    0.058    0.029 {method 'commit' of 'sqlite3.Connectio
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler
 43932    2.261    0.000    2.261    0.000 {method 'execute' of 'sqlite3.Connecti
 23065    0.037    0.000    0.037    0.000 {method 'fetchone' of 'sqlite3.Cursor'
     1    0.000    0.000    0.000    0.000 {range}

但是，是否有可能进一步减少SQL查询的数量以加快函数的速度呢？Update：calculatepagerank2()中的固定缩进。

python

sql

optimization

pagerank

回答 3

Stack Overflow用户

回答已采纳

发布于 2010-03-24 22:25:33

我在回答我自己的问题，因为最终我发现，所有的答案混合在一起对我来说是最好的：

    def calculatepagerank4(self,iterations=20):
    # clear out the current PageRank tables
    self.con.execute("drop table if exists pagerank")
    self.con.execute("create table pagerank(urlid primary key,score)")
    self.con.execute("create index prankidx on pagerank(urlid)")

    # initialize every url with a PageRank of 1.0
    self.con.execute("insert into pagerank select rowid,1.0 from urllist")
    self.dbcommit()

    inlinks={}
    numoutlinks={}
    pagerank={}

    for (urlid,) in self.con.execute("select rowid from urllist"):
        inlinks[urlid]=[]
        numoutlinks[urlid]=0
        # Initialize pagerank vector with 1.0
        pagerank[urlid]=1.0

    for src,dest in self.con.execute("select distinct fromid, toid from link"):
        inlinks[dest].append(src)
        numoutlinks[src]+=1          

    for i in range(iterations):
        print "Iteration %d" % i

        for urlid in pagerank:
            pr=0.15
            for link in inlinks[urlid]:
                linkpr=pagerank[link]
                linkcount=numoutlinks[link]
                pr+=0.85*(linkpr/linkcount)
            pagerank[urlid]=pr

    args=((pagerank[urlid],urlid) for urlid in pagerank)
    self.con.executemany("update pagerank set score=? where urlid=?" , args)
    self.dbcommit()

因此，我替换了allyourcode建议的前两个循环，但也使用了˜unutbu的解决方案中的executemany()。但是与˜unutbu不同的是，我为args使用了生成器表达式，以避免浪费太多的内存，尽管使用列表理解要快一些。最后，这个例行公事比书中的例行公事快100倍：

>>> cProfile.run("crawler.calculatepagerank4()")
     33512 function calls in 1.377 CPU seconds
Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.004    0.004    1.377    1.377 <string>:1(<module>)
     2    0.000    0.000    0.073    0.036 searchengine.py:27(dbcommit)
     1    0.693    0.693    1.373    1.373 searchengine.py:286(calculatepagerank4
 10432    0.011    0.000    0.011    0.000 searchengine.py:321(<genexpr>)
 23065    0.009    0.000    0.009    0.000 {method 'append' of 'list' objects}
     2    0.073    0.036    0.073    0.036 {method 'commit' of 'sqlite3.Connectio
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler
     6    0.379    0.063    0.379    0.063 {method 'execute' of 'sqlite3.Connecti
     1    0.209    0.209    0.220    0.220 {method 'executemany' of 'sqlite3.Conn
     1    0.000    0.000    0.000    0.000 {range}

人们还应意识到以下问题：

如果您使用使用%f进行字符串格式化，而不是使用占位符?来构造SQL语句，您将失去精度(例如，我获得了2.9796095721920315使用?，而2.9796100000000001使用%f )。
从一个页面到另一个页面的重复链接在默认的PageRank算法中仅被视为一个链接。然而，书中的解决方案并没有考虑到这一点。
书中的整个算法都有缺陷:原因是，在每一次迭代中，pagerank分数没有存储在第二个表中。但这意味着迭代的结果取决于遍历的页面的顺序，这可能会在多次迭代之后改变结果。要解决这个问题，必须使用额外的表/字典来存储下一次迭代的pagerank，或者使用完全不同的算法(如幂迭代 )。

票数 0

Stack Overflow用户

发布于 2010-03-21 01:23:30

如果您有一个非常大的数据库(例如，WWW中的# records ~# pages )，那么使用数据库的方式类似于书中的建议，这是有意义的，因为您将无法将所有的数据保存在内存中。

如果数据集足够小，则可以(可能)通过不执行这么多查询来改进第二个版本。尝试将第一个循环替换为如下所示：

for urlid, in self.con.execute('select rowid from urllist'):
    inlinks[urlid] = []
    numoutlinks[urlid] = 0
    pagerank[urlid] = 1.0

for src, dest in self.con.execute('select fromid, toid from link'):
    inlinks[dest].append(src)
    numoutlinks[src] += 1

此版本只执行2个查询，而不是O(n^2)查询。

票数 2

Stack Overflow用户

发布于 2010-03-21 01:06:48

您是否有足够的内存以某种形式保存稀疏矩阵(fromid, toid)？这将允许大的优化(与大的算法变化)。至少，在内存中缓存您现在使用最内部循环中的(fromid, numlinks)进行的select count(*)应该会有所帮助(我可以想象，如果您处理N URL，缓存就会在空间中成为O(N) )。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/2484445

复制

相似问题

问如何优化我的PageRank计算？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何优化我的PageRank计算？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何优化我的PageRank计算？
EN