文章/答案/技术大牛

发布

社区首页 >问答首页 >从文件中读取URL字符串的列表，并找到阅读率最高的前10个URL

问从文件中读取URL字符串的列表，并找到阅读率最高的前10个URL
EN

Stack Overflow用户

提问于 2013-09-22 02:57:46

回答 3查看 2.7K关注 0票数 1

好的,

我已经在大量的面试中得到了这个问题，我想我需要一些帮助来解决这个问题。

你有一吨的URL表示为字符串数组或从文件中读取。

我的方法是：

         Read them into a String Array, 
         Iterate through each String/URL,
             At every Iteration, put them into Hashtable, incrementing the count.
         Iterate again and find feed the scores into an array
         Sort and find the top 10 scores OR use max-heap to get the top 10.
         Iterate again and remove the URL's with the top 10 scores.

这是一个非常糟糕的答案吗？有人能帮我做进一步的工作吗？

string

algorithm

url

data-structures

回答 3

Stack Overflow用户

发布于 2013-09-22 08:10:39

你可以用最小的内存和一个基本上无限大的文件来做到这一点：

Use the OS-supplied sort utility to sort the URLs on disk
Create a priority queue (binary heap, for example)
For each URL in the file
    if the URL is the same as the previous URL
        increase count
    else
        AddToQueue(previous_url, count)
        previous_url = current_url
        count = 1
EndFor
AddToQueue(previous_url, count)

此时，访问次数最多的前10个URL将位于优先级队列中。

AddToQueue函数如下所示：

AddToQueue(url, count)
    if (queue.Count < 10)
        add new url with count to queue
    else if (count > top_of_queue.count)
        remove top URL from queue
        add new url with count to queue

如果您有足够的内存来加载所有的URL，您可以将它们加载到一个数组中并对其进行排序。但是如果你有足够的内存来存储所有的URL，那么基于字典的方法可能会更快。

票数 3

Stack Overflow用户

发布于 2013-09-22 03:06:24

运行时并不差，总体上类似于O(nlogn)。

Read them into a String Array - O(n)
Iterate through each String/URL - O(n)
    At every Iteration, put them into Hashtable, incrementing the count - O(1)
Iterate again and find feed the scores into an array - O(n)
Sort and find the top 10 scores OR use max-heap to get the top 10 - O(nlogn)
Iterate again and remove the URL's with the top 10 scores. - O(1)

但是，您可能会跳过最后两个步骤。相反，迭代哈希表(或URL)的条目，并在遍历条目时维护一个包含前10个分数的10个条目的数组；这样，您跳过了排序步骤，总体运行时间将为O(n)。

票数 0

Stack Overflow用户

发布于 2013-09-22 03:26:39

首先，请注意，您正在不必要地使用额外的内存。为什么要将所有内容读入数组，然后遍历该数组以将所有内容插入哈希表？除非您有很强的理由这样做，否则您应该在阅读时将其放入哈希表中。

这减少了对数组的需求，并将内存使用量降低了O(n)。其他步骤听起来很合理。维护一个包含前10个分数的10个条目的数组的想法是一种很好的方法，但它提出了一个问题，即如何有效地做到这一点。

此外，使用哈希表可能会引发实现问题，您过于依赖通常在内置库中的东西。为了便于采访，也许更好的方法是将所有内容读取到二进制搜索树中，其中每个节点都包含一个带有字符串的结构，以及该字符串出现的次数(以及指向左右节点的指针)。使用二进制搜索树可以查看字符串是否在O(log(n))时间内出现。在读取所有内容并将其放入树中后，您可以使用shell sort对树进行排序。Shell排序对于这个练习来说是一个很好的选择，因为它倾向于快速消除大量的无序。该解决方案的运行时间为O(nlog(n))。

如果你的面试官同意使用哈希表，也许不值得麻烦地实现一个树，但如果你说“我将使用哈希表”，你可能会搬起石头砸自己的脚，当然，除非你实现了哈希表。这真的取决于上下文。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/18936381

复制

相似问题

问从文件中读取URL字符串的列表，并找到阅读率最高的前10个URL
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从文件中读取URL字符串的列表，并找到阅读率最高的前10个URLEN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从文件中读取URL字符串的列表，并找到阅读率最高的前10个URL
EN