文章/答案/技术大牛

发布

问Word文档检索
EN

Code Review用户

提问于 2017-06-26 14:29:01

回答 1查看 1.1K关注 0票数 0

我有一个网站，提供搜索文件从当地举行的听证会，存储在一个网络文件服务器。我需要接受搜索词和搜索一堆.docx (大约4500)文件。它们不太大，大部分都小于150 kb，但下载到流中的文件运行非常慢。我确信有更好的方法来编写搜索，(可能是多个处理)，但我不知道如何调整它并加快搜索速度。搜索本身需要超过3分钟。

bool found = false;
Hearing h = new Hearing();
Stream str = null;
MemoryStream str2 = new MemoryStream();
HttpWebRequest fileRequest = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse fileResponse = (HttpWebResponse)fileRequest.GetResponse();
str = fileResponse.GetResponseStream();
str.CopyTo(str2);
str2.Position = 0;
using (WordprocessingDocument wpd = WordprocessingDocument.Open(str2, true))
{
    string docText = null;
    using (StreamReader sr = new StreamReader(wpd.MainDocumentPart.GetStream()))
    {
        docText = sr.ReadToEnd();
        found = docText.ToUpper().Contains(txtBasicSearch.Text.ToUpper());
        if (found)
        {
            hearingArrayList.Add(h);
            foundCount++;
        }
    }
}

performance

asp.net

回答 1

Code Review用户

回答已采纳

发布于 2017-06-27 04:14:12

这确实是索引全文搜索引擎的确切用例。

由于您是在网站上运行此代码服务器端，我建议您认真考虑编写一个简单的工作人员，该工作人员将您的FS轮询新文档并将它们添加到启用全文搜索的数据库中。

如果您使用的是Server：https://docs.microsoft.com/en-us/sql/relational-databases/search/get-started-with-full-text-search

如果您使用的是MySQL：http://www.w3resource.com/mysql/mysql-full-text-search-functions.php

这样，不仅可以比手动扫描每个文档更快地返回结果，而且还可以避免为每个请求从FS中流出每个文件所涉及的繁重的网络流量。

要做到这一点，您可以很容易地在站点中编写一个页面，也可以很容易地编写一个新的控制台应用程序(更好)，该应用程序通常由cron作业(linux)或服务器上的计划任务(windows)调用。然而，这一间隔通常会被添加到FS中，或者不管您对陈旧数据的容忍度如何。

此时，page/app将提取数据库中已经缓存的文档列表，查询FS的内容，并比较文件名或文件日期列表，以查看需要添加/更新哪些内容。在这一点上，您只能在实际需要添加的文件中进行流，而且您并不关心它需要多长时间。

然后，数据库将负责新文档的索引编制。然后，您的网页就变成了搜索那些索引文档的哑管道。

如果将文本存储在数据库中不是一个选项，则可以考虑在您自己的服务器上镜像这些文件。它仍然会删除算法中最慢的部分(网络流量)。

您仍然需要您的cron/计划任务工作人员来进行镜像，但是将新文件从FS复制到本地磁盘将是一个简单的问题。

如果您在本地镜像或两者都做不到，您最好的选择是并行化。您可以进行一些重构，但是本地操作并不是真正的瓶颈。

例如，如果您可以在本地镜像，则可以使用它来代替现有代码：

// ToUpper() your search string outside of the loop,
// rather than in each passs.
string txtBasicSearch = "My Search String".ToUpper();

// Use Parallel.ForEach over every docx file in our directory. 
Parallel.ForEach(Directory.EnumerateFiles(directoryPath, "*.docx"), (string file) =>
{
    string docText = string.Empty;

    try
    {
        // Try to dispose of our streams as soon as possible to avoid
        // holding memory unecessarily. Also, avoid copying Streams
        // to different types. A generic Stream works just fine.
        //
        // As well, only open with read perms to avoid unecessary locks and
        // any delays that may cause.
        using (Stream str = File.OpenRead(file))
        {
            using (WordprocessingDocument wpd = WordprocessingDocument.Open(str, false))
            {
                using (StreamReader sr = new StreamReader(wpd.MainDocumentPart.GetStream()))
                {
                    docText = sr.ReadToEnd();
                }
            }
        }

        // Search the haystack for the needle.
        if (docText.ToUpper().Contains(txtBasicSearch))
        {
            // No need for a counter variable. Just user
            // hearingArrayList.Count() at the end.
            hearingArrayList.Add(file);
        }
    }
    catch (Exception ex)
    {
        // Do whatever error handling here.
        return;
    }
});

在本地NAS上使用带有一个小目录的正则foreach循环对同一版本进行并行版本计时，这表明并行版本通常要快3-6倍。

如果无法在本地镜像，则仍然可以并行处理文件流，但需要了解服务器对可以同时打开的连接数量的限制。

在这里，HttpClient类可能比WebRequest类更适合您。https://msdn.microsoft.com/en-us/library/hh696703(v=vs.110).aspx

在那里，您可以查询远程目录中的文件，然后遍历它们，使用HttpClient进行异步调用。

所以，这看起来可能是：

string txtBasicSearch = "My Search String".ToUpper();

HttpClient client = new HttpClient();

// Use client to populate myFileList with the remote files.

foreach (string file in myFileList)
{
    client.GetStreamAsync(file).ContinueWith((Task<Stream> result) =>
    {
        if (result.Status != TaskStatus.RanToCompletion)
        {
            // Error handling.
            return;
        }

        string docText = string.Empty;

        try
        {
            using (WordprocessingDocument wpd = WordprocessingDocument.Open(result.Result, false))
            {
                using (StreamReader sr = new StreamReader(wpd.MainDocumentPart.GetStream()))
                {
                    docText = sr.ReadToEnd();
                }
            }

            if (docText.ToUpper().Contains(txtBasicSearch))
            {
                hearingArrayList.Add(file);
            }
        }
        catch (Exception ex)
        {
            // Do whatever error handling here.
            return;
        }
    }
}

HttpClient类将处理限制您的速率。默认情况下，我相信它允许在任何时候建立三个连接，但是您可以很容易地根据自己的喜好来更改它。

枚举远程服务器上的文件将是一个不同的主题，这取决于如何访问该远程。我建议搜索其他的答案，比如https://stackoverflow.com/questions/124492/c-sharp-httpwebrequest-command-to-get-directory-listing

(如果您的文件服务器只是intranet上的NAS，请省去麻烦，只需使用System.IO.Directory和.File类查询文件)

票数 3

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/166678

复制

相似问题

问Word文档检索
EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Word文档检索EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Word文档检索
EN