文章/答案/技术大牛

发布

社区首页 >问答首页 >将warc (Warc)文件转换为XML文件的Java实用程序

问将warc (Warc)文件转换为XML文件的Java实用程序
EN

Code Review用户

提问于 2015-11-27 05:53:57

回答 2查看 581关注 0票数 1

我已经编写了一个程序来执行一个任务，但我猜它没有被优化。我想知道是否有任何方法来提高这个程序的效率和性能。

该程序从目录中读取一组.gz文件，解析每个文件，将过滤后的内容插入到结果目录中的另一个.xml文件中。

例如，1.gz的内容如下：

URL:http://www.samplePage1.com

HTTP/1.1 200 OK
Content-Type: application/vnd.ms-excel
Content-Length: 46592
Last-Modified: Mon, 08 Mar 2010 18:48:10 GMT

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="title" content="Internet Infrastructure Vendors (Vendor SIG)" />
<meta name="description" content="Sample page1" />
<title>My title1</title>
</head>
<body class="home">
 <p> body content of this sample page 1 </p>
</body>
</html>

这将作为一个XML文件被读取、解析和插入(在输入文件夹中为每个.gz文件创建一个xml )如下：

<docHead>
<doc>
<field name="url">http://www.samplePage1.com</field>
<field name="meta">Sample page1</field>
<field name="title">My title1</field>
<field name="body">body content of this sample page 1 </field>
<field name="lastmodified">Mon, 08 Mar 2010 18:48:10 GMT</field>
</doc>
<doc>
...another doc
</doc>
...
...
</docHead>

Java代码如下：

import java.io.BufferedWriter;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.nio.file.DirectoryStream;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.zip.GZIPInputStream;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class WarcToXML {

    static Pattern lstModPattern = Pattern.compile("last-modified:.*?\r?\n", Pattern.CASE_INSENSITIVE);


    public static void main(String[] args) throws IOException {

        String in_directory=args[0];
        String result_dir=args[1];
        String resFileName=null;

        //Path to create newly craeted xml files
        Path outPath = Paths.get(result_dir);

        //Create new directory if it does not exist
        if (!Files.exists(outPath)) {
            try {
                Files.createDirectory(outPath);
            } catch (IOException e) {
                 System.err.println(e);
            }
        }

        int fileCount=1;
        Path dir = FileSystems.getDefault().getPath(in_directory);
        DirectoryStream<Path> stream=null;
        try {
            stream = Files.newDirectoryStream( dir );

            for (Path path : stream) {
                if((path.getFileName().toString()).endsWith(".gz"))

                    resFileName=result_dir+"\\"+fileCount+".xml";

                try {
                    parseFile(path.toFile(), resFileName);
                } catch (Exception e) {
                    e.printStackTrace();
                    }
            }
            stream.close();
        } catch (Exception e1) {
            e.printStackTrace();
        }

    }


    public static void parseFile(File inputFile, String resFileName) throws IOException {

        // open the gzip input stream
        GZIPInputStream gzStream=new GZIPInputStream(new FileInputStream(inputFile));
        DataInputStream inStream=new DataInputStream(gzStream);

        int i=0;
        String pageContent;

        String thisTargetURI=null;
        BufferedWriter writer=null;

        try{
            writer = new BufferedWriter(new OutputStreamWriter(
                    new FileOutputStream(resFileName), "utf-8"));

            writer.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
            writer.newLine();
            writer.write("<docHead>");
            writer.newLine();
            writer.close();

            // PageRepository iterates through the inStream and returns each WARC Record as a String
            while ((pageContent=PageRepository.readPage(inStream))!=null) {

                int startOfHtmlContent=0;
                if(pageContent.toLowerCase().indexOf("<!doctype html")!=-1)
                    startOfHtmlContent=pageContent.toLowerCase().indexOf("<!doctype html");
                else
                    startOfHtmlContent=pageContent.toLowerCase().indexOf("<html");


                pageContent=pageContent.substring(startOfHtmlContent, pageContent.length()-1);

                //Start-Get value of last-modified header

                int endOfHeader=startOfHtmlContent;
                String headersBlock=pageContent.substring(0, endOfHeader);
                String lastModified=null;

                Pattern pattern = Pattern.compile("last-modified:.*?\r?\n", Pattern.CASE_INSENSITIVE);
                Matcher matcher = pattern.matcher(headersBlock);
                if (matcher.find())
                {
                    lastModified=(matcher.group(0).substring(14)).trim();
                }

                //end-get last-modified header

                String pageTitle=null;
                String h1=null;
                Element firstH1=null;

                //Parsing the html content using Jsoup
                Document doc=Jsoup.parse(pageContent);

                /**Extracting document title, if no title is present, select the text inside 1st h1 or h2 tag as the title. 
                 * If that too is not found, take the url as title
                 * */
                if(doc.title()!=null && !doc.title().isEmpty()){
                    pageTitle=doc.title();
                }
                else{
                    if(doc.select("h1").first()!=null)
                        firstH1= doc.select("h1").first();
                    else if(doc.select("h2").first()!=null)
                        firstH1= doc.select("h2").first();

                    if(firstH1!=null)
                        h1=firstH1.text();
                    else
                        h1=thisTargetURI;

                    pageTitle=h1;
                }
                /**
                    End of extracting Title
                 */


                //getting meta data
                String metaInfo="";
                Elements metalinks = doc.select("meta");

                for (Element ele : metalinks) {
                    if(ele.attr("name").equalsIgnoreCase("keywords") || ele.attr("name").equalsIgnoreCase("description"))
                        metaInfo=metaInfo+" "+ele.attr("content");
                }

                writeToXml(thisTargetURI, metaInfo, pageTitle, doc.text(), lastModified, resFileName);
            }

            writer = new BufferedWriter(new OutputStreamWriter(
                    new FileOutputStream(resFileName,true), "utf-8"));
            writer.write("</docHead>");
            writer.close();

        }catch(Exception e){
            e.printStackTrace();
        }
        finally {
            ...
        }
    }

    public static void writeToXml(String url, String metaKeywords, String title, String content, 
            String lastModified, String resFileName){

        BufferedWriter writer = null;

        try {
            url=url.replace("<", "<");
            url=url.replace(">", ">");
            url=url.replace("'", "'");

            if(metaKeywords!=null){
                metaKeywords=metaKeywords.replace("<", "<");
                metaKeywords=metaKeywords.replace(">", ">");
                metaKeywords=metaKeywords.replace("'", "'");
            }


            if(title!=null){
                title=title.replace("<", "<");
                title=title.replace(">", ">");
                title=title.replace("'", "'");
            }

            if(content!=null){
                content=content.replace("<", "<");
                content=content.replace(">", ">");
                content=content.replace("'", "'");
            }

            writer = new BufferedWriter(new OutputStreamWriter(
                    new FileOutputStream(resFileName,true), "utf-8"));


            writer.write("<doc>");
            writer.newLine();
            writer.write("<field name=\"url\">"+url+"</field>");
            writer.newLine();
            writer.write("<field name=\"meta\">"+metaKeywords+"</field>");
            writer.newLine();
            writer.write("<field name=\"title\">"+title+"</field>");
            writer.newLine();
            writer.write("<field name=\"body\">"+content+"</field>");
            writer.newLine();
            writer.write("<field name=\"lastmodified\">"+lastModified+"</field>");
            writer.newLine();
            writer.write("</doc>");
            writer.newLine();

            writer.close();
        }
        catch (Exception ex) {
            e.printStackTrace();
        }
        finally {
            ...
    }
}

有更好的方法来完成这个任务吗？我猜线程一次读取和处理多个文件会有帮助，但不太确定如何使用它们。

java

html

file

xml

回答 2

Code Review用户

发布于 2015-11-27 06:44:54

最终改性剂

static Pattern lstModPattern = Pattern.compile("last-modified:.*?\r?\n", Pattern.CASE_INSENSITIVE);

永远不会更改，因此可以声明为final，这两者都告诉开发人员，它永远不会对应用程序进行更改。

不一致格式

if((path.getFileName().toString()).endsWith(".gz")) resFileName=result_dir+"\"+fileCount+".xml"；尝试{

例如，这段代码是用额外的行格式化的，与代码的其他部分不同。

if(firstH1!=null) h1=firstH1.text()；h1=thisTargetURI；

这里没有括号。我强烈建议您使用括号，如果您要根据firstH1的状态添加更多。

试捕

您忽略了所有的exceptions，只是打印堆栈跟踪，这通常是您想要避免的。从异常中恢复并不总是可能的，但是现在进行了尝试，输出也不能直接告诉我们哪里出了问题。

公共静态空主(String[] args)抛出IOException {

一般人都不赞成。在这种情况下，您正在捕捉一些抛出的东西，这可能会使调试变得很痛苦。

捕获并尝试恢复，如果可能的话，至少继续，如果不生成一个真实而明确的错误，作为错误的原因，并终止应用程序。

重复码

writer.write("")；writer.newLine()；writer.write(""+url+"")；writer.newLine()；writer.write(">”+metaKeywords+“”)；writer.newLine()；writer.write(">”+title+“)；writer.newLine()；writer.write("“writer.write”“)；name=\"lastmodified\">"+lastModified+"")；writer.write()；writer.newLine()；

至少将<field name=\" + head + \"> + data + "</field>"分解为一个助手方法，甚至考虑使用一个HashMap<K,V>来迭代它。

title=title.replace("<"，"<")；title=title.replace(">"，">")；title=title.replace("'"，‘’)；

与此相同，它们在许多地方也用于不同的变量。也可以考虑在这里使用助手。

这些是一些快速的指示。一旦它们被更新，就可以自由地问一个新的问题。一旦有了，我可以提供更多的直接反馈给整个项目。

票数 2

Code Review用户

发布于 2015-11-27 07:33:22

与现在手工构建XML文件不同，我强烈建议使用“适当的”XML序列化程序，例如Xerces

通常，为java编写的整个代码都不太面向对象。用一种更“脚本”的-like语言(例如Python)写这篇文章可能会很有趣，因为它并不那么深地植根于OOP。

但是当你得到OOP的时候，我建议你用它。

目前(如前所述)，代码是非常程序化的，不是面向对象的。我们可以通过引入一个包含我们想要的文档信息的类，并知道如何正确序列化(和反序列化)这些信息来改变这一点：

public class DocumentInformation {
    private URL url; // maybe String is more appropriate?
    private String title;
    private String metaKeywords;
    private String body;
    private String lastModified;

    // omitting constructor and getters / setters

    public void serializeToXML(XMLSerializer serializer) {
        // Serialization specific code :)
    }
}

您应该争取的是为解析和序列化过程创建一个适当的抽象，这样您就可以通过查看主方法来掌握发生了什么的概念。

考虑一个主方法，如(免责声明:这是示例代码)：

public static void main(String[] args) {
    // skipping the nitty-gritty argument parsing
    try (DirectoryStream<Path> input = Files.newDirectoryStream(dir)) {
         Path outputFile = Paths.get(result_dir, fileCount + ".xml"
         XMLSerialize serializer = new XMLSerializer(
                   Files.newOutputStream(outputFile, StandardOpenOptions.CREATE), 
                   OutputFormat.defaults);
         serializer.startDocument("docHead");
         input.map(Parser::parse)
             .forEach(document -> document.serialize(serializer));
    } catch (IOException ex) {
         // ... better error handling :D
    }
}

这可以同时实现OOP的多个目标。首先，我们将解析和序列化抽象为专门的类。理论上，这些类可以更改(如果需要的话)，而不必调整应用程序的其余部分。

第二，我们现在将输入和输出的责任分开。这比当前的parseFile要简单得多，它同时执行多个任务，即：

读取文件
将其解析为中间结果
对这些结果执行规范化
将结果序列化为XML

所发生的事情的总体要旨是更清楚和更容易掌握。

最后，我们现在更多地使用更新的(和更干净的) nio来简化处理和错误处理。

票数 2

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/111985

复制

相似问题

问将warc (Warc)文件转换为XML文件的Java实用程序
EN

回答 2

Code Review用户

最终改性剂

不一致格式

试捕

重复码

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将warc (Warc)文件转换为XML文件的Java实用程序EN

回答 2

Code Review用户

最终改性剂

不一致格式

试捕

重复码

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将warc (Warc)文件转换为XML文件的Java实用程序
EN