我已经在网站上搜索了很多次,但没有找到我真正需要的东西。我有一个包含数据的web.warc.gz文件,我需要提取WARC头。我已经安装了Tomcat和Wayback (1.6),试图使用./warc header脚本来派生,这是由Wayback提供的,但我一直收到关于我使用的格式的错误消息:
Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz \r\n\
~/Desktop/output.csv type \r\n
USAGE: tgtWarc fieldsSrc id
tgtWarc is the path to the target WARC.gz
fieldsSrc is the path to the text of the record
make sure each line is terminated by \r\n
and that the file ends with a blank, \r\n terminiated line
id is the XXX in:
Content-Description: Made from XXX by org.archive.wayback.util.WARCHeader
of the header record... header... 或者其他类型的错误:
Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz
~/Desktop/output.csv Content-Type
java.io.IOException: End-Of-Stream before \r\n\r\n End-Of-ANVLRecord:
at org.archive.util.anvl.ANVLRecord.load(ANVLRecord.java:163)
at org.archive.wayback.util.WARCHeader.writeHeaderRecord(WARCHeader.java:43)
at org.archive.wayback.util.WARCHeader.main(WARCHeader.java:75)我很确定这是我在命令行中编写的一种格式,但我仍然不能正确地使用它。请帮帮忙?
发布于 2015-04-02 19:23:31
https://stackoverflow.com/questions/21922726
复制相似问题