在下面的第09行,有以下一行:WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ
Line 01: WARC/1.0
Line 02: WARC-Type: request
Line 03: WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/
Line 04: Content-Type: application/http;msgtype=request
Line 05: WARC-Date: 2018-11-03T17:20:02Z
Line 06: WARC-Record-ID: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638>
Line 07: WARC-IP-Address: 54.230.195.16
Line 08: WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb>
Line 09: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ
Line 10: Content-Length: 141
Line 11:
Line 12: GET /vital-signs/carbon-dioxide/ HTTP/1.1
Line 13: User-Agent: Wget/1.15 (linux-gnu)
Line 14: Accept: */*
Line 15: Host: climate.nasa.gov
Line 16: Connection: Keep-AliveWARC的说明书上说The WARC-Block-Digest is an optional parameter indicating the algorithm name and calculated value of a digest applied to the full block of the record.
我一直在试图弄清楚full block of the record指的是什么。是11号到16号线吗?还是12号至16号线?还是第1至16行(没有第9行)?我尝试过散列这些可能性,但无法获得上面的sha1 (基32)值。
发布于 2021-08-13 14:06:55
HTTP请求的WARC记录有三个部分(cf )。( WARC规范):
记录的有效载荷摘要是空字符串的base32 32编码的SHA-1。使用Linux命令行工具的示例:
$> echo -n "" | openssl dgst -binary -sha1 | base32
3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJWARC记录的形式如下:
warc-record = header CRLF
block CRLF CRLF(见WARC规范:记录模型)
“完整”块应该包括直到尾随的\r\n\r\n的所有内容。这意味着第11至17行。注意:HTTP获取请求也以\r\n\r\n结尾(尾随的空行):
$> cat request
GET /vital-signs/carbon-dioxide/ HTTP/1.1
User-Agent: Wget/1.15 (linux-gnu)
Accept: */*
Host: climate.nasa.gov
Connection: Keep-Alive
$> tail -n2 request | hexdump -C
00000000 43 6f 6e 6e 65 63 74 69 6f 6e 3a 20 4b 65 65 70 |Connection: Keep|
00000010 2d 41 6c 69 76 65 0d 0a 0d 0a |-Alive....|
0000001a
$> cat request | openssl dgst -binary -sha1 | base32
CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQhttps://stackoverflow.com/questions/68768851
复制相似问题