2 @unnumbered WARCs management
4 To view WARC files, you have to load them in daemon. Responses will be
5 transparently replaced from those WARCs for corresponding URIs.
7 There is no strict validation or checking of WARCs correctness at all!
8 But built-in WARC support seems to be good enough for various sources.
9 Following formats are supported:
14 Ordinary uncompressed WARC. Useful to be stored on transparently
15 compressed ZFS dataset.
17 @item @command{.warc.gz}
18 GZIP compressed WARC. Multi-stream (multi-segment) formats are also
19 supported and properly indexed.
21 @item @command{.warc.zst}
22 Zstandard compressed WARC, as in
23 @url{https://iipc.github.io/warc-specifications/specifications/warc-zstd/, specification}.
24 Multi-frame format is properly indexed. Dictionary at the beginning
27 It is processed with with @command{unzstd} (@file{cmd/unzstd/unzstd})
28 utility. It eats compressed stream from @code{stdin}, outputs
29 decompressed data to @code{stdout}, and prints each frame size with
30 corresponding decompressed data size to 3rd file descriptor (if it is
31 opened). You can adjust path to it with
32 @code{-X go.stargrave.org/tofuproxy/warc.UnZSTDPath} command line option
43 $ tee fifos/add-warcs < warcs.txt
44 smth.warc-00000.warc.gz
45 smth.warc-00001.warc.gz
46 smth.warc-00002.warc.gz
51 Visit the URI you know, that exists in those WARCs, or go to
52 @url{http://warc/}, to view full list of known loaded URIs from
56 Pay attention that order of WARCs loading is important! WARC can be
57 segmented and single response can be split on multiple WARC files.
58 Each following WARC files will overwrite possibly already existing URIs.
61 To list and delete loaded known WARCs:
64 $ cat fifos/list-warcs
65 smth.warc-00000.warc.gz 154
66 smth.warc-00001.warc.gz 13
67 smth.warc-00002.warc.gz 0
69 $ echo another.warc > fifos/del-warcs
72 One possibility that @file{smth.warc-00002.warc.gz} has no URIs is that
73 it contains continuation segmented records.
77 Loading of WARC involves its whole reading and remembering where is each
78 URI response is located. You can @code{echo SAVE > fifos/add-warcs} to
79 save in-memory index to the disk as @file{....idx.gob} files. During
80 the next load, if those files exists, they are used as index immediately,
81 without expensive WARC parsing.
83 @code{redo warc-extract.cmd} utility uses exactly the same code for
84 parsing WARCs. It can be used to check if WARCs can be successfully
85 loaded, to list all URIs after, to extract some specified URI and to
86 pre-generate @file{.idx.gob} indexes.
89 $ warc-extract.cmd -idx \
90 smth.warc-00000.warc.gz \
91 smth.warc-00001.warc.gz \
92 smth.warc-00002.warc.gz
93 $ warc-extract.cmd -uri http://some/uri \
94 smth.warc-00000.warc.gz \
95 smth.warc-00001.warc.gz \
96 smth.warc-00002.warc.gz
99 Following example can be used to create multi-frame @file{.warc.zst}
100 from any kind of already existing WARCs. It has better compression ratio
101 and much higher decompression speed, than @file{.warc.gz}.
104 $ redo cmd/enzstd/enzstd
105 $ ./warc-extract.cmd -for-enzstd /path/to.warc.gz |
106 cmd/enzstd/enzstd > /path/to.warc.zst
109 @url{https://www.gnu.org/software/wget/, GNU Wget} can be easily used to
113 $ wget ... [--page-requisites] [--recursive] \
114 --no-warc-keep-log --no-warc-digests [--warc-max-size=XXX] \
115 --warc-file smth.warc ...