@node WARCs
@unnumbered WARCs management

To view WARC files, you have to load them in daemon. Responses will be
transparently replaced from those WARCs for corresponding URIs.

There is no strict validation or checking of WARCs correctness at all!
But built-in WARC support seems to be good enough for various sources.
Following formats are supported:

@table @asis

@item @file{.warc}
Ordinary uncompressed WARC. Useful to be stored on transparently
compressed ZFS dataset.

@item @command{.warc.gz}
GZIP compressed WARC. Multi-stream (multi-segment) formats are also
supported and properly indexed.

@item @command{.warc.zst}
Zstandard compressed WARC, as in
@url{https://iipc.github.io/warc-specifications/specifications/warc-zstd/, specification}.
Multi-frame format is properly indexed. Dictionary at the beginning
is also supported.

It is processed with @command{unzstd} (@file{cmd/zstd/unzstd})
utility. It eats compressed stream from @code{stdin}, outputs
decompressed data to @code{stdout}, and prints each frame size with
corresponding decompressed data size to 3rd file descriptor (if it is
opened).

@end table

@itemize

@item
Load WARCs:

@example
$ tee fifos/add-warcs <warcs.txt
smth.warc-00000.warc.gz
smth.warc-00001.warc.gz
smth.warc-00002.warc.gz
another.warc
@end example

@item
Visit the URI you know, that exists in those WARCs, or go to
@url{http://warc/}, to view full list of known loaded URIs from
those WARCs.

@item
Pay attention that order of WARCs loading is important! WARC can be
segmented and single response can be split on multiple WARC files.
Each following WARC files will overwrite possibly already existing URIs.

@item
To list and delete loaded known WARCs:

@example
$ cat fifos/list-warcs
smth.warc-00000.warc.gz 154
smth.warc-00001.warc.gz 13
smth.warc-00002.warc.gz 0
another.warc 123
$ echo another.warc >fifos/del-warcs
@end example

One possibility that @file{smth.warc-00002.warc.gz} has no URIs is that
it contains continuation segmented records.

@end itemize

Loading of WARC involves its whole reading and remembering where is each
URI response is located. You can @code{echo SAVE >fifos/add-warcs} to
save in-memory index to the disk as @file{....idx.gob} files. During
the next load, if those files exists, they are used as index immediately,
without expensive WARC parsing.

@code{cmd/warc-extract/warc-extract} utility uses exactly the same code
for parsing WARCs. It can be used to check if WARCs can be successfully
loaded, to list all URIs after, to extract some specified URI and to
pre-generate @file{.idx.gob} indices.

@example
$ cmd/warc-extract/warc-extract -idx \
    smth.warc-00000.warc.gz \
    smth.warc-00001.warc.gz \
    smth.warc-00002.warc.gz
$ cmd/warc-extract/warc-extract -uri http://some/uri \
    smth.warc-00000.warc.gz \
    smth.warc-00001.warc.gz \
    smth.warc-00002.warc.gz
@end example

Following example can be used to create multi-frame @file{.warc.zst}
from any kind of already existing WARCs. It has better compression ratio
and much higher decompression speed, than @file{.warc.gz}.

@example
$ cmd/warc-extract/warc-extract -for-enzstd /path/to.warc.gz |
    cmd/zstd/enzstd >/path/to.warc.zst
@end example

@url{https://www.gnu.org/software/wget/, GNU Wget} can be easily used to
create WARCs:

@example
$ wget ... [--page-requisites] [--recursive] \
    --no-warc-keep-log --no-warc-digests [--warc-max-size=XXX] \
    --warc-file smth.warc ...
@end example

Or even more simpler @url{https://git.jordan.im/crawl/tree/README.md, crawl}
utility written on Go too.