doc/warcs.texi

   1 @node WARCs
   2 @unnumbered WARCs management
   3
   4 To view WARC files, you have to load them in daemon. Responses will be
   5 transparently replaced from those WARCs for corresponding URIs.
   6
   7 There is no strict validation or checking of WARCs correctness at all!
   8 But built-in WARC support seems to be good enough for various sources.
   9 Following formats are supported:
  10
  11 @table @asis
  12
  13 @item @file{.warc}
  14 Ordinary uncompressed WARC. Useful to be stored on transparently
  15 compressed ZFS dataset.
  16
  17 @item @command{.warc.gz}
  18 GZIP compressed WARC. Multi-stream (multi-segment) formats are also
  19 supported and properly indexed.
  20
  21 @item @command{.warc.zst}
  22 Zstandard compressed WARC, as in
  23 @url{https://iipc.github.io/warc-specifications/specifications/warc-zstd/, specification}.
  24 Multi-frame format is properly indexed. Dictionary at the beginning
  25 is also supported.
  26
  27 It is processed with @command{unzstd} (@file{cmd/zstd/unzstd})
  28 utility. It eats compressed stream from @code{stdin}, outputs
  29 decompressed data to @code{stdout}, and prints each frame size with
  30 corresponding decompressed data size to 3rd file descriptor (if it is
  31 opened).
  32
  33 @end table
  34
  35 @itemize
  36
  37 @item
  38 Load WARCs:
  39
  40 @example
  41 $ tee fifos/add-warcs < warcs.txt
  42 smth.warc-00000.warc.gz
  43 smth.warc-00001.warc.gz
  44 smth.warc-00002.warc.gz
  45 another.warc
  46 @end example
  47
  48 @item
  49 Visit the URI you know, that exists in those WARCs, or go to
  50 @url{http://warc/}, to view full list of known loaded URIs from
  51 those WARCs.
  52
  53 @item
  54 Pay attention that order of WARCs loading is important! WARC can be
  55 segmented and single response can be split on multiple WARC files.
  56 Each following WARC files will overwrite possibly already existing URIs.
  57
  58 @item
  59 To list and delete loaded known WARCs:
  60
  61 @example
  62 $ cat fifos/list-warcs
  63 smth.warc-00000.warc.gz 154
  64 smth.warc-00001.warc.gz 13
  65 smth.warc-00002.warc.gz 0
  66 another.warc 123
  67 $ echo another.warc > fifos/del-warcs
  68 @end example
  69
  70 One possibility that @file{smth.warc-00002.warc.gz} has no URIs is that
  71 it contains continuation segmented records.
  72
  73 @end itemize
  74
  75 Loading of WARC involves its whole reading and remembering where is each
  76 URI response is located. You can @code{echo SAVE > fifos/add-warcs} to
  77 save in-memory index to the disk as @file{....idx.gob} files. During
  78 the next load, if those files exists, they are used as index immediately,
  79 without expensive WARC parsing.
  80
  81 @code{cmd/warc-extract/warc-extract} utility uses exactly the same code
  82 for parsing WARCs. It can be used to check if WARCs can be successfully
  83 loaded, to list all URIs after, to extract some specified URI and to
  84 pre-generate @file{.idx.gob} indices.
  85
  86 @example
  87 $ cmd/warc-extract/warc-extract -idx \
  88     smth.warc-00000.warc.gz \
  89     smth.warc-00001.warc.gz \
  90     smth.warc-00002.warc.gz
  91 $ cmd/warc-extract/warc-extract -uri http://some/uri \
  92     smth.warc-00000.warc.gz \
  93     smth.warc-00001.warc.gz \
  94     smth.warc-00002.warc.gz
  95 @end example
  96
  97 Following example can be used to create multi-frame @file{.warc.zst}
  98 from any kind of already existing WARCs. It has better compression ratio
  99 and much higher decompression speed, than @file{.warc.gz}.
 100
 101 @example
 102 $ cmd/warc-extract/warc-extract -for-enzstd /path/to.warc.gz |
 103     cmd/zstd/enzstd > /path/to.warc.zst
 104 @end example
 105
 106 @url{https://www.gnu.org/software/wget/, GNU Wget} can be easily used to
 107 create WARCs:
 108
 109 @example
 110 $ wget ... [--page-requisites] [--recursive] \
 111     --no-warc-keep-log --no-warc-digests [--warc-max-size=XXX] \
 112     --warc-file smth.warc ...
 113 @end example
 114
 115 Or even more simpler @url{https://git.jordan.im/crawl/tree/README.md, crawl}
 116 utility written on Go too.