doc/warcs.texi

   1 @node WARCs
   2 @section WARCs management
   3
   4 To view WARC files, you have to load them in daemon. Responses will be
   5 transparently replaced from those WARCs for corresponding URIs.
   6
   7 There is no strict validation or checking of WARCs correctness at all!
   8 But built-in WARC support seems to be good enough for various sources.
   9 Following formats are supported:
  10
  11 @table @asis
  12
  13 @item @file{.warc}
  14 Ordinary uncompressed WARC. Useful to be stored on transparently
  15 compressed ZFS dataset.
  16
  17 @item @command{.warc.gz}
  18 GZIP compressed WARC. Multi-stream (multi-segment) formats are also
  19 supported and properly indexed.
  20
  21 @item @command{.warc.zst}
  22 Zstandard compressed WARC, as in
  23 @url{https://iipc.github.io/warc-specifications/specifications/warc-zstd/, specification}.
  24 Multi-frame format is properly indexed. Dictionary at the beginning
  25 is also supported.
  26
  27 It is processed with with @command{unzstd} (@command{redo
  28 cmd/unzstd/unzstd}) utility. It eats compressed stream from
  29 @code{stdin}, outputs decompressed data to @code{stdout}, and prints
  30 each frame size with corresponding decompressed data size to 3rd file
  31 descriptor (if it is opened). You can adjust path to it with @code{-X
  32 go.stargrave.org/tofuproxy/warc.UnZSTDPath} command line option during
  33 building.
  34
  35 @end table
  36
  37 @itemize
  38
  39 @item
  40 Load WARCs:
  41
  42 @example
  43 $ tee fifos/add-warcs < warcs.txt
  44 smth.warc-00000.warc.gz
  45 smth.warc-00001.warc.gz
  46 smth.warc-00002.warc.gz
  47 another.warc
  48 @end example
  49
  50 @item
  51 Visit the URI you know, that exists in those WARCs, or go to
  52 @url{http://warc/}, to view full list of known loaded URIs from
  53 those WARCs.
  54
  55 @item
  56 Pay attention that order of WARCs loading is important! WARC can be
  57 segmented and single response can be split on multiple WARC files.
  58 Each following WARC files will overwrite possibly already existing URIs.
  59
  60 @item
  61 To list and delete loaded known WARCs:
  62
  63 @example
  64 $ cat fifos/list-warcs
  65 smth.warc-00000.warc.gz 154
  66 smth.warc-00001.warc.gz 13
  67 smth.warc-00002.warc.gz 0
  68 another.warc 123
  69 $ echo another.warc > fifos/del-warcs
  70 @end example
  71
  72 One possibility that @file{smth.warc-00002.warc.gz} has no URIs is that
  73 it contains continuation segmented records.
  74
  75 @end itemize
  76
  77 Loading of WARC involves its whole reading and remembering where is each
  78 URI response is located. You can @code{echo SAVE > fifos/add-warcs} to
  79 save in-memory index to the disk as @file{....idx.gob} file. During
  80 the next load, if that file exists, it is used as index immediately,
  81 without expensive WARC parsing.
  82
  83 @code{redo warc-extract.cmd} builds @command{warc-extract.cmd} utility,
  84 that uses exactly the same code for parsing WARCs. It can be used to
  85 check if WARCs can be successfully loaded, to list all URIs after, to
  86 extract some specified URI and to pre-generate @file{.idx.gob} indexes.
  87
  88 @example
  89 $ warc-extract.cmd -idx \
  90     smth.warc-00000.warc.gz \
  91     smth.warc-00001.warc.gz \
  92     smth.warc-00002.warc.gz
  93 $ warc-extract.cmd -uri http://some/uri \
  94     smth.warc-00000.warc.gz \
  95     smth.warc-00001.warc.gz \
  96     smth.warc-00002.warc.gz
  97 @end example
  98
  99 Following example can be used to create multi-frame @file{.warc.zst}
 100 from any kind of already existing WARCs. It has better compression ratio
 101 and much higher decompression speed.
 102
 103 @example
 104 $ redo cmd/enzstd/enzstd
 105 $ ./warc-extract.cmd -for-enzstd /path/to.warc.gz |
 106     cmd/enzstd/enzstd > /path/to.warc.zst
 107 @end example
 108
 109 @url{https://www.gnu.org/software/wget/, GNU Wget} can be easily used to
 110 create WARCs:
 111
 112 @example
 113 $ wget ... [--page-requisites] [--recursive] \
 114     --no-warc-keep-log --no-warc-digests [--warc-max-size=XXX] \
 115     --warc-file smth.warc ...
 116 @end example