doc/warcs.texi

   1 @node WARCs
   2 @section WARCs management
   3
   4 To view WARC files, you have to load them in daemon. Responses will be
   5 transparently replaced from those WARCs for corresponding URIs.
   6
   7 There is no strict validation or checking of WARCs correctness at all!
   8 But built-in WARC support seems to be good enough for various sources.
   9 Uncompressed, @command{gzip} (multiple streams and single stream are
  10 supported) and @command{zstd} compressed ones are supported.
  11
  12 Searching in compressed files is @strong{slow} -- every request will
  13 lead to decompression of the file from the very beginning, so keeping
  14 uncompressed WARCs on compressed ZFS dataset is much more preferable.
  15 @command{tofuproxy} does not take advantage of multistream gzip files.
  16
  17 @itemize
  18
  19 @item
  20 Load WARCs:
  21
  22 @example
  23 $ tee fifos/add-warcs < warcs.txt
  24 smth.warc-00000.warc.gz
  25 smth.warc-00001.warc.gz
  26 smth.warc-00002.warc.gz
  27 another.warc
  28 @end example
  29
  30 @item
  31 Visit the URI you know, that exists in those WARCs, or go to
  32 @url{http://warc/}, to view full list of known loaded URIs from
  33 those WARCs.
  34
  35 @item
  36 Pay attention that order of WARCs loading is important! WARC can be
  37 segmented and single response can be split on multiple WARC files.
  38 Each following WARC files will overwrite possibly already existing URIs.
  39
  40 @item
  41 To list and delete loaded known WARCs:
  42
  43 @example
  44 $ cat fifos/list-warcs
  45 smth.warc-00000.warc.gz 154
  46 smth.warc-00001.warc.gz 13
  47 smth.warc-00002.warc.gz 0
  48 another.warc 123
  49 $ echo another.warc > fifos/del-warcs
  50 @end example
  51
  52 One possibility that @file{smth.warc-00002.warc.gz} has no URIs is that
  53 it contains continuation segmented records.
  54
  55 @end itemize
  56
  57 Loading of WARC involves its whole reading and remembering where is each
  58 URI response is located. You can @code{echo SAVE > fifos/add-warcs} to
  59 save in-memory index to the disk as @file{....warc.idx.gob} file. During
  60 the next load, if that file exists, it is used as index immediately,
  61 without expensive WARC reading.
  62
  63 @code{redo warc-extract.cmd} builds @command{warc-extract.cmd} utility,
  64 that uses exactly the same code for parsing WARCs. It can be used to
  65 check if WARCs can be successfully loaded, to list all URIs after, to
  66 extract some specified URI and to pre-generate @file{.idx.gob} indexes.
  67
  68 @example
  69 $ warc-extract.cmd -idx \
  70     smth.warc-00000.warc.gz \
  71     smth.warc-00001.warc.gz \
  72     smth.warc-00002.warc.gz
  73 $ warc-extract.cmd -uri http://some/uri \
  74     smth.warc-00000.warc.gz \
  75     smth.warc-00001.warc.gz \
  76     smth.warc-00002.warc.gz
  77 @end example
  78
  79 @url{https://www.gnu.org/software/wget/, GNU Wget} can be easily used to
  80 create WARCs:
  81
  82 @example
  83 $ wget ... [--page-requisites] [--recursive] \
  84     --no-warc-keep-log --no-warc-digests [--warc-max-size=XXX] \
  85     --warc-file smth.warc ...
  86 @end example