X-Git-Url: http://www.git.stargrave.org/?a=blobdiff_plain;f=doc%2Fwarcs.texi;h=aba62ae7d703bf7b13bbc49d9fa43f50ad420f27;hb=bae1cfe5ce46a1b758ccc4dddda2751b6ac47f3e;hp=917055346d6151914e2b45782416abaca3c47826;hpb=0c0a261a6ef4fddfc34a9150005f7964cc69c420;p=tofuproxy.git diff --git a/doc/warcs.texi b/doc/warcs.texi index 9170553..aba62ae 100644 --- a/doc/warcs.texi +++ b/doc/warcs.texi @@ -6,13 +6,33 @@ transparently replaced from those WARCs for corresponding URIs. There is no strict validation or checking of WARCs correctness at all! But built-in WARC support seems to be good enough for various sources. -Uncompressed, @command{gzip} (multiple streams and single stream are -supported) and @command{zstd} compressed ones are supported. +Following formats are supported: -Searching in compressed files is @strong{slow} -- every request will -lead to decompression of the file from the very beginning, so keeping -uncompressed WARCs on compressed ZFS dataset is much more preferable. -@command{tofuproxy} does not take advantage of multistream gzip files. +@table @asis + +@item @file{.warc} +Ordinary uncompressed WARC. Useful to be stored on transparently +compressed ZFS dataset. + +@item @command{.warc.gz} +GZIP compressed WARC. Multi-stream (multi-segment) formats are also +supported and properly indexed. + +@item @command{.warc.zst} +Zstandard compressed WARC, as in +@url{https://iipc.github.io/warc-specifications/specifications/warc-zstd/, specification}. +Multi-frame format is properly indexed. Dictionary at the beginning +is also supported. + +It is processed with with @command{unzstd} (@command{redo +cmd/unzstd/unzstd}) utility. It eats compressed stream from +@code{stdin}, outputs decompressed data to @code{stdout}, and prints +each frame size with corresponding decompressed data size to 3rd file +descriptor (if it is opened). You can adjust path to it with @code{-X +go.stargrave.org/tofuproxy/warc.UnZSTDPath} command line option during +building. + +@end table @itemize @@ -56,9 +76,9 @@ it contains continuation segmented records. Loading of WARC involves its whole reading and remembering where is each URI response is located. You can @code{echo SAVE > fifos/add-warcs} to -save in-memory index to the disk as @file{....warc.idx.gob} file. During +save in-memory index to the disk as @file{....idx.gob} file. During the next load, if that file exists, it is used as index immediately, -without expensive WARC reading. +without expensive WARC parsing. @code{redo warc-extract.cmd} builds @command{warc-extract.cmd} utility, that uses exactly the same code for parsing WARCs. It can be used to @@ -76,6 +96,16 @@ $ warc-extract.cmd -uri http://some/uri \ smth.warc-00002.warc.gz @end example +Following example can be used to create multi-frame @file{.warc.zst} +from any kind of already existing WARCs. It has better compression ratio +and much higher decompression speed. + +@example +$ redo cmd/enzstd/enzstd +$ ./warc-extract.cmd -for-enzstd /path/to.warc.gz | + cmd/enzstd/enzstd > /path/to.warc.zst +@end example + @url{https://www.gnu.org/software/wget/, GNU Wget} can be easily used to create WARCs: