Multi-frame format is properly indexed. Dictionary at the beginning
is also supported.
-It is processed with with @command{unzstd} (@file{cmd/zstd/unzstd})
+It is processed with @command{unzstd} (@file{cmd/zstd/unzstd})
utility. It eats compressed stream from @code{stdin}, outputs
decompressed data to @code{stdout}, and prints each frame size with
corresponding decompressed data size to 3rd file descriptor (if it is
the next load, if those files exists, they are used as index immediately,
without expensive WARC parsing.
-@code{redo warc-extract.cmd} utility uses exactly the same code for
-parsing WARCs. It can be used to check if WARCs can be successfully
+@code{cmd/warc-extract/warc-extract} utility uses exactly the same code
+for parsing WARCs. It can be used to check if WARCs can be successfully
loaded, to list all URIs after, to extract some specified URI and to
-pre-generate @file{.idx.gob} indexes.
+pre-generate @file{.idx.gob} indices.
@example
-$ warc-extract.cmd -idx \
+$ cmd/warc-extract/warc-extract -idx \
smth.warc-00000.warc.gz \
smth.warc-00001.warc.gz \
smth.warc-00002.warc.gz
-$ warc-extract.cmd -uri http://some/uri \
+$ cmd/warc-extract/warc-extract -uri http://some/uri \
smth.warc-00000.warc.gz \
smth.warc-00001.warc.gz \
smth.warc-00002.warc.gz
and much higher decompression speed, than @file{.warc.gz}.
@example
-$ redo cmd/enzstd/enzstd
-$ ./warc-extract.cmd -for-enzstd /path/to.warc.gz |
- cmd/enzstd/enzstd > /path/to.warc.zst
+$ cmd/warc-extract/warc-extract -for-enzstd /path/to.warc.gz |
+ cmd/zstd/enzstd > /path/to.warc.zst
@end example
@url{https://www.gnu.org/software/wget/, GNU Wget} can be easily used to
--no-warc-keep-log --no-warc-digests [--warc-max-size=XXX] \
--warc-file smth.warc ...
@end example
+
+Or even more simpler @url{https://git.jordan.im/crawl/tree/README.md, crawl}
+utility written on Go too.