sgodup -- file deduplication utility
====================================
-DESCRIPTION AND USAGE
+sgodup is utility for files deduplication. You supply two directories:
+the base and one with possible duplicates, utility determines duplicate
+files and replaces them with the links. It is aimed to have very high
+performance.
-sgodup is utility for duplicate files detection. You supply two
-directories: the base and one with possible duplicates, utility
-determines duplicate files and replaces them with the links. It
-is aimed to have very high performance.
+SINGLE PASS MODE
+================
-There are just few arguments:
+$ sgodup -basedir DIR -dupdir DIR -action ACTION \
+ [-minsize NNN] [-chmod NNN] [-fsync]
--basedir -- directory with files that are possible link targets
- -dupdir -- directory with possible duplicates, which are replaced
- with the links to basedir's files
- -action -- * print: just print to stdout duplicate file path with
- relative path to basedir's corresponding file
- * symlink: create symbolic link with relative path to
- basedir's corresponding file
- * hardlink: create hard link instead
- -chmod -- if specified, then chmod files in basedir and dupdir
- during scan phase. Octal representation is expected
- -fsync -- fsync directories where linking occurs
+basedir is a directory with "original" files, that are possible link
+targets. dupdir is a directory with possible duplicates, which are to be
+replaced with the links to basedir's file. It is safe to specify same
+directory as a basedir and dupdir.
-There are three stages:
+There are 3 stages this command will do:
* basedir directory scan: collect all *regular* file paths, sizes and
- inodes. If -chmod is specified, then apply it at once. Empty files are
- ignored
+ inodes. If -chmod is specified, then apply it to them. Files smaller
+ than -minsize (by default it is equal to 1 bytes) are not taken for
+ duplication comparison
* dupdir directory scan: same as above. If there is no basedir's file
- with the same size, then skip dupdir's file (obviously it can not be
+ with the same size, then skip dupdir's one (obviously it can not be
duplicate). Check that no basedir's files have the same inode, skip
- dupdir's file otherwise, because it is already hardlinked
-* deduplication stage. For each dupdir file, find basedir file with the
- same size and compare their contents, to determine if dupdir's one is
- the duplicate. Perform specified action if so. There are two separate
- queues and processing cycles:
-
- * small files, up to 4 KiB (one disk sector): files are fully read and
- compared in memory
- * large files (everything else): read and compare first 4 KiB of files
- in memory. If they are not equal, then this is not a duplicate.
- Fully read each file's contents sequentially with 128 KiB chunks and
- calculate BLAKE2b-512 digest otherwise
+ dupdir's file otherwise (it is hardlink)
+* deduplication stage. For each dupdir file, find basedir one with the
+ same size and compare their contents, to determine if dupdir one is
+ the duplicate. Perform specified action if so. Comparing is done the
+ following way:
+ * read first 4 KiB (one disk sector) of each file
+ * if that sector differs, then files are not duplicates
+ * read each file's contents sequentially with 128 KiB chunks and
+ calculate BLAKE2b-512 digest
+
+Action can be the following:
+
+* print: print to stdout duplicate file path with corresponding relative
+ path to basedir's file
+* symlink: create symbolic link with relative path to corresponding
+ basedir's file
+* hardlink: create hard link instead
+* ns: write to stdout series of netstring encoded pairs of duplicate
+ file path and its corresponding basedir's one. It is used in two pass
+ mode. Hint: it is highly compressible
+
+If -fsync is specified, then fsync directories where linking occurs.
Progress is showed at each stage: how many files are counted/processed,
total size of the files, how much space is deduplicated.
[...]
2020/03/20 11:17:20 321,123 files deduplicated
-It is safe to specify same directory as a basedir and dupdir.
+TWO PASS MODE
+=============
+
+$ sgodup -basedir DIR -dupdir DIR -action ns [-minsize NNN] [-chmod NNN] > state
+$ sgodup -action ACTION [-fsync] -ns state
+
+If you are dealing with huge amount of small files, then simultaneous
+reading (duplicate detection) and writing (duplicate files linking) on
+the same disk can dramatically decrease performance. It is advisable to
+separate the whole process on two stages: read-only duplicates
+detection, write-only duplicates linking.
+
+Start sgodup with "-action ns" and redirect stdout output to some
+temporary state file, for storing detected duplicate files information
+in it. Then start again with "-ns state" option to relink files.
SAFETY AND CONSISTENCY
+======================
+
+It was not tested on 32-bit platforms and probably won't work on them
+correctly.
POSIX has no ability to atomically replace regular file with with
symbolic/hard link. So file is removed first, then link created. sgodup
within) where utility is working with are modified.
LICENCE
+=======
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by