Higher performance, two pass mode

[sgodup.git] / README
diff --git a/README b/README

index b34dcd95fdd5581f84ac08ecb114aa71a9a0e8db..f8b2b8a7b0eafc769c8e661a228126600c99e42f 100644 (file)
--- a/README
+++ b/README
@@ -1,47 +1,53 @@
                    sgodup -- file deduplication utility
                    ====================================
  
-DESCRIPTION AND USAGE
+sgodup is utility for files deduplication. You supply two directories:
+the base and one with possible duplicates, utility determines duplicate
+files and replaces them with the links. It is aimed to have very high
+performance.
  
-sgodup is utility for duplicate files detection. You supply two
-directories: the base and one with possible duplicates, utility
-determines duplicate files and replaces them with the links. It
-is aimed to have very high performance.
+SINGLE PASS MODE
+================
  
-There are just few arguments:
+$ sgodup -basedir DIR -dupdir DIR -action ACTION \
+  [-minsize NNN] [-chmod NNN] [-fsync]
  
--basedir -- directory with files that are possible link targets
- -dupdir -- directory with possible duplicates, which are replaced
-            with the links to basedir's files
- -action -- * print: just print to stdout duplicate file path with
-              relative path to basedir's corresponding file
-            * symlink: create symbolic link with relative path to
-              basedir's corresponding file
-            * hardlink: create hard link instead
-  -chmod -- if specified, then chmod files in basedir and dupdir
-            during scan phase. Octal representation is expected
-  -fsync -- fsync directories where linking occurs
+basedir is a directory with "original" files, that are possible link
+targets. dupdir is a directory with possible duplicates, which are to be
+replaced with the links to basedir's file. It is safe to specify same
+directory as a basedir and dupdir.
  
-There are three stages:
+There are 3 stages this command will do:
  
  * basedir directory scan: collect all *regular* file paths, sizes and
-  inodes. If -chmod is specified, then apply it at once. Empty files are
-  ignored
+  inodes. If -chmod is specified, then apply it to them. Files smaller
+  than -minsize (by default it is equal to 1 bytes) are not taken for
+  duplication comparison
  * dupdir directory scan: same as above. If there is no basedir's file
-  with the same size, then skip dupdir's file (obviously it can not be
+  with the same size, then skip dupdir's one (obviously it can not be
    duplicate). Check that no basedir's files have the same inode, skip
-  dupdir's file otherwise, because it is already hardlinked
-* deduplication stage. For each dupdir file, find basedir file with the
-  same size and compare their contents, to determine if dupdir's one is
-  the duplicate. Perform specified action if so. There are two separate
-  queues and processing cycles:
-
-  * small files, up to 4 KiB (one disk sector): files are fully read and
-    compared in memory
-  * large files (everything else): read and compare first 4 KiB of files
-    in memory. If they are not equal, then this is not a duplicate.
-    Fully read each file's contents sequentially with 128 KiB chunks and
-    calculate BLAKE2b-512 digest otherwise
+  dupdir's file otherwise (it is hardlink)
+* deduplication stage. For each dupdir file, find basedir one with the
+  same size and compare their contents, to determine if dupdir one is
+  the duplicate. Perform specified action if so. Comparing is done the
+  following way:
+  * read first 4 KiB (one disk sector) of each file
+  * if that sector differs, then files are not duplicates
+  * read each file's contents sequentially with 128 KiB chunks and
+    calculate BLAKE2b-512 digest
+
+Action can be the following:
+
+* print: print to stdout duplicate file path with corresponding relative
+  path to basedir's file
+* symlink: create symbolic link with relative path to corresponding
+  basedir's file
+* hardlink: create hard link instead
+* ns: write to stdout series of netstring encoded pairs of duplicate
+  file path and its corresponding basedir's one. It is used in two pass
+  mode. Hint: it is highly compressible
+
+If -fsync is specified, then fsync directories where linking occurs.
  
  Progress is showed at each stage: how many files are counted/processed,
  total size of the files, how much space is deduplicated.
@@ -58,9 +64,27 @@ total size of the files, how much space is deduplicated.
      [...]
      2020/03/20 11:17:20 321,123 files deduplicated
  
-It is safe to specify same directory as a basedir and dupdir.
+TWO PASS MODE
+=============
+
+$ sgodup -basedir DIR -dupdir DIR -action ns [-minsize NNN] [-chmod NNN] > state
+$ sgodup -action ACTION [-fsync] -ns state
+
+If you are dealing with huge amount of small files, then simultaneous
+reading (duplicate detection) and writing (duplicate files linking) on
+the same disk can dramatically decrease performance. It is advisable to
+separate the whole process on two stages: read-only duplicates
+detection, write-only duplicates linking.
+
+Start sgodup with "-action ns" and redirect stdout output to some
+temporary state file, for storing detected duplicate files information
+in it. Then start again with "-ns state" option to relink files.
  
  SAFETY AND CONSISTENCY
+======================
+
+It was not tested on 32-bit platforms and probably won't work on them
+correctly.
  
  POSIX has no ability to atomically replace regular file with with
  symbolic/hard link. So file is removed first, then link created. sgodup
@@ -77,6 +101,7 @@ There are no warranties and any defined behaviour if directories (and files
  within) where utility is working with are modified.
  
  LICENCE
+=======
  
  This program is free software: you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by