Higher performance, two pass mode

[sgodup.git] / README
diff --git a/README b/README

index b34dcd95fdd5581f84ac08ecb114aa71a9a0e8db..f8b2b8a7b0eafc769c8e661a228126600c99e42f 100644 (file)
--- a/README
+++ b/README
@@ -1,47 +1,53 @@
                    sgodup -- file deduplication utility
                    ====================================
  
                    sgodup -- file deduplication utility
                    ====================================
  
-DESCRIPTION AND USAGE
+sgodup is utility for files deduplication. You supply two directories:
+the base and one with possible duplicates, utility determines duplicate
+files and replaces them with the links. It is aimed to have very high
+performance.
  
  
-sgodup is utility for duplicate files detection. You supply two
-directories: the base and one with possible duplicates, utility
-determines duplicate files and replaces them with the links. It
-is aimed to have very high performance.
+SINGLE PASS MODE
+================
  
  
-There are just few arguments:
+$ sgodup -basedir DIR -dupdir DIR -action ACTION \
+  [-minsize NNN] [-chmod NNN] [-fsync]
  
  
--basedir -- directory with files that are possible link targets
- -dupdir -- directory with possible duplicates, which are replaced
-            with the links to basedir's files
- -action -- * print: just print to stdout duplicate file path with
-              relative path to basedir's corresponding file
-            * symlink: create symbolic link with relative path to
-              basedir's corresponding file
-            * hardlink: create hard link instead
-  -chmod -- if specified, then chmod files in basedir and dupdir
-            during scan phase. Octal representation is expected
-  -fsync -- fsync directories where linking occurs
+basedir is a directory with "original" files, that are possible link
+targets. dupdir is a directory with possible duplicates, which are to be
+replaced with the links to basedir's file. It is safe to specify same
+directory as a basedir and dupdir.
  
  
-There are three stages:
+There are 3 stages this command will do:
  
  * basedir directory scan: collect all *regular* file paths, sizes and
  
  * basedir directory scan: collect all *regular* file paths, sizes and
-  inodes. If -chmod is specified, then apply it at once. Empty files are
-  ignored
+  inodes. If -chmod is specified, then apply it to them. Files smaller
+  than -minsize (by default it is equal to 1 bytes) are not taken for
+  duplication comparison
  * dupdir directory scan: same as above. If there is no basedir's file
  * dupdir directory scan: same as above. If there is no basedir's file
-  with the same size, then skip dupdir's file (obviously it can not be
+  with the same size, then skip dupdir's one (obviously it can not be
    duplicate). Check that no basedir's files have the same inode, skip
    duplicate). Check that no basedir's files have the same inode, skip
-  dupdir's file otherwise, because it is already hardlinked
-* deduplication stage. For each dupdir file, find basedir file with the
-  same size and compare their contents, to determine if dupdir's one is
-  the duplicate. Perform specified action if so. There are two separate
-  queues and processing cycles:
-
-  * small files, up to 4 KiB (one disk sector): files are fully read and
-    compared in memory
-  * large files (everything else): read and compare first 4 KiB of files
-    in memory. If they are not equal, then this is not a duplicate.
-    Fully read each file's contents sequentially with 128 KiB chunks and
-    calculate BLAKE2b-512 digest otherwise
+  dupdir's file otherwise (it is hardlink)
+* deduplication stage. For each dupdir file, find basedir one with the
+  same size and compare their contents, to determine if dupdir one is
+  the duplicate. Perform specified action if so. Comparing is done the
+  following way:
+  * read first 4 KiB (one disk sector) of each file
+  * if that sector differs, then files are not duplicates
+  * read each file's contents sequentially with 128 KiB chunks and
+    calculate BLAKE2b-512 digest
+
+Action can be the following:
+
+* print: print to stdout duplicate file path with corresponding relative
+  path to basedir's file
+* symlink: create symbolic link with relative path to corresponding
+  basedir's file
+* hardlink: create hard link instead
+* ns: write to stdout series of netstring encoded pairs of duplicate
+  file path and its corresponding basedir's one. It is used in two pass
+  mode. Hint: it is highly compressible
+
+If -fsync is specified, then fsync directories where linking occurs.
  
  Progress is showed at each stage: how many files are counted/processed,
  total size of the files, how much space is deduplicated.
  
  Progress is showed at each stage: how many files are counted/processed,
  total size of the files, how much space is deduplicated.
@@ -58,9 +64,27 @@ total size of the files, how much space is deduplicated.
      [...]
      2020/03/20 11:17:20 321,123 files deduplicated
  
      [...]
      2020/03/20 11:17:20 321,123 files deduplicated
  
-It is safe to specify same directory as a basedir and dupdir.
+TWO PASS MODE
+=============
+
+$ sgodup -basedir DIR -dupdir DIR -action ns [-minsize NNN] [-chmod NNN] > state
+$ sgodup -action ACTION [-fsync] -ns state
+
+If you are dealing with huge amount of small files, then simultaneous
+reading (duplicate detection) and writing (duplicate files linking) on
+the same disk can dramatically decrease performance. It is advisable to
+separate the whole process on two stages: read-only duplicates
+detection, write-only duplicates linking.
+
+Start sgodup with "-action ns" and redirect stdout output to some
+temporary state file, for storing detected duplicate files information
+in it. Then start again with "-ns state" option to relink files.
  
  SAFETY AND CONSISTENCY
  
  SAFETY AND CONSISTENCY
+======================
+
+It was not tested on 32-bit platforms and probably won't work on them
+correctly.
  
  POSIX has no ability to atomically replace regular file with with
  symbolic/hard link. So file is removed first, then link created. sgodup
  
  POSIX has no ability to atomically replace regular file with with
  symbolic/hard link. So file is removed first, then link created. sgodup
@@ -77,6 +101,7 @@ There are no warranties and any defined behaviour if directories (and files
  within) where utility is working with are modified.
  
  LICENCE
  within) where utility is working with are modified.
  
  LICENCE
+=======
  
  This program is free software: you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  
  This program is free software: you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by