1 sgodup -- file deduplication utility
2 ====================================
4 sgodup is utility for files deduplication. You supply two directories:
5 the base and one with possible duplicates, utility determines duplicate
6 files and replaces them with the links. It is aimed to have very high
12 $ sgodup -basedir DIR -dupdir DIR -action ACTION \
13 [-minsize NNN] [-chmod NNN] [-fsync]
15 basedir is a directory with "original" files, that are possible link
16 targets. dupdir is a directory with possible duplicates, which are to be
17 replaced with the links to basedir's file. It is safe to specify same
18 directory as a basedir and dupdir.
20 There are 3 stages this command will do:
22 * basedir directory scan: collect all *regular* file paths, sizes and
23 inodes. If -chmod is specified, then apply it to them. Files smaller
24 than -minsize (by default it is equal to 1 bytes) are not taken for
25 duplication comparison
26 * dupdir directory scan: same as above. If there is no basedir's file
27 with the same size, then skip dupdir's one (obviously it can not be
28 duplicate). Check that no basedir's files have the same inode, skip
29 dupdir's file otherwise (it is hardlink)
30 * deduplication stage. For each dupdir file, find basedir one with the
31 same size and compare their contents, to determine if dupdir one is
32 the duplicate. Perform specified action if so. Comparing is done the
34 * read first 4 KiB (one disk sector) of each file
35 * if that sector differs, then files are not duplicates
36 * read each file's contents sequentially with 128 KiB chunks and
37 calculate BLAKE2b-512 digest
39 Action can be the following:
41 * print: print to stdout duplicate file path with corresponding relative
42 path to basedir's file
43 * symlink: create symbolic link with relative path to corresponding
45 * hardlink: create hard link instead
46 * ns: write to stdout series of netstring encoded pairs of duplicate
47 file path and its corresponding basedir's one. It is used in two pass
48 mode. Hint: it is highly compressible
50 If -fsync is specified, then fsync directories where linking occurs.
52 Progress is showed at each stage: how many files are counted/processed,
53 total size of the files, how much space is deduplicated.
55 2020/03/19 22:57:07 processing basedir...
56 2020/03/19 22:57:07 464,329 / 0 (0%) files scanned
57 2020/03/19 22:57:07 534 GiB / 0 B (0%)
58 2020/03/19 22:57:12 processing dupdir...
59 2020/03/19 22:57:12 362,245 / 0 (0%) files scanned
60 2020/03/19 22:57:12 362 GiB / 0 B (0%)
61 2020/03/19 22:57:17 deduplicating...
62 2020/03/19 22:58:18 8,193 / 362,245 (2%) files processed
63 2020/03/19 22:58:18 7.7 GiB / 362 GiB (2%) deduplicated
65 2020/03/20 11:17:20 321,123 files deduplicated
70 $ sgodup -basedir DIR -dupdir DIR -action ns [-minsize NNN] [-chmod NNN] > state
71 $ sgodup -action ACTION [-fsync] -ns < state
73 If you are dealing with huge amount of small files, then simultaneous
74 reading (duplicate detection) and writing (duplicate files linking) on
75 the same disk can dramatically decrease performance. It is advisable to
76 separate the whole process on two stages: read-only duplicates
77 detection, write-only duplicates linking.
79 Start sgodup with "-action ns" and redirect stdout output to some
80 temporary state file, for storing detected duplicate files information
81 in it. Then start again with "-ns state" option to relink files.
83 SAFETY AND CONSISTENCY
84 ======================
86 It was not tested on 32-bit platforms and probably won't work on them
89 POSIX has no ability to atomically replace regular file with with
90 symbolic/hard link. So file is removed first, then link created. sgodup
91 cautiously prevents possible interruption by signal (TERM, INT) of those
92 two calls. But any other failure could possibly break the program after
93 file removal without link creation, leading to its loss!
95 It is recommended to use filesystems with snapshot capability to be able
96 to rollback and restore removed file. Or you can use "-action print"
97 beforehand to collect the duplicates and use it as a log for possible
100 There are no warranties and any defined behaviour if directories (and files
101 within) where utility is working with are modified.
106 This program is free software: you can redistribute it and/or modify
107 it under the terms of the GNU General Public License as published by
108 the Free Software Foundation, version 3 of the License.
110 This program is distributed in the hope that it will be useful,
111 but WITHOUT ANY WARRANTY; without even the implied warranty of
112 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
113 GNU General Public License for more details.