1 sgodup -- file deduplication utility
2 ====================================
6 sgodup is utility for duplicate files detection. You supply two
7 directories: the base and one with possible duplicates, utility
8 determines duplicate files and replaces them with the links. It
9 is aimed to have very high performance.
11 There are just few arguments:
13 -basedir -- directory with files that are possible link targets
14 -dupdir -- directory with possible duplicates, which are replaced
15 with the links to basedir's files
16 -action -- * print: just print to stdout duplicate file path with
17 relative path to basedir's corresponding file
18 * symlink: create symbolic link with relative path to
19 basedir's corresponding file
20 * hardlink: create hard link instead
21 -chmod -- if specified, then chmod files in basedir and dupdir
22 during scan phase. Octal representation is expected
23 -fsync -- fsync directories where linking occurs
25 There are three stages:
27 * basedir directory scan: collect all *regular* file paths, sizes and
28 inodes. If -chmod is specified, then apply it at once. Empty files are
30 * dupdir directory scan: same as above. If there is no basedir's file
31 with the same size, then skip dupdir's file (obviously it can not be
32 duplicate). Check that no basedir's files have the same inode, skip
33 dupdir's file otherwise, because it is already hardlinked
34 * deduplication stage. For each dupdir file, find basedir file with the
35 same size and compare their contents, to determine if dupdir's one is
36 the duplicate. Perform specified action if so. There are two separate
37 queues and processing cycles:
39 * small files, up to 4 KiB (one disk sector): files are fully read and
41 * large files (everything else): read and compare first 4 KiB of files
42 in memory. If they are not equal, then this is not a duplicate.
43 Fully read each file's contents sequentially with 128 KiB chunks and
44 calculate BLAKE2b-512 digest otherwise
46 Progress is showed at each stage: how many files are counted/processed,
47 total size of the files, how much space is deduplicated.
49 2020/03/19 22:57:07 processing basedir...
50 2020/03/19 22:57:07 464,329 / 0 (0%) files scanned
51 2020/03/19 22:57:07 534 GiB / 0 B (0%)
52 2020/03/19 22:57:12 processing dupdir...
53 2020/03/19 22:57:12 362,245 / 0 (0%) files scanned
54 2020/03/19 22:57:12 362 GiB / 0 B (0%)
55 2020/03/19 22:57:17 deduplicating...
56 2020/03/19 22:58:18 8,193 / 362,245 (2%) files processed
57 2020/03/19 22:58:18 7.7 GiB / 362 GiB (2%) deduplicated
59 2020/03/20 11:17:20 321,123 files deduplicated
61 It is safe to specify same directory as a basedir and dupdir.
63 SAFETY AND CONSISTENCY
65 POSIX has no ability to atomically replace regular file with with
66 symbolic/hard link. So file is removed first, then link created. sgodup
67 cautiously prevents possible interruption by signal (TERM, INT) of those
68 two calls. But any other failure could possibly break the program after
69 file removal without link creation, leading to its loss!
71 It is recommended to use filesystems with snapshot capability to be able
72 to rollback and restore removed file. Or you can use "-action print"
73 beforehand to collect the duplicates and use it as a log for possible
76 There are no warranties and any defined behaviour if directories (and files
77 within) where utility is working with are modified.
81 This program is free software: you can redistribute it and/or modify
82 it under the terms of the GNU General Public License as published by
83 the Free Software Foundation, version 3 of the License.
85 This program is distributed in the hope that it will be useful,
86 but WITHOUT ANY WARRANTY; without even the implied warranty of
87 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
88 GNU General Public License for more details.