sgodup -- file deduplication utility ==================================== sgodup is utility for files deduplication. You supply two directories: the base and one with possible duplicates, utility determines duplicate files and replaces them with the links. It is aimed to have very high performance. SINGLE PASS MODE ================ $ sgodup -basedir DIR -dupdir DIR -action ACTION \ [-minsize NNN] [-chmod NNN] [-fsync] basedir is a directory with "original" files, that are possible link targets. dupdir is a directory with possible duplicates, which are to be replaced with the links to basedir's file. It is safe to specify same directory as a basedir and dupdir. There are 3 stages this command will do: * basedir directory scan: collect all *regular* file paths, sizes and inodes. If -chmod is specified, then apply it to them. Files smaller than -minsize (by default it is equal to 1 bytes) are not taken for duplication comparison * dupdir directory scan: same as above. If there is no basedir's file with the same size, then skip dupdir's one (obviously it can not be duplicate). Check that no basedir's files have the same inode, skip dupdir's file otherwise (it is hardlink) * deduplication stage. For each dupdir file, find basedir one with the same size and compare their contents, to determine if dupdir one is the duplicate. Perform specified action if so. Comparing is done the following way: * read first 4 KiB (one disk sector) of each file * if that sector differs, then files are not duplicates * read each file's contents sequentially with 128 KiB chunks and calculate BLAKE2b-512 digest Action can be the following: * print: print to stdout duplicate file path with corresponding relative path to basedir's file * symlink: create symbolic link with relative path to corresponding basedir's file * hardlink: create hard link instead * ns: write to stdout series of netstring encoded pairs of duplicate file path and its corresponding basedir's one. It is used in two pass mode. Hint: it is highly compressible If -fsync is specified, then fsync directories where linking occurs. Progress is showed at each stage: how many files are counted/processed, total size of the files, how much space is deduplicated. 2020/03/19 22:57:07 processing basedir... 2020/03/19 22:57:07 464,329 / 0 (0%) files scanned 2020/03/19 22:57:07 534 GiB / 0 B (0%) 2020/03/19 22:57:12 processing dupdir... 2020/03/19 22:57:12 362,245 / 0 (0%) files scanned 2020/03/19 22:57:12 362 GiB / 0 B (0%) 2020/03/19 22:57:17 deduplicating... 2020/03/19 22:58:18 8,193 / 362,245 (2%) files processed 2020/03/19 22:58:18 7.7 GiB / 362 GiB (2%) deduplicated [...] 2020/03/20 11:17:20 321,123 files deduplicated TWO PASS MODE ============= $ sgodup -basedir DIR -dupdir DIR -action ns [-minsize NNN] [-chmod NNN] > state $ sgodup -action ACTION [-fsync] -ns < state If you are dealing with huge amount of small files, then simultaneous reading (duplicate detection) and writing (duplicate files linking) on the same disk can dramatically decrease performance. It is advisable to separate the whole process on two stages: read-only duplicates detection, write-only duplicates linking. Start sgodup with "-action ns" and redirect stdout output to some temporary state file, for storing detected duplicate files information in it. Then start again with "-ns state" option to relink files. SAFETY AND CONSISTENCY ====================== It was not tested on 32-bit platforms and probably won't work on them correctly. POSIX has no ability to atomically replace regular file with with symbolic/hard link. So file is removed first, then link created. sgodup cautiously prevents possible interruption by signal (TERM, INT) of those two calls. But any other failure could possibly break the program after file removal without link creation, leading to its loss! It is recommended to use filesystems with snapshot capability to be able to rollback and restore removed file. Or you can use "-action print" beforehand to collect the duplicates and use it as a log for possible recovery. There are no warranties and any defined behaviour if directories (and files within) where utility is working with are modified. LICENCE ======= This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.