sgodup -- file deduplication utility ==================================== DESCRIPTION AND USAGE sgodup is utility for duplicate files detection. You supply two directories: the base and one with possible duplicates, utility determines duplicate files and replaces them with the links. It is aimed to have very high performance. There are just few arguments: -basedir -- directory with files that are possible link targets -dupdir -- directory with possible duplicates, which are replaced with the links to basedir's files -action -- * print: just print to stdout duplicate file path with relative path to basedir's corresponding file * symlink: create symbolic link with relative path to basedir's corresponding file * hardlink: create hard link instead -chmod -- if specified, then chmod files in basedir and dupdir during scan phase. Octal representation is expected -fsync -- fsync directories where linking occurs There are three stages: * basedir directory scan: collect all *regular* file paths, sizes and inodes. If -chmod is specified, then apply it at once. Empty files are ignored * dupdir directory scan: same as above. If there is no basedir's file with the same size, then skip dupdir's file (obviously it can not be duplicate). Check that no basedir's files have the same inode, skip dupdir's file otherwise, because it is already hardlinked * deduplication stage. For each dupdir file, find basedir file with the same size and compare their contents, to determine if dupdir's one is the duplicate. Perform specified action if so. There are two separate queues and processing cycles: * small files, up to 4 KiB (one disk sector): files are fully read and compared in memory * large files (everything else): read and compare first 4 KiB of files in memory. If they are not equal, then this is not a duplicate. Fully read each file's contents sequentially with 128 KiB chunks and calculate BLAKE2b-512 digest otherwise Progress is showed at each stage: how many files are counted/processed, total size of the files, how much space is deduplicated. 2020/03/19 22:57:07 processing basedir... 2020/03/19 22:57:07 464,329 / 0 (0%) files scanned 2020/03/19 22:57:07 534 GiB / 0 B (0%) 2020/03/19 22:57:12 processing dupdir... 2020/03/19 22:57:12 362,245 / 0 (0%) files scanned 2020/03/19 22:57:12 362 GiB / 0 B (0%) 2020/03/19 22:57:17 deduplicating... 2020/03/19 22:58:18 8,193 / 362,245 (2%) files processed 2020/03/19 22:58:18 7.7 GiB / 362 GiB (2%) deduplicated [...] 2020/03/20 11:17:20 321,123 files deduplicated It is safe to specify same directory as a basedir and dupdir. SAFETY AND CONSISTENCY POSIX has no ability to atomically replace regular file with with symbolic/hard link. So file is removed first, then link created. sgodup cautiously prevents possible interruption by signal (TERM, INT) of those two calls. But any other failure could possibly break the program after file removal without link creation, leading to its loss! It is recommended to use filesystems with snapshot capability to be able to rollback and restore removed file. Or you can use "-action print" beforehand to collect the duplicates and use it as a log for possible recovery. There are no warranties and any defined behaviour if directories (and files within) where utility is working with are modified. LICENCE This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.