X-Git-Url: http://www.git.stargrave.org/?a=blobdiff_plain;f=README;fp=README;h=b34dcd95fdd5581f84ac08ecb114aa71a9a0e8db;hb=4cb0b1b52ba01d3b200b56c349f4177f5ec355d3;hp=0000000000000000000000000000000000000000;hpb=bdcc6c7a51b4ec8252ef2795f0688e805fda5353;p=sgodup.git diff --git a/README b/README new file mode 100644 index 0000000..b34dcd9 --- /dev/null +++ b/README @@ -0,0 +1,88 @@ + sgodup -- file deduplication utility + ==================================== + +DESCRIPTION AND USAGE + +sgodup is utility for duplicate files detection. You supply two +directories: the base and one with possible duplicates, utility +determines duplicate files and replaces them with the links. It +is aimed to have very high performance. + +There are just few arguments: + +-basedir -- directory with files that are possible link targets + -dupdir -- directory with possible duplicates, which are replaced + with the links to basedir's files + -action -- * print: just print to stdout duplicate file path with + relative path to basedir's corresponding file + * symlink: create symbolic link with relative path to + basedir's corresponding file + * hardlink: create hard link instead + -chmod -- if specified, then chmod files in basedir and dupdir + during scan phase. Octal representation is expected + -fsync -- fsync directories where linking occurs + +There are three stages: + +* basedir directory scan: collect all *regular* file paths, sizes and + inodes. If -chmod is specified, then apply it at once. Empty files are + ignored +* dupdir directory scan: same as above. If there is no basedir's file + with the same size, then skip dupdir's file (obviously it can not be + duplicate). Check that no basedir's files have the same inode, skip + dupdir's file otherwise, because it is already hardlinked +* deduplication stage. For each dupdir file, find basedir file with the + same size and compare their contents, to determine if dupdir's one is + the duplicate. Perform specified action if so. There are two separate + queues and processing cycles: + + * small files, up to 4 KiB (one disk sector): files are fully read and + compared in memory + * large files (everything else): read and compare first 4 KiB of files + in memory. If they are not equal, then this is not a duplicate. + Fully read each file's contents sequentially with 128 KiB chunks and + calculate BLAKE2b-512 digest otherwise + +Progress is showed at each stage: how many files are counted/processed, +total size of the files, how much space is deduplicated. + + 2020/03/19 22:57:07 processing basedir... + 2020/03/19 22:57:07 464,329 / 0 (0%) files scanned + 2020/03/19 22:57:07 534 GiB / 0 B (0%) + 2020/03/19 22:57:12 processing dupdir... + 2020/03/19 22:57:12 362,245 / 0 (0%) files scanned + 2020/03/19 22:57:12 362 GiB / 0 B (0%) + 2020/03/19 22:57:17 deduplicating... + 2020/03/19 22:58:18 8,193 / 362,245 (2%) files processed + 2020/03/19 22:58:18 7.7 GiB / 362 GiB (2%) deduplicated + [...] + 2020/03/20 11:17:20 321,123 files deduplicated + +It is safe to specify same directory as a basedir and dupdir. + +SAFETY AND CONSISTENCY + +POSIX has no ability to atomically replace regular file with with +symbolic/hard link. So file is removed first, then link created. sgodup +cautiously prevents possible interruption by signal (TERM, INT) of those +two calls. But any other failure could possibly break the program after +file removal without link creation, leading to its loss! + +It is recommended to use filesystems with snapshot capability to be able +to rollback and restore removed file. Or you can use "-action print" +beforehand to collect the duplicates and use it as a log for possible +recovery. + +There are no warranties and any defined behaviour if directories (and files +within) where utility is working with are modified. + +LICENCE + +This program is free software: you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation, version 3 of the License. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details.