X-Git-Url: http://www.git.stargrave.org/?a=blobdiff_plain;f=README;h=f8b2b8a7b0eafc769c8e661a228126600c99e42f;hb=81b2fc3aa1f04bd483711149e6ad51a312ce3f2c;hp=b34dcd95fdd5581f84ac08ecb114aa71a9a0e8db;hpb=4cb0b1b52ba01d3b200b56c349f4177f5ec355d3;p=sgodup.git diff --git a/README b/README index b34dcd9..f8b2b8a 100644 --- a/README +++ b/README @@ -1,47 +1,53 @@ sgodup -- file deduplication utility ==================================== -DESCRIPTION AND USAGE +sgodup is utility for files deduplication. You supply two directories: +the base and one with possible duplicates, utility determines duplicate +files and replaces them with the links. It is aimed to have very high +performance. -sgodup is utility for duplicate files detection. You supply two -directories: the base and one with possible duplicates, utility -determines duplicate files and replaces them with the links. It -is aimed to have very high performance. +SINGLE PASS MODE +================ -There are just few arguments: +$ sgodup -basedir DIR -dupdir DIR -action ACTION \ + [-minsize NNN] [-chmod NNN] [-fsync] --basedir -- directory with files that are possible link targets - -dupdir -- directory with possible duplicates, which are replaced - with the links to basedir's files - -action -- * print: just print to stdout duplicate file path with - relative path to basedir's corresponding file - * symlink: create symbolic link with relative path to - basedir's corresponding file - * hardlink: create hard link instead - -chmod -- if specified, then chmod files in basedir and dupdir - during scan phase. Octal representation is expected - -fsync -- fsync directories where linking occurs +basedir is a directory with "original" files, that are possible link +targets. dupdir is a directory with possible duplicates, which are to be +replaced with the links to basedir's file. It is safe to specify same +directory as a basedir and dupdir. -There are three stages: +There are 3 stages this command will do: * basedir directory scan: collect all *regular* file paths, sizes and - inodes. If -chmod is specified, then apply it at once. Empty files are - ignored + inodes. If -chmod is specified, then apply it to them. Files smaller + than -minsize (by default it is equal to 1 bytes) are not taken for + duplication comparison * dupdir directory scan: same as above. If there is no basedir's file - with the same size, then skip dupdir's file (obviously it can not be + with the same size, then skip dupdir's one (obviously it can not be duplicate). Check that no basedir's files have the same inode, skip - dupdir's file otherwise, because it is already hardlinked -* deduplication stage. For each dupdir file, find basedir file with the - same size and compare their contents, to determine if dupdir's one is - the duplicate. Perform specified action if so. There are two separate - queues and processing cycles: - - * small files, up to 4 KiB (one disk sector): files are fully read and - compared in memory - * large files (everything else): read and compare first 4 KiB of files - in memory. If they are not equal, then this is not a duplicate. - Fully read each file's contents sequentially with 128 KiB chunks and - calculate BLAKE2b-512 digest otherwise + dupdir's file otherwise (it is hardlink) +* deduplication stage. For each dupdir file, find basedir one with the + same size and compare their contents, to determine if dupdir one is + the duplicate. Perform specified action if so. Comparing is done the + following way: + * read first 4 KiB (one disk sector) of each file + * if that sector differs, then files are not duplicates + * read each file's contents sequentially with 128 KiB chunks and + calculate BLAKE2b-512 digest + +Action can be the following: + +* print: print to stdout duplicate file path with corresponding relative + path to basedir's file +* symlink: create symbolic link with relative path to corresponding + basedir's file +* hardlink: create hard link instead +* ns: write to stdout series of netstring encoded pairs of duplicate + file path and its corresponding basedir's one. It is used in two pass + mode. Hint: it is highly compressible + +If -fsync is specified, then fsync directories where linking occurs. Progress is showed at each stage: how many files are counted/processed, total size of the files, how much space is deduplicated. @@ -58,9 +64,27 @@ total size of the files, how much space is deduplicated. [...] 2020/03/20 11:17:20 321,123 files deduplicated -It is safe to specify same directory as a basedir and dupdir. +TWO PASS MODE +============= + +$ sgodup -basedir DIR -dupdir DIR -action ns [-minsize NNN] [-chmod NNN] > state +$ sgodup -action ACTION [-fsync] -ns state + +If you are dealing with huge amount of small files, then simultaneous +reading (duplicate detection) and writing (duplicate files linking) on +the same disk can dramatically decrease performance. It is advisable to +separate the whole process on two stages: read-only duplicates +detection, write-only duplicates linking. + +Start sgodup with "-action ns" and redirect stdout output to some +temporary state file, for storing detected duplicate files information +in it. Then start again with "-ns state" option to relink files. SAFETY AND CONSISTENCY +====================== + +It was not tested on 32-bit platforms and probably won't work on them +correctly. POSIX has no ability to atomically replace regular file with with symbolic/hard link. So file is removed first, then link created. sgodup @@ -77,6 +101,7 @@ There are no warranties and any defined behaviour if directories (and files within) where utility is working with are modified. LICENCE +======= This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by