X-Git-Url: http://www.git.stargrave.org/?p=glocate.git;a=blobdiff_plain;f=README;fp=README;h=358d5302a98ae9b877e6acf1c950d413e096a065959e58f8d3a631cabf2c1a37;hp=5e946c27257c1a82a915d0bfd305c5109710624b0094de0099c3d34205dd7fd3;hb=d5b8c235a1f3088c6c1e7261df3d1b565d042db2ba2ad1bbd1018782b9178e1f;hpb=411a031ec7cc707b8269acc3dfe28bc8db1bab5a9a91781c26809ae9853c6f6a diff --git a/README b/README index 5e946c2..358d530 100644 --- a/README +++ b/README @@ -1,4 +1,89 @@ glocate -- ZFS-diff-friendly locate-like utility +This utility is intended to keep the database of filesystem hierarchy +and quickly display some part of it. Like ordinary *locate utilities. +But unlike others, it is able to eat zfs-diff's output and apply the +changes to existing database. + +Why I wrote it? Indexing, just "find /big" can take a considerable +amount of time, like an hour or so, with many I/O operations spent. But +my home NAS has relatively few number of changes made every day. The +only possible quick way to determine what exactly was modified is to +traverse over ZFS'es Merkle trees to find a difference between +snapshots. Fortunately zfs-diff command does exactly that, providing +pretty machine-friendly output. + +Why this utility is so complicated? Initially it kept all database in +memory, but that takes 2-3 GiBs of memory, that is huge amount. Moreover +it fully loads it to perform any basic searches. So current +implementation uses temporary files and heavy use of data streaming. + +Its storage format is trivial: + +* 16-bit BE size of the following name +* entity (file, directory, symbolic link, etc) name itself. + Directory has trailing "/" +* single byte indicating current file's depth +* 64-bit BE mtime seconds +* 64-bit BE file or directory (sum of all files and directories) size + +Its indexing algorithm is following: + +* traverse over all filesystem hierarchy in a *sorted* order. All + records are written to temporary file, without directory sizes, + because they are not known in advance during the walking +* during the walk, remember in memory each directory's total size +* read all records from that temporary file, writing to another one, but + replacing directory sizes with ones remembered + +Searching is trivial: + +* there is no actual searching, just a streaming through all the + database file sequentially +* if some root is specified, then the program will output only its + hierarchy path, exiting after it is finished + +Updating algorithm is following: + +* read all [-+MR] actions from zfs-diff, validating the whole format +* each file's "R" becomes "-" and "+" actions +* if there are directory "R", then collect them and stream from current + database to determine each path entity you have to "-" and "+" +* each "+" adds an entry to the list of "M"s +* sort all "-", "+" and "M" filenames in ascending order +* get entity's information for each "M" (remembering its size and mtime) +* stream current database records, writing them to temporary file +* if record exists in "-"-list, then skip it +* if any "+" exists in the *sorted* list, that has precedence over the + record from database, then insert it into the stream, taking size and + mtime information from "M"-list +* if any "M" exists for the read record, then use it to alter it +* all that time, directory size calculating algorithm also works, the + same one used during indexing +* create another temporary file to copy the records with actualized + directory sizes + +How to use it? + + $ zfs snap big@snap1 + $ cd /big ; glocate -db /tmp/glocate.db -index + + $ glocate -db /tmp/glocate.db + [list of all files] + + $ glocate -db /tmp/glocate.db -machine + [machine parseable list of files with sizes and mtimes] + + $ glocate -db /tmp/glocate.db -tree + [beauty tree-like list of files with sizes and mtimes] + + $ glocate -db /tmp/glocate.db some/sub/path + [just a part of the whole hierarchy] + +and update it carefully: + + $ zfs snap big@snap2 + $ zfs diff -FH big@snap2 | glocate -db /tmp/glocate.db -strip /big/ -update + glocate is copylefted free software: see the file COPYING for copying conditions.