Improved version

[glocate.git] / README
diff --git a/README b/README

index 5e946c27257c1a82a915d0bfd305c5109710624b0094de0099c3d34205dd7fd3..358d5302a98ae9b877e6acf1c950d413e096a065959e58f8d3a631cabf2c1a37 100644 (file)
--- a/README
+++ b/README
@@ -1,4 +1,89 @@
  glocate -- ZFS-diff-friendly locate-like utility
  
+This utility is intended to keep the database of filesystem hierarchy
+and quickly display some part of it. Like ordinary *locate utilities.
+But unlike others, it is able to eat zfs-diff's output and apply the
+changes to existing database.
+
+Why I wrote it? Indexing, just "find /big" can take a considerable
+amount of time, like an hour or so, with many I/O operations spent. But
+my home NAS has relatively few number of changes made every day. The
+only possible quick way to determine what exactly was modified is to
+traverse over ZFS'es Merkle trees to find a difference between
+snapshots. Fortunately zfs-diff command does exactly that, providing
+pretty machine-friendly output.
+
+Why this utility is so complicated? Initially it kept all database in
+memory, but that takes 2-3 GiBs of memory, that is huge amount. Moreover
+it fully loads it to perform any basic searches. So current
+implementation uses temporary files and heavy use of data streaming.
+
+Its storage format is trivial:
+
+* 16-bit BE size of the following name
+* entity (file, directory, symbolic link, etc) name itself.
+  Directory has trailing "/"
+* single byte indicating current file's depth
+* 64-bit BE mtime seconds
+* 64-bit BE file or directory (sum of all files and directories) size
+
+Its indexing algorithm is following:
+
+* traverse over all filesystem hierarchy in a *sorted* order. All
+  records are written to temporary file, without directory sizes,
+  because they are not known in advance during the walking
+* during the walk, remember in memory each directory's total size
+* read all records from that temporary file, writing to another one, but
+  replacing directory sizes with ones remembered
+
+Searching is trivial:
+
+* there is no actual searching, just a streaming through all the
+  database file sequentially
+* if some root is specified, then the program will output only its
+  hierarchy path, exiting after it is finished
+
+Updating algorithm is following:
+
+* read all [-+MR] actions from zfs-diff, validating the whole format
+* each file's "R" becomes "-" and "+" actions
+* if there are directory "R", then collect them and stream from current
+  database to determine each path entity you have to "-" and "+"
+* each "+" adds an entry to the list of "M"s
+* sort all "-", "+" and "M" filenames in ascending order
+* get entity's information for each "M" (remembering its size and mtime)
+* stream current database records, writing them to temporary file
+* if record exists in "-"-list, then skip it
+* if any "+" exists in the *sorted* list, that has precedence over the
+  record from database, then insert it into the stream, taking size and
+  mtime information from "M"-list
+* if any "M" exists for the read record, then use it to alter it
+* all that time, directory size calculating algorithm also works, the
+  same one used during indexing
+* create another temporary file to copy the records with actualized
+  directory sizes
+
+How to use it?
+
+    $ zfs snap big@snap1
+    $ cd /big ; glocate -db /tmp/glocate.db -index
+
+    $ glocate -db /tmp/glocate.db
+    [list of all files]
+
+    $ glocate -db /tmp/glocate.db -machine
+    [machine parseable list of files with sizes and mtimes]
+
+    $ glocate -db /tmp/glocate.db -tree
+    [beauty tree-like list of files with sizes and mtimes]
+
+    $ glocate -db /tmp/glocate.db some/sub/path
+    [just a part of the whole hierarchy]
+
+and update it carefully:
+
+    $ zfs snap big@snap2
+    $ zfs diff -FH big@snap2 | glocate -db /tmp/glocate.db -strip /big/ -update
+
  glocate is copylefted free software: see the file COPYING for copying
  conditions.