README

   1 glocate -- ZFS-diff-friendly locate-like utility
   2
   3 This utility is intended to keep the database of filesystem hierarchy
   4 and quickly display some part of it. Like ordinary *locate utilities.
   5 But unlike others, it is able to eat zfs-diff's output and apply the
   6 changes to existing database.
   7
   8 Why I wrote it? Indexing, just "find /big" can take a considerable
   9 amount of time, like an hour or so, with many I/O operations spent. But
  10 my home NAS has relatively few number of changes made every day. The
  11 only possible quick way to determine what exactly was modified is to
  12 traverse over ZFS'es Merkle trees to find a difference between
  13 snapshots. Fortunately zfs-diff command does exactly that, providing
  14 pretty machine-friendly output.
  15
  16 Why this utility is so complicated? Initially it kept all database in
  17 memory, but that takes 2-3 GiBs of memory, that is huge amount. Moreover
  18 it fully loads it to perform any basic searches. So current
  19 implementation uses temporary files and heavy use of data streaming.
  20
  21 Its storage format is simple: Zstandard-compressed list of records:
  22
  23 * 16-bit BE size of the following name
  24 * entity (file, directory, symbolic link, etc) name itself.
  25   Directory has trailing "/"
  26 * single byte indicating current file's depth
  27 * 64-bit BE mtime seconds
  28 * 64-bit BE file or directory (sum of all files and directories) size
  29
  30 Its indexing algorithm is following:
  31
  32 * traverse over all filesystem hierarchy in a *sorted* order. All
  33   records are written to temporary file, without directory sizes,
  34   because they are not known in advance during the walking
  35 * during the walk, remember in memory each directory's total size
  36 * read all records from that temporary file, writing to another one, but
  37   replacing directory sizes with ones remembered
  38
  39 Searching is trivial:
  40
  41 * there is no actual searching, just a streaming through all the
  42   database file sequentially
  43 * if some root is specified, then the program will output only its
  44   hierarchy path, exiting after it is finished
  45
  46 Updating algorithm is following:
  47
  48 * read all [-+MR] actions from zfs-diff, validating the whole format
  49 * each file's "R" becomes "-" and "+" actions
  50 * if there are directory "R", then collect them and stream from current
  51   database to determine each path entity you have to "-" and "+"
  52 * each "+" adds an entry to the list of "M"s
  53 * sort all "-", "+" and "M" filenames in ascending order
  54 * get entity's information for each "M" (remembering its size and mtime)
  55 * stream current database records, writing them to temporary file
  56 * if record exists in "-"-list, then skip it
  57 * if any "+" exists in the *sorted* list, that has precedence over the
  58   record from database, then insert it into the stream, taking size and
  59   mtime information from "M"-list
  60 * if any "M" exists for the read record, then use it to alter it
  61 * all that time, directory size calculating algorithm also works, the
  62   same one used during indexing
  63 * create another temporary file to copy the records with actualized
  64   directory sizes
  65
  66 How to use it?
  67
  68     $ zfs snap big@snap1
  69     $ cd /big ; glocate -db /tmp/glocate.db -index
  70
  71     $ glocate -db /tmp/glocate.db
  72     [list of all files]
  73
  74     $ glocate -db /tmp/glocate.db -machine
  75     [machine parseable list of files with sizes and mtimes]
  76
  77     $ glocate -db /tmp/glocate.db -tree
  78     [beauty tree-like list of files with sizes and mtimes]
  79
  80     $ glocate -db /tmp/glocate.db some/sub/path
  81     [just a part of the whole hierarchy]
  82
  83 and update it carefully:
  84
  85     $ zfs snap big@snap2
  86     $ zfs diff -FH big@snap2 | glocate -db /tmp/glocate.db -strip /big/ -update
  87
  88 glocate is copylefted free software: see the file COPYING for copying
  89 conditions.