From: Sergey Matveev <stargrave@stargrave.org>
Date: Sat, 13 Aug 2022 17:01:30 +0000 (+0300)
Subject: Slightly refactored documentation
X-Git-Tag: v0.1.0~5
X-Git-Url: http://www.git.stargrave.org/?p=glocate.git;a=commitdiff_plain;h=ecb9a41c73d8d7d8c75215c57d5d91f6c476098d756489036bedb8f136e8fb9e

Slightly refactored documentation
---

diff --git a/FORMAT b/FORMAT
new file mode 100644
index 0000000..6becd33
--- /dev/null
+++ b/FORMAT
@@ -0,0 +1,49 @@
+Storage format is simple: Zstandard-compressed list of records:
+
+* 16-bit BE size of the following name
+* entity (file, directory, symbolic link, etc) name itself.
+  Directory has trailing "/"
+* single byte indicating current file's depth
+* 64-bit BE mtime seconds
+* 64-bit BE file or directory (sum of all files and directories) size
+
+Index algorithm:
+
+* traverse over all filesystem hierarchy in a *sorted* order. All
+  records are written to temporary file, without directory sizes,
+  because they are not known in advance during the walk
+* during the walk, remember in memory each directory's total size
+* read all records from that temporary file, writing to another one,
+  replacing directory sizes with ones remembered
+
+Search is trivial:
+
+* searching is performed on each record streamed from the database
+* if -root is specified, then search will stop after that hierarchy
+  part is over
+* by default all elements are printed, unless you provide a single
+  argument that becomes "*X*" pattern matched on case-lowered path
+  elements
+
+Update algorithm:
+
+* read all [-+MR] actions from "zfs diff -FH", validating the whole
+  format
+* each "R" for the file becomes "-" and "+" actions
+* if there are "R"s for directories, then stream current database and
+  get each file entity for those directories, making "-" and "+"
+  actions correspondingly
+* each "+" also adds an entry to the list of "M"s
+* sort all "-", "+" and "M" filenames in ascending order
+* get entity's information for each "M" (remembering its size and mtime)
+* stream current database records, writing them to temporary file,
+  taking into account, that:
+  * if record exists in "-"-list, then skip it
+  * if any "+" exists in the *sorted* list, that has precedence over
+    the record from database, then insert it into the stream, taking
+    size and mtime information from "M"-list
+  * if any "M" exists for the read record, then use it to alter it
+* all that time, directory size calculating algorithm, same used during
+  the index procedure, also works in parallel
+* create another temporary file to copy the records with actualized
+  directory sizes
diff --git a/INSTALL b/INSTALL
new file mode 100644
index 0000000..c2a9c3e
--- /dev/null
+++ b/INSTALL
@@ -0,0 +1,15 @@
+Utility is written on Go, so basically it could be installed by:
+
+    $ go install go.stargrave.org/glocate
+
+However you may have some issues with authenticity of go.stargrave.org
+HTTPS server, that uses ca.cypherpunks.ru CA. Look at SSL_CERT_FILE,
+GIT_SSL_CAINFO and GOPRIVATE environment variables.
+
+Or you can manually close its source code and build it up:
+
+    $ git clone git://git.stargrave.org/glocate.git
+    $ cd glocate
+    $ go build
+
+glocate has two dependencies, that will be fetched by Go automatically.
diff --git a/README b/README
index e5d6fd2..150009a 100644
--- a/README
+++ b/README
@@ -5,95 +5,23 @@ and quickly display some part of it. Like ordinary *locate utilities.
 But unlike others, it is able to eat zfs-diff's output and apply the
 changes to existing database.
 
-Why I wrote it? Indexing, just "find /big" can take a considerable
-amount of time, like an hour or so, with many I/O operations spent. But
-my home NAS has relatively few number of changes made every day. The
-only possible quick way to determine what exactly was modified is to
-traverse over ZFS'es Merkle trees to find a difference between
-snapshots. Fortunately zfs-diff command does exactly that, providing
-pretty machine-friendly output.
-
-Why this utility is so complicated? Initially it kept all database in
-memory, but that takes 2-3 GiBs of memory, that is huge amount. Moreover
-it fully loads it to perform any basic searches. So current
-implementation uses temporary files and heavy use of data streaming.
-
-Its storage format is simple: Zstandard-compressed list of records:
-
-* 16-bit BE size of the following name
-* entity (file, directory, symbolic link, etc) name itself.
-  Directory has trailing "/"
-* single byte indicating current file's depth
-* 64-bit BE mtime seconds
-* 64-bit BE file or directory (sum of all files and directories) size
-
-Its indexing algorithm is following:
-
-* traverse over all filesystem hierarchy in a *sorted* order. All
-  records are written to temporary file, without directory sizes,
-  because they are not known in advance during the walking
-* during the walk, remember in memory each directory's total size
-* read all records from that temporary file, writing to another one, but
-  replacing directory sizes with ones remembered
-
-Searching is trivial:
-
-* searching is performed on each record streamed from the database
-* if -root is specified, then search will stop after that hierarchy part
-  is over
-* by default all elements are printed, unless you provide a single
-  argument that becomes "*X*" pattern matched on case-lowered path
-  elements
-
-Updating algorithm is following:
-
-* read all [-+MR] actions from zfs-diff, validating the whole format
-* each file's "R" becomes "-" and "+" actions
-* if there are directory "R", then collect them and stream from current
-  database to determine each path entity you have to "-" and "+"
-* each "+" adds an entry to the list of "M"s
-* sort all "-", "+" and "M" filenames in ascending order
-* get entity's information for each "M" (remembering its size and mtime)
-* stream current database records, writing them to temporary file
-* if record exists in "-"-list, then skip it
-* if any "+" exists in the *sorted* list, that has precedence over the
-  record from database, then insert it into the stream, taking size and
-  mtime information from "M"-list
-* if any "M" exists for the read record, then use it to alter it
-* all that time, directory size calculating algorithm also works, the
-  same one used during indexing
-* create another temporary file to copy the records with actualized
-  directory sizes
-
-How to use it?
-
-    $ zfs snap big@snap1
-    $ cd /big ; glocate -db /tmp/glocate.db -index
-
-    $ glocate -db /tmp/glocate.db
-    [list of all files]
-
-    $ glocate -db /tmp/glocate.db -machine
-    [machine parseable list of files with sizes and mtimes]
-
-    $ glocate -db /tmp/glocate.db -tree
-    [beauty tree-like list of files with sizes and mtimes]
-
-    $ glocate -db /tmp/glocate.db -root music
-    [just a music hierarchy path]
-
-    $ glocate -db /tmp/glocate.db -root music blasphemy | grep "/$"
-    music/Blasphemy-2001-Gods_Of_War_+_Blood_Upon_The_Altar/
-    music/Cryptopsy-1994-Blasphemy_Made_Flesh/
-    music/Infernal_Blasphemy-2005-Unleashed/
-    music/Ravenous-Assembled_In_Blasphemy/
-    music/Sect_Of_Execration-2002-Baptized_Through_Blasphemy/
-    music/Spectral_Blasphemy-2012-Blasphmemial_Catastrophic/
-
-and update it carefully, providing the strip prefix to -update:
-
-    $ zfs snap big@snap2
-    $ zfs diff -FH big@snap2 | glocate -db /tmp/glocate.db -update /big/
-
-glocate is copylefted free software: see the file COPYING for copying
-conditions.
+Why I wrote it? I have got ~18M files ZFS data storage, where even
+"find /storage" takes considerable amount of time, up to an hour.
+So I have to use separate indexed database and search against it.
+locate family of utilities does exactly that. But none of them are
+able to detect a few seldom made changes to the dataset, without
+traversing through the whole dataset anyway, taking much IO.
+
+Fortunately ZFS design with Merkle trees is able to show us the
+difference quickly and without notable IO. "zfs diff" command's
+output is very machine friendly. So locate-like utility has to be able
+to update its database with zfs-diff's output.
+
+Why this utility is so relatively complicated? Initially it kept all
+database in memory, but that took 2-3 GiBs of memory, that is huge
+amount. Moreover it fully loads it to perform any basic searches. So
+current implementation uses temporary files and heavy use of data
+streaming. Database in my case takes less than 128MiB of data. And
+searching takes only several seconds on my machine.
+
+It is free software: see the file COPYING for copying conditions.
diff --git a/USAGE b/USAGE
new file mode 100644
index 0000000..fb9e732
--- /dev/null
+++ b/USAGE
@@ -0,0 +1,47 @@
+It is advisable to create a ZFS snapshot, to be sure that there are some
+checkpoint state:
+
+    # zfs snap storage@snap1
+    $ cd /storage/.zfs/snapshot/snap1
+
+Run the index procedure:
+
+    $ glocate -db /tmp/db -index
+
+After that, you can print all filenames:
+
+    $ glocate -db /tmp/db
+
+List them with sizes and mtimes in machine parseable format:
+
+    $ glocate -db /tmp/db -machine
+
+Or in human-friendly tree-like format:
+
+    $ glocate -db /tmp/db -tree
+
+You can limit the hierarchy with the -root:
+
+    $ glocate -db /tmp/db -root music
+    [just a music]
+
+And you can specify glob pattern for case-insensitive match for each
+element path, that is automatically wrapped with "*":
+
+    $ glocate -db /tmp/db -root music blasphemy | grep "/$"
+    music/Blasphemy-2001-Gods_Of_War_+_Blood_Upon_The_Altar/
+    music/Cryptopsy-1994-Blasphemy_Made_Flesh/
+    music/Infernal_Blasphemy-2005-Unleashed/
+    music/Ravenous-Assembled_In_Blasphemy/
+    music/Sect_Of_Execration-2002-Baptized_Through_Blasphemy/
+    music/Spectral_Blasphemy-2012-Blasphmemial_Catastrophic/
+
+If you changed you dataset(s) somehow, then you should create new
+snapshot and feed its diff to the command:
+
+    # zfs snap storage@snap2
+    # zfs diff -FH storage@snap2 |
+      glocate -db /tmp/db -update /storage/
+
+Argument to -update is the prefix stripped from each filename of
+diff's output.