From: Sergey Matveev Date: Sat, 13 Aug 2022 17:01:30 +0000 (+0300) Subject: Slightly refactored documentation X-Git-Tag: v0.1.0~5 X-Git-Url: http://www.git.stargrave.org/?p=glocate.git;a=commitdiff_plain;h=ecb9a41c73d8d7d8c75215c57d5d91f6c476098d756489036bedb8f136e8fb9e Slightly refactored documentation --- diff --git a/FORMAT b/FORMAT new file mode 100644 index 0000000..6becd33 --- /dev/null +++ b/FORMAT @@ -0,0 +1,49 @@ +Storage format is simple: Zstandard-compressed list of records: + +* 16-bit BE size of the following name +* entity (file, directory, symbolic link, etc) name itself. + Directory has trailing "/" +* single byte indicating current file's depth +* 64-bit BE mtime seconds +* 64-bit BE file or directory (sum of all files and directories) size + +Index algorithm: + +* traverse over all filesystem hierarchy in a *sorted* order. All + records are written to temporary file, without directory sizes, + because they are not known in advance during the walk +* during the walk, remember in memory each directory's total size +* read all records from that temporary file, writing to another one, + replacing directory sizes with ones remembered + +Search is trivial: + +* searching is performed on each record streamed from the database +* if -root is specified, then search will stop after that hierarchy + part is over +* by default all elements are printed, unless you provide a single + argument that becomes "*X*" pattern matched on case-lowered path + elements + +Update algorithm: + +* read all [-+MR] actions from "zfs diff -FH", validating the whole + format +* each "R" for the file becomes "-" and "+" actions +* if there are "R"s for directories, then stream current database and + get each file entity for those directories, making "-" and "+" + actions correspondingly +* each "+" also adds an entry to the list of "M"s +* sort all "-", "+" and "M" filenames in ascending order +* get entity's information for each "M" (remembering its size and mtime) +* stream current database records, writing them to temporary file, + taking into account, that: + * if record exists in "-"-list, then skip it + * if any "+" exists in the *sorted* list, that has precedence over + the record from database, then insert it into the stream, taking + size and mtime information from "M"-list + * if any "M" exists for the read record, then use it to alter it +* all that time, directory size calculating algorithm, same used during + the index procedure, also works in parallel +* create another temporary file to copy the records with actualized + directory sizes diff --git a/INSTALL b/INSTALL new file mode 100644 index 0000000..c2a9c3e --- /dev/null +++ b/INSTALL @@ -0,0 +1,15 @@ +Utility is written on Go, so basically it could be installed by: + + $ go install go.stargrave.org/glocate + +However you may have some issues with authenticity of go.stargrave.org +HTTPS server, that uses ca.cypherpunks.ru CA. Look at SSL_CERT_FILE, +GIT_SSL_CAINFO and GOPRIVATE environment variables. + +Or you can manually close its source code and build it up: + + $ git clone git://git.stargrave.org/glocate.git + $ cd glocate + $ go build + +glocate has two dependencies, that will be fetched by Go automatically. diff --git a/README b/README index e5d6fd2..150009a 100644 --- a/README +++ b/README @@ -5,95 +5,23 @@ and quickly display some part of it. Like ordinary *locate utilities. But unlike others, it is able to eat zfs-diff's output and apply the changes to existing database. -Why I wrote it? Indexing, just "find /big" can take a considerable -amount of time, like an hour or so, with many I/O operations spent. But -my home NAS has relatively few number of changes made every day. The -only possible quick way to determine what exactly was modified is to -traverse over ZFS'es Merkle trees to find a difference between -snapshots. Fortunately zfs-diff command does exactly that, providing -pretty machine-friendly output. - -Why this utility is so complicated? Initially it kept all database in -memory, but that takes 2-3 GiBs of memory, that is huge amount. Moreover -it fully loads it to perform any basic searches. So current -implementation uses temporary files and heavy use of data streaming. - -Its storage format is simple: Zstandard-compressed list of records: - -* 16-bit BE size of the following name -* entity (file, directory, symbolic link, etc) name itself. - Directory has trailing "/" -* single byte indicating current file's depth -* 64-bit BE mtime seconds -* 64-bit BE file or directory (sum of all files and directories) size - -Its indexing algorithm is following: - -* traverse over all filesystem hierarchy in a *sorted* order. All - records are written to temporary file, without directory sizes, - because they are not known in advance during the walking -* during the walk, remember in memory each directory's total size -* read all records from that temporary file, writing to another one, but - replacing directory sizes with ones remembered - -Searching is trivial: - -* searching is performed on each record streamed from the database -* if -root is specified, then search will stop after that hierarchy part - is over -* by default all elements are printed, unless you provide a single - argument that becomes "*X*" pattern matched on case-lowered path - elements - -Updating algorithm is following: - -* read all [-+MR] actions from zfs-diff, validating the whole format -* each file's "R" becomes "-" and "+" actions -* if there are directory "R", then collect them and stream from current - database to determine each path entity you have to "-" and "+" -* each "+" adds an entry to the list of "M"s -* sort all "-", "+" and "M" filenames in ascending order -* get entity's information for each "M" (remembering its size and mtime) -* stream current database records, writing them to temporary file -* if record exists in "-"-list, then skip it -* if any "+" exists in the *sorted* list, that has precedence over the - record from database, then insert it into the stream, taking size and - mtime information from "M"-list -* if any "M" exists for the read record, then use it to alter it -* all that time, directory size calculating algorithm also works, the - same one used during indexing -* create another temporary file to copy the records with actualized - directory sizes - -How to use it? - - $ zfs snap big@snap1 - $ cd /big ; glocate -db /tmp/glocate.db -index - - $ glocate -db /tmp/glocate.db - [list of all files] - - $ glocate -db /tmp/glocate.db -machine - [machine parseable list of files with sizes and mtimes] - - $ glocate -db /tmp/glocate.db -tree - [beauty tree-like list of files with sizes and mtimes] - - $ glocate -db /tmp/glocate.db -root music - [just a music hierarchy path] - - $ glocate -db /tmp/glocate.db -root music blasphemy | grep "/$" - music/Blasphemy-2001-Gods_Of_War_+_Blood_Upon_The_Altar/ - music/Cryptopsy-1994-Blasphemy_Made_Flesh/ - music/Infernal_Blasphemy-2005-Unleashed/ - music/Ravenous-Assembled_In_Blasphemy/ - music/Sect_Of_Execration-2002-Baptized_Through_Blasphemy/ - music/Spectral_Blasphemy-2012-Blasphmemial_Catastrophic/ - -and update it carefully, providing the strip prefix to -update: - - $ zfs snap big@snap2 - $ zfs diff -FH big@snap2 | glocate -db /tmp/glocate.db -update /big/ - -glocate is copylefted free software: see the file COPYING for copying -conditions. +Why I wrote it? I have got ~18M files ZFS data storage, where even +"find /storage" takes considerable amount of time, up to an hour. +So I have to use separate indexed database and search against it. +locate family of utilities does exactly that. But none of them are +able to detect a few seldom made changes to the dataset, without +traversing through the whole dataset anyway, taking much IO. + +Fortunately ZFS design with Merkle trees is able to show us the +difference quickly and without notable IO. "zfs diff" command's +output is very machine friendly. So locate-like utility has to be able +to update its database with zfs-diff's output. + +Why this utility is so relatively complicated? Initially it kept all +database in memory, but that took 2-3 GiBs of memory, that is huge +amount. Moreover it fully loads it to perform any basic searches. So +current implementation uses temporary files and heavy use of data +streaming. Database in my case takes less than 128MiB of data. And +searching takes only several seconds on my machine. + +It is free software: see the file COPYING for copying conditions. diff --git a/USAGE b/USAGE new file mode 100644 index 0000000..fb9e732 --- /dev/null +++ b/USAGE @@ -0,0 +1,47 @@ +It is advisable to create a ZFS snapshot, to be sure that there are some +checkpoint state: + + # zfs snap storage@snap1 + $ cd /storage/.zfs/snapshot/snap1 + +Run the index procedure: + + $ glocate -db /tmp/db -index + +After that, you can print all filenames: + + $ glocate -db /tmp/db + +List them with sizes and mtimes in machine parseable format: + + $ glocate -db /tmp/db -machine + +Or in human-friendly tree-like format: + + $ glocate -db /tmp/db -tree + +You can limit the hierarchy with the -root: + + $ glocate -db /tmp/db -root music + [just a music] + +And you can specify glob pattern for case-insensitive match for each +element path, that is automatically wrapped with "*": + + $ glocate -db /tmp/db -root music blasphemy | grep "/$" + music/Blasphemy-2001-Gods_Of_War_+_Blood_Upon_The_Altar/ + music/Cryptopsy-1994-Blasphemy_Made_Flesh/ + music/Infernal_Blasphemy-2005-Unleashed/ + music/Ravenous-Assembled_In_Blasphemy/ + music/Sect_Of_Execration-2002-Baptized_Through_Blasphemy/ + music/Spectral_Blasphemy-2012-Blasphmemial_Catastrophic/ + +If you changed you dataset(s) somehow, then you should create new +snapshot and feed its diff to the command: + + # zfs snap storage@snap2 + # zfs diff -FH storage@snap2 | + glocate -db /tmp/db -update /storage/ + +Argument to -update is the prefix stripped from each filename of +diff's output.