X-Git-Url: http://www.git.stargrave.org/?p=glocate.git;a=blobdiff_plain;f=README;h=150009a3a3bd10419e829f4a0b697c0f228c6ff3c254dbeba74c6f0ce38feb78;hp=e5d6fd212d1112f1fa25948c8884a29ec50308cf4aa665a5fccd931cd5a99ae5;hb=ecb9a41c73d8d7d8c75215c57d5d91f6c476098d756489036bedb8f136e8fb9e;hpb=e5a64361a0537c82d3fbc21b986fc2815f394571f46a0e4128ced789a6712f36 diff --git a/README b/README index e5d6fd2..150009a 100644 --- a/README +++ b/README @@ -5,95 +5,23 @@ and quickly display some part of it. Like ordinary *locate utilities. But unlike others, it is able to eat zfs-diff's output and apply the changes to existing database. -Why I wrote it? Indexing, just "find /big" can take a considerable -amount of time, like an hour or so, with many I/O operations spent. But -my home NAS has relatively few number of changes made every day. The -only possible quick way to determine what exactly was modified is to -traverse over ZFS'es Merkle trees to find a difference between -snapshots. Fortunately zfs-diff command does exactly that, providing -pretty machine-friendly output. - -Why this utility is so complicated? Initially it kept all database in -memory, but that takes 2-3 GiBs of memory, that is huge amount. Moreover -it fully loads it to perform any basic searches. So current -implementation uses temporary files and heavy use of data streaming. - -Its storage format is simple: Zstandard-compressed list of records: - -* 16-bit BE size of the following name -* entity (file, directory, symbolic link, etc) name itself. - Directory has trailing "/" -* single byte indicating current file's depth -* 64-bit BE mtime seconds -* 64-bit BE file or directory (sum of all files and directories) size - -Its indexing algorithm is following: - -* traverse over all filesystem hierarchy in a *sorted* order. All - records are written to temporary file, without directory sizes, - because they are not known in advance during the walking -* during the walk, remember in memory each directory's total size -* read all records from that temporary file, writing to another one, but - replacing directory sizes with ones remembered - -Searching is trivial: - -* searching is performed on each record streamed from the database -* if -root is specified, then search will stop after that hierarchy part - is over -* by default all elements are printed, unless you provide a single - argument that becomes "*X*" pattern matched on case-lowered path - elements - -Updating algorithm is following: - -* read all [-+MR] actions from zfs-diff, validating the whole format -* each file's "R" becomes "-" and "+" actions -* if there are directory "R", then collect them and stream from current - database to determine each path entity you have to "-" and "+" -* each "+" adds an entry to the list of "M"s -* sort all "-", "+" and "M" filenames in ascending order -* get entity's information for each "M" (remembering its size and mtime) -* stream current database records, writing them to temporary file -* if record exists in "-"-list, then skip it -* if any "+" exists in the *sorted* list, that has precedence over the - record from database, then insert it into the stream, taking size and - mtime information from "M"-list -* if any "M" exists for the read record, then use it to alter it -* all that time, directory size calculating algorithm also works, the - same one used during indexing -* create another temporary file to copy the records with actualized - directory sizes - -How to use it? - - $ zfs snap big@snap1 - $ cd /big ; glocate -db /tmp/glocate.db -index - - $ glocate -db /tmp/glocate.db - [list of all files] - - $ glocate -db /tmp/glocate.db -machine - [machine parseable list of files with sizes and mtimes] - - $ glocate -db /tmp/glocate.db -tree - [beauty tree-like list of files with sizes and mtimes] - - $ glocate -db /tmp/glocate.db -root music - [just a music hierarchy path] - - $ glocate -db /tmp/glocate.db -root music blasphemy | grep "/$" - music/Blasphemy-2001-Gods_Of_War_+_Blood_Upon_The_Altar/ - music/Cryptopsy-1994-Blasphemy_Made_Flesh/ - music/Infernal_Blasphemy-2005-Unleashed/ - music/Ravenous-Assembled_In_Blasphemy/ - music/Sect_Of_Execration-2002-Baptized_Through_Blasphemy/ - music/Spectral_Blasphemy-2012-Blasphmemial_Catastrophic/ - -and update it carefully, providing the strip prefix to -update: - - $ zfs snap big@snap2 - $ zfs diff -FH big@snap2 | glocate -db /tmp/glocate.db -update /big/ - -glocate is copylefted free software: see the file COPYING for copying -conditions. +Why I wrote it? I have got ~18M files ZFS data storage, where even +"find /storage" takes considerable amount of time, up to an hour. +So I have to use separate indexed database and search against it. +locate family of utilities does exactly that. But none of them are +able to detect a few seldom made changes to the dataset, without +traversing through the whole dataset anyway, taking much IO. + +Fortunately ZFS design with Merkle trees is able to show us the +difference quickly and without notable IO. "zfs diff" command's +output is very machine friendly. So locate-like utility has to be able +to update its database with zfs-diff's output. + +Why this utility is so relatively complicated? Initially it kept all +database in memory, but that took 2-3 GiBs of memory, that is huge +amount. Moreover it fully loads it to perform any basic searches. So +current implementation uses temporary files and heavy use of data +streaming. Database in my case takes less than 128MiB of data. And +searching takes only several seconds on my machine. + +It is free software: see the file COPYING for copying conditions.