X-Git-Url: http://www.git.stargrave.org/?a=blobdiff_plain;f=README;h=150009a3a3bd10419e829f4a0b697c0f228c6ff3c254dbeba74c6f0ce38feb78;hb=66cef1659e7d49d330e693b5747bed1167c6294622908a868adc0597c7d5743a;hp=8c52b350ce33c6e2b4f83ae2dd4fc980b342b8d3fa0b79ac2917988101f77477;hpb=db65ffeff7274def395c8ee747873d0e9d8250b75f543b6ac0d7bbd079cce66d;p=glocate.git diff --git a/README b/README index 8c52b35..150009a 100644 --- a/README +++ b/README @@ -5,85 +5,23 @@ and quickly display some part of it. Like ordinary *locate utilities. But unlike others, it is able to eat zfs-diff's output and apply the changes to existing database. -Why I wrote it? Indexing, just "find /big" can take a considerable -amount of time, like an hour or so, with many I/O operations spent. But -my home NAS has relatively few number of changes made every day. The -only possible quick way to determine what exactly was modified is to -traverse over ZFS'es Merkle trees to find a difference between -snapshots. Fortunately zfs-diff command does exactly that, providing -pretty machine-friendly output. - -Why this utility is so complicated? Initially it kept all database in -memory, but that takes 2-3 GiBs of memory, that is huge amount. Moreover -it fully loads it to perform any basic searches. So current -implementation uses temporary files and heavy use of data streaming. - -Its storage format is simple: Zstandard-compressed list of records: - -* 16-bit BE size of the following name -* entity (file, directory, symbolic link, etc) name itself. - Directory has trailing "/" -* single byte indicating current file's depth -* 64-bit BE mtime seconds -* 64-bit BE file or directory (sum of all files and directories) size - -Its indexing algorithm is following: - -* traverse over all filesystem hierarchy in a *sorted* order. All - records are written to temporary file, without directory sizes, - because they are not known in advance during the walking -* during the walk, remember in memory each directory's total size -* read all records from that temporary file, writing to another one, but - replacing directory sizes with ones remembered - -Searching is trivial: - -* there is no actual searching, just a streaming through all the - database file sequentially -* if some root is specified, then the program will output only its - hierarchy path, exiting after it is finished - -Updating algorithm is following: - -* read all [-+MR] actions from zfs-diff, validating the whole format -* each file's "R" becomes "-" and "+" actions -* if there are directory "R", then collect them and stream from current - database to determine each path entity you have to "-" and "+" -* each "+" adds an entry to the list of "M"s -* sort all "-", "+" and "M" filenames in ascending order -* get entity's information for each "M" (remembering its size and mtime) -* stream current database records, writing them to temporary file -* if record exists in "-"-list, then skip it -* if any "+" exists in the *sorted* list, that has precedence over the - record from database, then insert it into the stream, taking size and - mtime information from "M"-list -* if any "M" exists for the read record, then use it to alter it -* all that time, directory size calculating algorithm also works, the - same one used during indexing -* create another temporary file to copy the records with actualized - directory sizes - -How to use it? - - $ zfs snap big@snap1 - $ cd /big ; glocate -db /tmp/glocate.db -index - - $ glocate -db /tmp/glocate.db - [list of all files] - - $ glocate -db /tmp/glocate.db -machine - [machine parseable list of files with sizes and mtimes] - - $ glocate -db /tmp/glocate.db -tree - [beauty tree-like list of files with sizes and mtimes] - - $ glocate -db /tmp/glocate.db some/sub/path - [just a part of the whole hierarchy] - -and update it carefully: - - $ zfs snap big@snap2 - $ zfs diff -FH big@snap2 | glocate -db /tmp/glocate.db -strip /big/ -update - -glocate is copylefted free software: see the file COPYING for copying -conditions. +Why I wrote it? I have got ~18M files ZFS data storage, where even +"find /storage" takes considerable amount of time, up to an hour. +So I have to use separate indexed database and search against it. +locate family of utilities does exactly that. But none of them are +able to detect a few seldom made changes to the dataset, without +traversing through the whole dataset anyway, taking much IO. + +Fortunately ZFS design with Merkle trees is able to show us the +difference quickly and without notable IO. "zfs diff" command's +output is very machine friendly. So locate-like utility has to be able +to update its database with zfs-diff's output. + +Why this utility is so relatively complicated? Initially it kept all +database in memory, but that took 2-3 GiBs of memory, that is huge +amount. Moreover it fully loads it to perform any basic searches. So +current implementation uses temporary files and heavy use of data +streaming. Database in my case takes less than 128MiB of data. And +searching takes only several seconds on my machine. + +It is free software: see the file COPYING for copying conditions.