X-Git-Url: http://www.git.stargrave.org/?a=blobdiff_plain;ds=sidebyside;f=Documentation%2Fpublic-inbox-extindex-format.pod;fp=Documentation%2Fpublic-inbox-extindex-format.pod;h=acb21cd9b4a97ebb114744b4d4f62c6776b2fa17;hb=6887925b5e5028090d92dedc8a584b701d56e180;hp=0000000000000000000000000000000000000000;hpb=1bf653ad139bf7bb3d853ab0b5eae3eaa1b13a95;p=public-inbox.git diff --git a/Documentation/public-inbox-extindex-format.pod b/Documentation/public-inbox-extindex-format.pod new file mode 100644 index 00000000..acb21cd9 --- /dev/null +++ b/Documentation/public-inbox-extindex-format.pod @@ -0,0 +1,110 @@ +% public-inbox developer manual + +=head1 NAME + +public-inbox extindex format description + +=head1 DESCRIPTION + +The extindex is an index-only evolution of the per-inbox +SQLite and Xapian indices used by L +and L. It exists to facilitate +searches across multiple inboxes as well as to reduce index +space when messages are cross-posted to several existing +inboxes. + +It transparently indexes messages across any combination of v1 and v2 +inboxes and data about inboxes themselves. + +=head1 DIRECTORY LAYOUT + +While inspired by v2, there is no git blob storage nor +C DB. + +Instead, there is an C (all caps) git repo which treats +every indexed v1 inbox or v2 epoch as a git alternate. + +As with v2 inboxes, it uses C and Xapian "shards" +for WWW and IMAP use. Several exclusive new tables are added +to deal with L and metadata. + +Unlike v1 and v2 inboxes, it is NOT designed to map to a NNTP +newsgroup. Thus it lacks C to enforce the +unique Message-ID requirement of NNTP. + +=head2 INDEX OVERVIEW AND DEFINITIONS + + $SCHEMA_VERSION - DB schema version (for Xapian) + $SHARD - Integer starting with 0 based on parallelism + + foo/ # "foo" is the name of the index + - ei.lock # lock file to protect global state + - ALL.git # empty, alternates for inboxes + - ei$SCHEMA_VERSION/$SHARD # per-shard Xapian DB + - ei$SCHEMA_VERSION/over.sqlite3 # overview DB for WWW, IMAP + - ei$SCHEMA_VERSION/misc # misc Xapian DB + +File and directory names are intentionally different from +analogous v2 names to ensure extindex and v2 inboxes can +easily be distinguished from each other. + +=head2 XREF3 DEDUPLICATION + +Due to cross-posted messages being the norm in the large Linux kernel +development community and Xapian indices being the primary consumer of +storage, it makes sense to deduplicate indexing as much as possible. + +The internal storage format is based on the NNTP "Xref" tuple, +but with the addition of a third element: the git blob OID. +Thus the triple is expressed in string form as: + + $NEWSGROUP_NAME:$ARTICLE_NUM:$OID + +If no C is configured for an inbox, the C +of the inbox is used. + +This data is stored in the C table of over.sqlite3. + +=head2 misc XAPIAN DB + +In addition to the numeric Xapian shards for indexing messages, +there is a new, in-development Xapian index for storing data +about inboxes themselves and other non-message data. This +index allows us to speed up operations involving hundreds or +thousands of inboxes. + +=head1 BENEFITS + +In addition to providing cross-inbox search capabilities, it can +also replace per-inbox Xapian shards (but not per-inbox +over.sqlite3). This allows reduction in disk space, open file +handles, and associated memory use. + +=head1 CAVEATS + +Relocating v1 and v2 inboxes on the filesystem will require +extindex to be garbage-collected and/or reindexed. + +Configuring and maintaining stable C names before any +messages are indexed from every inbox can avoid expensive +reindexing and rely exclusively on GC. + +=head1 LOCKING + +L locking exclusively locks the empty ei.lock file +for all non-atomic operations. + +=head1 THANKS + +Thanks to the Linux Foundation for sponsoring the development +and testing. + +=head1 COPYRIGHT + +Copyright 2020 all contributors L + +License: AGPL-3.0+ L + +=head1 SEE ALSO + +L