1 % public-inbox developer manual
5 public-inbox-extindex-format - external index format description
9 The extindex is an index-only evolution of the per-inbox
10 SQLite and Xapian indices used by L<public-inbox-v2-format(5)>
11 and L<public-inbox-v1-format(5)>. It exists to facilitate
12 searches across multiple inboxes as well as to reduce index
13 space when messages are cross-posted to several existing
16 It transparently indexes messages across any combination of v1 and v2
17 inboxes and data about inboxes themselves.
19 =head1 DIRECTORY LAYOUT
21 While inspired by v2, there is no git blob storage nor
24 Instead, there is an C<ALL.git> (all caps) git repo which treats
25 every indexed v1 inbox or v2 epoch as a git alternate.
27 As with v2 inboxes, it uses C<over.sqlite3> and Xapian "shards"
28 for WWW and IMAP use. Several exclusive new tables are added
29 to deal with L</XREF3 DEDUPLICATION> and metadata.
31 Unlike v1 and v2 inboxes, it is NOT designed to map to a NNTP
32 newsgroup. Thus it lacks C<msgmap.sqlite3> to enforce the
33 unique Message-ID requirement of NNTP.
35 =head2 INDEX OVERVIEW AND DEFINITIONS
37 $SCHEMA_VERSION - DB schema version (for Xapian)
38 $SHARD - Integer starting with 0 based on parallelism
40 foo/ # "foo" is the name of the index
41 - ei.lock # lock file to protect global state
42 - ALL.git # empty, alternates for inboxes
43 - ei$SCHEMA_VERSION/$SHARD # per-shard Xapian DB
44 - ei$SCHEMA_VERSION/over.sqlite3 # overview DB for WWW, IMAP
45 - ei$SCHEMA_VERSION/misc # misc Xapian DB
47 File and directory names are intentionally different from
48 analogous v2 names to ensure extindex and v2 inboxes can
49 easily be distinguished from each other.
51 =head2 XREF3 DEDUPLICATION
53 Due to cross-posted messages being the norm in the large Linux kernel
54 development community and Xapian indices being the primary consumer of
55 storage, it makes sense to deduplicate indexing as much as possible.
57 The internal storage format is based on the NNTP "Xref" tuple,
58 but with the addition of a third element: the git blob OID.
59 Thus the triple is expressed in string form as:
61 $NEWSGROUP_NAME:$ARTICLE_NUM:$OID
63 If no C<newsgroup> is configured for an inbox, the C<inboxdir>
66 This data is stored in the C<xref3> table of over.sqlite3.
70 In addition to the numeric Xapian shards for indexing messages,
71 there is a new, in-development Xapian index for storing data
72 about inboxes themselves and other non-message data. This
73 index allows us to speed up operations involving hundreds or
78 In addition to providing cross-inbox search capabilities, it can
79 also replace per-inbox Xapian shards (but not per-inbox
80 over.sqlite3). This allows reduction in disk space, open file
81 handles, and associated memory use.
85 Relocating v1 and v2 inboxes on the filesystem will require
86 extindex to be garbage-collected and/or reindexed.
88 Configuring and maintaining stable C<newsgroup> names before any
89 messages are indexed from every inbox can avoid expensive
90 reindexing and rely exclusively on GC.
94 L<flock(2)> locking exclusively locks the empty ei.lock file
95 for all non-atomic operations.
99 Thanks to the Linux Foundation for sponsoring the development
104 Copyright 2020-2021 all contributors L<mailto:meta@public-inbox.org>
106 License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt>
110 L<public-inbox-v2-format(5)>