Documentation/public-inbox-v2-format.pod

   1 % public-inbox developer manual
   2
   3 =head1 NAME
   4
   5 public-inbox v2 repository description
   6
   7 =head1 DESCRIPTION
   8
   9 The v2 format is designed primarily to address several
  10 scalability problems of the original format described at
  11 L<public-inbox-v1-format(5)>.  It also handles messages with
  12 Message-IDs.
  13
  14 =head1 INBOX LAYOUT
  15
  16 The key change in v2 is the inbox is no longer a bare git
  17 repository, but a directory with two or more git repositories.
  18 v2 divides git repositories by time "epochs" and Xapian
  19 databases for parallelism by "partitions".
  20
  21 =head2 INBOX OVERVIEW AND DEFINITIONS
  22
  23 $EPOCH - Integer starting with 0 based on time
  24 $SCHEMA_VERSION - PublicInbox::Search::SCHEMA_VERSION used by Xapian
  25 $PART - Integer (0..NPROCESSORS)
  26
  27 foo/ # assuming "foo" is the name of the list
  28 - inbox.lock                 # lock file (flock) to protect global state
  29 - git/$EPOCH.git             # normal git repositories
  30 - all.git                    # empty git repo, alternates to git/$EPOCH.git
  31 - xap$SCHEMA_VERSION/$PART   # per-partition Xapian DB
  32 - xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP and threading
  33 - msgmap.sqlite3             # same the v1 msgmap
  34
  35 For blob lookups, the reader only needs to open the "all.git"
  36 repository with $GIT_DIR/objects/info/alternates which references
  37 every $EPOCH.git repo.
  38
  39 Individual $EPOCH.git repos DO NOT use alternates themselves as
  40 git currently limits recursion of alternates nesting depth to 5.
  41
  42 =head2 GIT EPOCHS
  43
  44 One of the inherent scalability problems with git itself is the
  45 full history of a project must be stored and carried around to
  46 all clients.  To address this problem, the v2 format uses
  47 multiple git repositories, stored as time-based "epochs".
  48
  49 We currently divide epochs into roughly one gigabyte segments;
  50 but this size can be configurable (if needed) in the future.
  51
  52 A pleasant side-effect of this design is the git packs of older
  53 epochs are stable, allowing them to be cloned without requiring
  54 expensive pack generation.  This also allows clients to clone
  55 only the epochs they are interested in to save bandwidth and
  56 storage.
  57
  58 To minimize changes to existing v1-based code and simplify our
  59 code, we use the "alternates" mechanism described in
  60 L<gitrepository-layout(5)> to link all the epoch repositories
  61 with a single read-only "all.git" endpoint.
  62
  63 Processes retrieve blobs via the "all.git" repository, while
  64 writers write blobs directly to epochs.
  65
  66 =head2 GIT TREE LAYOUT
  67
  68 One key problem specific to v1 was large trees were frequently a
  69 performance problem as name lookups are expensive and there were
  70 limited deltafication opportunities with unpredictable file
  71 names.  As a result, all Xapian-enabled installations retrieve
  72 blob object_ids directly in v1, bypassing tree lookups.
  73
  74 While dividing git repositories into epochs caps the growth of
  75 trees, worst-case tree size was still unnecessary overhead and
  76 worth eliminating.
  77
  78 So in contrast to the big trees of v1, the v2 git tree contains
  79 only a single file at the top-level of the tree, either 'm' (for
  80 'mail' or 'message') or 'd' (for deleted).  A tree does not have
  81 'm' and 'd' at the same time.
  82
  83 Mail is still stored in blobs (instead of inline with the commit
  84 object) as we still need a stable reference in the indices in
  85 case commit history is rewritten to comply with legal
  86 requirements.
  87
  88 After-the-fact invocations of L<public-inbox-index> will ignore
  89 messages written to 'd' after they are written to 'm'.
  90
  91 Deltafication is not significantly improved over v1, but overall
  92 storage for trees is made as as small as possible.  Initial
  93 statistics and benchmarks showing the benefits of this approach
  94 are documented at:
  95
  96 L<https://public-inbox.org/meta/20180209205140.GA11047@dcvr/>
  97
  98 =head2 XAPIAN PARTITIONS
  99
 100 Another second scalability problem in v1 was the inability to
 101 utilize multiple CPU cores for Xapian indexing.  This is
 102 addressed by using partitions in Xapian to perform import
 103 indexing in parallel.
 104
 105 As with git alternates, Xapian natively supports a read-only
 106 interface which transparently abstracts away the knowledge of
 107 multiple partitions.  This allows us to simplify our read-only
 108 code paths.
 109
 110 The performance of the storage device is now the bottleneck on
 111 larger multi-core systems.  In our experience, performance is
 112 improved with high-quality and high-quantity solid-state storage.
 113 Issuing TRIM commands with L<fstrim(8)> was necessary to maintain
 114 consistent performance while developing this feature.
 115
 116 Rotational storage devices are NOT recommended for indexing of
 117 large mail archives; but are fine for backup and usable for
 118 small instances.
 119
 120 Our use of the L</OVERVIEW DB> requires Xapian document IDs to
 121 remain stable.  Thus, use of L<xapian-compact(1)> and
 122 L<copydatabase(8)> require the use of C<--no-renumber> switch.
 123
 124 =head2 OVERVIEW DB
 125
 126 Towards the end of v2 development, it became apparent Xapian did
 127 not perform well for sorting large result sets used to generate
 128 the landing page in the PSGI UI (/$INBOX/) or many queries used
 129 by the NNTP server.  Thus, SQLite was employed and the Xapian
 130 "skeleton" DB was renamed to the "overview" DB (after the NNTP
 131 OVER/XOVER commands).
 132
 133 The overview DB maintains all the header information necessary
 134 to implement the NNTP OVER/XOVER commands and non-search
 135 endpoints of of the PSGI UI.
 136
 137 In the future, Xapian will become completely optional for v2 (as
 138 it is for v1) as SQLite turns out to be powerful enough to
 139 maintain overview information.  Most of the PSGI and all of the
 140 NNTP functionality will be possible with only SQLite in addition
 141 to git.
 142
 143 The overview DB was an instrumental piece in maintaining near
 144 constant-time read performance on a dataset 2-3 times larger
 145 than LKML history as of 2018.
 146
 147 =head3 GHOST MESSAGES
 148
 149 The overview DB also includes references to "ghost" messages,
 150 or messages which have replies but have not been seen by us.
 151 Thus it is expected to have more rows than the "msgmap" DB
 152 described below.
 153
 154 =head2 msgmap.sqlite3
 155
 156 The SQLite msgmap DB is unchanged from v1, but it is now at the
 157 top-level of the directory.
 158
 159 =head1 OBJECT IDENTIFIERS
 160
 161 There are three distinct type of identifiers.  content_id is the
 162 new one for v2 and should make message removal and deduplication
 163 easier.  object_id and Message-ID are already known.
 164
 165 =over
 166
 167 =item object_id
 168
 169 The blob identifier git uses (currently SHA-1).  No need to
 170 publically expose this outside of normal git ops (cloning) and
 171 there's no need to make this searchable.  As with v1 of
 172 public-inbox, this is stored as part of the Xapian document so
 173 expensive name lookups can be avoided for document retrieval.
 174
 175 =item Message-ID
 176
 177 The email header; duplicates allowed for archival purposes.
 178 This remains a searchable field in Xapian.  Note: it's possible
 179 for emails to have multiple Message-ID headers (and L<git-send-email(1)>
 180 had that bug for a bit); so we take all of them into account.
 181 In case of conflicts detected by content_id below, we generate a new
 182 Message-ID based on content_id; if the generated Message-ID still
 183 conflicts, a random one is generated.
 184
 185 =item content_id
 186
 187 A hash of relevant headers and raw body content for
 188 purging of unwanted content.  This is not stored anywhere,
 189 but always calculated on-the-fly.
 190
 191 For now, the relevant headers are:
 192
 193         Subject, From, Date, References, In-Reply-To, To, Cc
 194
 195 Received, List-Id, and similar headers are NOT part of content_id as
 196 they differ across lists and we will want removal to be able to cross
 197 lists.
 198
 199 The textual parts of the body are decoded, CRLF normalized to
 200 LF, and trailing whitespace stripped.  Notably, hashing the
 201 raw body risks being broken by list signatures; but we can use
 202 filters (e.g. PublicInbox::Filter::Vger) to clean the body for
 203 imports.
 204
 205 content_id is SHA-256 for now; but can be changed at any time
 206 without making DB changes.
 207
 208 =back
 209
 210 =head1 LOCKING
 211
 212 L<flock(2)> locking exclusively locks the empty inbox.lock file
 213 for all non-atomic operations.
 214
 215 =head1 HEADERS
 216
 217 Same handling as with v1, except the Message-ID header will
 218 be generated if not provided or conflicting.  "Bytes", "Lines"
 219 and "Content-Length" headers are stripped and not allowed, they
 220 can interfere with further processing.
 221
 222 The "Status" mbox header is also stripped as that header makes
 223 no sense in a public archive.
 224
 225 =head1 THANKS
 226
 227 Thanks to the Linux Foundation for sponsoring the development
 228 and testing of the v2 repository format.
 229
 230 =head1 COPYRIGHT
 231
 232 Copyright 2018-2019 all contributors L<mailto:meta@public-inbox.org>
 233
 234 License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt>
 235
 236 =head1 SEE ALSO
 237
 238 L<gitrepository-layout(5)>, L<public-inbox-v1-format(5)>