X-Git-Url: http://www.git.stargrave.org/?a=blobdiff_plain;f=Documentation%2Fpublic-inbox-v2-format.pod;h=d87a717d40bcce2ddeade5c5e3bb7f63abba4c38;hb=e39585ee2bdcbeaab7b6bd33b3568021042d0879;hp=05ef32a9b6782cf79469d2206c20d51d4bf636bd;hpb=cf35d38e7f845393659dfce0249a76d529a2c92c;p=public-inbox.git diff --git a/Documentation/public-inbox-v2-format.pod b/Documentation/public-inbox-v2-format.pod index 05ef32a9..d87a717d 100644 --- a/Documentation/public-inbox-v2-format.pod +++ b/Documentation/public-inbox-v2-format.pod @@ -2,7 +2,7 @@ =head1 NAME -public-inbox v2 repository description +public-inbox v2 format description =head1 DESCRIPTION @@ -16,21 +16,21 @@ Message-IDs. The key change in v2 is the inbox is no longer a bare git repository, but a directory with two or more git repositories. v2 divides git repositories by time "epochs" and Xapian -databases for parallelism by "partitions". +databases for parallelism by "shards". =head2 INBOX OVERVIEW AND DEFINITIONS -$EPOCH - Integer starting with 0 based on time -$SCHEMA_VERSION - PublicInbox::Search::SCHEMA_VERSION used by Xapian -$PART - Integer (0..NPROCESSORS) + $EPOCH - Integer starting with 0 based on time + $SCHEMA_VERSION - DB schema version (for Xapian) + $SHARD - Integer starting with 0 based on parallelism -foo/ # assuming "foo" is the name of the list -- inbox.lock # lock file (flock) to protect global state -- git/$EPOCH.git # normal git repositories -- all.git # empty git repo, alternates to git/$EPOCH.git -- xap$SCHEMA_VERSION/$PART # per-partition Xapian DB -- xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP and threading -- msgmap.sqlite3 # same the v1 msgmap + foo/ # "foo" is the name of the inbox + - inbox.lock # lock file to protect global state + - git/$EPOCH.git # normal git repositories + - all.git # empty, alternates to $EPOCH.git + - xap$SCHEMA_VERSION/$SHARD # per-shard Xapian DB + - xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP, threading + - msgmap.sqlite3 # same the v1 msgmap For blob lookups, the reader only needs to open the "all.git" repository with $GIT_DIR/objects/info/alternates which references @@ -95,21 +95,21 @@ are documented at: L -=head2 XAPIAN PARTITIONS +=head2 XAPIAN SHARDS Another second scalability problem in v1 was the inability to utilize multiple CPU cores for Xapian indexing. This is -addressed by using partitions in Xapian to perform import +addressed by using shards in Xapian to perform import indexing in parallel. As with git alternates, Xapian natively supports a read-only interface which transparently abstracts away the knowledge of -multiple partitions. This allows us to simplify our read-only +multiple shards. This allows us to simplify our read-only code paths. The performance of the storage device is now the bottleneck on larger multi-core systems. In our experience, performance is -improves with high-quality and high-quantity solid-state storage. +improved with high-quality and high-quantity solid-state storage. Issuing TRIM commands with L was necessary to maintain consistent performance while developing this feature. @@ -117,6 +117,11 @@ Rotational storage devices are NOT recommended for indexing of large mail archives; but are fine for backup and usable for small instances. +Our use of the L requires Xapian document IDs to +remain stable. Using L and +L wrappers are recommended over tools +provided by Xapian. + =head2 OVERVIEW DB Towards the end of v2 development, it became apparent Xapian did @@ -130,10 +135,10 @@ The overview DB maintains all the header information necessary to implement the NNTP OVER/XOVER commands and non-search endpoints of of the PSGI UI. -In the future, Xapian will become completely optional for v2 (as -it is for v1) as SQLite turns out to be powerful enough to -maintain overview information. Most of the PSGI and all of the -NNTP functionality will be possible with only SQLite in addition +Xapian has become completely optional for v2 (as it is for v1), but +SQLite remains required for v2. SQLite turns out to be powerful +enough to maintain overview information. Most of the PSGI and all +of the NNTP functionality is possible with only SQLite in addition to git. The overview DB was an instrumental piece in maintaining near @@ -163,7 +168,7 @@ easier. object_id and Message-ID are already known. =item object_id The blob identifier git uses (currently SHA-1). No need to -publically expose this outside of normal git ops (cloning) and +publicly expose this outside of normal git ops (cloning) and there's no need to make this searchable. As with v1 of public-inbox, this is stored as part of the Xapian document so expensive name lookups can be avoided for document retrieval. @@ -210,7 +215,7 @@ for all non-atomic operations. =head1 HEADERS -Same handling as with v1, except the Message-ID header will will +Same handling as with v1, except the Message-ID header will be generated if not provided or conflicting. "Bytes", "Lines" and "Content-Length" headers are stripped and not allowed, they can interfere with further processing. @@ -221,11 +226,11 @@ no sense in a public archive. =head1 THANKS Thanks to the Linux Foundation for sponsoring the development -and testing of the v2 repository format. +and testing of the v2 format. =head1 COPYRIGHT -Copyright 2018-2019 all contributors L +Copyright 2018-2020 all contributors L License: AGPL-3.0+ L