X-Git-Url: http://www.git.stargrave.org/?a=blobdiff_plain;f=Documentation%2Fpublic-inbox-v2-format.pod;h=e93d7fc701d9f3191081629c6ecfab744d5d7c4c;hb=3e639ca78aa70ad6a6598bcf32d4b72696e3ebfb;hp=7dfe3296363b1c5032eca127ee13e901a364a9a8;hpb=96d4a98d1a28ec64b5abd8289ddd4177ff87ad7e;p=public-inbox.git
diff --git a/Documentation/public-inbox-v2-format.pod b/Documentation/public-inbox-v2-format.pod
index 7dfe3296..e93d7fc7 100644
--- a/Documentation/public-inbox-v2-format.pod
+++ b/Documentation/public-inbox-v2-format.pod
@@ -2,7 +2,7 @@
=head1 NAME
-public-inbox v2 repository description
+public-inbox-v2-format - structure of public inbox v2 archives
=head1 DESCRIPTION
@@ -16,21 +16,21 @@ Message-IDs.
The key change in v2 is the inbox is no longer a bare git
repository, but a directory with two or more git repositories.
v2 divides git repositories by time "epochs" and Xapian
-databases for parallelism by "partitions".
+databases for parallelism by "shards".
=head2 INBOX OVERVIEW AND DEFINITIONS
-$EPOCH - Integer starting with 0 based on time
-$SCHEMA_VERSION - PublicInbox::Search::SCHEMA_VERSION used by Xapian
-$PART - Integer (0..NPROCESSORS)
+ $EPOCH - Integer starting with 0 based on time
+ $SCHEMA_VERSION - DB schema version (for Xapian)
+ $SHARD - Integer starting with 0 based on parallelism
-foo/ # assuming "foo" is the name of the list
-- inbox.lock # lock file (flock) to protect global state
-- git/$EPOCH.git # normal git repositories
-- all.git # empty git repo, alternates to git/$EPOCH.git
-- xap$SCHEMA_VERSION/$PART # per-partition Xapian DB
-- xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP and threading
-- msgmap.sqlite3 # same the v1 msgmap
+ foo/ # "foo" is the name of the inbox
+ - inbox.lock # lock file to protect global state
+ - git/$EPOCH.git # normal git repositories
+ - all.git # empty, alternates to $EPOCH.git
+ - xap$SCHEMA_VERSION/$SHARD # per-shard Xapian DB
+ - xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP, threading
+ - msgmap.sqlite3 # same the v1 msgmap
For blob lookups, the reader only needs to open the "all.git"
repository with $GIT_DIR/objects/info/alternates which references
@@ -95,16 +95,16 @@ are documented at:
L
-=head2 XAPIAN PARTITIONS
+=head2 XAPIAN SHARDS
Another second scalability problem in v1 was the inability to
utilize multiple CPU cores for Xapian indexing. This is
-addressed by using partitions in Xapian to perform import
+addressed by using shards in Xapian to perform import
indexing in parallel.
As with git alternates, Xapian natively supports a read-only
interface which transparently abstracts away the knowledge of
-multiple partitions. This allows us to simplify our read-only
+multiple shards. This allows us to simplify our read-only
code paths.
The performance of the storage device is now the bottleneck on
@@ -113,9 +113,19 @@ improved with high-quality and high-quantity solid-state storage.
Issuing TRIM commands with L was necessary to maintain
consistent performance while developing this feature.
-Rotational storage devices are NOT recommended for indexing of
-large mail archives; but are fine for backup and usable for
-small instances.
+Rotational storage devices perform significantly worse than
+solid state storage for indexing of large mail archives; but are
+fine for backup and usable for small instances.
+
+As of public-inbox 1.6.0, the C
+option of L may be used with a high shard
+count to ensure individual shards fit into page cache when the entire
+Xapian DB cannot.
+
+Our use of the L requires Xapian document IDs to
+remain stable. Using L and
+L wrappers are recommended over tools
+provided by Xapian.
=head2 OVERVIEW DB
@@ -128,12 +138,12 @@ OVER/XOVER commands).
The overview DB maintains all the header information necessary
to implement the NNTP OVER/XOVER commands and non-search
-endpoints of of the PSGI UI.
+endpoints of the PSGI UI.
-In the future, Xapian will become completely optional for v2 (as
-it is for v1) as SQLite turns out to be powerful enough to
-maintain overview information. Most of the PSGI and all of the
-NNTP functionality will be possible with only SQLite in addition
+Xapian has become completely optional for v2 (as it is for v1), but
+SQLite remains required for v2. SQLite turns out to be powerful
+enough to maintain overview information. Most of the PSGI and all
+of the NNTP functionality is possible with only SQLite in addition
to git.
The overview DB was an instrumental piece in maintaining near
@@ -154,7 +164,7 @@ top-level of the directory.
=head1 OBJECT IDENTIFIERS
-There are three distinct type of identifiers. content_id is the
+There are three distinct type of identifiers. content_hash is the
new one for v2 and should make message removal and deduplication
easier. object_id and Message-ID are already known.
@@ -163,7 +173,7 @@ easier. object_id and Message-ID are already known.
=item object_id
The blob identifier git uses (currently SHA-1). No need to
-publically expose this outside of normal git ops (cloning) and
+publicly expose this outside of normal git ops (cloning) and
there's no need to make this searchable. As with v1 of
public-inbox, this is stored as part of the Xapian document so
expensive name lookups can be avoided for document retrieval.
@@ -174,11 +184,11 @@ The email header; duplicates allowed for archival purposes.
This remains a searchable field in Xapian. Note: it's possible
for emails to have multiple Message-ID headers (and L
had that bug for a bit); so we take all of them into account.
-In case of conflicts detected by content_id below, we generate a new
-Message-ID based on content_id; if the generated Message-ID still
+In case of conflicts detected by content_hash below, we generate a new
+Message-ID based on content_hash; if the generated Message-ID still
conflicts, a random one is generated.
-=item content_id
+=item content_hash
A hash of relevant headers and raw body content for
purging of unwanted content. This is not stored anywhere,
@@ -188,7 +198,7 @@ For now, the relevant headers are:
Subject, From, Date, References, In-Reply-To, To, Cc
-Received, List-Id, and similar headers are NOT part of content_id as
+Received, List-Id, and similar headers are NOT part of content_hash as
they differ across lists and we will want removal to be able to cross
lists.
@@ -198,7 +208,7 @@ raw body risks being broken by list signatures; but we can use
filters (e.g. PublicInbox::Filter::Vger) to clean the body for
imports.
-content_id is SHA-256 for now; but can be changed at any time
+content_hash is SHA-256 for now; but can be changed at any time
without making DB changes.
=back
@@ -221,11 +231,11 @@ no sense in a public archive.
=head1 THANKS
Thanks to the Linux Foundation for sponsoring the development
-and testing of the v2 repository format.
+and testing of the v2 format.
=head1 COPYRIGHT
-Copyright 2018-2019 all contributors L
+Copyright 2018-2021 all contributors L
License: AGPL-3.0+ L