X-Git-Url: http://www.git.stargrave.org/?a=blobdiff_plain;f=Documentation%2Fpublic-inbox-v2-format.pod;h=e93d7fc701d9f3191081629c6ecfab744d5d7c4c;hb=0ae89efce11e1e3b10a067c61c5b4cde30fa2b3b;hp=28d3550cc3fc091b5c1978290bece59568a508f5;hpb=c477bdd8a80eecc319b680764edfb24bd12cb7b2;p=public-inbox.git diff --git a/Documentation/public-inbox-v2-format.pod b/Documentation/public-inbox-v2-format.pod index 28d3550c..e93d7fc7 100644 --- a/Documentation/public-inbox-v2-format.pod +++ b/Documentation/public-inbox-v2-format.pod @@ -2,7 +2,7 @@ =head1 NAME -public-inbox v2 repository description +public-inbox-v2-format - structure of public inbox v2 archives =head1 DESCRIPTION @@ -20,17 +20,17 @@ databases for parallelism by "shards". =head2 INBOX OVERVIEW AND DEFINITIONS -$EPOCH - Integer starting with 0 based on time -$SCHEMA_VERSION - PublicInbox::Search::SCHEMA_VERSION used by Xapian -$PART - Integer (0..NPROCESSORS) + $EPOCH - Integer starting with 0 based on time + $SCHEMA_VERSION - DB schema version (for Xapian) + $SHARD - Integer starting with 0 based on parallelism -foo/ # assuming "foo" is the name of the list -- inbox.lock # lock file (flock) to protect global state -- git/$EPOCH.git # normal git repositories -- all.git # empty git repo, alternates to git/$EPOCH.git -- xap$SCHEMA_VERSION/$SHARD # per-shard Xapian DB -- xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP and threading -- msgmap.sqlite3 # same the v1 msgmap + foo/ # "foo" is the name of the inbox + - inbox.lock # lock file to protect global state + - git/$EPOCH.git # normal git repositories + - all.git # empty, alternates to $EPOCH.git + - xap$SCHEMA_VERSION/$SHARD # per-shard Xapian DB + - xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP, threading + - msgmap.sqlite3 # same the v1 msgmap For blob lookups, the reader only needs to open the "all.git" repository with $GIT_DIR/objects/info/alternates which references @@ -113,9 +113,14 @@ improved with high-quality and high-quantity solid-state storage. Issuing TRIM commands with L was necessary to maintain consistent performance while developing this feature. -Rotational storage devices are NOT recommended for indexing of -large mail archives; but are fine for backup and usable for -small instances. +Rotational storage devices perform significantly worse than +solid state storage for indexing of large mail archives; but are +fine for backup and usable for small instances. + +As of public-inbox 1.6.0, the C +option of L may be used with a high shard +count to ensure individual shards fit into page cache when the entire +Xapian DB cannot. Our use of the L requires Xapian document IDs to remain stable. Using L and @@ -133,7 +138,7 @@ OVER/XOVER commands). The overview DB maintains all the header information necessary to implement the NNTP OVER/XOVER commands and non-search -endpoints of of the PSGI UI. +endpoints of the PSGI UI. Xapian has become completely optional for v2 (as it is for v1), but SQLite remains required for v2. SQLite turns out to be powerful @@ -159,7 +164,7 @@ top-level of the directory. =head1 OBJECT IDENTIFIERS -There are three distinct type of identifiers. content_id is the +There are three distinct type of identifiers. content_hash is the new one for v2 and should make message removal and deduplication easier. object_id and Message-ID are already known. @@ -168,7 +173,7 @@ easier. object_id and Message-ID are already known. =item object_id The blob identifier git uses (currently SHA-1). No need to -publically expose this outside of normal git ops (cloning) and +publicly expose this outside of normal git ops (cloning) and there's no need to make this searchable. As with v1 of public-inbox, this is stored as part of the Xapian document so expensive name lookups can be avoided for document retrieval. @@ -179,11 +184,11 @@ The email header; duplicates allowed for archival purposes. This remains a searchable field in Xapian. Note: it's possible for emails to have multiple Message-ID headers (and L had that bug for a bit); so we take all of them into account. -In case of conflicts detected by content_id below, we generate a new -Message-ID based on content_id; if the generated Message-ID still +In case of conflicts detected by content_hash below, we generate a new +Message-ID based on content_hash; if the generated Message-ID still conflicts, a random one is generated. -=item content_id +=item content_hash A hash of relevant headers and raw body content for purging of unwanted content. This is not stored anywhere, @@ -193,7 +198,7 @@ For now, the relevant headers are: Subject, From, Date, References, In-Reply-To, To, Cc -Received, List-Id, and similar headers are NOT part of content_id as +Received, List-Id, and similar headers are NOT part of content_hash as they differ across lists and we will want removal to be able to cross lists. @@ -203,7 +208,7 @@ raw body risks being broken by list signatures; but we can use filters (e.g. PublicInbox::Filter::Vger) to clean the body for imports. -content_id is SHA-256 for now; but can be changed at any time +content_hash is SHA-256 for now; but can be changed at any time without making DB changes. =back @@ -226,11 +231,11 @@ no sense in a public archive. =head1 THANKS Thanks to the Linux Foundation for sponsoring the development -and testing of the v2 repository format. +and testing of the v2 format. =head1 COPYRIGHT -Copyright 2018-2019 all contributors L +Copyright 2018-2021 all contributors L License: AGPL-3.0+ L