X-Git-Url: http://www.git.stargrave.org/?a=blobdiff_plain;f=Documentation%2Fpublic-inbox-v2-format.pod;h=e93d7fc701d9f3191081629c6ecfab744d5d7c4c;hb=0b73ad048e715065efc3ed3eb1c376e945957693;hp=730f663381069df0b31957a7771f248349ec29c5;hpb=95bdac7f09c69036efed537a4d03d5bdd2ae4eb6;p=public-inbox.git diff --git a/Documentation/public-inbox-v2-format.pod b/Documentation/public-inbox-v2-format.pod index 730f6633..e93d7fc7 100644 --- a/Documentation/public-inbox-v2-format.pod +++ b/Documentation/public-inbox-v2-format.pod @@ -2,7 +2,7 @@ =head1 NAME -public-inbox v2 repository description +public-inbox-v2-format - structure of public inbox v2 archives =head1 DESCRIPTION @@ -113,9 +113,14 @@ improved with high-quality and high-quantity solid-state storage. Issuing TRIM commands with L was necessary to maintain consistent performance while developing this feature. -Rotational storage devices are NOT recommended for indexing of -large mail archives; but are fine for backup and usable for -small instances. +Rotational storage devices perform significantly worse than +solid state storage for indexing of large mail archives; but are +fine for backup and usable for small instances. + +As of public-inbox 1.6.0, the C +option of L may be used with a high shard +count to ensure individual shards fit into page cache when the entire +Xapian DB cannot. Our use of the L requires Xapian document IDs to remain stable. Using L and @@ -133,7 +138,7 @@ OVER/XOVER commands). The overview DB maintains all the header information necessary to implement the NNTP OVER/XOVER commands and non-search -endpoints of of the PSGI UI. +endpoints of the PSGI UI. Xapian has become completely optional for v2 (as it is for v1), but SQLite remains required for v2. SQLite turns out to be powerful @@ -159,7 +164,7 @@ top-level of the directory. =head1 OBJECT IDENTIFIERS -There are three distinct type of identifiers. content_id is the +There are three distinct type of identifiers. content_hash is the new one for v2 and should make message removal and deduplication easier. object_id and Message-ID are already known. @@ -179,11 +184,11 @@ The email header; duplicates allowed for archival purposes. This remains a searchable field in Xapian. Note: it's possible for emails to have multiple Message-ID headers (and L had that bug for a bit); so we take all of them into account. -In case of conflicts detected by content_id below, we generate a new -Message-ID based on content_id; if the generated Message-ID still +In case of conflicts detected by content_hash below, we generate a new +Message-ID based on content_hash; if the generated Message-ID still conflicts, a random one is generated. -=item content_id +=item content_hash A hash of relevant headers and raw body content for purging of unwanted content. This is not stored anywhere, @@ -193,7 +198,7 @@ For now, the relevant headers are: Subject, From, Date, References, In-Reply-To, To, Cc -Received, List-Id, and similar headers are NOT part of content_id as +Received, List-Id, and similar headers are NOT part of content_hash as they differ across lists and we will want removal to be able to cross lists. @@ -203,7 +208,7 @@ raw body risks being broken by list signatures; but we can use filters (e.g. PublicInbox::Filter::Vger) to clean the body for imports. -content_id is SHA-256 for now; but can be changed at any time +content_hash is SHA-256 for now; but can be changed at any time without making DB changes. =back @@ -226,11 +231,11 @@ no sense in a public archive. =head1 THANKS Thanks to the Linux Foundation for sponsoring the development -and testing of the v2 repository format. +and testing of the v2 format. =head1 COPYRIGHT -Copyright 2018-2020 all contributors L +Copyright 2018-2021 all contributors L License: AGPL-3.0+ L