X-Git-Url: http://www.git.stargrave.org/?a=blobdiff_plain;f=Documentation%2Fpublic-inbox-v2-format.pod;h=86a9b8f224905fa3ae89613e880c71c8dc130afd;hb=e61ade9e03e754b5bde70518223b1e9d92ab57e4;hp=d87a717d40bcce2ddeade5c5e3bb7f63abba4c38;hpb=d87053bf6cae0258125e84e1531d5f1206c53288;p=public-inbox.git diff --git a/Documentation/public-inbox-v2-format.pod b/Documentation/public-inbox-v2-format.pod index d87a717d..86a9b8f2 100644 --- a/Documentation/public-inbox-v2-format.pod +++ b/Documentation/public-inbox-v2-format.pod @@ -113,9 +113,14 @@ improved with high-quality and high-quantity solid-state storage. Issuing TRIM commands with L was necessary to maintain consistent performance while developing this feature. -Rotational storage devices are NOT recommended for indexing of -large mail archives; but are fine for backup and usable for -small instances. +Rotational storage devices perform significantly worse than +solid state storage for indexing of large mail archives; but are +fine for backup and usable for small instances. + +As of public-inbox 1.6.0, the C +option of L may be used with a high shard +count to ensure individual shards fit into page cache when the entire +Xapian DB cannot. Our use of the L requires Xapian document IDs to remain stable. Using L and @@ -159,7 +164,7 @@ top-level of the directory. =head1 OBJECT IDENTIFIERS -There are three distinct type of identifiers. content_id is the +There are three distinct type of identifiers. content_hash is the new one for v2 and should make message removal and deduplication easier. object_id and Message-ID are already known. @@ -179,11 +184,11 @@ The email header; duplicates allowed for archival purposes. This remains a searchable field in Xapian. Note: it's possible for emails to have multiple Message-ID headers (and L had that bug for a bit); so we take all of them into account. -In case of conflicts detected by content_id below, we generate a new -Message-ID based on content_id; if the generated Message-ID still +In case of conflicts detected by content_hash below, we generate a new +Message-ID based on content_hash; if the generated Message-ID still conflicts, a random one is generated. -=item content_id +=item content_hash A hash of relevant headers and raw body content for purging of unwanted content. This is not stored anywhere, @@ -193,7 +198,7 @@ For now, the relevant headers are: Subject, From, Date, References, In-Reply-To, To, Cc -Received, List-Id, and similar headers are NOT part of content_id as +Received, List-Id, and similar headers are NOT part of content_hash as they differ across lists and we will want removal to be able to cross lists. @@ -203,7 +208,7 @@ raw body risks being broken by list signatures; but we can use filters (e.g. PublicInbox::Filter::Vger) to clean the body for imports. -content_id is SHA-256 for now; but can be changed at any time +content_hash is SHA-256 for now; but can be changed at any time without making DB changes. =back