=head1 NAME
-public-inbox v2 repository description
+public-inbox-v2-format - structure of public inbox v2 archives
=head1 DESCRIPTION
Issuing TRIM commands with L<fstrim(8)> was necessary to maintain
consistent performance while developing this feature.
-Rotational storage devices are NOT recommended for indexing of
-large mail archives; but are fine for backup and usable for
-small instances.
+Rotational storage devices perform significantly worse than
+solid state storage for indexing of large mail archives; but are
+fine for backup and usable for small instances.
+
+As of public-inbox 1.6.0, the C<publicInbox.indexSequentialShard>
+option of L<public-inbox-index(1)> may be used with a high shard
+count to ensure individual shards fit into page cache when the entire
+Xapian DB cannot.
Our use of the L</OVERVIEW DB> requires Xapian document IDs to
remain stable. Using L<public-inbox-compact(1)> and
The overview DB maintains all the header information necessary
to implement the NNTP OVER/XOVER commands and non-search
-endpoints of of the PSGI UI.
+endpoints of the PSGI UI.
Xapian has become completely optional for v2 (as it is for v1), but
SQLite remains required for v2. SQLite turns out to be powerful
=head1 OBJECT IDENTIFIERS
-There are three distinct type of identifiers. content_id is the
+There are three distinct type of identifiers. content_hash is the
new one for v2 and should make message removal and deduplication
easier. object_id and Message-ID are already known.
This remains a searchable field in Xapian. Note: it's possible
for emails to have multiple Message-ID headers (and L<git-send-email(1)>
had that bug for a bit); so we take all of them into account.
-In case of conflicts detected by content_id below, we generate a new
-Message-ID based on content_id; if the generated Message-ID still
+In case of conflicts detected by content_hash below, we generate a new
+Message-ID based on content_hash; if the generated Message-ID still
conflicts, a random one is generated.
-=item content_id
+=item content_hash
A hash of relevant headers and raw body content for
purging of unwanted content. This is not stored anywhere,
Subject, From, Date, References, In-Reply-To, To, Cc
-Received, List-Id, and similar headers are NOT part of content_id as
+Received, List-Id, and similar headers are NOT part of content_hash as
they differ across lists and we will want removal to be able to cross
lists.
filters (e.g. PublicInbox::Filter::Vger) to clean the body for
imports.
-content_id is SHA-256 for now; but can be changed at any time
+content_hash is SHA-256 for now; but can be changed at any time
without making DB changes.
=back
=head1 THANKS
Thanks to the Linux Foundation for sponsoring the development
-and testing of the v2 repository format.
+and testing of the v2 format.
=head1 COPYRIGHT