Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:56 +0000 (09:57 +0000)]
www: cleanup expensive fallback for legacy URLs
Back in the day, we compressed long Message-IDs to SHA-1
hexdigests for the URL. This now redirects to a 301 in
the hopes we can remove these checks some day to reduce
overhead.
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:52 +0000 (09:57 +0000)]
search: get rid of most lookup_* subroutines
Too many similar functions doing the same basic thing was
redundant and misleading, especially since Message-ID is
no longer treated as a truly unique identifier.
For displaying threads in the HTML, this makes it clear
that we favor the primary Message-ID mapped to an NNTP
article number if a message cannot be found.
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:50 +0000 (09:57 +0000)]
v2writable: support purging messages from git entirely
Purging existing messages is fairly straightforward since we can
take advantage of Xapian and lookup the git object_id with it.
Unfortunately, purging an already "removed" message (which is
no longer in Xapian) is not as easy and we'll need to expose
->purge_oids to purge by the git object_id (currently SHA-1).
Furthermore, we expire reflogs and prune in hopes a dumb HTTP
client won't get the object.
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:45 +0000 (09:57 +0000)]
v2writable: append, instead of prepending generated Message-ID
The original Message-ID is still the most important when
discussing with other recipients who do not rely on a message
flowing through public-inbox. So whatever Message-ID we use
to deduplicate internally will be secondary and less important.
All of our front-end v2 code is order-independent, so we won't
let the message count against us, that way.
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:44 +0000 (09:57 +0000)]
www: remove unnecessary ghost checks
We do not need to care about ghosts at multiple call sites; they
cannot have a {blob} field and we've stored the blob field in
Xapian since SCHEMA_VERSION=13.
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Mar 2018 21:27:00 +0000 (21:27 +0000)]
githttpbackend: avoid infinite loop on generic PSGI servers
We must detect EOF when reading a POST body with standard PSGI servers.
This does not affect deployments using the standard public-inbox-httpd;
but most smaller inboxes should be able to get away using a generic
PSGI server.
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Mar 2018 20:31:44 +0000 (20:31 +0000)]
www: support cloning individual v2 git partitions
This will require multiple client invocations, but should reduce
load on the server and make it easier for readers to only clone
the latest data.
Unfortunately, supporting a cloneurl file for externally-hosted
repos will be more difficult as we cannot easily know if the
clones use v1 or v2 repositories, or how many git partitions
they have.
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Mar 2018 20:24:21 +0000 (20:24 +0000)]
githttpbackend: avoid infinite loop on generic PSGI servers
We must detect EOF when reading a POST body with standard PSGI servers.
This does not affect deployments using the standard public-inbox-httpd;
but most smaller inboxes should be able to get away using a generic
PSGI server.
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 18:00:41 +0000 (18:00 +0000)]
feed: fix new.html for v2
I forget this endpoint is still accessible (even if not linked).
This also simplifies new.html all around and removes some unused
clutter from the old days while we're at it.
Since v2 supports duplicate messages, we need to support
looking up different messages with the same Message-Id.
Fortunately, our "raw" endpoint has always been mboxrd,
so users won't need to change their parsing tools.
We can no longer rely on tree name lookups for v2. This also
optimizes v1 by relying on git blob object_id lookups while
avoiding process spawning overhead for "git log".
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 03:39:30 +0000 (03:39 +0000)]
v2writable: add NNTP article number regeneration support
Allow best-effort regeneration of NNTP article numbers from
cloned git repositories in addition to indexing Xapian Article
numbers will not remain consistent when we add purge support,
though.
Eric Wong (Contractor, The Linux Foundation) [Wed, 21 Mar 2018 09:04:50 +0000 (09:04 +0000)]
v2writable: support reindexing Xapian
This still requires a msgmap.sqlite3 file to exist, but
it allows us to tweak Xapian indexing rules and reindex
the Xapian database online while -watch is running.
Eric Wong (Contractor, The Linux Foundation) [Wed, 21 Mar 2018 01:52:58 +0000 (01:52 +0000)]
use both Date: and Received: times
We want to rely on Date: to sort messages within individual
threads since it keeps messages from git-send-email(1) sorted.
However, since developers occasionally have the clock set
wrong on their machines, sort overall messages by the newest
date in a Received: header so the landing page isn't forever
polluted by messages from the future.
This also gives us determinism for commit times in most cases,
as we'll used the Received: timestamp there, as well.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 23:24:50 +0000 (23:24 +0000)]
content_id: do not take Message-Id into account
If we need to use content_id, we've already lost hope
in relying on Message-Id as a differentiator. This
prevents duplicates from showing up repeatedly with
-watch when Message-Ids are reused and we generate
new Message-Ids to disambiguate.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:59 +0000 (08:14 +0000)]
v2writable: remove "resent" message for duplicate Message-IDs
public-inbox-watch gets restarted on reboots and whatnot, so
it could get pointlessly noisy. This message was only useful
during initial development and imports.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:56 +0000 (08:14 +0000)]
v2writable: allow disabling parallelization
While parallel processes improves import speed for initial
imports; they are probably not necessary for daily mail imports
via WatchMaildir and certainly not for public-inbox-init. Save
some memory for daily use and even helps improve readability of
some subroutines by showing which methods they call remotely.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:46 +0000 (08:14 +0000)]
import: force Message-ID generation for v1 here
This allows us to share code for generating Message-IDs
between v1 and v2 repos.
For v1, this introduces a slight incompatibility in message
removal iff the original message lacked a Message-ID AND
the training request came from a message which did not
pass through the public-inbox:
The workaround for this would be to reuse the bad message from
the archive itself.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:43 +0000 (08:14 +0000)]
import: implement barrier operation for v1 repos
This will allow WatchMaildir to use ->barrier operations instead
of reaching inside for nchg. This also ensures dumb HTTP
clients can see changes to V2 repos immediately.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:42 +0000 (08:14 +0000)]
import: (v2): write deletes to a separate '_' subdirectory
In the future, we may store "purged" content IDs or other
uncommon stuff under "_/" of the git tree. This keeps the
top-level tree small and more amenable to deltafication.
This helps the the common case where "m" is most commonly
changed file at the top level.
Also, use 'D' instead of 'd' since it matches git's '--raw'
output format.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:41 +0000 (08:14 +0000)]
import: (v2) delete writes the blob into history in subdir
This makes it easier to audit deletes with "git log -p" and
prevents an unstable specification of "content_id" from being
stored in history.
This should be cost-free if done in the same partition (and even
cheaper than before as it introduces no new blobs). It does
have a higher cost across partitions, but is probably irrelevant
given the typical ham:spam ratio.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:34 +0000 (08:14 +0000)]
v2writable: support "barrier" operation to avoid reforking
Stopping and starting a bunch of processes to look up duplicates
or removals is inefficient. Take advantage of checkpointing
in "git fast-import" and transactions in Xapian and SQLite.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 07:51:09 +0000 (07:51 +0000)]
extmsg: rework partial MID matching to favor current inbox
The current inbox is more important for partial Message-ID
matching, so we try harder on that to fix common errors before
moving onto other inboxes. Then, prevent expensive scanning of
other inboxes by requiring a Message-ID length of at least 16
bytes.
Finally, we limit the overall partial responses to 200 when
scanning other inboxes to avoid excessive memory usage.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 07:51:09 +0000 (07:51 +0000)]
extmsg: rework partial MID matching to favor current inbox
The current inbox is more important for partial Message-ID
matching, so we try harder on that to fix common errors before
moving onto other inboxes. Then, prevent expensive scanning of
other inboxes by requiring a Message-ID length of at least 16
bytes.
Finally, we limit the overall partial responses to 200 when
scanning other inboxes to avoid excessive memory usage.
Eric Wong [Wed, 7 Mar 2018 19:05:20 +0000 (19:05 +0000)]
nntp: do not drain rbuf if there is a command pending
Some clients pipeline requests aggressively (enough to match
LINE_MAX) and we should not read from the client socket until we
know there's no pending command in our read buffer.
Eric Wong [Wed, 7 Mar 2018 09:46:46 +0000 (09:46 +0000)]
nntp: improve fairness during XOVER and similar commands
For other commands generating long responses, we generally want
to yield to another client after emitting 100 . However,
XOVER-based responses already query 200 lines worth of responses
at a time, so we were sending 20000 lines before yielding to
other clients. This may help avoid timeouts for some clients.
Eric Wong (Contractor, The Linux Foundation) [Tue, 6 Mar 2018 07:28:56 +0000 (07:28 +0000)]
v2writable: detect and use previous partition count
We need to detect the number of partitions the repository was
created with to ensure Xapian DBs can work across different
machines (or even CPU affinity changes) without leaving messages
unaffected by search.
Eric Wong (Contractor, The Linux Foundation) [Tue, 6 Mar 2018 04:15:38 +0000 (04:15 +0000)]
favor Received: date over Date: header globally
The first Received: header is believable since it typically
hits the user's mail server and can be treated as relatively
trustworthy. We still show the Date: in per-message (permalink)
views, which may expose users for having incorrect Date:
headers, but all the ISO YYYY-MM-DD dates we display will
match what we see.
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 20:56:15 +0000 (20:56 +0000)]
v2: avoid redundant/repeated configs for git partition repos
We'll let the config of all.git dictate every other subrepo to
ease maintenance and configuration. The "include" directive has
been supported since git 1.7.10, so it's safe to depend on as v2
requires git 2.6.0+ anyways for "get-mark" in fast-import.
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 20:21:05 +0000 (20:21 +0000)]
searchidx: store the primary MID in doc data for NNTP
We can't rely on header order for Message-ID after all
since we fall back to existing MIDs if they exist and
are unseen. This lets us use SearchMsg->mid to get the
MID we associated with the NNTP article number to ensure
all NNTP article lookups roundtrip correctly.
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 18:47:47 +0000 (18:47 +0000)]
nntp: use NNTP article numbers for lookups
Since Message-IDs are no longer unique within Xapian
(but are within the SQLite Msgmap); favor NNTP article
numbers for internal lookups. This will prevent us
from finding the "wrong" internal Message-ID.
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 17:57:57 +0000 (17:57 +0000)]
mid: truncate excessively long MIDs early
Since we support duplicate MIDs in v2, we can safely truncate
long MID terms in the database and let other normal duplicate
resolution sort it out. It seems only spammers use excessively
long MIDs, and there'll always be abuse/misuse vectors for causing
mis-threaded messages, so it's not worth worrying about
excessively long MIDs.
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 05:55:26 +0000 (05:55 +0000)]
searchidx: avoid excessive XNQ indexing with diffs
When indexing diffs, we can avoid indexing the diff parts under
XNQ and instead combine the parts in the read-only search
interface. This results in better indexing performance and
10-15% smaller Xapian indices.
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 05:14:33 +0000 (05:14 +0000)]
mid: be strict with References, but loose on Message-Id
Traditionally we've been more lax on parsing Message-Id
and allow it without the angle brackets. We've always been
strict on References and can't have it be pointlessly
large when some MUA decides to use HTML-escaped angle
brackets ("<", ">").
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 20:46:55 +0000 (20:46 +0000)]
search: revert to using 'Q' as a uniQue id per-Xapian conventions
'Q' is merely a convention in the Xapian world, and is close
enough to unique for practical purposes, so stop using XMID
and gain a little more term length as a result.
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 19:32:19 +0000 (19:32 +0000)]
v2writable: inject new Message-IDs on true duplicates
Since we'll need to support multiple Message-IDs anyways,
inject a new one if we hit a duplicate (or don't get one at
all).
Try to use a deterministic Message-Id for consistency, but give
up determinism and use a random Message-Id if an "attacker"
wants to prevent their message from being archived.
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 18:27:54 +0000 (18:27 +0000)]
content_id: no need to be human-friendly
We merely use this for internal comparisons and do not store
this in Xapian. So using a shorter, non-human readable digest
is enough. Furthermore, introduce "content_digest" which
returns the Digest::SHA object for extra changes.