Eric Wong (Contractor, The Linux Foundation) [Wed, 4 Apr 2018 21:24:59 +0000 (21:24 +0000)]
v2: support incremental indexing + purge
This is important for people running mirrors via "git fetch",
as they need to be kept up-to-date. Purging is also now
supported in mirrors.
The short-lived "--regenerate" option is gone and is now
implicitly enabled as a result. It's still cheap when
article number regeneration is unnecessary, as we track
the range for each git repository.
Eric Wong (Contractor, The Linux Foundation) [Wed, 4 Apr 2018 21:11:47 +0000 (21:11 +0000)]
searchidx: ensure duplicated Message-IDs can be linked together
This allows us to emulate the display of thread-aware MUAs when
multiple messages share the same Message-ID. This also is a
place where "public-inbox-index --reindex" is useful to fix
existing messages and no schema version bump is necessary.
Eric Wong (Contractor, The Linux Foundation) [Tue, 3 Apr 2018 11:09:12 +0000 (11:09 +0000)]
nntp: simplify the long_response API
We we worked around the default range/termination conditions of
long_response in many cases to reduce calls to SQLite or Xapian.
So continue that trend and become more like the PSGI API
which doesn't force callers to specify an article range or
work inside a loop.
Eric Wong (Contractor, The Linux Foundation) [Tue, 3 Apr 2018 11:09:11 +0000 (11:09 +0000)]
msgmap: replace id_batch with ids_after
id_batch had a an overly complicated interface, replace it
with id_batch which is simpler and takes advantage of
selectcol_arrayref in DBI. This allows simplification of
callers and the diffstat agrees with me.
Eric Wong (Contractor, The Linux Foundation) [Tue, 3 Apr 2018 11:09:09 +0000 (11:09 +0000)]
view: avoid offset during pagination
OFFSET in SQLite gets painful to deal with. Instead,
rely on timestamps (from Received:) for pagination.
This also sets us up for more precise Date searching
in case we want it.
Eric Wong (Contractor, The Linux Foundation) [Tue, 3 Apr 2018 11:09:08 +0000 (11:09 +0000)]
nntp: make XOVER, XHDR, OVER, HDR and NEWNEWS faster
While SQLite is faster than Xapian for some queries we
use, it sucks at handling OFFSET. Fortunately, we do
not need offsets when retrieving sorted results and
can bake it into the query.
For inbox.comp.version-control.git (v1 Xapian),
XOVER and XHDR are over 20x faster.
Eric Wong (Contractor, The Linux Foundation) [Mon, 2 Apr 2018 00:04:56 +0000 (00:04 +0000)]
over: speedup get_thread by avoiding JOIN
JOIN operations on SQLite can be disasterously slow.
This reduces per-message pages with the thread overview
at the bottom of those pages from over 800ms to ~60ms.
In comparison, the v1 code took around 70-80ms using
Xapian on my machine.
Eric Wong (Contractor, The Linux Foundation) [Mon, 2 Apr 2018 00:04:55 +0000 (00:04 +0000)]
www: rework query responses to avoid COUNT in SQLite
In many cases, we do not care about the total number of
messages. It's a rather expensive operation in SQLite
(Xapian only provides an estimate).
For LKML, this brings top-level /$INBOX/ loading time from
~375ms to around 60ms on my system. Days ago, this operation
was taking 800-900ms(!) for me before introducing the SQLite
overview DB.
Eric Wong (Contractor, The Linux Foundation) [Mon, 2 Apr 2018 00:04:53 +0000 (00:04 +0000)]
v2writable: simplify barrier vs checkpoints
searchidx_checkpoint was too convoluted and confusing.
Since barrier is mostly the same thing; use that instead
and add an fsync option for the overview DB.
Eric Wong (Contractor, The Linux Foundation) [Mon, 2 Apr 2018 00:04:52 +0000 (00:04 +0000)]
replace Xapian skeleton with SQLite overview DB
This ought to provide better performance and scalability
which is less dependent on inbox size. Xapian does not
seem optimized for some queries used by the WWW homepage,
Atom feeds, XOVER and NEWNEWS NNTP commands.
This can actually make Xapian optional for NNTP usage,
and allow more functionality to work without Xapian
installed.
Indexing performance was extremely bad at first, but
DBI::Profile helped me optimize away problematic queries.
Eric Wong (Contractor, The Linux Foundation) [Sun, 1 Apr 2018 23:23:07 +0000 (23:23 +0000)]
v2writable: fix parallel termination
I was too aggressively disabling parallelization to speed up
the test suite and broke this :x Re-enable parallelization
for the v2reindex test so we can catch it later.
Eric Wong (Contractor, The Linux Foundation) [Sun, 1 Apr 2018 23:15:04 +0000 (23:15 +0000)]
v2: one file, really
We need to ensure there is only one file in the top-level tree
at any commit so the "add; remove; add;" sequence on the same
message is detected properly.
Otherwise, git will not detect the second "add" unless
a second message is added to history.
Deletes are now stored in "d" (and not "D" or "_/D") at the
top-level, now. There's no need to have a "_" to reduce churn
as "m" and "d" should never co-exist. It's now lowercased to
make it easier-to-distinguish from "D" in git-log output.
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:48 +0000 (01:20 +0000)]
feed: optimize query for feeds, too
This is a smaller improvement than the landing /$INBOX/ page
because full message bodies are shown; but still saves around
100ms for my system with LKML.
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:46 +0000 (01:20 +0000)]
view: drop load_results
It's no longer necessary to have this since load_expand
now populates $smsg->mid with the "preferred" Message-ID.
This saves around 10ms on the homepage for me.
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:44 +0000 (01:20 +0000)]
v2writable: go backwards through alternate Message-IDs
This is consistent with how we internally generate new
Message-IDs to break conflicts and allows ->reindex to
succeed while walking backwards through history
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 20:17:19 +0000 (20:17 +0000)]
public-inbox-compact: new tool for driving xapian-compact
Having multiple Xapian partitions is mostly pointless after
the initial import. We can compact all the partitions into
one while keeping the skeleton separate.
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 19:14:12 +0000 (19:14 +0000)]
search: retry_reopen on first_smsg_by_mid
This was causing errors while attempting to load messages via
the WWW interface while mass-importing LKML. While we're at it,
remove unnecessary eval from lookup_article.
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:56 +0000 (09:57 +0000)]
www: cleanup expensive fallback for legacy URLs
Back in the day, we compressed long Message-IDs to SHA-1
hexdigests for the URL. This now redirects to a 301 in
the hopes we can remove these checks some day to reduce
overhead.
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:52 +0000 (09:57 +0000)]
search: get rid of most lookup_* subroutines
Too many similar functions doing the same basic thing was
redundant and misleading, especially since Message-ID is
no longer treated as a truly unique identifier.
For displaying threads in the HTML, this makes it clear
that we favor the primary Message-ID mapped to an NNTP
article number if a message cannot be found.
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:50 +0000 (09:57 +0000)]
v2writable: support purging messages from git entirely
Purging existing messages is fairly straightforward since we can
take advantage of Xapian and lookup the git object_id with it.
Unfortunately, purging an already "removed" message (which is
no longer in Xapian) is not as easy and we'll need to expose
->purge_oids to purge by the git object_id (currently SHA-1).
Furthermore, we expire reflogs and prune in hopes a dumb HTTP
client won't get the object.
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:45 +0000 (09:57 +0000)]
v2writable: append, instead of prepending generated Message-ID
The original Message-ID is still the most important when
discussing with other recipients who do not rely on a message
flowing through public-inbox. So whatever Message-ID we use
to deduplicate internally will be secondary and less important.
All of our front-end v2 code is order-independent, so we won't
let the message count against us, that way.
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:44 +0000 (09:57 +0000)]
www: remove unnecessary ghost checks
We do not need to care about ghosts at multiple call sites; they
cannot have a {blob} field and we've stored the blob field in
Xapian since SCHEMA_VERSION=13.
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Mar 2018 20:31:44 +0000 (20:31 +0000)]
www: support cloning individual v2 git partitions
This will require multiple client invocations, but should reduce
load on the server and make it easier for readers to only clone
the latest data.
Unfortunately, supporting a cloneurl file for externally-hosted
repos will be more difficult as we cannot easily know if the
clones use v1 or v2 repositories, or how many git partitions
they have.
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Mar 2018 20:24:21 +0000 (20:24 +0000)]
githttpbackend: avoid infinite loop on generic PSGI servers
We must detect EOF when reading a POST body with standard PSGI servers.
This does not affect deployments using the standard public-inbox-httpd;
but most smaller inboxes should be able to get away using a generic
PSGI server.
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 18:00:41 +0000 (18:00 +0000)]
feed: fix new.html for v2
I forget this endpoint is still accessible (even if not linked).
This also simplifies new.html all around and removes some unused
clutter from the old days while we're at it.
Since v2 supports duplicate messages, we need to support
looking up different messages with the same Message-Id.
Fortunately, our "raw" endpoint has always been mboxrd,
so users won't need to change their parsing tools.
We can no longer rely on tree name lookups for v2. This also
optimizes v1 by relying on git blob object_id lookups while
avoiding process spawning overhead for "git log".
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 03:39:30 +0000 (03:39 +0000)]
v2writable: add NNTP article number regeneration support
Allow best-effort regeneration of NNTP article numbers from
cloned git repositories in addition to indexing Xapian Article
numbers will not remain consistent when we add purge support,
though.
Eric Wong (Contractor, The Linux Foundation) [Wed, 21 Mar 2018 09:04:50 +0000 (09:04 +0000)]
v2writable: support reindexing Xapian
This still requires a msgmap.sqlite3 file to exist, but
it allows us to tweak Xapian indexing rules and reindex
the Xapian database online while -watch is running.
Eric Wong (Contractor, The Linux Foundation) [Wed, 21 Mar 2018 01:52:58 +0000 (01:52 +0000)]
use both Date: and Received: times
We want to rely on Date: to sort messages within individual
threads since it keeps messages from git-send-email(1) sorted.
However, since developers occasionally have the clock set
wrong on their machines, sort overall messages by the newest
date in a Received: header so the landing page isn't forever
polluted by messages from the future.
This also gives us determinism for commit times in most cases,
as we'll used the Received: timestamp there, as well.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 23:24:50 +0000 (23:24 +0000)]
content_id: do not take Message-Id into account
If we need to use content_id, we've already lost hope
in relying on Message-Id as a differentiator. This
prevents duplicates from showing up repeatedly with
-watch when Message-Ids are reused and we generate
new Message-Ids to disambiguate.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:59 +0000 (08:14 +0000)]
v2writable: remove "resent" message for duplicate Message-IDs
public-inbox-watch gets restarted on reboots and whatnot, so
it could get pointlessly noisy. This message was only useful
during initial development and imports.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:56 +0000 (08:14 +0000)]
v2writable: allow disabling parallelization
While parallel processes improves import speed for initial
imports; they are probably not necessary for daily mail imports
via WatchMaildir and certainly not for public-inbox-init. Save
some memory for daily use and even helps improve readability of
some subroutines by showing which methods they call remotely.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:46 +0000 (08:14 +0000)]
import: force Message-ID generation for v1 here
This allows us to share code for generating Message-IDs
between v1 and v2 repos.
For v1, this introduces a slight incompatibility in message
removal iff the original message lacked a Message-ID AND
the training request came from a message which did not
pass through the public-inbox:
The workaround for this would be to reuse the bad message from
the archive itself.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:43 +0000 (08:14 +0000)]
import: implement barrier operation for v1 repos
This will allow WatchMaildir to use ->barrier operations instead
of reaching inside for nchg. This also ensures dumb HTTP
clients can see changes to V2 repos immediately.
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:42 +0000 (08:14 +0000)]
import: (v2): write deletes to a separate '_' subdirectory
In the future, we may store "purged" content IDs or other
uncommon stuff under "_/" of the git tree. This keeps the
top-level tree small and more amenable to deltafication.
This helps the the common case where "m" is most commonly
changed file at the top level.
Also, use 'D' instead of 'd' since it matches git's '--raw'
output format.