]> Sergey Matveev's repositories - public-inbox.git/log
public-inbox.git
6 years agowww: support cloning individual v2 git partitions
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Mar 2018 20:31:44 +0000 (20:31 +0000)]
www: support cloning individual v2 git partitions

This will require multiple client invocations, but should reduce
load on the server and make it easier for readers to only clone
the latest data.

Unfortunately, supporting a cloneurl file for externally-hosted
repos will be more difficult as we cannot easily know if the
clones use v1 or v2 repositories, or how many git partitions
they have.

6 years agogithttpbackend: avoid infinite loop on generic PSGI servers
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Mar 2018 20:24:21 +0000 (20:24 +0000)]
githttpbackend: avoid infinite loop on generic PSGI servers

We must detect EOF when reading a POST body with standard PSGI servers.
This does not affect deployments using the standard public-inbox-httpd;
but most smaller inboxes should be able to get away using a generic
PSGI server.

6 years agohttp: fix modification of read-only value
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Mar 2018 20:09:06 +0000 (20:09 +0000)]
http: fix modification of read-only value

This fails in the rare case we get a partial send() on "\r\n"
when writing chunked HTTP responses out.

6 years agoview: depend on SearchMsg for Message-ID
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 19:33:18 +0000 (19:33 +0000)]
view: depend on SearchMsg for Message-ID

Since we need to handle messages with multiple and duplicate
Message-ID headers, our thread skeleton display must account
for that.

Since we have a "preferred" Message-ID in case of conflicts,
use it as the UUID in an Atom feed so readers do not get
confused by conflicts.

6 years agosearchview: remove unnecessary imports from MID module
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 19:25:47 +0000 (19:25 +0000)]
searchview: remove unnecessary imports from MID module

We do not need many of these, anymore.

6 years agowww: get rid of unnecessary 'inbox' name reference
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 19:15:10 +0000 (19:15 +0000)]
www: get rid of unnecessary 'inbox' name reference

We use the actual Inbox object everywhere else and don't
need the name of the inbox separated from the object.

6 years agov2writable: warn on unseen deleted files
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 19:09:08 +0000 (19:09 +0000)]
v2writable: warn on unseen deleted files

It would be a bug to have deleted files marked but not
seen in our histories.

6 years agosearchidx: warn about vivifying multiple ghosts
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 18:19:31 +0000 (18:19 +0000)]
searchidx: warn about vivifying multiple ghosts

This should help us detect bugs sooner in case we have
space waste problems.

6 years agoview: permalink (per-message) view shows multiple messages
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 20:29:24 +0000 (20:29 +0000)]
view: permalink (per-message) view shows multiple messages

This needs tests and further refinement, but current tests pass.

6 years agofeed: fix new.html for v2
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 18:00:41 +0000 (18:00 +0000)]
feed: fix new.html for v2

I forget this endpoint is still accessible (even if not linked).
This also simplifies new.html all around and removes some unused
clutter from the old days while we're at it.

6 years agot/psgi_v2: minimal test for Atom feed and t.mbox.gz
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 02:24:03 +0000 (02:24 +0000)]
t/psgi_v2: minimal test for Atom feed and t.mbox.gz

Some test coverage is better than none, here.

6 years agosearch: reopen DB if each_smsg_by_mid fails
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 02:03:46 +0000 (02:03 +0000)]
search: reopen DB if each_smsg_by_mid fails

This gives more-up-to-date data in case and allows us
to avoid reopening in more places ourselves.

6 years agowww: $MESSAGE_ID/raw endpoint supports "duplicates"
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 01:54:16 +0000 (01:54 +0000)]
www: $MESSAGE_ID/raw endpoint supports "duplicates"

Since v2 supports duplicate messages, we need to support
looking up different messages with the same Message-Id.
Fortunately, our "raw" endpoint has always been mboxrd,
so users won't need to change their parsing tools.

6 years agoimport: consolidate mid prepend logic, here
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 18:21:54 +0000 (18:21 +0000)]
import: consolidate mid prepend logic, here

This also quiets down warnings from -watch when spam training
happens on messages without Message-Id.

6 years agofeed: $INBOX/new.atom endpoint supports v2 inboxes
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 08:48:29 +0000 (08:48 +0000)]
feed: $INBOX/new.atom endpoint supports v2 inboxes

We can no longer rely on tree name lookups for v2.  This also
optimizes v1 by relying on git blob object_id lookups while
avoiding process spawning overhead for "git log".

6 years agov2writable: DEBUG_DIFF respects $TMPDIR
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 08:16:22 +0000 (08:16 +0000)]
v2writable: DEBUG_DIFF respects $TMPDIR

The File::Temp API is a bit tricky and needs TMPDIR explicitly
enabled if a template is given.

6 years agov2writable: clarify header cleanups
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 08:14:19 +0000 (08:14 +0000)]
v2writable: clarify header cleanups

We want to make it clear to the code and DEBUG_DIFF users
that we do not introduce messages with unsuitable headers
into public archives.

6 years agov2writable: add NNTP article number regeneration support
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 03:39:30 +0000 (03:39 +0000)]
v2writable: add NNTP article number regeneration support

Allow best-effort regeneration of NNTP article numbers from
cloned git repositories in addition to indexing Xapian Article
numbers will not remain consistent when we add purge support,
though.

6 years agot/altid.t: extra tests for mid_set
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 03:06:56 +0000 (03:06 +0000)]
t/altid.t: extra tests for mid_set

I'll be relying on some of this behavior for regenerating NNTP
article numbers off fresh clones.

6 years agov2writable: support reindexing Xapian
Eric Wong (Contractor, The Linux Foundation) [Wed, 21 Mar 2018 09:04:50 +0000 (09:04 +0000)]
v2writable: support reindexing Xapian

This still requires a msgmap.sqlite3 file to exist, but
it allows us to tweak Xapian indexing rules and reindex
the Xapian database online while -watch is running.

6 years agofix syntax warnings
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 00:12:25 +0000 (00:12 +0000)]
fix syntax warnings

I keep forgetting to run "make syntax"

6 years agomsgmap: add tmp_clone to create an anonymous copy
Eric Wong (Contractor, The Linux Foundation) [Wed, 21 Mar 2018 08:29:25 +0000 (08:29 +0000)]
msgmap: add tmp_clone to create an anonymous copy

This will be used to keep track of Message-ID <-> NNTP Article
numbers to prevent article number reuse when reindexing.

6 years agouse both Date: and Received: times
Eric Wong (Contractor, The Linux Foundation) [Wed, 21 Mar 2018 01:52:58 +0000 (01:52 +0000)]
use both Date: and Received: times

We want to rely on Date: to sort messages within individual
threads since it keeps messages from git-send-email(1) sorted.
However, since developers occasionally have the clock set
wrong on their machines, sort overall messages by the newest
date in a Received: header so the landing page isn't forever
polluted by messages from the future.

This also gives us determinism for commit times in most cases,
as we'll used the Received: timestamp there, as well.

6 years agoInboxWritable: add mbox/maildir parsing + import logic
Eric Wong (Contractor, The Linux Foundation) [Tue, 20 Mar 2018 21:00:00 +0000 (21:00 +0000)]
InboxWritable: add mbox/maildir parsing + import logic

This will make it easier to as well as supporting future
Filter API users.  It allows simplifying our ad-hoc
import_vger_from_mbox script.

6 years agoimport: discard all the same headers as MDA
Eric Wong (Contractor, The Linux Foundation) [Tue, 20 Mar 2018 19:50:03 +0000 (19:50 +0000)]
import: discard all the same headers as MDA

Reduce the places where we have duplicate logic for discarding
unwanted headers.

6 years agointroduce InboxWritable class
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 20:49:30 +0000 (20:49 +0000)]
introduce InboxWritable class

This code will be shared with future mass-import tools.

6 years agocontent_id: do not take Message-Id into account
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 23:24:50 +0000 (23:24 +0000)]
content_id: do not take Message-Id into account

If we need to use content_id, we've already lost hope
in relying on Message-Id as a differentiator.  This
prevents duplicates from showing up repeatedly with
-watch when Message-Ids are reused and we generate
new Message-Ids to disambiguate.

6 years agov2writable: remove "resent" message for duplicate Message-IDs
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:59 +0000 (08:14 +0000)]
v2writable: remove "resent" message for duplicate Message-IDs

public-inbox-watch gets restarted on reboots and whatnot, so
it could get pointlessly noisy.  This message was only useful
during initial development and imports.

6 years agov2writable: add DEBUG_DIFF env support
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:58 +0000 (08:14 +0000)]
v2writable: add DEBUG_DIFF env support

This can help us track down some differences during import,
if needed.

6 years agoscripts/import_vger_from_mbox: filter out same headers as MDA
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:57 +0000 (08:14 +0000)]
scripts/import_vger_from_mbox: filter out same headers as MDA

Perhaps we should filter these headers out in Import

6 years agov2writable: allow disabling parallelization
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:56 +0000 (08:14 +0000)]
v2writable: allow disabling parallelization

While parallel processes improves import speed for initial
imports; they are probably not necessary for daily mail imports
via WatchMaildir and certainly not for public-inbox-init.  Save
some memory for daily use and even helps improve readability of
some subroutines by showing which methods they call remotely.

6 years agosearchidxpart: s/barrier/remote_barrier/
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:55 +0000 (08:14 +0000)]
searchidxpart: s/barrier/remote_barrier/

Be consistent with our "remote_" prefix for other IPC subs

6 years agowatchmaildir: support v2 repositories
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:54 +0000 (08:14 +0000)]
watchmaildir: support v2 repositories

Unfortunately this gives up some minor performance tweaks we
made to avoid reforking import processes.

6 years agov2writable: ensure ->done is idempotent
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:53 +0000 (08:14 +0000)]
v2writable: ensure ->done is idempotent

This matches Import::done behavior

6 years agot/watch_maildir: note the reason for FIFO creation
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:52 +0000 (08:14 +0000)]
t/watch_maildir: note the reason for FIFO creation

I had to dig through commit history for this and we should
better document our tests (along with everything else).

6 years agoLock: new base class for writable lockers
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:51 +0000 (08:14 +0000)]
Lock: new base class for writable lockers

This reduces code duplication needed for locking and
and hopefully makes things easier to understand.

6 years agoindex: s/GIT_DIR/REPO_DIR/
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:50 +0000 (08:14 +0000)]
index: s/GIT_DIR/REPO_DIR/

No functional changes, yet, but this makes future changes
easier-to-read.

6 years agoimport: enable locking under v2
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:49 +0000 (08:14 +0000)]
import: enable locking under v2

Instead of using ssoma-based locking, enable locking via Import
for now.

6 years agov2writable: test for idempotent removals
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:48 +0000 (08:14 +0000)]
v2writable: test for idempotent removals

This will make reindexing easier.

6 years agoimport: switch to URL-safe Base64 for Message-IDs
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:47 +0000 (08:14 +0000)]
import: switch to URL-safe Base64 for Message-IDs

Hexdigests are too long and shorter Message-IDs are easier
to deal with.

6 years agoimport: force Message-ID generation for v1 here
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:46 +0000 (08:14 +0000)]
import: force Message-ID generation for v1 here

This allows us to share code for generating Message-IDs
between v1 and v2 repos.

For v1, this introduces a slight incompatibility in message
removal iff the original message lacked a Message-ID AND
the training request came from a message which did not
pass through the public-inbox:

The workaround for this would be to reuse the bad message from
the archive itself.

6 years agowatchmaildir: use content_digest to generate Message-Id
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:45 +0000 (08:14 +0000)]
watchmaildir: use content_digest to generate Message-Id

This can probably be moved to Import for code reuse.

6 years agomid: mid_mime uses v2-compatible mids function
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:44 +0000 (08:14 +0000)]
mid: mid_mime uses v2-compatible mids function

This allows us to be more consistent in dealing with completely
empty Message-Ids.

6 years agoimport: implement barrier operation for v1 repos
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:43 +0000 (08:14 +0000)]
import: implement barrier operation for v1 repos

This will allow WatchMaildir to use ->barrier operations instead
of reaching inside for nchg.  This also ensures dumb HTTP
clients can see changes to V2 repos immediately.

6 years agoimport: (v2): write deletes to a separate '_' subdirectory
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:42 +0000 (08:14 +0000)]
import: (v2): write deletes to a separate '_' subdirectory

In the future, we may store "purged" content IDs or other
uncommon stuff under "_/" of the git tree.  This keeps the
top-level tree small and more amenable to deltafication.
This helps the the common case where "m" is most commonly
changed file at the top level.

Also, use 'D' instead of 'd' since it matches git's '--raw'
output format.

6 years agoimport: (v2) delete writes the blob into history in subdir
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:41 +0000 (08:14 +0000)]
import: (v2) delete writes the blob into history in subdir

This makes it easier to audit deletes with "git log -p" and
prevents an unstable specification of "content_id" from being
stored in history.

This should be cost-free if done in the same partition (and even
cheaper than before as it introduces no new blobs).  It does
have a higher cost across partitions, but is probably irrelevant
given the typical ham:spam ratio.

6 years agoskeleton: barrier init requires a lock
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:40 +0000 (08:14 +0000)]
skeleton: barrier init requires a lock

Writing to the main skeleton pipe requires a lock since it's
shared with partition processes.

6 years agov2writable: implement remove correctly
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:39 +0000 (08:14 +0000)]
v2writable: implement remove correctly

We need to hide removals from anybody hitting the search engine.

6 years agosearch: allow ->reopen to be chainable
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:38 +0000 (08:14 +0000)]
search: allow ->reopen to be chainable

Makes life a little easier for V2Writable...

6 years agosearchidx: do not delete documents while iterating
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:37 +0000 (08:14 +0000)]
searchidx: do not delete documents while iterating

Followup-to: ebb59815035b42c2
  ("searchidx: do not modify Xapian DB while iterating")

6 years agov2writable: remove unnecessary idx_init call
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:36 +0000 (08:14 +0000)]
v2writable: remove unnecessary idx_init call

We no longer need it with ->barrier working

6 years agouse string ref for Email::Simple->new
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:35 +0000 (08:14 +0000)]
use string ref for Email::Simple->new

Email::Simple is slightly faster this way, and Email::MIME
and PublicInbox::MIME both wrap that.

6 years agov2writable: support "barrier" operation to avoid reforking
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:34 +0000 (08:14 +0000)]
v2writable: support "barrier" operation to avoid reforking

Stopping and starting a bunch of processes to look up duplicates
or removals is inefficient.  Take advantage of checkpointing
in "git fast-import" and transactions in Xapian and SQLite.

6 years agocontent_id: use Sender header if From is not available
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:33 +0000 (08:14 +0000)]
content_id: use Sender header if From is not available

We will be using Sender: in more places if the From: header
is not available, this is one of them.

Followup-to: ("import: fall back to Sender for extracting name and email")
6 years agoextmsg: rework partial MID matching to favor current inbox
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 07:51:09 +0000 (07:51 +0000)]
extmsg: rework partial MID matching to favor current inbox

The current inbox is more important for partial Message-ID
matching, so we try harder on that to fix common errors before
moving onto other inboxes.  Then, prevent expensive scanning of
other inboxes by requiring a Message-ID length of at least 16
bytes.

Finally, we limit the overall partial responses to 200 when
scanning other inboxes to avoid excessive memory usage.

6 years agov2writable: detect and use previous partition count
Eric Wong (Contractor, The Linux Foundation) [Tue, 6 Mar 2018 07:28:56 +0000 (07:28 +0000)]
v2writable: detect and use previous partition count

We need to detect the number of partitions the repository was
created with to ensure Xapian DBs can work across different
machines (or even CPU affinity changes) without leaving messages
unaffected by search.

6 years agoscripts/import_vger_from_mbox: perform mboxrd or mboxo escaping
Eric Wong (Contractor, The Linux Foundation) [Tue, 6 Mar 2018 03:51:08 +0000 (03:51 +0000)]
scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping

It appears most of the mboxes in the archive I've been given are
mboxrd (despite having Content-Length:) and needs the escaping.

6 years agoimport: fall back to Sender for extracting name and email
Eric Wong (Contractor, The Linux Foundation) [Tue, 6 Mar 2018 03:49:42 +0000 (03:49 +0000)]
import: fall back to Sender for extracting name and email

This seems like a reasonable course of action for old messages.
Cc: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
6 years agofavor Received: date over Date: header globally
Eric Wong (Contractor, The Linux Foundation) [Tue, 6 Mar 2018 04:15:38 +0000 (04:15 +0000)]
favor Received: date over Date: header globally

The first Received: header is believable since it typically
hits the user's mail server and can be treated as relatively
trustworthy.  We still show the Date: in per-message (permalink)
views, which may expose users for having incorrect Date:
headers, but all the ISO YYYY-MM-DD dates we display will
match what we see.

6 years agov2writable: remove unnecessary skeleton commit
Eric Wong (Contractor, The Linux Foundation) [Tue, 6 Mar 2018 02:59:35 +0000 (02:59 +0000)]
v2writable: remove unnecessary skeleton commit

Not a big deal since we still commit to the skeleton for every
single partition (barrier work abandoned).

6 years agosearch: each_smsg_by_mid uses skeleton if available
Eric Wong (Contractor, The Linux Foundation) [Mon, 5 Mar 2018 23:13:47 +0000 (23:13 +0000)]
search: each_smsg_by_mid uses skeleton if available

We do not need the large DBs for MID scans.

6 years agosearch: favor skeleton DB for lookup_mail
Eric Wong (Contractor, The Linux Foundation) [Sun, 4 Mar 2018 20:04:29 +0000 (20:04 +0000)]
search: favor skeleton DB for lookup_mail

The skeleton DB is smaller and hit more frequently given the
homepage and per-message/thread views; so it will be hotter in
the page cache.

6 years agoINSTALL: document more optional dependencies
Eric Wong (Contractor, The Linux Foundation) [Mon, 5 Mar 2018 17:06:46 +0000 (17:06 +0000)]
INSTALL: document more optional dependencies

I've missed a few things over time :x

6 years agov2: avoid redundant/repeated configs for git partition repos
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 20:56:15 +0000 (20:56 +0000)]
v2: avoid redundant/repeated configs for git partition repos

We'll let the config of all.git dictate every other subrepo to
ease maintenance and configuration.  The "include" directive has
been supported since git 1.7.10, so it's safe to depend on as v2
requires git 2.6.0+ anyways for "get-mark" in fast-import.

6 years agoimport: consolidate object info for v2 imports
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 20:33:25 +0000 (20:33 +0000)]
import: consolidate object info for v2 imports

It's easier to store everything in one array ref similar
to what our Git->check routine returns

6 years agosearchidx: store the primary MID in doc data for NNTP
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 20:21:05 +0000 (20:21 +0000)]
searchidx: store the primary MID in doc data for NNTP

We can't rely on header order for Message-ID after all
since we fall back to existing MIDs if they exist and
are unseen.  This lets us use SearchMsg->mid to get the
MID we associated with the NNTP article number to ensure
all NNTP article lookups roundtrip correctly.

6 years agonntp: fix NEWNEWS command
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 20:18:34 +0000 (20:18 +0000)]
nntp: fix NEWNEWS command

I guess nobody uses this command (slrnpull does not), and
the breakage was not noticed until I started writing new
tests for multi-MID handling.

Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
6 years agonntp: use NNTP article numbers for lookups
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 18:47:47 +0000 (18:47 +0000)]
nntp: use NNTP article numbers for lookups

Since Message-IDs are no longer unique within Xapian
(but are within the SQLite Msgmap); favor NNTP article
numbers for internal lookups.  This will prevent us
from finding the "wrong" internal Message-ID.

6 years agomid: truncate excessively long MIDs early
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 17:57:57 +0000 (17:57 +0000)]
mid: truncate excessively long MIDs early

Since we support duplicate MIDs in v2, we can safely truncate
long MID terms in the database and let other normal duplicate
resolution sort it out.  It seems only spammers use excessively
long MIDs, and there'll always be abuse/misuse vectors for causing
mis-threaded messages, so it's not worth worrying about
excessively long MIDs.

6 years agosearchidx: add NNTP article number as a searchable term
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 17:42:20 +0000 (17:42 +0000)]
searchidx: add NNTP article number as a searchable term

Since we support duplicate MIDs in v2, the NNTP article number
becomes the true unique identifier and we want a way to do fast
lookups on it.

While we're at it, stop putting XPATH in the term partitions
since we only need it in the skeleton DB.

6 years agosearchidx: use add_boolean_term for internal terms
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 17:26:16 +0000 (17:26 +0000)]
searchidx: use add_boolean_term for internal terms

Aside from the Message-Id ('Q'), these terms do not appear in
content and thus have no business contributing to the Xapian
document length.

Thanks-to Olly Betts for the tip on xapian-discuss
<20180228004400.GU12724@survex.com>

6 years agov2writable: generated Message-ID goes first
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 07:31:54 +0000 (07:31 +0000)]
v2writable: generated Message-ID goes first

This is to make SearchMsg behave more sanely under NNTP.

6 years agosearchidxskeleton: add a note about locking
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 07:16:29 +0000 (07:16 +0000)]
searchidxskeleton: add a note about locking

It's tempting to rely on the atomicity of smaller-than-PIPE_BUF
writes, but it doesn't work if mixed with larger ones.

6 years agosearchidx: avoid excessive XNQ indexing with diffs
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 05:55:26 +0000 (05:55 +0000)]
searchidx: avoid excessive XNQ indexing with diffs

When indexing diffs, we can avoid indexing the diff parts under
XNQ and instead combine the parts in the read-only search
interface.  This results in better indexing performance and
10-15% smaller Xapian indices.

6 years agomid: be strict with References, but loose on Message-Id
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 05:14:33 +0000 (05:14 +0000)]
mid: be strict with References, but loose on Message-Id

Traditionally we've been more lax on parsing Message-Id
and allow it without the angle brackets.  We've always been
strict on References and can't have it be pointlessly
large when some MUA decides to use HTML-escaped angle
brackets ("&lt;", "&gt;").

6 years agosearchidx: support indexing multiple MIDs
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 04:00:09 +0000 (04:00 +0000)]
searchidx: support indexing multiple MIDs

It's possible to have a message handle multiple terms;
so use this feature to ensure messages with multiple MIDs
can be found by either one.

6 years agosearch: revert to using 'Q' as a uniQue id per-Xapian conventions
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 20:46:55 +0000 (20:46 +0000)]
search: revert to using 'Q' as a uniQue id per-Xapian conventions

'Q' is merely a convention in the Xapian world, and is close
enough to unique for practical purposes, so stop using XMID
and gain a little more term length as a result.

6 years agov2writable: inject new Message-IDs on true duplicates
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 19:32:19 +0000 (19:32 +0000)]
v2writable: inject new Message-IDs on true duplicates

Since we'll need to support multiple Message-IDs anyways,
inject a new one if we hit a duplicate (or don't get one at
all).

Try to use a deterministic Message-Id for consistency, but give
up determinism and use a random Message-Id if an "attacker"
wants to prevent their message from being archived.

6 years agocontent_id: no need to be human-friendly
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 18:27:54 +0000 (18:27 +0000)]
content_id: no need to be human-friendly

We merely use this for internal comparisons and do not store
this in Xapian.  So using a shorter, non-human readable digest
is enough.  Furthermore, introduce "content_digest" which
returns the Digest::SHA object for extra changes.

6 years agosearchidx: use new `references' method for parsing References
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 09:53:11 +0000 (09:53 +0000)]
searchidx: use new `references' method for parsing References

It's shorter and more convenient, here.

6 years agocontent_id: use `mids' and `references' for MID extraction
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 09:39:31 +0000 (09:39 +0000)]
content_id: use `mids' and `references' for MID extraction

These already take care of deduping internally, so we'll save
ourselves at least some of the trouble while using a more
consistent API.  While we're at it, hash the header name as
well, since we need to distinguish which header a certain value
came from.

6 years agomid: add `mids' and `references' methods for extraction
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 09:38:35 +0000 (09:38 +0000)]
mid: add `mids' and `references' methods for extraction

We'll be using a more consistent API for extracting Message-IDs
from various headers.

6 years agoevcleanup: do not create event loop if nothing was registered
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 03:43:30 +0000 (03:43 +0000)]
evcleanup: do not create event loop if nothing was registered

This was creating an unnecessary epoll descriptor via
Danga::Socket when using V2Writable to import a mbox.  That
said, there should probably be better way of detecting whether
or not we're inside a Danga::Socket event loop.

Fixes: 427245acacaf04a8
       ("evcleanup: ensure deferred close from timers are handled ASAP")

6 years agov2writable: deduplicate detection on add
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 03:39:09 +0000 (03:39 +0000)]
v2writable: deduplicate detection on add

This is a bit expensive in a multi-process situation because
we need to make our indices and packs visible to the read-only
pieces.

6 years agoevcleanup: disable outside of daemon
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 03:12:23 +0000 (03:12 +0000)]
evcleanup: disable outside of daemon

We'll be using these in a more OO manner for V2Writable
(which doesn't use Danga::Socket), so lets not unnecessarily
register cleanup handlers intended for network daemons.

6 years agocontent_id: special treatment for Message-Id headers
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 03:08:33 +0000 (03:08 +0000)]
content_id: special treatment for Message-Id headers

Some emails in LKML archives are identical with the only
difference being s/References:/In-Reply-To:/ in the headers.
Since this difference doesn't affect how we handle message
threading, we will treat them the same way for the purposes
of deduplication.

There may be more changes to how we do content_id along these
lines (e.g. using msg_iter to walk the message).

6 years agosearchidx: add PID to error message when die-ing
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 00:50:56 +0000 (00:50 +0000)]
searchidx: add PID to error message when die-ing

6 years agosearch: remove informational "warning" message
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 00:49:51 +0000 (00:49 +0000)]
search: remove informational "warning" message

It was making imports too noisy.

6 years agov2writable: delete ::Import obj when ->done
Eric Wong (Contractor, The Linux Foundation) [Thu, 1 Mar 2018 08:24:00 +0000 (08:24 +0000)]
v2writable: delete ::Import obj when ->done

As with the ::Import class this wraps, we want this to be
usable as a checkpoint and be able to call ->add afterwards.
We'll be relying on ->done to flush changes through all
partition and skeleton DBs for deduplication checks.

6 years agov2/ui: get nntpd and init tests running on v2
Eric Wong (Contractor, The Linux Foundation) [Wed, 28 Feb 2018 22:29:38 +0000 (22:29 +0000)]
v2/ui: get nntpd and init tests running on v2

A work-in-progress, but it appears the v2 UI pieces do
will not require a lot of work to do.

6 years agosearch: query_xover uses skeleton DB iff available
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Feb 2018 08:29:09 +0000 (08:29 +0000)]
search: query_xover uses skeleton DB iff available

The skeleton DB is where we store all the information needed
for NNTP overviews via XOVER.  This seems to be the only change
necessary (besides eventually handling duplicates) necessary
to support our nntpd interface for v2 repositories.

6 years agosearchidx: do not modify Xapian DB while iterating
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Feb 2018 20:25:23 +0000 (20:25 +0000)]
searchidx: do not modify Xapian DB while iterating

Iterating through a list of documents while modifying them does
not seem to be supported in Xapian and it can trigger
DatabaseCorruptError exceptions.  This only worked with past
datasets out of dumb luck.  With the work-in-progress "v2"
public-inbox layout, this problem might become more visible
as the "thread skeleton" is partitioned out to a separate,
smaller Xapian database.

I've reproduced the problem on both Debian 8.x and 9.x with
Xapian 1.2.19 (chert backend) and 1.4.3 (glass backend)
respectively.

6 years agosearchidxskeleton: extra error checking
Eric Wong (Contractor, The Linux Foundation) [Wed, 28 Feb 2018 17:37:00 +0000 (17:37 +0000)]
searchidxskeleton: extra error checking

I added these while chasing down the DatabaseCorruptError
exceptions which turned out to be caused by Xapian DB
modifications during iteration.

6 years agov2writable: commit to skeleton via remote partitions
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Feb 2018 20:29:55 +0000 (20:29 +0000)]
v2writable: commit to skeleton via remote partitions

We need to ensure Xapian transaction commits are made to remote
partitions before associated commits hit the skeleton DB.

This causes unnecessary commits to be made to the skeleton DB;
but they're mostly harmless.  Further work will be necessary
to ensure proper ordering and avoidance of unnecessary commits.

6 years agorename SearchIdxThread to SearchIdxSkeleton
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Feb 2018 00:41:21 +0000 (00:41 +0000)]
rename SearchIdxThread to SearchIdxSkeleton

Interchangably using "all", "skel", "threader", etc. were
confusing.  Standardize on the "skeleton" term to describe
this class since it's also used for retrieval of basic headers.

6 years agosearch: use different Enquire object for skeleton queries
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Feb 2018 23:42:14 +0000 (23:42 +0000)]
search: use different Enquire object for skeleton queries

A different Xapian DB requires the use of a different Enquire
object.  This is necessary for get_thread and thread skeleton
to work in the PSGI UI.

6 years agosearchidx: index values in the threader
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Feb 2018 23:41:11 +0000 (23:41 +0000)]
searchidx: index values in the threader

We will need timestamp, YYYYMMDD, article number, and line count
for querying thread information (including XOVER for NNTP).

6 years agosearch: reopen skeleton DB as well
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Feb 2018 23:25:52 +0000 (23:25 +0000)]
search: reopen skeleton DB as well

Any Xapian DB is subject to the same errors and retries.
Perhaps in the future this can made more granular to avoid
unnecessary reopens.

6 years agosearchidxpart: force integers into add_message
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Feb 2018 23:02:13 +0000 (23:02 +0000)]
searchidxpart: force integers into add_message

Make data passed via Storable to the skeleton worker
a little neater.

6 years agosearchidxthread: load doc data for references
Eric Wong (Contractor, The Linux Foundation) [Sat, 24 Feb 2018 06:58:55 +0000 (06:58 +0000)]
searchidxthread: load doc data for references

Otherwise, references and thread linking doesn't happen
across subject mismatches.  Oops, this is important.