]> Sergey Matveev's repositories - public-inbox.git/log
public-inbox.git
6 years agowwwstream: flesh out clone instructions for v2
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:43 +0000 (01:20 +0000)]
wwwstream: flesh out clone instructions for v2

Relying solely on git for v2 repos is probably not
so useful, so add pointers to public-inbox-init/index
commands.

6 years agov2writable: convert some fatal reindex errors to warnings
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:42 +0000 (01:20 +0000)]
v2writable: convert some fatal reindex errors to warnings

By supporting purge and allowing users to delete git partitions,
we can open up ourselves to gaps and un-reindexible data.  Let
that be.

6 years agov2writable: allow gaps in git partitions
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:41 +0000 (01:20 +0000)]
v2writable: allow gaps in git partitions

Somebody may only care about the most recent history,
so allow -init and -index to operate quietly on missing
partitions.

6 years agosearch: warn on reopens and die on total failure
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:40 +0000 (01:20 +0000)]
search: warn on reopens and die on total failure

-watch on a busy/giant Maildir caused too many Xapian
errors while attempting to browse.

6 years agomda: support v2 inboxes
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 20:17:20 +0000 (20:17 +0000)]
mda: support v2 inboxes

I mainly focus on -watch for mirroring busy mailing lists, but
using -mda should remain an option.

6 years agopublic-inbox-compact: new tool for driving xapian-compact
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 20:17:19 +0000 (20:17 +0000)]
public-inbox-compact: new tool for driving xapian-compact

Having multiple Xapian partitions is mostly pointless after
the initial import.  We can compact all the partitions into
one while keeping the skeleton separate.

6 years agov2writable: initializing an existing inbox is idempotent
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 20:17:18 +0000 (20:17 +0000)]
v2writable: initializing an existing inbox is idempotent

And we do not want to start making confused repos if somebody
leaves out "-V2" the second time around.

6 years agoimport: run_die supports redirects as spawn does
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 20:17:17 +0000 (20:17 +0000)]
import: run_die supports redirects as spawn does

We'll be using it in more future tests and scripts.

6 years agosearch: retry_reopen on first_smsg_by_mid
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 19:14:12 +0000 (19:14 +0000)]
search: retry_reopen on first_smsg_by_mid

This was causing errors while attempting to load messages via
the WWW interface while mass-importing LKML.  While we're at it,
remove unnecessary eval from lookup_article.

6 years agoview: get rid of some unnecessary imports
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:57 +0000 (09:57 +0000)]
view: get rid of some unnecessary imports

We no longer need some of these old subroutines which
assumed a single Message-ID for each message.

6 years agowww: cleanup expensive fallback for legacy URLs
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:56 +0000 (09:57 +0000)]
www: cleanup expensive fallback for legacy URLs

Back in the day, we compressed long Message-IDs to SHA-1
hexdigests for the URL.  This now redirects to a 301 in
the hopes we can remove these checks some day to reduce
overhead.

6 years agombox: avoid extracting Message-ID for linkification
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:55 +0000 (09:57 +0000)]
mbox: avoid extracting Message-ID for linkification

We can avoid a small amount of overhead and use the "preferred"
Message-ID based on what is in the SearchMsg object.

6 years agov2writable: cleanup: get rid of unused fields
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:54 +0000 (09:57 +0000)]
v2writable: cleanup: get rid of unused fields

The layout of this structure ended up being a bit different
and the read-only access is handled through the ::Inbox class,
instead.

6 years agosearch: move find_doc_ids to searchidx
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:53 +0000 (09:57 +0000)]
search: move find_doc_ids to searchidx

We do not need this subroutine for read-only use in Search.pm

6 years agosearch: get rid of most lookup_* subroutines
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:52 +0000 (09:57 +0000)]
search: get rid of most lookup_* subroutines

Too many similar functions doing the same basic thing was
redundant and misleading, especially since Message-ID is
no longer treated as a truly unique identifier.

For displaying threads in the HTML, this makes it clear
that we favor the primary Message-ID mapped to an NNTP
article number if a message cannot be found.

6 years agosearch: cleanup uniqueness checking
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:51 +0000 (09:57 +0000)]
search: cleanup uniqueness checking

The only Xapian term which should be unique is the NNTP article
number; so we no longer need find_unique_doc_id.

6 years agov2writable: support purging messages from git entirely
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:50 +0000 (09:57 +0000)]
v2writable: support purging messages from git entirely

Purging existing messages is fairly straightforward since we can
take advantage of Xapian and lookup the git object_id with it.

Unfortunately, purging an already "removed" message (which is
no longer in Xapian) is not as easy and we'll need to expose
->purge_oids to purge by the git object_id (currently SHA-1).

Furthermore, we expire reflogs and prune in hopes a dumb HTTP
client won't get the object.

6 years agopublic-inbox-convert: tool for converting old to new inboxes
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:49 +0000 (09:57 +0000)]
public-inbox-convert: tool for converting old to new inboxes

This should make it easier to let users perform comparisons and
migrate to v2 if needed.

6 years agosearchmsg: document why we store To: and Cc: for NNTP
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:48 +0000 (09:57 +0000)]
searchmsg: document why we store To: and Cc: for NNTP

Otherwise I would forget and be tempted to remove them.

6 years agowww: fix attachment downloads for conflicted Message-IDs
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:47 +0000 (09:57 +0000)]
www: fix attachment downloads for conflicted Message-IDs

By using the "primary" Message-ID in WwwAttach, we can avoid
conflicts in the links we use for downloading attachments.

6 years agolookup by Message-ID favors the "primary" one
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:46 +0000 (09:57 +0000)]
lookup by Message-ID favors the "primary" one

The Message-ID mapped to an NNTP article number is stronger,
so we will favor that for attachment lookups.

6 years agov2writable: append, instead of prepending generated Message-ID
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:45 +0000 (09:57 +0000)]
v2writable: append, instead of prepending generated Message-ID

The original Message-ID is still the most important when
discussing with other recipients who do not rely on a message
flowing through public-inbox.  So whatever Message-ID we use
to deduplicate internally will be secondary and less important.

All of our front-end v2 code is order-independent, so we won't
let the message count against us, that way.

6 years agowww: remove unnecessary ghost checks
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:44 +0000 (09:57 +0000)]
www: remove unnecessary ghost checks

We do not need to care about ghosts at multiple call sites; they
cannot have a {blob} field and we've stored the blob field in
Xapian since SCHEMA_VERSION=13.

6 years agowww: support cloning individual v2 git partitions
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Mar 2018 20:31:44 +0000 (20:31 +0000)]
www: support cloning individual v2 git partitions

This will require multiple client invocations, but should reduce
load on the server and make it easier for readers to only clone
the latest data.

Unfortunately, supporting a cloneurl file for externally-hosted
repos will be more difficult as we cannot easily know if the
clones use v1 or v2 repositories, or how many git partitions
they have.

6 years agogithttpbackend: avoid infinite loop on generic PSGI servers
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Mar 2018 20:24:21 +0000 (20:24 +0000)]
githttpbackend: avoid infinite loop on generic PSGI servers

We must detect EOF when reading a POST body with standard PSGI servers.
This does not affect deployments using the standard public-inbox-httpd;
but most smaller inboxes should be able to get away using a generic
PSGI server.

6 years agohttp: fix modification of read-only value
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Mar 2018 20:09:06 +0000 (20:09 +0000)]
http: fix modification of read-only value

This fails in the rare case we get a partial send() on "\r\n"
when writing chunked HTTP responses out.

6 years agoview: depend on SearchMsg for Message-ID
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 19:33:18 +0000 (19:33 +0000)]
view: depend on SearchMsg for Message-ID

Since we need to handle messages with multiple and duplicate
Message-ID headers, our thread skeleton display must account
for that.

Since we have a "preferred" Message-ID in case of conflicts,
use it as the UUID in an Atom feed so readers do not get
confused by conflicts.

6 years agosearchview: remove unnecessary imports from MID module
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 19:25:47 +0000 (19:25 +0000)]
searchview: remove unnecessary imports from MID module

We do not need many of these, anymore.

6 years agowww: get rid of unnecessary 'inbox' name reference
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 19:15:10 +0000 (19:15 +0000)]
www: get rid of unnecessary 'inbox' name reference

We use the actual Inbox object everywhere else and don't
need the name of the inbox separated from the object.

6 years agov2writable: warn on unseen deleted files
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 19:09:08 +0000 (19:09 +0000)]
v2writable: warn on unseen deleted files

It would be a bug to have deleted files marked but not
seen in our histories.

6 years agosearchidx: warn about vivifying multiple ghosts
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 18:19:31 +0000 (18:19 +0000)]
searchidx: warn about vivifying multiple ghosts

This should help us detect bugs sooner in case we have
space waste problems.

6 years agoview: permalink (per-message) view shows multiple messages
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 20:29:24 +0000 (20:29 +0000)]
view: permalink (per-message) view shows multiple messages

This needs tests and further refinement, but current tests pass.

6 years agofeed: fix new.html for v2
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 18:00:41 +0000 (18:00 +0000)]
feed: fix new.html for v2

I forget this endpoint is still accessible (even if not linked).
This also simplifies new.html all around and removes some unused
clutter from the old days while we're at it.

6 years agot/psgi_v2: minimal test for Atom feed and t.mbox.gz
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 02:24:03 +0000 (02:24 +0000)]
t/psgi_v2: minimal test for Atom feed and t.mbox.gz

Some test coverage is better than none, here.

6 years agosearch: reopen DB if each_smsg_by_mid fails
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 02:03:46 +0000 (02:03 +0000)]
search: reopen DB if each_smsg_by_mid fails

This gives more-up-to-date data in case and allows us
to avoid reopening in more places ourselves.

6 years agowww: $MESSAGE_ID/raw endpoint supports "duplicates"
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 01:54:16 +0000 (01:54 +0000)]
www: $MESSAGE_ID/raw endpoint supports "duplicates"

Since v2 supports duplicate messages, we need to support
looking up different messages with the same Message-Id.
Fortunately, our "raw" endpoint has always been mboxrd,
so users won't need to change their parsing tools.

6 years agoimport: consolidate mid prepend logic, here
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 18:21:54 +0000 (18:21 +0000)]
import: consolidate mid prepend logic, here

This also quiets down warnings from -watch when spam training
happens on messages without Message-Id.

6 years agofeed: $INBOX/new.atom endpoint supports v2 inboxes
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 08:48:29 +0000 (08:48 +0000)]
feed: $INBOX/new.atom endpoint supports v2 inboxes

We can no longer rely on tree name lookups for v2.  This also
optimizes v1 by relying on git blob object_id lookups while
avoiding process spawning overhead for "git log".

6 years agov2writable: DEBUG_DIFF respects $TMPDIR
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 08:16:22 +0000 (08:16 +0000)]
v2writable: DEBUG_DIFF respects $TMPDIR

The File::Temp API is a bit tricky and needs TMPDIR explicitly
enabled if a template is given.

6 years agov2writable: clarify header cleanups
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 08:14:19 +0000 (08:14 +0000)]
v2writable: clarify header cleanups

We want to make it clear to the code and DEBUG_DIFF users
that we do not introduce messages with unsuitable headers
into public archives.

6 years agov2writable: add NNTP article number regeneration support
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 03:39:30 +0000 (03:39 +0000)]
v2writable: add NNTP article number regeneration support

Allow best-effort regeneration of NNTP article numbers from
cloned git repositories in addition to indexing Xapian Article
numbers will not remain consistent when we add purge support,
though.

6 years agot/altid.t: extra tests for mid_set
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 03:06:56 +0000 (03:06 +0000)]
t/altid.t: extra tests for mid_set

I'll be relying on some of this behavior for regenerating NNTP
article numbers off fresh clones.

6 years agov2writable: support reindexing Xapian
Eric Wong (Contractor, The Linux Foundation) [Wed, 21 Mar 2018 09:04:50 +0000 (09:04 +0000)]
v2writable: support reindexing Xapian

This still requires a msgmap.sqlite3 file to exist, but
it allows us to tweak Xapian indexing rules and reindex
the Xapian database online while -watch is running.

6 years agofix syntax warnings
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 00:12:25 +0000 (00:12 +0000)]
fix syntax warnings

I keep forgetting to run "make syntax"

6 years agomsgmap: add tmp_clone to create an anonymous copy
Eric Wong (Contractor, The Linux Foundation) [Wed, 21 Mar 2018 08:29:25 +0000 (08:29 +0000)]
msgmap: add tmp_clone to create an anonymous copy

This will be used to keep track of Message-ID <-> NNTP Article
numbers to prevent article number reuse when reindexing.

6 years agouse both Date: and Received: times
Eric Wong (Contractor, The Linux Foundation) [Wed, 21 Mar 2018 01:52:58 +0000 (01:52 +0000)]
use both Date: and Received: times

We want to rely on Date: to sort messages within individual
threads since it keeps messages from git-send-email(1) sorted.
However, since developers occasionally have the clock set
wrong on their machines, sort overall messages by the newest
date in a Received: header so the landing page isn't forever
polluted by messages from the future.

This also gives us determinism for commit times in most cases,
as we'll used the Received: timestamp there, as well.

6 years agoInboxWritable: add mbox/maildir parsing + import logic
Eric Wong (Contractor, The Linux Foundation) [Tue, 20 Mar 2018 21:00:00 +0000 (21:00 +0000)]
InboxWritable: add mbox/maildir parsing + import logic

This will make it easier to as well as supporting future
Filter API users.  It allows simplifying our ad-hoc
import_vger_from_mbox script.

6 years agoimport: discard all the same headers as MDA
Eric Wong (Contractor, The Linux Foundation) [Tue, 20 Mar 2018 19:50:03 +0000 (19:50 +0000)]
import: discard all the same headers as MDA

Reduce the places where we have duplicate logic for discarding
unwanted headers.

6 years agointroduce InboxWritable class
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 20:49:30 +0000 (20:49 +0000)]
introduce InboxWritable class

This code will be shared with future mass-import tools.

6 years agocontent_id: do not take Message-Id into account
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 23:24:50 +0000 (23:24 +0000)]
content_id: do not take Message-Id into account

If we need to use content_id, we've already lost hope
in relying on Message-Id as a differentiator.  This
prevents duplicates from showing up repeatedly with
-watch when Message-Ids are reused and we generate
new Message-Ids to disambiguate.

6 years agov2writable: remove "resent" message for duplicate Message-IDs
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:59 +0000 (08:14 +0000)]
v2writable: remove "resent" message for duplicate Message-IDs

public-inbox-watch gets restarted on reboots and whatnot, so
it could get pointlessly noisy.  This message was only useful
during initial development and imports.

6 years agov2writable: add DEBUG_DIFF env support
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:58 +0000 (08:14 +0000)]
v2writable: add DEBUG_DIFF env support

This can help us track down some differences during import,
if needed.

6 years agoscripts/import_vger_from_mbox: filter out same headers as MDA
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:57 +0000 (08:14 +0000)]
scripts/import_vger_from_mbox: filter out same headers as MDA

Perhaps we should filter these headers out in Import

6 years agov2writable: allow disabling parallelization
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:56 +0000 (08:14 +0000)]
v2writable: allow disabling parallelization

While parallel processes improves import speed for initial
imports; they are probably not necessary for daily mail imports
via WatchMaildir and certainly not for public-inbox-init.  Save
some memory for daily use and even helps improve readability of
some subroutines by showing which methods they call remotely.

6 years agosearchidxpart: s/barrier/remote_barrier/
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:55 +0000 (08:14 +0000)]
searchidxpart: s/barrier/remote_barrier/

Be consistent with our "remote_" prefix for other IPC subs

6 years agowatchmaildir: support v2 repositories
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:54 +0000 (08:14 +0000)]
watchmaildir: support v2 repositories

Unfortunately this gives up some minor performance tweaks we
made to avoid reforking import processes.

6 years agov2writable: ensure ->done is idempotent
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:53 +0000 (08:14 +0000)]
v2writable: ensure ->done is idempotent

This matches Import::done behavior

6 years agot/watch_maildir: note the reason for FIFO creation
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:52 +0000 (08:14 +0000)]
t/watch_maildir: note the reason for FIFO creation

I had to dig through commit history for this and we should
better document our tests (along with everything else).

6 years agoLock: new base class for writable lockers
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:51 +0000 (08:14 +0000)]
Lock: new base class for writable lockers

This reduces code duplication needed for locking and
and hopefully makes things easier to understand.

6 years agoindex: s/GIT_DIR/REPO_DIR/
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:50 +0000 (08:14 +0000)]
index: s/GIT_DIR/REPO_DIR/

No functional changes, yet, but this makes future changes
easier-to-read.

6 years agoimport: enable locking under v2
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:49 +0000 (08:14 +0000)]
import: enable locking under v2

Instead of using ssoma-based locking, enable locking via Import
for now.

6 years agov2writable: test for idempotent removals
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:48 +0000 (08:14 +0000)]
v2writable: test for idempotent removals

This will make reindexing easier.

6 years agoimport: switch to URL-safe Base64 for Message-IDs
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:47 +0000 (08:14 +0000)]
import: switch to URL-safe Base64 for Message-IDs

Hexdigests are too long and shorter Message-IDs are easier
to deal with.

6 years agoimport: force Message-ID generation for v1 here
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:46 +0000 (08:14 +0000)]
import: force Message-ID generation for v1 here

This allows us to share code for generating Message-IDs
between v1 and v2 repos.

For v1, this introduces a slight incompatibility in message
removal iff the original message lacked a Message-ID AND
the training request came from a message which did not
pass through the public-inbox:

The workaround for this would be to reuse the bad message from
the archive itself.

6 years agowatchmaildir: use content_digest to generate Message-Id
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:45 +0000 (08:14 +0000)]
watchmaildir: use content_digest to generate Message-Id

This can probably be moved to Import for code reuse.

6 years agomid: mid_mime uses v2-compatible mids function
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:44 +0000 (08:14 +0000)]
mid: mid_mime uses v2-compatible mids function

This allows us to be more consistent in dealing with completely
empty Message-Ids.

6 years agoimport: implement barrier operation for v1 repos
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:43 +0000 (08:14 +0000)]
import: implement barrier operation for v1 repos

This will allow WatchMaildir to use ->barrier operations instead
of reaching inside for nchg.  This also ensures dumb HTTP
clients can see changes to V2 repos immediately.

6 years agoimport: (v2): write deletes to a separate '_' subdirectory
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:42 +0000 (08:14 +0000)]
import: (v2): write deletes to a separate '_' subdirectory

In the future, we may store "purged" content IDs or other
uncommon stuff under "_/" of the git tree.  This keeps the
top-level tree small and more amenable to deltafication.
This helps the the common case where "m" is most commonly
changed file at the top level.

Also, use 'D' instead of 'd' since it matches git's '--raw'
output format.

6 years agoimport: (v2) delete writes the blob into history in subdir
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:41 +0000 (08:14 +0000)]
import: (v2) delete writes the blob into history in subdir

This makes it easier to audit deletes with "git log -p" and
prevents an unstable specification of "content_id" from being
stored in history.

This should be cost-free if done in the same partition (and even
cheaper than before as it introduces no new blobs).  It does
have a higher cost across partitions, but is probably irrelevant
given the typical ham:spam ratio.

6 years agoskeleton: barrier init requires a lock
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:40 +0000 (08:14 +0000)]
skeleton: barrier init requires a lock

Writing to the main skeleton pipe requires a lock since it's
shared with partition processes.

6 years agov2writable: implement remove correctly
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:39 +0000 (08:14 +0000)]
v2writable: implement remove correctly

We need to hide removals from anybody hitting the search engine.

6 years agosearch: allow ->reopen to be chainable
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:38 +0000 (08:14 +0000)]
search: allow ->reopen to be chainable

Makes life a little easier for V2Writable...

6 years agosearchidx: do not delete documents while iterating
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:37 +0000 (08:14 +0000)]
searchidx: do not delete documents while iterating

Followup-to: ebb59815035b42c2
  ("searchidx: do not modify Xapian DB while iterating")

6 years agov2writable: remove unnecessary idx_init call
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:36 +0000 (08:14 +0000)]
v2writable: remove unnecessary idx_init call

We no longer need it with ->barrier working

6 years agouse string ref for Email::Simple->new
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:35 +0000 (08:14 +0000)]
use string ref for Email::Simple->new

Email::Simple is slightly faster this way, and Email::MIME
and PublicInbox::MIME both wrap that.

6 years agov2writable: support "barrier" operation to avoid reforking
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:34 +0000 (08:14 +0000)]
v2writable: support "barrier" operation to avoid reforking

Stopping and starting a bunch of processes to look up duplicates
or removals is inefficient.  Take advantage of checkpointing
in "git fast-import" and transactions in Xapian and SQLite.

6 years agocontent_id: use Sender header if From is not available
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:33 +0000 (08:14 +0000)]
content_id: use Sender header if From is not available

We will be using Sender: in more places if the From: header
is not available, this is one of them.

Followup-to: ("import: fall back to Sender for extracting name and email")
6 years agoextmsg: rework partial MID matching to favor current inbox
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 07:51:09 +0000 (07:51 +0000)]
extmsg: rework partial MID matching to favor current inbox

The current inbox is more important for partial Message-ID
matching, so we try harder on that to fix common errors before
moving onto other inboxes.  Then, prevent expensive scanning of
other inboxes by requiring a Message-ID length of at least 16
bytes.

Finally, we limit the overall partial responses to 200 when
scanning other inboxes to avoid excessive memory usage.

6 years agov2writable: detect and use previous partition count
Eric Wong (Contractor, The Linux Foundation) [Tue, 6 Mar 2018 07:28:56 +0000 (07:28 +0000)]
v2writable: detect and use previous partition count

We need to detect the number of partitions the repository was
created with to ensure Xapian DBs can work across different
machines (or even CPU affinity changes) without leaving messages
unaffected by search.

6 years agoscripts/import_vger_from_mbox: perform mboxrd or mboxo escaping
Eric Wong (Contractor, The Linux Foundation) [Tue, 6 Mar 2018 03:51:08 +0000 (03:51 +0000)]
scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping

It appears most of the mboxes in the archive I've been given are
mboxrd (despite having Content-Length:) and needs the escaping.

6 years agoimport: fall back to Sender for extracting name and email
Eric Wong (Contractor, The Linux Foundation) [Tue, 6 Mar 2018 03:49:42 +0000 (03:49 +0000)]
import: fall back to Sender for extracting name and email

This seems like a reasonable course of action for old messages.
Cc: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
6 years agofavor Received: date over Date: header globally
Eric Wong (Contractor, The Linux Foundation) [Tue, 6 Mar 2018 04:15:38 +0000 (04:15 +0000)]
favor Received: date over Date: header globally

The first Received: header is believable since it typically
hits the user's mail server and can be treated as relatively
trustworthy.  We still show the Date: in per-message (permalink)
views, which may expose users for having incorrect Date:
headers, but all the ISO YYYY-MM-DD dates we display will
match what we see.

6 years agov2writable: remove unnecessary skeleton commit
Eric Wong (Contractor, The Linux Foundation) [Tue, 6 Mar 2018 02:59:35 +0000 (02:59 +0000)]
v2writable: remove unnecessary skeleton commit

Not a big deal since we still commit to the skeleton for every
single partition (barrier work abandoned).

6 years agosearch: each_smsg_by_mid uses skeleton if available
Eric Wong (Contractor, The Linux Foundation) [Mon, 5 Mar 2018 23:13:47 +0000 (23:13 +0000)]
search: each_smsg_by_mid uses skeleton if available

We do not need the large DBs for MID scans.

6 years agosearch: favor skeleton DB for lookup_mail
Eric Wong (Contractor, The Linux Foundation) [Sun, 4 Mar 2018 20:04:29 +0000 (20:04 +0000)]
search: favor skeleton DB for lookup_mail

The skeleton DB is smaller and hit more frequently given the
homepage and per-message/thread views; so it will be hotter in
the page cache.

6 years agoINSTALL: document more optional dependencies
Eric Wong (Contractor, The Linux Foundation) [Mon, 5 Mar 2018 17:06:46 +0000 (17:06 +0000)]
INSTALL: document more optional dependencies

I've missed a few things over time :x

6 years agov2: avoid redundant/repeated configs for git partition repos
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 20:56:15 +0000 (20:56 +0000)]
v2: avoid redundant/repeated configs for git partition repos

We'll let the config of all.git dictate every other subrepo to
ease maintenance and configuration.  The "include" directive has
been supported since git 1.7.10, so it's safe to depend on as v2
requires git 2.6.0+ anyways for "get-mark" in fast-import.

6 years agoimport: consolidate object info for v2 imports
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 20:33:25 +0000 (20:33 +0000)]
import: consolidate object info for v2 imports

It's easier to store everything in one array ref similar
to what our Git->check routine returns

6 years agosearchidx: store the primary MID in doc data for NNTP
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 20:21:05 +0000 (20:21 +0000)]
searchidx: store the primary MID in doc data for NNTP

We can't rely on header order for Message-ID after all
since we fall back to existing MIDs if they exist and
are unseen.  This lets us use SearchMsg->mid to get the
MID we associated with the NNTP article number to ensure
all NNTP article lookups roundtrip correctly.

6 years agonntp: fix NEWNEWS command
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 20:18:34 +0000 (20:18 +0000)]
nntp: fix NEWNEWS command

I guess nobody uses this command (slrnpull does not), and
the breakage was not noticed until I started writing new
tests for multi-MID handling.

Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
6 years agonntp: use NNTP article numbers for lookups
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 18:47:47 +0000 (18:47 +0000)]
nntp: use NNTP article numbers for lookups

Since Message-IDs are no longer unique within Xapian
(but are within the SQLite Msgmap); favor NNTP article
numbers for internal lookups.  This will prevent us
from finding the "wrong" internal Message-ID.

6 years agomid: truncate excessively long MIDs early
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 17:57:57 +0000 (17:57 +0000)]
mid: truncate excessively long MIDs early

Since we support duplicate MIDs in v2, we can safely truncate
long MID terms in the database and let other normal duplicate
resolution sort it out.  It seems only spammers use excessively
long MIDs, and there'll always be abuse/misuse vectors for causing
mis-threaded messages, so it's not worth worrying about
excessively long MIDs.

6 years agosearchidx: add NNTP article number as a searchable term
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 17:42:20 +0000 (17:42 +0000)]
searchidx: add NNTP article number as a searchable term

Since we support duplicate MIDs in v2, the NNTP article number
becomes the true unique identifier and we want a way to do fast
lookups on it.

While we're at it, stop putting XPATH in the term partitions
since we only need it in the skeleton DB.

6 years agosearchidx: use add_boolean_term for internal terms
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 17:26:16 +0000 (17:26 +0000)]
searchidx: use add_boolean_term for internal terms

Aside from the Message-Id ('Q'), these terms do not appear in
content and thus have no business contributing to the Xapian
document length.

Thanks-to Olly Betts for the tip on xapian-discuss
<20180228004400.GU12724@survex.com>

6 years agov2writable: generated Message-ID goes first
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 07:31:54 +0000 (07:31 +0000)]
v2writable: generated Message-ID goes first

This is to make SearchMsg behave more sanely under NNTP.

6 years agosearchidxskeleton: add a note about locking
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 07:16:29 +0000 (07:16 +0000)]
searchidxskeleton: add a note about locking

It's tempting to rely on the atomicity of smaller-than-PIPE_BUF
writes, but it doesn't work if mixed with larger ones.

6 years agosearchidx: avoid excessive XNQ indexing with diffs
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 05:55:26 +0000 (05:55 +0000)]
searchidx: avoid excessive XNQ indexing with diffs

When indexing diffs, we can avoid indexing the diff parts under
XNQ and instead combine the parts in the read-only search
interface.  This results in better indexing performance and
10-15% smaller Xapian indices.

6 years agomid: be strict with References, but loose on Message-Id
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 05:14:33 +0000 (05:14 +0000)]
mid: be strict with References, but loose on Message-Id

Traditionally we've been more lax on parsing Message-Id
and allow it without the angle brackets.  We've always been
strict on References and can't have it be pointlessly
large when some MUA decides to use HTML-escaped angle
brackets ("&lt;", "&gt;").

6 years agosearchidx: support indexing multiple MIDs
Eric Wong (Contractor, The Linux Foundation) [Sat, 3 Mar 2018 04:00:09 +0000 (04:00 +0000)]
searchidx: support indexing multiple MIDs

It's possible to have a message handle multiple terms;
so use this feature to ensure messages with multiple MIDs
can be found by either one.

6 years agosearch: revert to using 'Q' as a uniQue id per-Xapian conventions
Eric Wong (Contractor, The Linux Foundation) [Fri, 2 Mar 2018 20:46:55 +0000 (20:46 +0000)]
search: revert to using 'Q' as a uniQue id per-Xapian conventions

'Q' is merely a convention in the Xapian world, and is close
enough to unique for practical purposes, so stop using XMID
and gain a little more term length as a result.