]> Sergey Matveev's repositories - public-inbox.git/log
public-inbox.git
6 years agonntp: make XOVER, XHDR, OVER, HDR and NEWNEWS faster
Eric Wong (Contractor, The Linux Foundation) [Tue, 3 Apr 2018 11:09:08 +0000 (11:09 +0000)]
nntp: make XOVER, XHDR, OVER, HDR and NEWNEWS faster

While SQLite is faster than Xapian for some queries we
use, it sucks at handling OFFSET.  Fortunately, we do
not need offsets when retrieving sorted results and
can bake it into the query.

For inbox.comp.version-control.git (v1 Xapian),
XOVER and XHDR are over 20x faster.

6 years agorename+rewrite test using Benchmark module
Eric Wong (Contractor, The Linux Foundation) [Tue, 3 Apr 2018 11:09:07 +0000 (11:09 +0000)]
rename+rewrite test using Benchmark module

There'll be more performance-related tests in the future.

6 years agot/thread-all.t: modernize test to support modern inboxes
Eric Wong (Contractor, The Linux Foundation) [Tue, 3 Apr 2018 11:09:06 +0000 (11:09 +0000)]
t/thread-all.t: modernize test to support modern inboxes

We'll be adding more tests in the same vein as this
to improve NNTP performance.

6 years agoover: speedup get_thread by avoiding JOIN
Eric Wong (Contractor, The Linux Foundation) [Mon, 2 Apr 2018 00:04:56 +0000 (00:04 +0000)]
over: speedup get_thread by avoiding JOIN

JOIN operations on SQLite can be disasterously slow.
This reduces per-message pages with the thread overview
at the bottom of those pages from over 800ms to ~60ms.
In comparison, the v1 code took around 70-80ms using
Xapian on my machine.

6 years agowww: rework query responses to avoid COUNT in SQLite
Eric Wong (Contractor, The Linux Foundation) [Mon, 2 Apr 2018 00:04:55 +0000 (00:04 +0000)]
www: rework query responses to avoid COUNT in SQLite

In many cases, we do not care about the total number of
messages.  It's a rather expensive operation in SQLite
(Xapian only provides an estimate).

For LKML, this brings top-level /$INBOX/ loading time from
~375ms to around 60ms on my system.  Days ago, this operation
was taking 800-900ms(!) for me before introducing the SQLite
overview DB.

6 years agot/over: test empty Subject: line matching
Eric Wong (Contractor, The Linux Foundation) [Mon, 2 Apr 2018 00:04:54 +0000 (00:04 +0000)]
t/over: test empty Subject: line matching

We need to ensure we don't match NULL 'sid' columns in the
`over' table.

6 years agov2writable: simplify barrier vs checkpoints
Eric Wong (Contractor, The Linux Foundation) [Mon, 2 Apr 2018 00:04:53 +0000 (00:04 +0000)]
v2writable: simplify barrier vs checkpoints

searchidx_checkpoint was too convoluted and confusing.
Since barrier is mostly the same thing; use that instead
and add an fsync option for the overview DB.

6 years agoreplace Xapian skeleton with SQLite overview DB
Eric Wong (Contractor, The Linux Foundation) [Mon, 2 Apr 2018 00:04:52 +0000 (00:04 +0000)]
replace Xapian skeleton with SQLite overview DB

This ought to provide better performance and scalability
which is less dependent on inbox size.  Xapian does not
seem optimized for some queries used by the WWW homepage,
Atom feeds, XOVER and NEWNEWS NNTP commands.

This can actually make Xapian optional for NNTP usage,
and allow more functionality to work without Xapian
installed.

Indexing performance was extremely bad at first, but
DBI::Profile helped me optimize away problematic queries.

6 years agosearch: reduce columns stored in Xapian
Eric Wong (Contractor, The Linux Foundation) [Sun, 1 Apr 2018 06:30:37 +0000 (06:30 +0000)]
search: reduce columns stored in Xapian

We can store :bytes and :lines in doc_data since we never
sort or search by them.  We don't have much use for the Date:
stamp at the moment, either.

6 years agoscripts/import_vger_from_mbox: set address properly
Eric Wong (Contractor, The Linux Foundation) [Sun, 1 Apr 2018 23:24:26 +0000 (23:24 +0000)]
scripts/import_vger_from_mbox: set address properly

For objects like Inbox; the '-' prefixed hash keys are
probably intended for auto-generated/hidden parameters.

6 years agotruncate Message-IDs and References consistently
Eric Wong (Contractor, The Linux Foundation) [Sun, 1 Apr 2018 23:23:44 +0000 (23:23 +0000)]
truncate Message-IDs and References consistently

We need to stop ghost messages from generating longer
Message-IDs than Xapian can handle with terms.

6 years agov2writable: fix parallel termination
Eric Wong (Contractor, The Linux Foundation) [Sun, 1 Apr 2018 23:23:07 +0000 (23:23 +0000)]
v2writable: fix parallel termination

I was too aggressively disabling parallelization to speed up
the test suite and broke this :x  Re-enable parallelization
for the v2reindex test so we can catch it later.

6 years agov2: one file, really
Eric Wong (Contractor, The Linux Foundation) [Sun, 1 Apr 2018 23:15:04 +0000 (23:15 +0000)]
v2: one file, really

We need to ensure there is only one file in the top-level tree
at any commit so the "add; remove; add;" sequence on the same
message is detected properly.

Otherwise, git will not detect the second "add" unless
a second message is added to history.

Deletes are now stored in "d" (and not "D" or "_/D") at the
top-level, now.  There's no need to have a "_" to reduce churn
as "m" and "d" should never co-exist.  It's now lowercased to
make it easier-to-distinguish from "D" in git-log output.

6 years agosearchidx: correct warning for over-vivification
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 20:55:13 +0000 (20:55 +0000)]
searchidx: correct warning for over-vivification

We will vivify multiple ghosts if a message has multiple
Message-IDs.

6 years agov2: respect core.sharedRepository in git configs
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 17:46:31 +0000 (17:46 +0000)]
v2: respect core.sharedRepository in git configs

Ensure -convert and -compact do not make repositories
unreadable on live servers.

6 years agot/v2writable: use simplify permissions reading
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 18:15:25 +0000 (18:15 +0000)]
t/v2writable: use simplify permissions reading

We have Git::qx nowadays.

6 years agosearch: move permissions handling to InboxWritable
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 18:03:01 +0000 (18:03 +0000)]
search: move permissions handling to InboxWritable

We'll be making sure V2Writable uses this.

6 years agoconvert: avoid redundant "done\n" statement for fast-import
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 20:31:48 +0000 (20:31 +0000)]
convert: avoid redundant "done\n" statement for fast-import

This bug was hidden due to timing problems with eatmydata or
running with tmpfs for TMPDIR.

6 years agomsgtime: parse 3-digit years properly
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:47 +0000 (01:20 +0000)]
msgtime: parse 3-digit years properly

Some folks had bad mail clients which generated 3-digit years
around Y2K...

6 years agofeed: optimize query for feeds, too
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:48 +0000 (01:20 +0000)]
feed: optimize query for feeds, too

This is a smaller improvement than the landing /$INBOX/ page
because full message bodies are shown; but still saves around
100ms for my system with LKML.

6 years agoview: drop load_results
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:46 +0000 (01:20 +0000)]
view: drop load_results

It's no longer necessary to have this since load_expand
now populates $smsg->mid with the "preferred" Message-ID.
This saves around 10ms on the homepage for me.

6 years agoview: speed up homepage loading time with date clamp
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:45 +0000 (01:20 +0000)]
view: speed up homepage loading time with date clamp

This saves over 400ms on my system with the full LKML
with over 2.8 million messages.

6 years agov2writable: go backwards through alternate Message-IDs
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:44 +0000 (01:20 +0000)]
v2writable: go backwards through alternate Message-IDs

This is consistent with how we internally generate new
Message-IDs to break conflicts and allows ->reindex to
succeed while walking backwards through history

6 years agowwwstream: flesh out clone instructions for v2
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:43 +0000 (01:20 +0000)]
wwwstream: flesh out clone instructions for v2

Relying solely on git for v2 repos is probably not
so useful, so add pointers to public-inbox-init/index
commands.

6 years agov2writable: convert some fatal reindex errors to warnings
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:42 +0000 (01:20 +0000)]
v2writable: convert some fatal reindex errors to warnings

By supporting purge and allowing users to delete git partitions,
we can open up ourselves to gaps and un-reindexible data.  Let
that be.

6 years agov2writable: allow gaps in git partitions
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:41 +0000 (01:20 +0000)]
v2writable: allow gaps in git partitions

Somebody may only care about the most recent history,
so allow -init and -index to operate quietly on missing
partitions.

6 years agosearch: warn on reopens and die on total failure
Eric Wong (Contractor, The Linux Foundation) [Fri, 30 Mar 2018 01:20:40 +0000 (01:20 +0000)]
search: warn on reopens and die on total failure

-watch on a busy/giant Maildir caused too many Xapian
errors while attempting to browse.

6 years agomda: support v2 inboxes
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 20:17:20 +0000 (20:17 +0000)]
mda: support v2 inboxes

I mainly focus on -watch for mirroring busy mailing lists, but
using -mda should remain an option.

6 years agopublic-inbox-compact: new tool for driving xapian-compact
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 20:17:19 +0000 (20:17 +0000)]
public-inbox-compact: new tool for driving xapian-compact

Having multiple Xapian partitions is mostly pointless after
the initial import.  We can compact all the partitions into
one while keeping the skeleton separate.

6 years agov2writable: initializing an existing inbox is idempotent
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 20:17:18 +0000 (20:17 +0000)]
v2writable: initializing an existing inbox is idempotent

And we do not want to start making confused repos if somebody
leaves out "-V2" the second time around.

6 years agoimport: run_die supports redirects as spawn does
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 20:17:17 +0000 (20:17 +0000)]
import: run_die supports redirects as spawn does

We'll be using it in more future tests and scripts.

6 years agosearch: retry_reopen on first_smsg_by_mid
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 19:14:12 +0000 (19:14 +0000)]
search: retry_reopen on first_smsg_by_mid

This was causing errors while attempting to load messages via
the WWW interface while mass-importing LKML.  While we're at it,
remove unnecessary eval from lookup_article.

6 years agoview: get rid of some unnecessary imports
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:57 +0000 (09:57 +0000)]
view: get rid of some unnecessary imports

We no longer need some of these old subroutines which
assumed a single Message-ID for each message.

6 years agowww: cleanup expensive fallback for legacy URLs
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:56 +0000 (09:57 +0000)]
www: cleanup expensive fallback for legacy URLs

Back in the day, we compressed long Message-IDs to SHA-1
hexdigests for the URL.  This now redirects to a 301 in
the hopes we can remove these checks some day to reduce
overhead.

6 years agombox: avoid extracting Message-ID for linkification
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:55 +0000 (09:57 +0000)]
mbox: avoid extracting Message-ID for linkification

We can avoid a small amount of overhead and use the "preferred"
Message-ID based on what is in the SearchMsg object.

6 years agov2writable: cleanup: get rid of unused fields
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:54 +0000 (09:57 +0000)]
v2writable: cleanup: get rid of unused fields

The layout of this structure ended up being a bit different
and the read-only access is handled through the ::Inbox class,
instead.

6 years agosearch: move find_doc_ids to searchidx
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:53 +0000 (09:57 +0000)]
search: move find_doc_ids to searchidx

We do not need this subroutine for read-only use in Search.pm

6 years agosearch: get rid of most lookup_* subroutines
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:52 +0000 (09:57 +0000)]
search: get rid of most lookup_* subroutines

Too many similar functions doing the same basic thing was
redundant and misleading, especially since Message-ID is
no longer treated as a truly unique identifier.

For displaying threads in the HTML, this makes it clear
that we favor the primary Message-ID mapped to an NNTP
article number if a message cannot be found.

6 years agosearch: cleanup uniqueness checking
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:51 +0000 (09:57 +0000)]
search: cleanup uniqueness checking

The only Xapian term which should be unique is the NNTP article
number; so we no longer need find_unique_doc_id.

6 years agov2writable: support purging messages from git entirely
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:50 +0000 (09:57 +0000)]
v2writable: support purging messages from git entirely

Purging existing messages is fairly straightforward since we can
take advantage of Xapian and lookup the git object_id with it.

Unfortunately, purging an already "removed" message (which is
no longer in Xapian) is not as easy and we'll need to expose
->purge_oids to purge by the git object_id (currently SHA-1).

Furthermore, we expire reflogs and prune in hopes a dumb HTTP
client won't get the object.

6 years agopublic-inbox-convert: tool for converting old to new inboxes
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:49 +0000 (09:57 +0000)]
public-inbox-convert: tool for converting old to new inboxes

This should make it easier to let users perform comparisons and
migrate to v2 if needed.

6 years agosearchmsg: document why we store To: and Cc: for NNTP
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:48 +0000 (09:57 +0000)]
searchmsg: document why we store To: and Cc: for NNTP

Otherwise I would forget and be tempted to remove them.

6 years agowww: fix attachment downloads for conflicted Message-IDs
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:47 +0000 (09:57 +0000)]
www: fix attachment downloads for conflicted Message-IDs

By using the "primary" Message-ID in WwwAttach, we can avoid
conflicts in the links we use for downloading attachments.

6 years agolookup by Message-ID favors the "primary" one
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:46 +0000 (09:57 +0000)]
lookup by Message-ID favors the "primary" one

The Message-ID mapped to an NNTP article number is stronger,
so we will favor that for attachment lookups.

6 years agov2writable: append, instead of prepending generated Message-ID
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:45 +0000 (09:57 +0000)]
v2writable: append, instead of prepending generated Message-ID

The original Message-ID is still the most important when
discussing with other recipients who do not rely on a message
flowing through public-inbox.  So whatever Message-ID we use
to deduplicate internally will be secondary and less important.

All of our front-end v2 code is order-independent, so we won't
let the message count against us, that way.

6 years agowww: remove unnecessary ghost checks
Eric Wong (Contractor, The Linux Foundation) [Thu, 29 Mar 2018 09:57:44 +0000 (09:57 +0000)]
www: remove unnecessary ghost checks

We do not need to care about ghosts at multiple call sites; they
cannot have a {blob} field and we've stored the blob field in
Xapian since SCHEMA_VERSION=13.

6 years agowww: support cloning individual v2 git partitions
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Mar 2018 20:31:44 +0000 (20:31 +0000)]
www: support cloning individual v2 git partitions

This will require multiple client invocations, but should reduce
load on the server and make it easier for readers to only clone
the latest data.

Unfortunately, supporting a cloneurl file for externally-hosted
repos will be more difficult as we cannot easily know if the
clones use v1 or v2 repositories, or how many git partitions
they have.

6 years agogithttpbackend: avoid infinite loop on generic PSGI servers
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Mar 2018 20:24:21 +0000 (20:24 +0000)]
githttpbackend: avoid infinite loop on generic PSGI servers

We must detect EOF when reading a POST body with standard PSGI servers.
This does not affect deployments using the standard public-inbox-httpd;
but most smaller inboxes should be able to get away using a generic
PSGI server.

6 years agohttp: fix modification of read-only value
Eric Wong (Contractor, The Linux Foundation) [Tue, 27 Mar 2018 20:09:06 +0000 (20:09 +0000)]
http: fix modification of read-only value

This fails in the rare case we get a partial send() on "\r\n"
when writing chunked HTTP responses out.

6 years agoview: depend on SearchMsg for Message-ID
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 19:33:18 +0000 (19:33 +0000)]
view: depend on SearchMsg for Message-ID

Since we need to handle messages with multiple and duplicate
Message-ID headers, our thread skeleton display must account
for that.

Since we have a "preferred" Message-ID in case of conflicts,
use it as the UUID in an Atom feed so readers do not get
confused by conflicts.

6 years agosearchview: remove unnecessary imports from MID module
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 19:25:47 +0000 (19:25 +0000)]
searchview: remove unnecessary imports from MID module

We do not need many of these, anymore.

6 years agowww: get rid of unnecessary 'inbox' name reference
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 19:15:10 +0000 (19:15 +0000)]
www: get rid of unnecessary 'inbox' name reference

We use the actual Inbox object everywhere else and don't
need the name of the inbox separated from the object.

6 years agov2writable: warn on unseen deleted files
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 19:09:08 +0000 (19:09 +0000)]
v2writable: warn on unseen deleted files

It would be a bug to have deleted files marked but not
seen in our histories.

6 years agosearchidx: warn about vivifying multiple ghosts
Eric Wong (Contractor, The Linux Foundation) [Mon, 26 Mar 2018 18:19:31 +0000 (18:19 +0000)]
searchidx: warn about vivifying multiple ghosts

This should help us detect bugs sooner in case we have
space waste problems.

6 years agoview: permalink (per-message) view shows multiple messages
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 20:29:24 +0000 (20:29 +0000)]
view: permalink (per-message) view shows multiple messages

This needs tests and further refinement, but current tests pass.

6 years agofeed: fix new.html for v2
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 18:00:41 +0000 (18:00 +0000)]
feed: fix new.html for v2

I forget this endpoint is still accessible (even if not linked).
This also simplifies new.html all around and removes some unused
clutter from the old days while we're at it.

6 years agot/psgi_v2: minimal test for Atom feed and t.mbox.gz
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 02:24:03 +0000 (02:24 +0000)]
t/psgi_v2: minimal test for Atom feed and t.mbox.gz

Some test coverage is better than none, here.

6 years agosearch: reopen DB if each_smsg_by_mid fails
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 02:03:46 +0000 (02:03 +0000)]
search: reopen DB if each_smsg_by_mid fails

This gives more-up-to-date data in case and allows us
to avoid reopening in more places ourselves.

6 years agowww: $MESSAGE_ID/raw endpoint supports "duplicates"
Eric Wong (Contractor, The Linux Foundation) [Fri, 23 Mar 2018 01:54:16 +0000 (01:54 +0000)]
www: $MESSAGE_ID/raw endpoint supports "duplicates"

Since v2 supports duplicate messages, we need to support
looking up different messages with the same Message-Id.
Fortunately, our "raw" endpoint has always been mboxrd,
so users won't need to change their parsing tools.

6 years agoimport: consolidate mid prepend logic, here
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 18:21:54 +0000 (18:21 +0000)]
import: consolidate mid prepend logic, here

This also quiets down warnings from -watch when spam training
happens on messages without Message-Id.

6 years agofeed: $INBOX/new.atom endpoint supports v2 inboxes
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 08:48:29 +0000 (08:48 +0000)]
feed: $INBOX/new.atom endpoint supports v2 inboxes

We can no longer rely on tree name lookups for v2.  This also
optimizes v1 by relying on git blob object_id lookups while
avoiding process spawning overhead for "git log".

6 years agov2writable: DEBUG_DIFF respects $TMPDIR
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 08:16:22 +0000 (08:16 +0000)]
v2writable: DEBUG_DIFF respects $TMPDIR

The File::Temp API is a bit tricky and needs TMPDIR explicitly
enabled if a template is given.

6 years agov2writable: clarify header cleanups
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 08:14:19 +0000 (08:14 +0000)]
v2writable: clarify header cleanups

We want to make it clear to the code and DEBUG_DIFF users
that we do not introduce messages with unsuitable headers
into public archives.

6 years agov2writable: add NNTP article number regeneration support
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 03:39:30 +0000 (03:39 +0000)]
v2writable: add NNTP article number regeneration support

Allow best-effort regeneration of NNTP article numbers from
cloned git repositories in addition to indexing Xapian Article
numbers will not remain consistent when we add purge support,
though.

6 years agot/altid.t: extra tests for mid_set
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 03:06:56 +0000 (03:06 +0000)]
t/altid.t: extra tests for mid_set

I'll be relying on some of this behavior for regenerating NNTP
article numbers off fresh clones.

6 years agov2writable: support reindexing Xapian
Eric Wong (Contractor, The Linux Foundation) [Wed, 21 Mar 2018 09:04:50 +0000 (09:04 +0000)]
v2writable: support reindexing Xapian

This still requires a msgmap.sqlite3 file to exist, but
it allows us to tweak Xapian indexing rules and reindex
the Xapian database online while -watch is running.

6 years agofix syntax warnings
Eric Wong (Contractor, The Linux Foundation) [Thu, 22 Mar 2018 00:12:25 +0000 (00:12 +0000)]
fix syntax warnings

I keep forgetting to run "make syntax"

6 years agomsgmap: add tmp_clone to create an anonymous copy
Eric Wong (Contractor, The Linux Foundation) [Wed, 21 Mar 2018 08:29:25 +0000 (08:29 +0000)]
msgmap: add tmp_clone to create an anonymous copy

This will be used to keep track of Message-ID <-> NNTP Article
numbers to prevent article number reuse when reindexing.

6 years agouse both Date: and Received: times
Eric Wong (Contractor, The Linux Foundation) [Wed, 21 Mar 2018 01:52:58 +0000 (01:52 +0000)]
use both Date: and Received: times

We want to rely on Date: to sort messages within individual
threads since it keeps messages from git-send-email(1) sorted.
However, since developers occasionally have the clock set
wrong on their machines, sort overall messages by the newest
date in a Received: header so the landing page isn't forever
polluted by messages from the future.

This also gives us determinism for commit times in most cases,
as we'll used the Received: timestamp there, as well.

6 years agoInboxWritable: add mbox/maildir parsing + import logic
Eric Wong (Contractor, The Linux Foundation) [Tue, 20 Mar 2018 21:00:00 +0000 (21:00 +0000)]
InboxWritable: add mbox/maildir parsing + import logic

This will make it easier to as well as supporting future
Filter API users.  It allows simplifying our ad-hoc
import_vger_from_mbox script.

6 years agoimport: discard all the same headers as MDA
Eric Wong (Contractor, The Linux Foundation) [Tue, 20 Mar 2018 19:50:03 +0000 (19:50 +0000)]
import: discard all the same headers as MDA

Reduce the places where we have duplicate logic for discarding
unwanted headers.

6 years agointroduce InboxWritable class
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 20:49:30 +0000 (20:49 +0000)]
introduce InboxWritable class

This code will be shared with future mass-import tools.

6 years agocontent_id: do not take Message-Id into account
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 23:24:50 +0000 (23:24 +0000)]
content_id: do not take Message-Id into account

If we need to use content_id, we've already lost hope
in relying on Message-Id as a differentiator.  This
prevents duplicates from showing up repeatedly with
-watch when Message-Ids are reused and we generate
new Message-Ids to disambiguate.

6 years agov2writable: remove "resent" message for duplicate Message-IDs
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:59 +0000 (08:14 +0000)]
v2writable: remove "resent" message for duplicate Message-IDs

public-inbox-watch gets restarted on reboots and whatnot, so
it could get pointlessly noisy.  This message was only useful
during initial development and imports.

6 years agov2writable: add DEBUG_DIFF env support
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:58 +0000 (08:14 +0000)]
v2writable: add DEBUG_DIFF env support

This can help us track down some differences during import,
if needed.

6 years agoscripts/import_vger_from_mbox: filter out same headers as MDA
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:57 +0000 (08:14 +0000)]
scripts/import_vger_from_mbox: filter out same headers as MDA

Perhaps we should filter these headers out in Import

6 years agov2writable: allow disabling parallelization
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:56 +0000 (08:14 +0000)]
v2writable: allow disabling parallelization

While parallel processes improves import speed for initial
imports; they are probably not necessary for daily mail imports
via WatchMaildir and certainly not for public-inbox-init.  Save
some memory for daily use and even helps improve readability of
some subroutines by showing which methods they call remotely.

6 years agosearchidxpart: s/barrier/remote_barrier/
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:55 +0000 (08:14 +0000)]
searchidxpart: s/barrier/remote_barrier/

Be consistent with our "remote_" prefix for other IPC subs

6 years agowatchmaildir: support v2 repositories
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:54 +0000 (08:14 +0000)]
watchmaildir: support v2 repositories

Unfortunately this gives up some minor performance tweaks we
made to avoid reforking import processes.

6 years agov2writable: ensure ->done is idempotent
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:53 +0000 (08:14 +0000)]
v2writable: ensure ->done is idempotent

This matches Import::done behavior

6 years agot/watch_maildir: note the reason for FIFO creation
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:52 +0000 (08:14 +0000)]
t/watch_maildir: note the reason for FIFO creation

I had to dig through commit history for this and we should
better document our tests (along with everything else).

6 years agoLock: new base class for writable lockers
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:51 +0000 (08:14 +0000)]
Lock: new base class for writable lockers

This reduces code duplication needed for locking and
and hopefully makes things easier to understand.

6 years agoindex: s/GIT_DIR/REPO_DIR/
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:50 +0000 (08:14 +0000)]
index: s/GIT_DIR/REPO_DIR/

No functional changes, yet, but this makes future changes
easier-to-read.

6 years agoimport: enable locking under v2
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:49 +0000 (08:14 +0000)]
import: enable locking under v2

Instead of using ssoma-based locking, enable locking via Import
for now.

6 years agov2writable: test for idempotent removals
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:48 +0000 (08:14 +0000)]
v2writable: test for idempotent removals

This will make reindexing easier.

6 years agoimport: switch to URL-safe Base64 for Message-IDs
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:47 +0000 (08:14 +0000)]
import: switch to URL-safe Base64 for Message-IDs

Hexdigests are too long and shorter Message-IDs are easier
to deal with.

6 years agoimport: force Message-ID generation for v1 here
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:46 +0000 (08:14 +0000)]
import: force Message-ID generation for v1 here

This allows us to share code for generating Message-IDs
between v1 and v2 repos.

For v1, this introduces a slight incompatibility in message
removal iff the original message lacked a Message-ID AND
the training request came from a message which did not
pass through the public-inbox:

The workaround for this would be to reuse the bad message from
the archive itself.

6 years agowatchmaildir: use content_digest to generate Message-Id
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:45 +0000 (08:14 +0000)]
watchmaildir: use content_digest to generate Message-Id

This can probably be moved to Import for code reuse.

6 years agomid: mid_mime uses v2-compatible mids function
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:44 +0000 (08:14 +0000)]
mid: mid_mime uses v2-compatible mids function

This allows us to be more consistent in dealing with completely
empty Message-Ids.

6 years agoimport: implement barrier operation for v1 repos
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:43 +0000 (08:14 +0000)]
import: implement barrier operation for v1 repos

This will allow WatchMaildir to use ->barrier operations instead
of reaching inside for nchg.  This also ensures dumb HTTP
clients can see changes to V2 repos immediately.

6 years agoimport: (v2): write deletes to a separate '_' subdirectory
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:42 +0000 (08:14 +0000)]
import: (v2): write deletes to a separate '_' subdirectory

In the future, we may store "purged" content IDs or other
uncommon stuff under "_/" of the git tree.  This keeps the
top-level tree small and more amenable to deltafication.
This helps the the common case where "m" is most commonly
changed file at the top level.

Also, use 'D' instead of 'd' since it matches git's '--raw'
output format.

6 years agoimport: (v2) delete writes the blob into history in subdir
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:41 +0000 (08:14 +0000)]
import: (v2) delete writes the blob into history in subdir

This makes it easier to audit deletes with "git log -p" and
prevents an unstable specification of "content_id" from being
stored in history.

This should be cost-free if done in the same partition (and even
cheaper than before as it introduces no new blobs).  It does
have a higher cost across partitions, but is probably irrelevant
given the typical ham:spam ratio.

6 years agoskeleton: barrier init requires a lock
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:40 +0000 (08:14 +0000)]
skeleton: barrier init requires a lock

Writing to the main skeleton pipe requires a lock since it's
shared with partition processes.

6 years agov2writable: implement remove correctly
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:39 +0000 (08:14 +0000)]
v2writable: implement remove correctly

We need to hide removals from anybody hitting the search engine.

6 years agosearch: allow ->reopen to be chainable
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:38 +0000 (08:14 +0000)]
search: allow ->reopen to be chainable

Makes life a little easier for V2Writable...

6 years agosearchidx: do not delete documents while iterating
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:37 +0000 (08:14 +0000)]
searchidx: do not delete documents while iterating

Followup-to: ebb59815035b42c2
  ("searchidx: do not modify Xapian DB while iterating")

6 years agov2writable: remove unnecessary idx_init call
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:36 +0000 (08:14 +0000)]
v2writable: remove unnecessary idx_init call

We no longer need it with ->barrier working

6 years agouse string ref for Email::Simple->new
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:35 +0000 (08:14 +0000)]
use string ref for Email::Simple->new

Email::Simple is slightly faster this way, and Email::MIME
and PublicInbox::MIME both wrap that.

6 years agov2writable: support "barrier" operation to avoid reforking
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:34 +0000 (08:14 +0000)]
v2writable: support "barrier" operation to avoid reforking

Stopping and starting a bunch of processes to look up duplicates
or removals is inefficient.  Take advantage of checkpointing
in "git fast-import" and transactions in Xapian and SQLite.

6 years agocontent_id: use Sender header if From is not available
Eric Wong (Contractor, The Linux Foundation) [Mon, 19 Mar 2018 08:14:33 +0000 (08:14 +0000)]
content_id: use Sender header if From is not available

We will be using Sender: in more places if the From: header
is not available, this is one of them.

Followup-to: ("import: fall back to Sender for extracting name and email")