Eric Wong [Sat, 16 Oct 2021 01:00:55 +0000 (01:00 +0000)]
httpd: move pipeline logic into event_step
Most of the HTTP server code was written for Danga::Socket and
not fully-transitioned to take advantage of PublicInbox::DS.
This change brings it up-to-date with the style of pipeline
handling used for -imapd and -nntpd.
Eric Wong [Sat, 16 Oct 2021 01:00:54 +0000 (01:00 +0000)]
imapd+nntpd: drop timer-based expiration
It's needlessly complex and O(n), so it doesn't scale well to a
high number of clients nor is it easy-to-scale with the data
structures available to us in pure Perl.
In any case, I see no evidence of either -imapd nor -nntpd
experiencing high connection loads on public-facing sites.
-httpd has never had its own timer-based expiration, either.
Fwiw, public-inbox.org itself has been running a public-facing
HTTP/HTTPS server with no userspace idle client expiration for
the past 8 years or with no ill effect. Clients can come and go
as they wish, and SO_KEEPALIVE takes care of truly broken
connections if they're gone for ~2 hours.
Internet connections drop all time, so it should be harmless to
drop connections w/o warning since both NNTP and IMAP protocols
have well-defined semantics for determining if a message was
truncated (as does HTTP/1.1+).
Eric Wong [Fri, 15 Oct 2021 14:02:15 +0000 (14:02 +0000)]
lei forget-search: support multiple args
I've been testing a lot of searches which I don't want to keep
around, so make it easy to remove a bunch at once. We'll behave
like rm(1) and keep going in the face of failure.
Eric Wong [Fri, 15 Oct 2021 13:30:55 +0000 (13:30 +0000)]
lei + ipc: simplify process reaping
Simplify our APIs and force dwaitpid() to work in async mode for
all lei workers. This avoids having lingering zombies for
parallel searches if one worker finishes soon before another.
The old distinction between "old" and "new" workers was
needlessly complex, error-prone, and embarrasingly bad.
We also never handled v2:// writers properly before on
Ctrl-C/Ctrl-Z (SIGINT/SIGTSTP), so add them to @WQ_KEYS
to ensure they get handled by $lei when appropropriate.
Eric Wong [Fri, 15 Oct 2021 09:52:53 +0000 (09:52 +0000)]
lei q: avoid kw lookup failure on remote mboxrd
When importing several sources in parallel via http(s) mboxrd,
we need to be able to get keywords of uncommitted documents
directly from shard workers. Otherwise, Xapian DocNotFound
errors happen because the read-only LeiSearch won't see
documents from uncomitted transactions. Keep in mind that it's
possible the keywords can be changed on-the-fly even for
uncommitted documents because of inotify watches from LeiNoteEvent.
Eric Wong [Fri, 15 Oct 2021 07:30:01 +0000 (07:30 +0000)]
www: various help text updates
`dt:' documentation is redundant with `d:' approxidate support;
so drop `dt:' since mairix uses `d:'. We'll also document
`rt:' since there are legit messages from senders with broken
clocks.
Reduce indentation level of help texts to be in 2-space
increments to using too much horizontal space.
We'll always place IMAP ahead of NNTP since it's alphabetical
and there's likely more IMAP clients out there.
Add "--ng NEWSGROUP" to -init instructions if configured.
There's also some minor wording changes throughout.
Eric Wong [Thu, 14 Oct 2021 13:16:09 +0000 (13:16 +0000)]
lei up --all: send signals to workers, receive errors
The redispatch mechanism wasn't routing signals and messages
between redispatched workers and script/lei properly. We now
rely on PktOp to do bidirectional message forwarding and
carefully avoiding circular references by using PktOp.
Eric Wong [Thu, 14 Oct 2021 13:16:07 +0000 (13:16 +0000)]
lei: TSTP affects all curl and related subprocesses
By relying more on pgroups for remaining remaining processes,
this lets us pause all curl+tail subprocesses with a single
kill(2) to avoid cluttering stderr.
We won't bother pausing the pigz/gzip/bzip2/xz compressor
process not cat-file processes, though, since those don't write
to the terminal and they idle soon after the workers react to
SIGSTOP.
AutoReap is hoisted out from TestCommon.pm. CLONE_SKIP
is gone since we won't be using Perl threads any time
soon (they're discouraged by the maintainers of Perl).
Eric Wong [Thu, 14 Oct 2021 13:16:06 +0000 (13:16 +0000)]
git: cat-file --batch are their own pgrp
We want these long-lived processes to die naturally when their
parent dies. Hopefully this improves graceful shutdown for
-extindex because I'm interrupting a lot of reindexing...
Eric Wong [Thu, 14 Oct 2021 13:16:05 +0000 (13:16 +0000)]
git: ->fail invokes current callback
While we try to invoke all pending callbacks to force error
handling, the current callback wasn't getting invoked on
invoked on async_abort if my_read/my_readline failed.
Eric Wong [Thu, 14 Oct 2021 04:32:53 +0000 (04:32 +0000)]
clone+fetch: respect umask for all downloaded files
Since public inboxes are usually intended to be public,
the File::Temp default permission of 0600 is wrong.
Just respect the user's umask in this case as git-clone
does.
This doesn't work for "lei add-external --mirror", yet;
but it will...
Eric Wong [Thu, 14 Oct 2021 03:12:25 +0000 (03:12 +0000)]
lei inspect: account for non-extindex inboxes
Inbox->xdb does not exist, but this code path was apparently
never tested :x I noticed this on basic v2 inbox, but it could
happen with any v1/v2 inbox. Move ->num2docid into Search
so it's less awkward to use.
Eric Wong [Thu, 14 Oct 2021 06:06:29 +0000 (06:06 +0000)]
extindex: guard against buggy unrefs
I noticed some unref messages which shouldn't have been
happening, but they were. Which is troubling. So add
a guard around an unref path until we can get to the bottom
of this.
Eric Wong [Wed, 13 Oct 2021 10:16:08 +0000 (10:16 +0000)]
eml: avoid Encode 2.87..3.12 leak
Encode::FB_CROAK leaks memory in old versions of Encode:
<https://rt.cpan.org/Public/Bug/Display.html?id=139622>
Since I expect there's still many users on old systems and old
Perls, we can use "$SIG{__WARN__} = \&croak" here with
Encode::FB_WARN to emulate Encode::FB_CROAK behavior.
Eric Wong [Wed, 13 Oct 2021 10:16:07 +0000 (10:16 +0000)]
t/www_listing: require opt-in for grokmirror tests
grokmirror 2.x seems to idle in several places for 5s at-a-time,
causing t/www_listing.t to take longer than "make check-run" on
a 4-core system when run without grokmirror. So make it
optional but add some test knobs to allow tailing the log
output so I can see what's going on.
Eric Wong [Wed, 13 Oct 2021 07:00:36 +0000 (07:00 +0000)]
treewide: use warn() or carp() instead of env->{psgi.errors}
Large chunks of our codebase and 3rd-party dependencies do not
use ->{psgi.errors}, so trying to standardize on it was a
fruitless endeavor. Since warn() and carp() are standard
mechanism within Perl, just use that instead and simplify a
bunch of existing code.
Eric Wong [Tue, 12 Oct 2021 22:44:56 +0000 (22:44 +0000)]
extindex: flush pending reindex before unref
This prevents unnecessary message renumbering and I/O.
Without this change, there is a small window for long-running
WWW streaming requests to miss a message that was unref-ed
before reindexing. If we expose an "All Mail" mailbox via
IMAP/JMAP, this will save client traffic.
Eric Wong [Tue, 12 Oct 2021 11:47:05 +0000 (11:47 +0000)]
www: _/text/config/raw Last-Modified: is mm->created_at
This allows IMAP mirrors to keep UIDVALIDITY synchronized (and
"LIST ACTIVE.TIMES" in NNTP). "lei add-external --mirror" will
automatically set it, as will the combination of
public-inbox-clone + public-inbox-index.
This avoids the need for extra endpoints or config entries,
at least...
Eric Wong [Tue, 12 Oct 2021 11:47:04 +0000 (11:47 +0000)]
msgmap: ->new_file to supports $ibx arg, drop ->new
The original Msgmap->new API was v1-specific and not necessary.
The ->new_file API now supports an $ibx object being passed to
it, simplify -no_fsync use. It will also make an upcoming
change easier...
Eric Wong [Tue, 12 Oct 2021 11:47:03 +0000 (11:47 +0000)]
daemon: unconditionally close Xapian shards on cleanup
The cost of opening a Xapian DB (even with shards) isn't high,
so save some FDs and just close it. We hit Xapian far less than
over.sqlite3 and we discard the MSet ASAP even when streaming
large responses.
This simplifies our code a bit and hopefully helps reduce
fragmentation by increasing mortality of late allocations.
Eric Wong [Tue, 12 Oct 2021 11:47:01 +0000 (11:47 +0000)]
msgmap: use DBI->prepare_cached
msgmap is not performance-critical enough to justify doing our
own prepared statement caching. Just rely on the functionality
of DBI here so future changes will be easier.
There's also minor style changes to avoid dirtying refcount
cache lines bumping by repeating hash lookups rather than attempting
to store them as locals.
Eric Wong [Tue, 12 Oct 2021 11:46:59 +0000 (11:46 +0000)]
search: delete QueryParser along with DB handle
Xapian::QueryParser is attached to the Xapian::Database,
so holding onto the QueryParser was preventing us from
releasing DB handles if a query was performed.
Eric Wong [Mon, 11 Oct 2021 08:06:20 +0000 (08:06 +0000)]
extindex: avoid invalid blobs after unref
When unref-ing a blob from xref3, make sure the "preferred"
smsg->{blob} doesn't point to the blob we just unrefed. This
is necessary because we periodically checkpoint our extindex
process to allow -watch and -mda processes to run.
This also gets rid of a lot of redundant code for ->remove_xref3,
since it's all handled in ExtSearchIdx, now.
Eric Wong [Mon, 11 Oct 2021 08:06:19 +0000 (08:06 +0000)]
extindex: more consistent doc removal
We need to ensure a message is consistently removed from eidxq,
over and Xapian in all cases. Removing from eidxq saves users
from some noisy error messages.
Eric Wong [Mon, 11 Oct 2021 08:06:18 +0000 (08:06 +0000)]
extindex: share unref logic in more places
We can use the same logic for --gc and --reindex and
'd' log entries
They're similar enough and the actual need to unref should
be fairly rare. We could go a lot faster if we didn't show
progress for --gc and --reindex, actually.
Eric Wong [Mon, 11 Oct 2021 08:06:16 +0000 (08:06 +0000)]
sqlite: PRAGMA optimize on close
As recommended by SQLite documentation[1]:
To achieve the best long-term query performance without the need
to do a detailed engineering analysis of the application schema
and SQL, it is recommended that applications run "PRAGMA optimize"
(with no arguments) just before closing each database connection.
Hopefully that works for our use cases and can make things
faster for us.
Eric Wong [Mon, 11 Oct 2021 08:06:15 +0000 (08:06 +0000)]
extindex: speed up --reindex --fast
This required some tweaking of xref3 indices in over.sqlite3,
but the end result is it brings no-op "--reindex --fast --all"
checks down to roughly 20 minutes (from 30-40 minutes) on
lore/all.
This is faster because a bunch of small SQLite queries are still
slower en-mass than a bunch of perlops. Despite the lack of IPC
overhead, crossing .so boundaries and repeating lookups over
btrees is still slower than doing the same with Perl hash tables.
Eric Wong [Sun, 10 Oct 2021 14:25:17 +0000 (14:25 +0000)]
lei/store: keep ".err-XXXX" in stderr tmpfile
This is slighly more meaningful since the file is already
in ~/.local/share/lei/store, so "lei_store" was redundant
(and the "XXXX" are random characters replaced by File::Temp)
Eric Wong [Sun, 10 Oct 2021 14:25:16 +0000 (14:25 +0000)]
extindex: --gc doesn't touch ghost entries
We were deleting ghost entries, this was usually harmless since
other messages could fill-in-the-blanks, but could cause
misthreading in odd cases where a big chunk of a thread is
missing and the latest messages only referenced ghosts.
We'll also save some cycles when scanning Xapian shards since
docids won't be <= 0.
Eric Wong [Sun, 10 Oct 2021 14:25:15 +0000 (14:25 +0000)]
extindex: minor cost reductions
Don't bother decoding the 20-byte SHA-1 to a 40-byte hex value
since we don't read it, anyways. We can also use the on-stack
ibx->eidx_key value instead of dispatching the method again.
Eric Wong [Sun, 10 Oct 2021 14:25:13 +0000 (14:25 +0000)]
set nodatacow on more SQLite files
We'll set nodatacow when detecting existing but empty
files, and also their directories in more cases (for
auxiliary -wal, -journal, -shm files). Hopefully
this keeps performance reasonable on CoW FSes.
Eric Wong [Sat, 9 Oct 2021 12:03:36 +0000 (12:03 +0000)]
view: save memory by dropping smsg->{from_name} on use
We'll also save a few LoC when generating it. $smsg objects can
linger a while when rendering large threads, so saving a few
bytes here can add up to several hundred KB saved.
I noticed this while chasing the ref cycle leak in commit b28e74c9dc0a (www: fix ref cycle from threading w/ extindex, 2021-10-03).
While there's no longer a leak, releasing memory earlier can
allow it to be reused sooner and reduce both memory traffic and
memory pressure.
Eric Wong [Sat, 9 Oct 2021 12:03:35 +0000 (12:03 +0000)]
http: avoid Perl target cache for psgi.input
By using syswrite to populate env->{psgi.input}. The substr()
call IO::Handle->write will trigger Perl's target/scratchpad and
result in a permanent allocation. Since this is a cold path,
that allocation is pointless, and syswrite() can already write a
substring.
Allowing Perl to cache a large allocation in a cold path only
result in fragmentation and wasted RAM.
write(2) on a regular file won't result in short writes
unless the FS quotas or free space limits are hit, or the buffer
is close to overflowing (e.g. the 0x7ffff000-byte Linux limit).
Since our HTTP server will never buffer that much in RAM,
there's no need to retry syswrite nor rely on the retrying
implicit in IO::Handle->write and the "print" perlop.
Eric Wong [Sat, 9 Oct 2021 12:03:34 +0000 (12:03 +0000)]
view: discard Eml->{bdy} when done using
We can release the raw body buffer once we've obtained a copy of
the decoded buffer. This reduces memory pressure ahead of some
expensive diff processing.
Eric Wong [Fri, 8 Oct 2021 10:20:04 +0000 (10:20 +0000)]
git: fatalize async callback errors by default
This should help us catch BUG: errors (and then some) in
-extindex and other read-write code paths. Only read-only
daemons should warn on async callback failures, since those
aren't capable of causing data loss.
Eric Wong [Wed, 6 Oct 2021 11:50:42 +0000 (11:50 +0000)]
ds: tmpio: avoid Perl target cache
The use of `substr' here an argument to `print' was causing Perl
to internally cache its target buffer. Since `syswrite()'
already offers a buffer offset arg and length limits, just use
`syswrite' directly. We were using autoflush anyways, so the
lack of buffering was of no concern performance-wise.
The target buffer could get to roughly ~10MB under some loads,
but it was usually a cold path and using memory which cannot be
released nor reused in other places.
note: IO::Handle::write uses `substr' internally, too;
so nothing would be gained using IO::Handle:write.
Eric Wong [Wed, 6 Oct 2021 11:19:36 +0000 (11:19 +0000)]
msg_iter: split_quotes adds trailing "\n"
The regexp in split_quotes relies on the presence of a
final "\n", so add it wherever we need to instead of
making it the responsibility of every caller.
This probably doesn't matter in practice since every
email seems to have a "\n" as the final byte (due to
the way SMTP works), but maybe there's some odd ones
that'll get imported via lei.
Eric Wong [Wed, 6 Oct 2021 10:12:21 +0000 (10:12 +0000)]
overidx: subject_path: allow non-ASCII char in subject matches
This should bring us closer to the "Base subject" definition in
IMAP ORDEREDSUBJECT (RFC 5256 2.1). Larger changes may cause
some breakage (until --reindex). But for now, a reindex will
prevents the non-ASCII subjects from being normalized to the
same fuzzy "thread" in the thread view.
Eric Wong [Wed, 6 Oct 2021 09:44:50 +0000 (09:44 +0000)]
extindex: --gc checkpoints
We need to ensure -extindex --gc runs don't prevent other
work from happening in the meantime. I actually caused
my -extindex to OOM due to the lack of checkpoints :x
We'll also hoist out the shard scanning into its own sub
in preparation for lei/store usage.
Eric Wong [Tue, 5 Oct 2021 09:40:17 +0000 (09:40 +0000)]
index: --reindex w/ --{since,until,before,after}
This lets administrators reindex specific time ranges
according to git "approxidate" formats. These arguments
are passed directly to underlying git-log(1) invocations
and may still reach into old epochs.
Since these options rely on git committer dates (which we infer
from the most recent Received: header), they are not guaranteed
to be strictly tied to git history and it's possible to
over/under-reindex some messages. It's probably not a major
problem in practice, though; reindexing a few extra messages
is generally harmless aside from some extra device wear.
Since this currently relies on git-log, these options do not
affect -extindex, yet.
Eric Wong [Mon, 4 Oct 2021 11:11:43 +0000 (11:11 +0000)]
overidx: update comment for new sub name
`shard_remove_eidx_info' was made unnecessary with commit 82b805db3ad9 (searchidxshard: IPC conversion, part 2, 2021-01-03)
and we now call `remove_eidx_info' directly.
Eric Wong [Mon, 4 Oct 2021 08:26:33 +0000 (08:26 +0000)]
{dir,inbox}idle: use level-triggered epoll
Both read(2) on inotify and kevent(2) return a finite amount of
events. Let the kernel notify us again in cases where we'd
need to retry instead of looping ourselves. This can prevent
missed/delayed notifications while still ensuring fairness in
busy event loops.
Making them immortal doesn't seem worth it, since doing immortal
allocations after process startup leads to fragmentation. While
the allocations made by highlight are small, those small
allocations can break up contiguous regions and prevent
consolidation by the malloc implementation.
Since instantiating code generators doesn't seem too expensive,
just use and delete them ASAP.
Eric Wong [Mon, 4 Oct 2021 00:07:17 +0000 (19:07 -0500)]
www: fix ref cycle from threading w/ extindex
Unlike v1 inboxes (which don't accept duplicate Message-IDs at
all), and v2 inboxes (which generate a new Message-ID for
duplicates), extindex must accept duplicate Message-IDs as-is.
This was fine for storage, but prevented the reference-cycle
mechanism of our message threading display algorithm from working
reliably. It could no longer delete the ->{parent} field from
clobbered entries in the %id_table.
So we now take into account reused Message-IDs and never clobber
entries in %id_table. Instead, we mark reused Message-IDs as
"imposters" and special-case them by injecting them as children
after all other threading is complete.
This cycle was noticed using a pre-release of Devel::Mwrap::PSGI:
https://80x24.org/mwrap-perl.git
Eric Wong [Sat, 2 Oct 2021 11:18:34 +0000 (11:18 +0000)]
content_hash: normalize whitespace before hashing addresses
This should prevent some false duplicates. I noticed this
while implementing "lei mail-diff", and only noticed it when
I implemented the ContentDigestDbg wrapper for mail-diff.
Eric Wong [Sat, 2 Oct 2021 11:18:33 +0000 (11:18 +0000)]
lei mail-diff: diagnostic command to diff mail contents
This is useful in finding the cause of deduplication bugs,
and possibly the cause of missing threads reported by
Konstantin in <20211001130527.z7eivotlgqbgetzz@meerkat.local>
usage:
u=https://yhbt.net/lore/all/87czop5j33.fsf@tynnyri.adurom.net/raw
lei mail-diff $u
Eric Wong [Fri, 1 Oct 2021 09:54:44 +0000 (09:54 +0000)]
ds: inline set_cloexec
I'm thinking we can drop support for Linux <2.6.27 soonish and
just use EPOLL_CLOEXEC. Perl without signalfd (or
EVFILT_SIGNAL) is miserable, actually.
Eric Wong [Fri, 1 Oct 2021 09:54:43 +0000 (09:54 +0000)]
inbox: keep DB handles if git processes are live
Having git processes outlive DB handles is likely to hurt
from a fragmentation perspective if the DB handle needs to
be recreated immediately due to a git->cat_async callback.
So only unref DB handles when we're sure there's no live
git users left, otherwise check the inodes.
We'll also avoid needless localization checks in git->cleanup
and make the return value more obvious since the pid fields are
unconditionally deleted nowadays.