Eric Wong [Tue, 5 Oct 2021 09:40:17 +0000 (09:40 +0000)]
index: --reindex w/ --{since,until,before,after}
This lets administrators reindex specific time ranges
according to git "approxidate" formats. These arguments
are passed directly to underlying git-log(1) invocations
and may still reach into old epochs.
Since these options rely on git committer dates (which we infer
from the most recent Received: header), they are not guaranteed
to be strictly tied to git history and it's possible to
over/under-reindex some messages. It's probably not a major
problem in practice, though; reindexing a few extra messages
is generally harmless aside from some extra device wear.
Since this currently relies on git-log, these options do not
affect -extindex, yet.
Eric Wong [Mon, 4 Oct 2021 11:11:43 +0000 (11:11 +0000)]
overidx: update comment for new sub name
`shard_remove_eidx_info' was made unnecessary with commit 82b805db3ad9 (searchidxshard: IPC conversion, part 2, 2021-01-03)
and we now call `remove_eidx_info' directly.
Eric Wong [Mon, 4 Oct 2021 08:26:33 +0000 (08:26 +0000)]
{dir,inbox}idle: use level-triggered epoll
Both read(2) on inotify and kevent(2) return a finite amount of
events. Let the kernel notify us again in cases where we'd
need to retry instead of looping ourselves. This can prevent
missed/delayed notifications while still ensuring fairness in
busy event loops.
Making them immortal doesn't seem worth it, since doing immortal
allocations after process startup leads to fragmentation. While
the allocations made by highlight are small, those small
allocations can break up contiguous regions and prevent
consolidation by the malloc implementation.
Since instantiating code generators doesn't seem too expensive,
just use and delete them ASAP.
Eric Wong [Mon, 4 Oct 2021 00:07:17 +0000 (19:07 -0500)]
www: fix ref cycle from threading w/ extindex
Unlike v1 inboxes (which don't accept duplicate Message-IDs at
all), and v2 inboxes (which generate a new Message-ID for
duplicates), extindex must accept duplicate Message-IDs as-is.
This was fine for storage, but prevented the reference-cycle
mechanism of our message threading display algorithm from working
reliably. It could no longer delete the ->{parent} field from
clobbered entries in the %id_table.
So we now take into account reused Message-IDs and never clobber
entries in %id_table. Instead, we mark reused Message-IDs as
"imposters" and special-case them by injecting them as children
after all other threading is complete.
This cycle was noticed using a pre-release of Devel::Mwrap::PSGI:
https://80x24.org/mwrap-perl.git
Eric Wong [Sat, 2 Oct 2021 11:18:34 +0000 (11:18 +0000)]
content_hash: normalize whitespace before hashing addresses
This should prevent some false duplicates. I noticed this
while implementing "lei mail-diff", and only noticed it when
I implemented the ContentDigestDbg wrapper for mail-diff.
Eric Wong [Sat, 2 Oct 2021 11:18:33 +0000 (11:18 +0000)]
lei mail-diff: diagnostic command to diff mail contents
This is useful in finding the cause of deduplication bugs,
and possibly the cause of missing threads reported by
Konstantin in <20211001130527.z7eivotlgqbgetzz@meerkat.local>
usage:
u=https://yhbt.net/lore/all/87czop5j33.fsf@tynnyri.adurom.net/raw
lei mail-diff $u
Eric Wong [Fri, 1 Oct 2021 09:54:44 +0000 (09:54 +0000)]
ds: inline set_cloexec
I'm thinking we can drop support for Linux <2.6.27 soonish and
just use EPOLL_CLOEXEC. Perl without signalfd (or
EVFILT_SIGNAL) is miserable, actually.
Eric Wong [Fri, 1 Oct 2021 09:54:43 +0000 (09:54 +0000)]
inbox: keep DB handles if git processes are live
Having git processes outlive DB handles is likely to hurt
from a fragmentation perspective if the DB handle needs to
be recreated immediately due to a git->cat_async callback.
So only unref DB handles when we're sure there's no live
git users left, otherwise check the inodes.
We'll also avoid needless localization checks in git->cleanup
and make the return value more obvious since the pid fields are
unconditionally deleted nowadays.
Eric Wong [Fri, 1 Oct 2021 09:54:42 +0000 (09:54 +0000)]
inbox: inline and eliminate git_cleanup
It was probably incorrect to use from max_git_epoch, and it's
small enough to inline into do_cleanup. We'll also eliminate
the unnecessary deletion of {-altid_map} while we're in the
area, since we no longer cache/memoize that.
Fixes: 7e5cea05f061e757 ("inbox: rewrite cleanup to be more aggressive")
Eric Wong [Fri, 1 Oct 2021 09:54:41 +0000 (09:54 +0000)]
ds: simplify signalfd use
Since signalfd is often combined with our event loop, give it a
convenient API and reduce the code duplication required to use it.
EventLoop is replaced with ::event_loop to allow consistent
parameter passing and avoid needlessly passing the package name
on stack.
We also avoid exporting SFD_NONBLOCK since it's the only flag we
support. There's no sense in having the memory overhead of a
constant function when it's in cold code.
Eric Wong [Fri, 1 Oct 2021 09:54:40 +0000 (09:54 +0000)]
ipc: run Net::SSLeay::randomize
Currently we don't use OpenSSL from child processes of parents
which use OpenSSL, but we may in the future. So ensure OpenSSL
initializes its PRNG after these forks to avoid one security
pitfall down the line.
Eric Wong [Fri, 1 Oct 2021 09:54:38 +0000 (09:54 +0000)]
listener: switch to level-triggered epoll
On second thought, the ->requeue + accept retry code path isn't
worth the userspace complexity and overhead. Level-triggered
epoll has always annoyed me since it takes an inefficient code
path in the kernel; but taking our less-efficient code path in
Perl seems even worse. We also need to take load distribution
into account for multi-worker systems.
Eric Wong [Fri, 1 Oct 2021 09:54:37 +0000 (09:54 +0000)]
doc: lei-security: some more updates
Virtual users will probably be used for read-write IMAP/JMAP
support. The potential for various kernel/hardware bugs and
attacks also needs to be highlighted.
Eric Wong [Fri, 1 Oct 2021 02:10:27 +0000 (02:10 +0000)]
search_view: various navigation tweaks
This improves the "&x=t" navigation between the thread overview
(skeleton) section at the bottom and jumping back to the top for
the mbox download form. The "--links below ..." text ought to
be helpful for users unfamiliar with the /$MSGID/T/ and /$MSGID/t/
views.
Eric Wong [Wed, 29 Sep 2021 21:25:20 +0000 (21:25 +0000)]
git: shorten --git-dir= in CLI with chdir in spawn
Long pathnames are difficult to read and distinguish in ps(1)
output. Deep paths can also slow down pathname resolution
when dealing with loose objects, so we put "cat-file --batch"
deeper into the directory tree.
Since v2 processes are in the form of $INBOXDIR/all.git, keep
the basename of $INBOXDIR in --git-dir= so it's easy to
distinguish between processes just by looking at ps(1).
While "git -C" also exists, it's only present in git 1.8.5+.
We also need to keep in mind the "directory" pointed to by
--git-dir= need not be a directory (nor a symlink pointing
to one).
This reduces pathname resolution overhead for v1 and v2 inbox
git processes, but unfortunately not for extindex since that
needs to store alternates as absolute paths.
Eric Wong [Wed, 29 Sep 2021 12:40:46 +0000 (07:40 -0500)]
ds: simplify idle time expiry, slightly
While it doesn't look like $EXPMAP can be populated in
non-obvious ways via ->DESTROY, it still makes sense to keep it
close to some of our other code around cleanup to reduce
the likelyhood of subtle bugs in case semantics change..
Eric Wong [Wed, 29 Sep 2021 03:02:54 +0000 (03:02 +0000)]
t/solver_git: fix test to work with git <2.29
'git diff --abbrev=40' did not abbreviate /^index / lines of
diff output with git <2.29, and 40 will be insufficient for
SHA-256. --full-index has been around since 2005, so it's safe
to rely on.
Tested git version 2.20.0 (Debian buster).
Fixes: 751df49e7db8ba77 ("lei rediff: add --drq and --dequote-only")
Eric Wong [Tue, 28 Sep 2021 23:11:06 +0000 (23:11 +0000)]
inbox: drop memoization/preload, cleanup expires caches
cloneurl, description, and base_url are no longer memoized. The
non-$env form of base_url is rare in WWW, and is fast enough to
not require memoization.
cloneurl and description are now expired during cleanup,
allowing admins to change these files without restarting
(or SIGHUP).
-altid_map is no longer cached nor memoized at all, since the
endpoint(s) which hit it seem rarely accessed.
nntp_url and imap_url are now cached (instead of memoized) in
case an inbox is unvisited for a long time. They remain cached
since the truthiness check gets called in every per-inbox HTML
page, which can potentially be expensive.
Eric Wong [Tue, 28 Sep 2021 23:11:05 +0000 (23:11 +0000)]
inbox: rewrite cleanup to be more aggressive
Avoid relying on a giant cleanup hash and instead use the new
DS->add_uniq_timer API to amortize the pause times associated
with having to cleanup many inboxes. We can also use smaller
intervals for this, as well.
We now discard SQLite DB handles at cleanup. Each of these can
use several megabytes of memory, which adds up with
hundreds/thousands of inboxes. Since per-inbox access intervals
are unpredictable and opening an SQLite handle is relatively
inexpensive, release memory more aggressively to avoid the heap
having to hit swap.
Eric Wong [Tue, 28 Sep 2021 23:11:04 +0000 (23:11 +0000)]
www: do not bump {over} refcnt on long responses
SQLite files may be replaced or removed by admins while
generating a large threads or mailbox responses. Ensure we
don't hold onto DBI handles and associated file descriptors
past their cleanup.
Eric Wong [Tue, 28 Sep 2021 07:53:49 +0000 (07:53 +0000)]
www+httpd: lower priority of large mbox downloads
While each git blob request is treated fairly w.r.t other git
blob requests, responses triggering thousands of git blob
requests can still noticeably increase latency for
less-expensive responses.
Move large mbox results and the nasty all.mbox endpoint to
a low priority queue which only fires once per-event loop
iteration. This reduces the response time of short HTTP
responses while many gigantic mboxes are being downloaded
simultaneously, but still maximizes use of available I/O
when there's no inexpensive HTTP responses happening.
This only affects PublicInbox::WWW users who use
public-inbox-httpd, not generic PSGI servers.
Eric Wong [Mon, 27 Sep 2021 21:05:45 +0000 (16:05 -0500)]
lei completion: workaround old Perl bug
While `$argv[-1]' is `undef' on an empty @argv, using `$argv[-1]'
as a subroutine argument would fail incorrectly with:
Modification of non-creatable array value attempted, subscript -1 at ...
...even though we'd never attempt to modify @_ itself in the
subroutines being called. Work around the bug (tested on
5.16.3) by passing `undef' explicitly when `$argv[-1]' is
already `undef'.
Eric Wong [Mon, 27 Sep 2021 07:53:07 +0000 (02:53 -0500)]
config: get_1: use full parameter name
Instead of passing the prefix section and key separately, pass
them together as is commonly done with git-config(1) usage as
well as our ->get_all API. This inconsistency in the get_1 API
is a needless footgun and confused me a bit while working on
"lei up" the other week.
Eric Wong [Mon, 27 Sep 2021 04:59:31 +0000 (04:59 +0000)]
lei rediff: add --drq and --dequote-only
More switches which can be useful for users who pipe from text
editors. --drq can be helpful while writing patch review email
replies, and perhaps --dequote-only, too.
Eric Wong [Sun, 26 Sep 2021 01:42:38 +0000 (01:42 +0000)]
t/run.perl: less confusing error reporting
The $sigchld handler was reporting the last test (successful or
not) for a given PID in case a worker dies prematurely.
Instead, redisplay all failed test in $run_log to ensure the
report only shows failed tests, and not the last started (and
possibly successful) one.
Eric Wong [Sun, 26 Sep 2021 01:30:47 +0000 (01:30 +0000)]
www_listing: support /all/ search as a 302 redirect
This allows users to search /all/ from the top-level WwwListing
without extra manual steps, although there's still extra network
roundtrips incurred.
No vertical whitespace is added, and there's no clumsy radio
buttons nor menus to deal with. Users only have to use a
different <input type=submit /> button. I forgot how to do this
until I realized we already do something similar with multiple
submit buttons for threaded vs non-threaded mboxrd.gz downloads.
Eric Wong [Sun, 26 Sep 2021 00:02:32 +0000 (00:02 +0000)]
lei note-event: ignore kw_changed exceptions
The note-event worker may see changes before a Xapian shard
commit happens, meaning keyword lookups fail as a result.
Just emit the request to the lei/store worker since it's a
fairly cheap operation at this point.
We'll try harder to look for kw changes, too, since
deduplication changes may lead to multiple docids being
resolved for a single message.
Eric Wong [Sat, 25 Sep 2021 22:16:45 +0000 (22:16 +0000)]
search: avoid setting undef hashtable entries
`undef' entries still take up a slot in the hash table, and
cause the `exists' check to false-positive in ->cleanup_shards.
This should fully fix the (innocuous) messages introduced in
commit 63d7b8ce (daemons: revamp periodic cleanup task, 2021-09-23)
Eric Wong [Sat, 25 Sep 2021 22:16:44 +0000 (22:16 +0000)]
extmsg: search_partial: use ->isrch if available
This allows us to avoid creating ibx->{search}->{xdb} at this
spot by using an `undef' value. This is a step towards
eliminating the innocuous "/path/to/inboxdir/xap15 has no shards"
messages introduced in commit 63d7b8ce (daemons: revamp
periodic cleanup task, 2021-09-23)
Eric Wong [Sat, 25 Sep 2021 08:49:43 +0000 (08:49 +0000)]
lei forget-external: split into separate file
This was written before we had auto-loading, and forget-external
should be a rarely-used command that's not worth loading at
startup. Do some golfing while we're in the area, too.
Eric Wong [Sat, 25 Sep 2021 07:08:38 +0000 (07:08 +0000)]
doc: lei-rm: remove unnecessary -F values
-F is really only useful for distinguishing between mbox
variants and single message/rfc822 files. URLs and
directory-based formats can be auto-detected easily enough.
Eric Wong [Sat, 25 Sep 2021 06:17:54 +0000 (06:17 +0000)]
lei: make pkt_op easier-to-use and understand
Since switching to SOCK_SEQUENTIAL, we no longer have to use
fixed-width records to guarantee atomic reads. Thus we can
maintain more human-readable/searchable PktOp opcodes.
Furthermore, we can infer the subroutine name in many cases
to avoid repeating ourselves by specifying a command-name
twice (e.g. $ops->{CMD} => [ \&CMD, $obj ]; can now simply be
written as: $ops->{CMD} => [ $obj ] if CMD is a method of
$obj.
Eric Wong [Sat, 25 Sep 2021 05:49:45 +0000 (05:49 +0000)]
lei2mail: augment_inprogress: guard against closed FDs
I'm not sure what caused it, but $err was undef and caused print
to fail, leading to an event loop error. Guard the timer with
an eval and assume warn() can't trigger an event loop failure.
Eric Wong [Sat, 25 Sep 2021 05:49:44 +0000 (05:49 +0000)]
lei: restore old sigmask before daemon exit
If the event loop fails, we want blocking waitpid (wait4) calls
to be interruptible with SIGTERM via "kill $PID" rather than
SIGKILL. Though a failing event loop is something we should
avoid...
Implicit stdin based on standard input being a pipe or regular
file is here to stay, so save users the trouble of typing '-'
or '--stdin'.
Inline::C is required as of commit 1d6e1f9a6a66 (lei: require
Socket::MsgHdr or Inline::C, drop oneshot, 2021-05-26); but
Socket::MsgHdr still gives a noticeable improvement in bash
completion speed.
Also, spell-out "MESSAGE-ID" since "MID" is actually not a
common abbreviation ("MSGID" is used by RFC 3977 and several
other RFCs, I recall).
Eric Wong [Sat, 25 Sep 2021 03:21:01 +0000 (03:21 +0000)]
t/v2mirror: check dependencies for legacy test
We still need Email::MIME to test against old revisions.
We'll also depend on the revision just prior to the
manifest.js.gz introduction to avoid loading Danga::Socket,
since it was getting loaded even with `plackup'.
Finally, we'll disable Inline::C usage with old Spawn.pm
since our old code included alloca.h, which is not
portable to FreeBSD.
Eric Wong [Fri, 24 Sep 2021 10:56:45 +0000 (10:56 +0000)]
fetch: support v2 w/o manifest on old WWW
There may still be pre-manifest.js.gz versions of
PublicInbox::WWW running and serving v2 inboxes.
While -clone and "add-external --mirror" were working, -fetch
was failing due to 301 redirect to $INBOX_URL/manifest.js.gz/
and not the expected 404. Update the code to deal with a JSON
decode error (from the 301) and ensure v2 epochs detection is
correct (and not using a shadowed variable).
Eric Wong [Fri, 24 Sep 2021 10:56:44 +0000 (10:56 +0000)]
clone|fetch|--mirror: cull manifest in partial mirrors
This makes it easier for users to enable fetching on a
previously read-only epoch. Prior to this change, users were
required to delete manifest.js.gz in addition to adding the
writable bit. Now, they just have to "chmod +w $EPOCH_DIR".
Eric Wong [Fri, 24 Sep 2021 10:56:43 +0000 (10:56 +0000)]
clone|--mirror: fix and test against pre-manifest WWW
There may still be pre-manifest.js.gz versions of PublicInbox::WWW.
running and serving v2 inboxes.
Since $INBOX_URL/manifest.js.gz was not understood, it was
assumed to be a Message-ID and 301-ed to
"$INBOX_URL/manifest.js.gz/" with a trailing slash, so our 404
checks were invalid. Update our fallbacks to deal with 301
by catching JSON decoding errors to trigger HTML scraping.
For HTML parsing, be sure to not be fooled by potential
user-generated content and only scan the part after the last
<hr>.
We also need to avoid propagating $? from curl unnecessarily
when we can continue safely.
Finally, update v2mirror.t with tests to use PublicInbox::WWW
from our "v1.1.0-pre1" tag to ensure these code paths get tested
Eric Wong [Fri, 24 Sep 2021 10:56:41 +0000 (10:56 +0000)]
clone|--mirror: support --epoch=RANGE for partial clones
Partial (v2) clones should be useful addition for users wanting
to conserve storage while having fast access to recent messages.
Continuing work started in 876e74283ff3 (fetch: ignore
non-writable epoch dirs, 2021-09-17), this creates bare,
read-only epoch git repos. These git repos have the remotes
pre-configured, but does not fetch any objects.
The goal is to allow users to set the writable bit on a
previously-skipped epoch and start fetching it.
Shell completion support may not be necessary given how short
the epoch ranges are, here.
Eric Wong [Thu, 23 Sep 2021 10:37:42 +0000 (10:37 +0000)]
lei_xsearch: use localtime for user message
It's probably least confusing for user-facing messages to
display times in the user's configured timezone. I considered
appending "UTC" to the message and sticking with gmtime(), too,
but this output isn't intended to be web-cache friendly nor
expect users from across multiple timezones to view the same
output.
Eric Wong [Thu, 23 Sep 2021 05:53:03 +0000 (05:53 +0000)]
xcpdb: avoid race when shards are added
It's possible for the rename() sequence to cause read-only
daemons using ->xdb_shards_flat to load an incomplete set of
contiguous shards and get invalid docids for search results.
With this change, we favor the case where search is momentarily
unavailable rather than giving wrong results during the small
window where Xapcmd->commit_changes runs.
Eric Wong [Thu, 23 Sep 2021 00:46:25 +0000 (00:46 +0000)]
daemons: revamp periodic cleanup task
Neither Inboxes nor ExtSearch objects were retrying correctly
when there are live git processes, but the inboxes were getting
rescanned for search or other reasons. Ensure the scan retries
eventually if there's live processes.
We also need to update the cleanup task to detect Xapian shard
count changes, since Xapian ->reopen is enough to detect any
other Xapian changes. Otherwise, we just issue an inexpensive
->reopen call and let Xapian check whether there's anything
worth reopening.
This also lets us eliminate the Devel::Peek dependency.
Eric Wong [Wed, 22 Sep 2021 09:45:17 +0000 (09:45 +0000)]
gcf2 + extsearch: check for unlinked files on Linux
Check for unlinked mmap-ed files via /proc/$PID/maps every 60s
or so.
ExtSearch (extindex) is compatible-enough with Inbox objects to
be wired into the old per-inbox code, but the startup cost is
projected to be much higher down the line when there's >30K
inboxes, so we scan /proc/$PID/maps for deleted files before
unlinking. With old Inbox objects, it was (and is) simpler to
just kill processes w/o checking due to the low startup cost
(and non-portability of checking).
Eric Wong [Wed, 22 Sep 2021 02:24:34 +0000 (02:24 +0000)]
lei up: avoid excessively parallel --all
We shouldn't dispatch all outputs right away since they
can be expensive CPU-wise. Instead, rely on DESTROY to
trigger further redispatches.
This also fixes a circular reference bug for the single-output
case that could lead to a leftover script/lei after MUA exit.
I'm not sure how --jobs/-j should work when the actual xsearch
and lei2mail has it's own parallelism ("--jobs=$X,$M"), but
it's better than having thousands of subtasks running.
Fixes: b34a267efff7b831 ("lei up: fix --mua with single output")
Eric Wong [Tue, 21 Sep 2021 09:29:45 +0000 (09:29 +0000)]
lei: umask(077) before opening errors.log
There's a chance some sensitive information (e.g. folder names)
can end up in errors.log, though $XDG_RUNTIME_DIR or
/tmp/lei-$UID/ will have 0700 permissions, anyways.
Eric Wong [Tue, 21 Sep 2021 09:29:44 +0000 (09:29 +0000)]
script/lei: handle SIGTSTP and SIGCONT
Sometimes it's useful to pause an expensive query or
refresh-mail-sync to do something else. While lei-daemon and
lei/store can't be paused since they're shared across clients,
per-invocation WQ workers can be paused safely using the
unblockable SIGSTOP.
While we're at it, drop the ETOOMANYREFS hint since it
hasn't been a problem since we drastically reduced FD passing
early in development.
Eric Wong [Tue, 21 Sep 2021 07:41:59 +0000 (07:41 +0000)]
lei q: improve --limit behavior and progress
Avoid slurping gigantic (e.g. 100000) result sets into a single
response if a giant limit is specified, and instead use 10000
as a window for the mset with a given offset. We'll also warn
and hint towards about the --limit= switch when the estimated
result set is larger than the default limit.
Eric Wong [Tue, 21 Sep 2021 07:41:55 +0000 (07:41 +0000)]
lei: various completion improvements
"lei export-kw" no longer completes for anonymous sources.
More commands use "lei refresh-mail-sync" as a basis for their
completion work, as well.
";AUTH=ANONYMOUS@" is stripped from completions since it was
preventing bash completion from working on AUTH=ANONYMOUS IMAP
URLs. I'm not sure if there's a better way, but all of our code
works fine without specifying AUTH=ANONYMOUS as a command-line
arg.
Finally, we fallback to using more candidates if none can
be found, allowing multiple URLs to be completed.
Eric Wong [Tue, 21 Sep 2021 07:41:52 +0000 (07:41 +0000)]
lei lcat: use single queue for ordering
If lcat-ing multiple argument types (blobs vs folders),
maintain the original order of the arguments instead of
dumping all blobs before folder contents.
Eric Wong [Tue, 21 Sep 2021 07:41:51 +0000 (07:41 +0000)]
lei: simplify internal arg2folder usage
We can set opt->{quiet} for (internal) 'note-event' command
to quiet ->qerr, since we use ->qerr everywhere else. And
we'll just die() instead of setting a ->{fail} message, since
eval + die are more inline with the rest of our Perl code.
Eric Wong [Tue, 21 Sep 2021 07:41:50 +0000 (07:41 +0000)]
lei_mail_sync: account for non-unique cases
NNTP servers, IMAP servers, and various MUAs may recycle
"unique" identifiers due to software bugs or careless BOFHs.
Warn about them, but always be prepared to account for them.
Eric Wong [Mon, 20 Sep 2021 13:00:33 +0000 (13:00 +0000)]
gcf2: fix loading at runtime
We need to waitpid synchronously on pkg-config to use $?.
When loading Gcf2 inside the event loop, implicit dwaitpid
done by PublicInbox::ProcessPipe would not call waitpid in
time to zero $?. This was causing one of my -httpd to
occasionally fall back to git(1) instead of using Gcf2.