Eric Wong [Wed, 27 Jan 2021 09:42:25 +0000 (03:42 -0600)]
lei: set PWD correctly for path expansion
While commit d1b9582872d1824f166a038dcf32b6ae8c6dc735
("lei: pass FD to CWD via cmsg, use fchdir on server")
ensured things work properly to get the daemon in the
right directory, it forgot to deal with places where
we expand relative paths based on the current working
directory.
Eric Wong [Mon, 25 Jan 2021 06:41:57 +0000 (22:41 -0800)]
spawn: split() on regexp, not a literal string
It doesn't appear Perl (as of 5.32.x) has any internal
optimization for splitting on a single-byte, so give it
a regexp instead of letting it compile and discard a
new one every single time.
Eric Wong [Mon, 25 Jan 2021 06:41:56 +0000 (22:41 -0800)]
miscidx: switch to lazy transactions
This fixes a sporadic failure on a 1/2 core VM where
"git cat-file --batch" hasn't started up by the time
$cleanup->() destroys the ALL.git directory in t/lei.t
(but not t/lei-oneshot.t).
This happens because dwaitpid() runs inside the event loop
asynchronously and we were able to return to the client before
the cat-file process could even start.
I could not reproduce this failure on my usual 4-core
workstation via "schedtool -a 0x1" to force the entire
test to use a single core.
Lazy transactions matches OverIdx and SearchIdx behavior, and
I've verified this lets us avoid problems with old Xapian
versions (on CentOS 7.x) which failed to set FD_CLOEXEC.
Eric Wong [Mon, 25 Jan 2021 04:53:46 +0000 (19:53 -0900)]
doc: start working on public-inbox-extindex(1) manpage
It's barely started, but I started writing this weeks ago, but
I'm still unsure about some behavioral/usability things and
hoping work on lei(1) can flush them out.
Eric Wong [Mon, 25 Jan 2021 01:18:57 +0000 (17:18 -0800)]
lei q: continue remote search if torsocks(1) is missing
torsocks is just one of many ways to get curl to use Tor,
so we'll continue if we can't find torsocks in our PATH
and assume the user has a proxy configured via curlrc,
the command-line, environment variable, or even firewall
rules.
Eric Wong [Mon, 25 Jan 2021 01:18:55 +0000 (17:18 -0800)]
lei q: demangle and quiet curl output
curl(1) writes to stderr one byte-at-a-time (presumably for the
progress bar). This ends up being unreadable on my terminal
when parallel processes are trying to write error messages.
So instead, we'll capture the output to a file and run
'tail -f' on it if --verbose is enabled.
Since HTTP 404s from non-existent results are a common response,
we'll ignore them and stay silent, matching behavior of local
searches.
Eric Wong [Mon, 25 Jan 2021 01:18:54 +0000 (17:18 -0800)]
lei q: drop "oid" output format
The default deduplication command-line arguments would be
non-sensical for such an option and probably confusing. It
doesn't seem worth the code to support OID-only output when it's
easy enough to use one of the JSON formats to extract the same
info.
We also don't have OIDs if using remotes, and the
to-be-implemented memoization will be optional.
Eric Wong [Mon, 25 Jan 2021 01:18:53 +0000 (17:18 -0800)]
lei: reinstate JSON smsg output deduplication
This was accidentally clobbered completely in
("lei q: fix JSON overview with remote externals").
There are now more tests to prevent future regressions.
Eric Wong [Sun, 24 Jan 2021 11:46:55 +0000 (04:46 -0700)]
smsg: parse_references: micro-optimization
With Perl 5.10+, we can rely on the defined-or-assignment (//=)
operator to avoid repeatedly rewriting an SV.
This may not provide a measurable difference here, but
it's more consistent with current style where things like
commit a05445fb400108e60ede7d377cf3b26a0392eb24
("config: config_fh_parse: micro-optimize") provide a measurable
improvement.
Eric Wong [Sun, 24 Jan 2021 11:46:53 +0000 (04:46 -0700)]
lei q: fix JSON overview with remote externals
We can't (and don't need to) repeatedly get the $each_smsg
callback for each URI since that clobbers {ovv_buf} before
it can be output.
I initially thought this was a dedupe-related bug and
moved the dedupe code into the $each_smsg callback to
minimize differences. Nevertheless it's a nice code
reduction.
I also thought it was related to incomplete smsg info,
so {references} is now filled in correctly for dedupe.
Eric Wong [Sun, 24 Jan 2021 11:46:51 +0000 (04:46 -0700)]
lei q: honor --no-local to force remote searches
This can be useful for testing remote behavior, or for
augmenting local results. It'll also be possible to explicitly
include/exclude externals via CLI switches (once names are
decided).
Eric Wong [Sun, 24 Jan 2021 11:46:49 +0000 (04:46 -0700)]
ipc: get rid of wq_set_recv_modes
Just open every FD as read/write. Perl (or any non-broken
runtime) won't care and won't attempt to use F_SETFL to alter
file description flags; as attempting to change those would
lead to unpleasant side effects if the file description is
shared with another process.
Eric Wong [Sun, 24 Jan 2021 11:46:48 +0000 (04:46 -0700)]
ipc: wq supports arbitrarily large payloads
This should not be needed, but somebody using lei could
theoretically create thousands of external URLs and
only have a handful of workers, which means the per-worker
URI list could be large.
Eric Wong [Sun, 24 Jan 2021 11:46:47 +0000 (04:46 -0700)]
lei q: limit concurrency to 4 remote connections
Unfortunately, this isn't a per-host limit, yet; but
nevertheless reduces load on existing PublicInbox::WWW
instances, since requesting a mboxrd is one of the more
expensive operations.
Eric Wong [Fri, 22 Jan 2021 20:01:19 +0000 (20:01 +0000)]
treewide: reseed RNG in child processes
This prevents name conflicts leading to retries and slowdowns in
temporary file name generation. No actual data corruption
resulted because all temporary files are opened with O_EXCL
anyways.
This may increase security for IMAP, NNTP, and HTTPS sessions
using TLS, but it's all public data anyways.
Eric Wong [Sat, 23 Jan 2021 10:27:54 +0000 (10:27 +0000)]
lei forget-external: don't show redundant "not found"
Pathname/URL canonicalization may not change the result at
all, so there's no point in trying (and failing) the same
form twice if pre and post-canonicalization are identical.
Eric Wong [Sat, 23 Jan 2021 10:27:53 +0000 (10:27 +0000)]
lei q: support a bunch of curl(1) options
Some of these options will make sense when on weird networks
(behind firewalls, etc.) Some of these options may not make
sense at all.
This allows users who prefer to use the SOCKS5 proxy support in
curl rather than torsocks(1), but we'll still support torsocks
by default since some Tor instances aren't on the default
127.0.0.1:9050.
Eric Wong [Sat, 23 Jan 2021 10:27:50 +0000 (10:27 +0000)]
lei: default "-f $mfolder" args for common MUAs
At least mail, mailx, mutt, and neomutt follow this convention.
Heirloom mailx doesn't support Maildir (our default), but GNU
mailutils mail/mailx does.
Eric Wong [Sat, 23 Jan 2021 10:27:47 +0000 (10:27 +0000)]
lei: support remote externals
Via curl(1), since that lets us easily use tor on a
per-connection basis via LD_PRELOAD (torsocks) or proxy.
We'll eventually support more curl options which can allow
users to get past firewalls and deal with other odd network
configurations.
Eric Wong [Thu, 21 Jan 2021 19:46:23 +0000 (19:46 +0000)]
lei: forget-external support with canonicalization
For proper matching, we'll do a better job canonicalizing
URLs and path names for matching. Of course, users may edit
the file outside of lei, so ensure we try both the canonicalized
and as-is form provided by the user.
I also don't think we'll need to store externals info in
MiscIdx; just the config file is fine.
Eric Wong [Thu, 21 Jan 2021 19:46:22 +0000 (19:46 +0000)]
lei: remove @TO_CLOSE_ATFORK_CHILD
..At least limit it to a single file handle. The write end
EOFpipe can be limited in scope and auto-closed when $quit is
clobbered, leaving only the listener. The listener is the only
handle that needs to be closed explicitly due to it being on the
stack in the Listener->event_step => accept_dispatch => lei_$FOO
code path.
Everything else gets clobbered by DS->Reset in children after
forking.
Eric Wong [Thu, 21 Jan 2021 19:46:21 +0000 (19:46 +0000)]
lei_xsearch: reduce reference paths to lxs
Having an extra reference to LeiXSearch from the OpPipe $done_op
map is unnecessary and makes the reference graph more complex
than it needs to be. Just use $lei->{lxs} to simplify and
reduce the likelyhood of bugs.
The signal handlers on the client side were unnecessary,
all we need is to handle socket EOF properly in the daemon
by killing xsearch and l2m workers.
Eric Wong [Thu, 21 Jan 2021 19:46:18 +0000 (19:46 +0000)]
lei_to_mail: avoid segfault on exit
Worker exit causes DESTROY ordering to become unpredictable and
leads to Perl segfaulting. Instead, rely on OnDestroy and
explicit triggering after wq_worker_loop to ensure we finish
all outstanding git requests before worker exit.
Eric Wong [Thu, 21 Jan 2021 19:46:16 +0000 (19:46 +0000)]
lei: show {pct} and {oid} in From_ lines and filenames
From_ lines are shown when mbox* variants are output to stdout,
making {oid} and {pct} information visible without risking being
propagated to other importer processes if they were in
lei-specific X-* headers.
Maildirs already had OIDs in the filename, now they gain Xapian
{pct} in case anybody cares.
Eric Wong [Thu, 21 Jan 2021 19:46:13 +0000 (19:46 +0000)]
lei_overview: rename {relevance} => {pct}
The old name was too long compared to the rest of the field
names. With the Xapian method being named ->get_percent,
"pct" is a well known abbreviation for "percent" and already
used internally by our wrapper.
..And cleanup some excess whitespace while we're in the area.
Eric Wong [Wed, 20 Jan 2021 05:04:43 +0000 (14:04 +0900)]
lei: allow more mbox inode types
We may attempt to write an mbox to any terminal, block, or
character device, not just regular files and FIFOs/pipes.
The only thing that is known to not work is a directory.
Sockets may be possible with some OSes (e.g. Plan 9) or
filesystems. This fixes t/lei.t on FreeBSD 11.x
Eric Wong [Tue, 19 Jan 2021 09:34:33 +0000 (09:34 +0000)]
t/lei: fix double-running of socket test with oneshot
We split out t/lei-oneshot.t and t/lei.t so it's easier
to isolate run-mode specific bugs and behavior and there's
no reason to rerun the socket daemon tests.
Eric Wong [Tue, 19 Jan 2021 09:34:31 +0000 (09:34 +0000)]
lei q: fix augment of compressed mailboxes
We need to delay writing out the mailbox until the compressor
process is up and running, so have startq wait a bit. This
means we must create the pipe early and hand it off to the
workers before augmenting, despite spawning the
gzip/pigz/xz/bzip2 process after augment is complete.
Eric Wong [Tue, 19 Jan 2021 09:34:30 +0000 (09:34 +0000)]
lei: write daemon errors to the sock directory
Most everything should be captured by the __WARN__ handlers and
routed to syslog, but it appears Perl may write to stderr in
some emergency cases, as can libc or other libraries. Just
point it to a small file that's cleared on reboot.
Eric Wong [Tue, 19 Jan 2021 09:34:28 +0000 (09:34 +0000)]
lei q: fix SIGPIPE handling from lei2mail workers
We need to properly propagate SIGPIPE to the top-level
lei-daemon process and avoid relying on auto-close,
since auto-close triggers Perl warnings when explicit
close() does not.
Eric Wong [Tue, 19 Jan 2021 09:34:27 +0000 (09:34 +0000)]
lei q: start ->mset while query_prepare runs
We don't need the result of query_prepare (for augmenting or
mass unlinking) until we're ready to deduplicate and write
results to the filesystem. This ought to let us hide some of
the cost of Xapian searches on multi-device/core systems for
extremely expensive searches.
Eric Wong [Mon, 18 Jan 2021 10:30:32 +0000 (04:30 -0600)]
lei_to_mail: optimize for MUAs
Instead of optimizing our own performance, this optimizes
our data to reduce work done by the MUA consumer.
Maildir and mbox destinations no longer support any notion of
the IMAP \Recent flag. JMAP has no functioning \Recent
equivalent, and neither do we.
In practice, having MUAs (e.g. mutt) clear the \Recent flag when
committing changes to the mbox is expensive: it creates a
rename(2) storm with Maildir and overwrites the entire mbox.
For mboxcl2 (and mboxcl), we'll further optimize mutt behavior
by setting the Lines: header in addition to Content-Length.
With these changes, mutt exits instantaneously on mboxcl2,
mboxcl, and Maildirs generated by "lei q".
Eric Wong [Mon, 18 Jan 2021 10:30:31 +0000 (04:30 -0600)]
lei q: parallelize Maildir and mbox writing
With 4 dedicated workers, this seems to provide a 100-120%
speedup on a 4 core machine when writing thousands of search
results to a Maildir or mbox. This also sets us up for
high-latency IMAP destinations in the future.
This opens the door to more speedup opportunities such
as optimizing dedupe locking and other ways to reduce
contention.
This change is fairly complex and convoluted, unfortunately.
Further work may allow us to simplify it and even improve
performance.
Eric Wong [Sun, 17 Jan 2021 08:52:27 +0000 (20:52 -1200)]
lei q: add --mua-cmd switch
It can be convenient to invoke an MUA as search results
are being written to it, as an eager person may want to
start seeing results ASAP. This lets Maildir users
see results in the MUA as we are writing them. Users
of IMAP will eventually be able to take advantage of
them, too.
Since we don't support mbox locking (yet?), we'll only invoke
the MUA after results are done for mbox formats.
Eric Wong [Sat, 16 Jan 2021 11:36:23 +0000 (23:36 -1200)]
lei: q: results output to Maildir and mbox* working
All the augment and deduplication stuff seems to be working
based on unit tests. OpPipe is a nice general addition that
will probably make future state machines easier.
Eric Wong [Sat, 16 Jan 2021 11:36:22 +0000 (23:36 -1200)]
ipc: children don't kill on DESTROY, reduce FD sharing
Children should not be blindly killing siblings on ->DESTROY
since they're typically shorter-lived than parents. We'll
also be more careful about on-stack variables and now we
can rely exclusively on delete ops to close FDs.
We also need to fix our SIGPIPE handling for the oneshot case
while fixing a typo for delete, so we write "!" to the EOF pipe
to ensure the parent oneshot process exits on the first worker
that hits SIGPIPE, rather than waiting for the last worker to
hit SIGPIPE.
Eric Wong [Sun, 17 Jan 2021 07:09:59 +0000 (07:09 +0000)]
extindex: fix w/ Xapian 1.2.21..1.2.24
Xapian v1.2.21..v1.2.24 failed to set the close-on-exec flag
on the flintlock FD, causing "git cat-file" processes to
hold onto the lock and prevent subsequent Xapian::WritableDatabase
from locking the DB. So cleanup git processes after committing
the miscidx transaction.
Eric Wong [Sun, 17 Jan 2021 07:09:58 +0000 (07:09 +0000)]
t/shared_kv: workaround old File::Spec
The version of File::Spec shipped with Perl 5.16.3 memoizes the
value of File::Spec->tmpdir, causing changes to $ENV{TMPDIR}
after-the-fact to be ignored.
We'll only work around this in the test since it's innocuous and
unlikely to matter in real-world usage (and there's many places
where we'd have to workaround this in non-test code).
Eric Wong [Sun, 17 Jan 2021 07:09:56 +0000 (07:09 +0000)]
initialize scalar for `vec' perlop modification
Older Perls (tested 5.16.3) would warn on uninitialized scalars while
newer (tested 5.28.1) do not. Just initialize it to an empty string
since it'll be filled in by `vec'.
Eric Wong [Thu, 14 Jan 2021 07:06:27 +0000 (19:06 -1200)]
lei: pass FD to CWD via cmsg, use fchdir on server
Perl chdir() automatically does fchdir(2) if given a file
or directory handle since 5.8.8/5.10.0, so we can safely
rely on it given our 5.10.1+ requirement.
This means we no longer have to waste several milliseconds
loading the Cwd.so and making stat() calls to ensure
ENV{PWD} is correct and usable in the server. It also lets
us work in directories that are no longer accessible via
pathname.
Eric Wong [Thu, 14 Jan 2021 07:06:24 +0000 (19:06 -1200)]
lei: q: lock stdout on overview output
Most writes to stdout aren't atomic and we need locking to
prevent workers from interleaving and corrupting JSON output.
The one case stdout won't require locking is if it's pointed
to a regular file with O_APPEND; as POSIX O_APPEND semantics
guarantees atomicity.
Eric Wong [Thu, 14 Jan 2021 07:06:23 +0000 (19:06 -1200)]
lei_overview: rename "references" to "refs"
"references" was too long of a name compared to the other field
names we output in the JSON. While we currently don't have a
"refs:" search prefix for the "References:" header, we may in
the future.
Eric Wong [Thu, 14 Jan 2021 07:06:22 +0000 (19:06 -1200)]
search: rename "ts:" prefix to "rt:"
Meaning "Received time", as it is the best description of the
value we use from the "Received:" header, if present. JMAP
calls it "receivedAt", but "rt:" seems like a better
abbreviation being in line with "dt:" for the "Date" header.
"Timestamp" ("ts") was potentially ambiguous given the presence
of the "Date" header.
Eric Wong [Thu, 14 Jan 2021 07:06:18 +0000 (19:06 -1200)]
lei: reduce live FD references in wq child
We can shrink the @TO_CLOSE_ATFORK_CHILD array by two
elements, at least. I may be possible to eliminate this
array entirely but clobbering $quit doesn't seem to
remove references to $eof_w or the $listener socket.
Eric Wong [Thu, 14 Jan 2021 07:06:17 +0000 (19:06 -1200)]
lei: do not unlink socket path at exit
This matches existing -httpd/-nntpd/-imapd daemon behavior.
From what I can recall, it is less racy for the process doing
bind(2) to unlink it if stale.
Eric Wong [Thu, 14 Jan 2021 07:06:16 +0000 (19:06 -1200)]
daemon+watch: fix localization of %SIG for non-signalfd users
It turns out "local" did not take effect in the way we used it:
http://nntp.perl.org/group/perl.perl5.porters/258784
<CAHhgV8hPbcmkzWizp6Vijw921M5BOXixj4+zTh3nRS9vRBYk8w@mail.gmail.com>
Fortunately, none of the old use cases seem affected, unlike the
previous lei change to ensure consistent SIGPIPE handling.
Eric Wong [Thu, 14 Jan 2021 07:06:15 +0000 (19:06 -1200)]
lei: test SIGPIPE, stop xsearch workers on client abort
The new test ensures consistency between oneshot and
client/daemon users. Cancelling an in-progress result now also
stops xsearch workers to avoid wasted CPU and I/O.
Note the lei->atfork_child_wq usage changes, it is to workaround
a bug in Perl 5: http://nntp.perl.org/group/perl.perl5.porters/258784
<CAHhgV8hPbcmkzWizp6Vijw921M5BOXixj4+zTh3nRS9vRBYk8w@mail.gmail.com>
This switches the internal protocol to use SOCK_SEQPACKET
AF_UNIX sockets to prevent merging messages from the daemon to
client to run pager and kill/exit the client script.
Eric Wong [Sun, 10 Jan 2021 12:15:19 +0000 (12:15 +0000)]
lei: query: restore JSON output overview
This internal API is better suited for fork-friendliness (but
locking + dedupe still needs to be re-added).
Normal "json" is the default, though stream-friendly "concatjson"
and "jsonl" (AKA "ndjson" AKA "ldjson") all seem working
(though tests aren't working, yet).
For normal "json", the biggest downside is the necessity of a
trailing "null" element at the end of the array because of
parallel processes, since (AFAIK) regular JSON doesn't allow
trailing commas, unlike JavaScript.
Eric Wong [Sun, 10 Jan 2021 12:15:18 +0000 (12:15 +0000)]
lei_xsearch: transfer 4 FDs internally, drop IO::FDPass
It's easier to make the code more generic by transferring
all four FDs (std(in|out|err) + socket) instead of omitting
stdin.
We'll be reading from stdin on some imports, and possibly
outputting to stdout, so omitting stdin now would needlessly
complicate things.
The differences with IO::FDPass "1" code paths and the "4"
code paths used by Inline::C and Socket::MsgHdr are far too
much to support and test at the moment.
Eric Wong [Sun, 10 Jan 2021 12:15:17 +0000 (12:15 +0000)]
lei: run pager in client script
While most single keystrokes work fine when the pager is
launched from the background daemon, Ctrl-C and WINCH can cause
strangeness when connected to the wrong terminal.
Eric Wong [Sun, 10 Jan 2021 12:15:16 +0000 (12:15 +0000)]
lei: fork + FD cleanup
Do a better job of closing FDs that we don't want shared with
the work queue workers. We'll also fix naming and use
"atfork_prepare" instead of "atfork_parent" to match
pthread_atfork(3) naming.
Eric Wong [Sun, 10 Jan 2021 12:15:15 +0000 (12:15 +0000)]
lei: get rid of client {pid} field
Using kill(2) is too dangerous since extremely long
queries may mean the original PID of the aborted lei(1)
client process to be recycled by a new process. It would
be bad if the lei_xsearch worker process issued a kill
on the wrong process.
So just rely on sending the exit message via socket.
Eric Wong [Sun, 10 Jan 2021 12:15:14 +0000 (12:15 +0000)]
ipc: drop unused fields, default sighandlers for wq
Relying on signal handlers to kill a particular worker was a
laggy/racy idea and I gave up on the idea of targetting workers
explicitly and instead chose to make wq_worker_decr stop the
next idle worker ->wq_exit.
We will however attempt to support sending signals to
a process group.