Eric Wong [Sun, 10 Jan 2021 12:15:02 +0000 (12:15 +0000)]
cmd_ipc: send FDs with buffer payload
For another step in in syscall reduction, we'll support
transferring 3 FDs and a buffer with a single sendmsg/recvmsg
syscall using Socket::MsgHdr if available.
Beyond script/lei itself, this will be used for internal IPC
between search backends (perhaps with SOCK_SEQPACKET). There's
a chance this could make it to the public-facing daemons, too.
This adds an optional dependency on the Socket::MsgHdr package,
available as libsocket-msghdr-perl on Debian-based distros
(but not CentOS 7.x and FreeBSD 11.x, at least).
Our Inline::C version in PublicInbox::Spawn remains the last
choice for script/lei due to the high startup time, and
IO::FDPass remains supported for non-Debian distros.
Since the socket name prefix changes from 3 to 4, we'll also
take this opportunity to make the argv+env buffer transfer less
error-prone by relying on argc instead of designated delimiters.
Eric Wong [Sun, 10 Jan 2021 12:15:01 +0000 (12:15 +0000)]
ipc: add support for asynchronous callbacks
Similar to git->cat_async, this will let us deal with responses
asynchronously, as well as being able to mix synchronous and
asynchronous code transparently (though perhaps not optimally).
Eric Wong [Sun, 10 Jan 2021 12:15:00 +0000 (12:15 +0000)]
ds: block signals when reaping
This lets us call dwaitpid long before a process exits
and not have to wait around for it.
This is advantageous for lei where we can run dwaitpid on the
pager as soon as we spawn it, instead of waiting for a client
socket to go away on DESTROY.
Eric Wong [Sun, 10 Jan 2021 12:14:58 +0000 (12:14 +0000)]
lei query + pagination sorta working
Parallelism and interactivity with pager + SIGPIPE needs work;
but results are shown and phrase search works without shell
users having to apply Xapian quoting rules on top of standard
shell quoting.
Eric Wong [Tue, 5 Jan 2021 01:29:10 +0000 (01:29 +0000)]
v2writable: exact discontiguous history handling
We've always temporarily unindexeded messages before reindexing
them again if there's discontiguous history.
This change improves the mechanism we use to prevent NNTP and
IMAP clients from seeing duplicate messages.
Previously, we relied on mapping Message-IDs to NNTP article
numbers to ensure clients would not see the same message twice.
This worked for most messages, but not for for messages with
reused or duplicate Message-IDs.
Instead of relying on Message-IDs as a key, we now rely on the
git blob object ID for exact content matching. This allows
truly different messages to show up for NNTP|IMAP clients, while
still those clients from seeing the message again.
Eric Wong [Tue, 5 Jan 2021 09:04:37 +0000 (09:04 +0000)]
address: pairs: new helper for JMAP (and maybe lei)
Per JMAP RFC 8621 sec 4.1.2.3, we should be able to
denote the lack of a phrase/comment corresponding to an
email address with a JSON "null" (or Perl `undef').
Eric Wong [Tue, 5 Jan 2021 09:04:36 +0000 (09:04 +0000)]
lei: use client env as-is, drop daemon-env command
There may be subtle misbehaviours when mixing the existing
daemon env and the client-supplied env. Just do the simplest
thing and use the client env as-is.
We'll also start the ->event_step callback since we'll need
to remember some things for long-lived commands.
Eric Wong [Tue, 5 Jan 2021 09:04:34 +0000 (09:04 +0000)]
lei: completion: fix filename completion
"-o default" is what we want from "complete", "-o filename" just
tells readline the result from the "_lei" function might be a
filename and quote appropriately.
Eric Wong [Mon, 4 Jan 2021 04:16:23 +0000 (04:16 +0000)]
lei: fix opt_dash to pass non-dash args to @argv
The special "<>" handling in Getopt::Long actually invokes the
callback for every single command-line arg, not just those
prefixed by "-". This will let us pass arbitrary non-dashed
words for search queries so users can type queries naturally
without quoting (unless they want phrase search).
Eric Wong [Sun, 3 Jan 2021 20:58:29 +0000 (20:58 +0000)]
lei: prefer IO::FDPass over our Inline::C recv_3fds
While our recv_3fds() implementation is more efficient
syscall-wise, loading Inline takes nearly 50ms on my machine
even after Inline::C memoizes the build. The current ~20ms in
the fast path is barely acceptable to me, and 50ms would be
unusable.
Eventually, script/lei may invoke tcc(1) or cc(1) directly in
the fast path, but it needs @INC for the slow path, at least.
We'll encode the number of FDs into the socket name allow
parallel installations, for now.
Eric Wong [Sun, 3 Jan 2021 02:06:16 +0000 (02:06 +0000)]
ipc: switch to one-way pipes
This fixes a performance regression in multi-process v2 indexing
due to the switch to PublicInbox::IPC. While Unix sockets are
fewer FDs to manage, pipes allow unprivileged processes to use
larger buffers (up to 1M) on out-of-the-box Linux instances.
A larger buffer via F_SETPIPE_SZ afforded by pipes was proven
valuable during v2 development in 2018 and continues to be
valuable when we get significant amounts of one-way traffic from
the producer parent to worker children.
Compression may be an option for systems without F_SETPIPE_SZ;
but it increases CPU usage with no memory bandwidth savings on
hosts where larger buffers are available.
Eric Wong [Sun, 3 Jan 2021 02:06:15 +0000 (02:06 +0000)]
use Eml (or MIME) objects for all indexing paths
We don't need to be keeping the raw message around after it hits
git. Shard work now relies on Storable (or Sereal) and all of
the indexing code relies on the Email::MIME-like API of Eml to
access interesting parts of the message.
Similarly, smsg->{raw_bytes} is no longer carried around and we
do the CRLF adjustment when setting smsg->{bytes}.
There's also a small simplification to t/import.t while
we're in the area to use xqx instead of spawn/popen_rd.
Eric Wong [Sun, 3 Jan 2021 02:06:12 +0000 (02:06 +0000)]
searchidxshard: use PublicInbox::IPC to kill lots of code
It's nice to prove the new code works by swapping it into
the current V2Writable / SearchIdxShard packages. This is
only the first step for the core bits, and we'll be able
to delete more code in a subsequent patch.
Eric Wong [Sun, 3 Jan 2021 02:12:06 +0000 (12:12 -1400)]
gcf2client: split out request API from regular git
While Gcf2Client is designed to mimic what git-cat-file writes
to stdout, its request format is different to support requests
with a git repository path included.
We'll highlight the distinction and make the GitAsyncCat support
code easier-to-follow as a result.
Since Gcf2Client relies on DS, we can rely on DS-specific code
here, too, and use a single Unix socket instead of separate
input and output pipes, reducing memory overhead in both users
and kernel space. Due to the interactive nature of requests and
responses, the buffer size limitations of Unix sockets on Linux
seems inconsequential here (just like it is for existing "git
cat-file --batch" use).
Eric Wong [Sun, 3 Jan 2021 11:24:51 +0000 (11:24 +0000)]
lei: fix output race in client/daemon mode
The daemon needs to flush stdout before disconnecting or killing
clients, otherwise they may reread empty data on redirected
outputs. We also don't want to unbuffer stdout too early in
case we have lots of small chunks of data to output.
The received ($self->{2}) will always have autoflush, matching normal
STDERR behavior.
Eric Wong [Sun, 3 Jan 2021 11:24:50 +0000 (11:24 +0000)]
send and receive all 3 FDs at once
We'll always be transferring stdin, stdout, and stderr together
for lei. Perhaps I lack imagination or foresight, but I can't
think of a reason to send more or less FDs.
Eric Wong [Sat, 2 Jan 2021 08:32:04 +0000 (08:32 +0000)]
lei_store: alternative unconfigured "git var" workaround
While the changes to git->qx/git->popen from commit 171a9c24022ad7ef
will be useful for the lei daemon, hiding git error messages from
actual users is probably wrong and we'll just localize GIT_*
vars for testing.
Eric Wong [Fri, 1 Jan 2021 05:47:49 +0000 (17:47 -1200)]
import: unset GIT_CONFIG with `git config --global'
GIT_CONFIG is set by -convert, and user may have it set
for other reasons. In either case, it conflicts with
any any attempt to use `git config --global` so we have
to unset it.
We need to use an absolute path after chdir in run modes
where scripts aren't loaded into in-memory subs.
The oneshot test was also failing under TEST_RUN_MODE=0 due to
no "lei-oneshot" command existing on the FS. So we force a
socket failure by making XDG_RUNTIME_DIR too large to fit into
the 108-byte .sun_path field of "struct sockaddr_un". This
even lets us simplify lei-oneshot significantly.
Eric Wong [Thu, 31 Dec 2020 13:51:54 +0000 (13:51 +0000)]
on_destroy: support PID owner guard
Since we'll be forking for Xapian indexing and maybe
other places, having a simple guard in place to ensure
OnDestroy doesn't unexpectedly unlink files or similar
is a safer option.
Eric Wong [Thu, 31 Dec 2020 13:51:52 +0000 (13:51 +0000)]
avoid calling waitpid from children in DESTROY
Objects with DESTROY callbacks get propagated to children, so we
must be careful to not invoke waitpid from children on their
sibling processes. Only parents (and their parents...) can reap
child processes.
Eric Wong [Thu, 31 Dec 2020 13:51:51 +0000 (13:51 +0000)]
lei: avoid Spawn package when starting daemon
Spawn was designed to speed up process spawning inside
long-lived daemons with largish memory usage. It does not help
for short-lived scripts which only exist to start and connect to
a daemon.
This change actually speeds up initial lei startup from
~190ms to ~140ms(!). Normal usage once the daemon is running
is unaffected, at <20ms for help text.
While we're in the area, simplify Cwd error message generation,
too.
Eric Wong [Thu, 31 Dec 2020 13:51:50 +0000 (13:51 +0000)]
syscall: SFD_NONBLOCK can be a constant, again
Since Perl exposes O_NONBLOCK as a constant, we can safely make
SFD_NONBLOCK a constant, too. This is not the case for
SFD_CLOEXEC, since O_CLOEXEC is not exposed by Perl despite
being used internally in the interpreter.
Eric Wong [Thu, 31 Dec 2020 13:51:49 +0000 (13:51 +0000)]
use PublicInbox::DS for dwaitpid
This simplifies our code and provides a more consistent API for
error handling. PublicInbox::DS can be loaded nowadays on all
*BSDs and Linux distros easily without extra packages to
install.
The downside is possibly increased startup time, but it's
probably not as a big problem with lei being a daemon
(and -mda possibly following suite).
Eric Wong [Thu, 31 Dec 2020 13:51:47 +0000 (13:51 +0000)]
searchidxshard: call DS->Reset at worker start
The daemon for the local email interface will be inside
the DS->EventLoop. -watch currently doesn't trigger this
bug since it doesn't enable parallelism, but it may in
the future.
Eric Wong [Thu, 31 Dec 2020 13:51:46 +0000 (13:51 +0000)]
lei_to_mail: open FIFOs O_WRONLY so we block
Opening a FIFO with O_RDWR always succeeds on Linux, which
cause the cat(1) process invoked by t/lei_to_mail.t to get
stuck. Furthermore O_APPEND makes no sense on FIFOs and
perhaps there's some kernel out there which will reject it.
Eric Wong [Thu, 31 Dec 2020 13:51:40 +0000 (13:51 +0000)]
lei_to_mail: unlink mboxes if not augmenting
This matches mairix(1) behavior and may be safer if there's
concurrent readers on the existing mbox, especially since
we don't do currently implement mbox locking (nor does mairix).
Eric Wong [Thu, 31 Dec 2020 13:51:39 +0000 (13:51 +0000)]
ipc: use shutdown(2), base atfork* callback
shutdown(2) on a socket can be preferable if there's multiple
forked processes writing to a single worker and we really want
to shut things down ASAP.
It may also be good to provide an ipc_worker_exit method which
subclasses can override if needed for graceful shutdown. But we
won't need equivalents to atexit(3) since we can rely on DESTROY
handlers given this is Perl5.
Eric Wong [Thu, 31 Dec 2020 13:51:36 +0000 (13:51 +0000)]
mid: use defined-or with `push' for uniqueness check
As shown recently in commit a05445fb400108e60ede7d377cf3b26a0392eb24
("config: config_fh_parse: micro-optimize"), the relying on
the return value of `push' and defined-or operators can avoid
modifying a the hash value scalar with an increment.
Eric Wong [Thu, 31 Dec 2020 13:51:35 +0000 (13:51 +0000)]
lei: rename "extinbox" => "external"
The words "extinbox" and "extindex" are too close and easy to
confuse with the other. Rename "extinbox" to "external", since
these could be IMAP, JMAP or other non-public-inbox search APIs.
Eric Wong [Thu, 31 Dec 2020 13:51:32 +0000 (13:51 +0000)]
ipc: generic IPC dispatch based on Storable
I intend to use this with LeiStore when importing from multiple
slow sources at once (e.g. curl, IMAP, etc). This is because
over.sqlite3 can only have a single writer, and we'll have
several slow readers running in parallel.
Watch and SearchIdxShard should also be able to use this code
in the future, but this will be proven with LeiStore, first.
Eric Wong [Thu, 31 Dec 2020 13:51:30 +0000 (13:51 +0000)]
lei_to_mail: support for non-seekable outputs
Users may wish to pipe output to "git am", "spamc",
or similar, so we need to support those cases and
not bail out on lseek(2) or ftruncate(2) failures.
Eric Wong [Thu, 31 Dec 2020 13:51:27 +0000 (13:51 +0000)]
lei_to_mail: start --augment, dedupe, bz2 and xz
--augment will match the mairix(1) option of the same
name to augment existing search results. We'll need
to implement deduplication for a better user experience.
mutt ships with compressed mbox support for bz2 and xz,
at least, so we'll support those out-of-the-box.
Eric Wong [Thu, 31 Dec 2020 13:51:25 +0000 (13:51 +0000)]
lei_to_mail: start atomic and compressed mbox writing
We'll allow using multiple workers to write to a single
mbox (which could be compressed). This is can be done
safely with O_APPEND + syswrite for uncompressed files,
and using a lock when piping to pigz/gzip/bzip2/xz.
Eric Wong [Fri, 1 Jan 2021 04:51:46 +0000 (04:51 +0000)]
Merge tag 'v1.6.1' into eidx
public-inbox 1.6.1 - minor bugfix release
* tag 'v1.6.1': (31 commits)
public-inbox 1.6.1 - minor bugfix release
import: drop X-Status in addition to Status
eml: fix undefined vars on <Perl 5.28
t/config: test --get-urlmatch for git <2.26
inboxidle: avoid needless syscalls on refresh
inboxidle: clue users into resolving ENOSPC from inotify
inbox: name variable for values loop iterator
public-inbox-v[12]-format.pod: make lexgrog happy
manifest.js.gz: fix per-inbox /$INBOX/manifest.js.gz
Fix manpage section of perl module documentation
t/psgi_v2: ignore warnings on missing P::M::ReverseProxy
daemon: support --daemonize without Net::Server::Daemonize
doc: v2-format: drop repeated word
over: ensure old, merged {tid} is really gone
wwwattach: prevent deep-linking via Referer match
t/eml.t: workaround newer Email::MIME* behavior
nntp: attempt RFC 5536 3.1.5-conformant Path: headers
nntp: delimit Newsgroup: header with commas
tls: epollbit: account for miscellaneous OpenSSL errors
scripts/dupe-finder: restore $dbh variable
...
Eric Wong [Thu, 31 Dec 2020 13:24:36 +0000 (13:24 +0000)]
Merge remote-tracking branch 'origin/master' into lorelei
* origin/master: (58 commits)
ds: flatten + reuse @events, epoll_wait style fixes
ds: simplify EventLoop implementation
check defined return value for localized slurp errors
import: check for git->qx errors, clearer return values
git: qx: avoid extra "local" for scalar context case
search: remove {mset} option for ->mset method
search: remove pointless {relevance} setting
miscsearch: take reopen from Search and use it
extsearch: unconditionally reopen on access
extindex: allow using --all without EXTINDEX_DIR
extindex: add undocumented --no-scan switch
extindex: enable autoflush on STDOUT/STDERR
extindex: various --watch signal handling fixes
extindex: --watch for inotify-based updates
eml: fix undefined vars on <Perl 5.28
t/config: test --get-urlmatch for git <2.26
default to CORE::warn in $SIG{__WARN__} handlers
inbox: name variable for values loop iterator
inboxidle: avoid needless syscalls on refresh
inboxidle: clue users into resolving ENOSPC from inotify
...
Eric Wong [Sat, 26 Dec 2020 11:13:11 +0000 (11:13 +0000)]
lei: rename proposed "query" command to "q", add JSON output
Using "query" as a verb may be confusing when we'll also refer to
them as nouns with the "<ls|rm|mv>-query" sub commands. "query"
is also many characters to type without tab-completion on what I
expect to be one of the most commonly used sub-commands
Furthermore, "q" is also the common query parameter name used by
our PSGI interface, as is the case with several major web search
engines; so there's an element of familiarity there.
The name "search" was disregarded because "show" could be a
commonly used lei sub-command, too, and typing "se" for
tab-completion may be slow since two-handed typists on QWERTY
keyboards won't be able to use alternating hands.
"f" or "find" could be a possibility here, too; but we're
currently using the term "forget" as a weaker version of
"remove" or "rm", though "ignore" could be substituted for
"forget", perhaps...
Kyle Meyer noted the lack of (proposed) JSON output support
so that's been added to the proposed UI.
Eric Wong [Sun, 27 Dec 2020 20:02:51 +0000 (20:02 +0000)]
lei_xsearch: cross-(inbox|extindex) search
While a single extindex combines multiple inboxes into a single
search index, extindex still requires up-front indexing on items
which can be searched. XSearch has no on-disk footprint itself
and uses Xapian DBs of existing publicinbox and extindex
("extinbox") exclusively.
XSearch still suffers from the multi-shard Xapian scalability
problems which led to the creation of extindex, but I expect the
number of shards to remain relatively low.
I envision users hosting public-inbox instances on their
workstations will only have two extindex combined by this, one
read-only extindex for serving public archives, and one
read-write extindex managed by LeiStore for private mail.
Eric Wong [Thu, 17 Dec 2020 09:14:48 +0000 (09:14 +0000)]
import: drop X-Status in addition to Status
It's actually supported by mutt, dovecot[1], and likely some other
software to augment the Status: header. While dovecot doesn't
expose X-Status to clients, mutt will write 'A' (answered) and
'F' to X-Status (but not T (draft)).
So we'll drop it like we do Status since it's not suitable for
public mail, but stick it in an @UNWANTED_HEADERS array will
allow us to configure an override if needed.
Consistently returning the equivalent of pollfd.revents in a
portable manner was never worth the effort for us, as we use the
same ->event_step callback regardless of POLLIN/POLLOUT/POLLHUP.
Being a Perl, @events knows it size and we don't have to return
a maximum index for the caller to iterate on.
We can also avoid redundant integer coercion ("+0") since we
ensure everything is an IV in other places.
Finally, vec() is preferable to ("\0" x $size) for resizing
buffers because it only needs to write the extended portion
and not overwrite the entire buffer.
Eric Wong [Sun, 27 Dec 2020 02:53:06 +0000 (02:53 +0000)]
ds: simplify EventLoop implementation
More importantly, make it easier-to-find the sub by avoiding
runtime manipulation of subroutine names. There's no point in
avoiding a potential call to _InitPoller in EventLoop since
entering EventLoop is rare.
On the contrary, PublicInbox::DS->new is called often and this
change to avoid entering _InitPoller there may have more
benefits (which may still be unmeasurable).
Eric Wong [Sun, 27 Dec 2020 19:38:29 +0000 (19:38 +0000)]
search: remove {mset} option for ->mset method
The ->mset method always returns a Xapian mset nowadays, so
naming a parameter {mset} is too confusing. As it does with
MiscSearch, setting the {relevance} parameter to -1 now sorts by
ascending docid order. -2 is now supported for descending
docid order, too, since it may be useful for lei users.
Eric Wong [Sun, 27 Dec 2020 11:01:42 +0000 (11:01 +0000)]
miscsearch: take reopen from Search and use it
As with ExtSearch, MiscSearch lacks a janky cleanup timer of
PublicInbox::Inbox objects, leading to info about
inboxes/newsgroups going stale. Fortunately, we don't use
MiscSearch very heavily, yet.
In the future, we may be able to detect new inboxes without
having to SIGHUP or restart daemons using MiscSearch.
Eric Wong [Sun, 27 Dec 2020 11:01:41 +0000 (11:01 +0000)]
extsearch: unconditionally reopen on access
Since ExtSearch lacks the janky cleanup timer of
PublicInbox::Inbox objects, its search results get stale.
Reopen the Xapian DB on every ->search call for now, as
reducing reopen calls doesn't seem worth the complexity.
The Xapian::Database::reopen operation itself takes only ~50us
on my old workstation with 3 shards totaling <200GB. Other
parts of Xapian dominates the search time, so the reopen seems
inconsequential with single-digit shard counts.
Eric Wong [Sat, 26 Dec 2020 10:16:24 +0000 (10:16 +0000)]
extindex: allow using --all without EXTINDEX_DIR
If "--all" is specified to index all inboxes, implicitly choose
the configured [extindex "all"] external index since "--all" is
incompatible with specifying inbox directories on the
command-line.
Eric Wong [Sat, 26 Dec 2020 10:16:23 +0000 (10:16 +0000)]
extindex: add undocumented --no-scan switch
This makes diagnosing --watch problems easier when there's
50K inboxes by avoiding the lengthy scan (which is the reason
--watch exists in the first place).
Eric Wong [Sat, 26 Dec 2020 10:16:22 +0000 (10:16 +0000)]
extindex: enable autoflush on STDOUT/STDERR
With --watch, the output may be redirected to a pipe or socket
which Perl may decide to buffer. Ensure Perl doesn't buffer
these outputs since they can provide real-time status updates
in response to signals or FS activity.
Eric Wong [Sat, 26 Dec 2020 10:16:21 +0000 (10:16 +0000)]
extindex: various --watch signal handling fixes
We need to clobber the SIGUSR1 resync queue on SIGHUP to
invalidate old inbox objects. Furthermore, the lengthy
initial scan needs to ignore signals intended for the
event loop to avoid unexpected behavior. Finally, add
some progress output to inform users on the terminal
to inform users' of progress.
Eric Wong [Sat, 26 Dec 2020 01:44:37 +0000 (01:44 +0000)]
extindex: --watch for inotify-based updates
This reuses existing InboxIdle infrastructure to update external
indices based on per-inbox updates. This is an alternative to
auto-updating external indices via the -index command and also
works with existing uses of -mda and public-inbox-watch.
Using inotify (or EVFILT_VNODE) allows watching thousands of
inboxes without having to scan every single one at every
invocation.
This is especially beneficial in cases where an external index
is not writable to the users writing to per-inbox indices.
Eric Wong [Sat, 26 Dec 2020 12:25:42 +0000 (12:25 +0000)]
eml: fix undefined vars on <Perl 5.28
Encode::MIME::Header::_decode_octets did not correctly default
to Encode::FB_DEFAULT until Encode 2.93 (perl5.git commit 0c541dc5633a341cf44b818014b58e7f8be532e9). Provide the default
again to work with older Perls.
Eric Wong [Sat, 26 Dec 2020 12:30:35 +0000 (12:30 +0000)]
t/config: test --get-urlmatch for git <2.26
While git 1.8.5 learned --get-urlmatch, git did not learn to
match URLs against wildcards until 2.26. So only depend on
1.8.5 for this test since 2.26 is too new.