Eric Wong [Mon, 15 Mar 2021 11:57:54 +0000 (12:57 +0100)]
test_common: minor simplifications to setup_public_inboxes
This will results in a small reduction in on-disk footprint
by removing Xapian docdata and reduction in code by removing
an unnecessary -index invocation.
Eric Wong [Mon, 15 Mar 2021 11:57:52 +0000 (12:57 +0100)]
test_common: add create_inbox helper sub
This saves over 100ms in t/lei-q-remote-import.t so far when
TMPDIR is on an SSD. If we can memoize inbox creation to save a
few dozen milliseconds every test, this could add up to
noticeable savings across our entire test suite.
Eric Wong [Sun, 14 Mar 2021 11:12:00 +0000 (13:12 +0200)]
lei q: do not import unnecessarily from externals
We only want to auto import messages that are exclusively in
remote externals. Messages in local externals are not
auto-imported to save space and reduce wear on storage device.
Eric Wong [Sat, 13 Mar 2021 15:40:27 +0000 (15:40 +0000)]
searchidx: fix -Lmedium for IDs and filenames
This fixes "m:", "l:", "f:", "t:", "c:", "dfn:", and "n:" search
prefixes under indexlevel=medium when mixed with indexlevel=full
inboxish. We need positional data for Message-IDs, List-Id,
email addresses and filenames for exact matches, though we still
want to support wildcards.
Fortunately the storage cost is still small as these prefixes
tend to be small compared to message bodies. These are NOT
boolean terms since wildcard support and partial matching is
desired.
Eric Wong [Fri, 12 Mar 2021 10:39:43 +0000 (10:39 +0000)]
lei q: mbox*: disable changing parallelism, add --rsyncable
Unfortunately, being mairix-compatible with --threads means we
can't change thread-count of gzip, bzip2, or xz when writing to
compressed mbox with a --threads= parameter. It's probably not
worth changing, anyways, so another switch or additional value
for --jobs= won't be added.
While we're in the area, add --rsyncable support since
most installations of gzip support it nowadays.
Fixes: 5beb4a5f6585acd ("lei: replace --thread with --threads")
Eric Wong [Fri, 12 Mar 2021 10:39:42 +0000 (10:39 +0000)]
lei: rearrange OPT_DESC and drop some TBD switches
It'll be easier for us to have the option-spec in front of the
command instead of the other way around. The option-spec in
front makes it easier to sort and keep track of potentially
confusing/ambiguous use of command-line switches between
different commands.
We'll also update some of the proposed switches while we're
at it.
Eric Wong [Thu, 11 Mar 2021 01:45:39 +0000 (19:45 -0600)]
msg_part_text: discover text in application/octet-stream
Some poorly-configured MUAs will send application/octet-stream
even for text-only attachments. We can't make expect all MUAs
are configured with proper MIME types, and there is plenty of
historical mail that falls into this unfortunate criteria.
v2: simplify the check and ensures returned text is Perl "utf8"
Eric Wong [Thu, 11 Mar 2021 10:45:38 +0000 (02:45 -0800)]
v2writable: fix undocumented --xapian-only
We can't pass $self and GLOBs across IPC channels transparently.
I only noticed this because I'm testing the application/octet-stream
fallback with https://public-inbox.org/meta/20210311014539.19756-1-e@80x24.org/
Fixes: bf8df8160076d7a1 ("searchidxshard: use PublicInbox::IPC to kill lots of code")
Eric Wong [Wed, 10 Mar 2021 13:23:44 +0000 (13:23 +0000)]
lei import: skip trashed Maildir messages
This matches IMAP behavior in NetReader in skipping \\Deleted
messages. Since lei may be used for personal, non-public mail;
Draft messages are NOT skipped by "lei import".
Eric Wong [Wed, 10 Mar 2021 13:23:43 +0000 (13:23 +0000)]
lei import: simplify Maildir handling
Having a one-off Maildir functionality in LeiStore doesn't seem
worth the maintenance burden, especially given an upcoming
change to skip trashed messages.
I expect this will hurt performance slightly with extra IPC
overhead for the socket copy, but "lei import" may eventually
become rare or at least not hit messages redundantly.
Eric Wong [Fri, 5 Mar 2021 01:38:29 +0000 (18:38 -0700)]
lei q: fix --import-before default and FIFO output
commit 6c551bffd75afb41d9b5e4774068abe7e06ed0e7
("lei q: --import-augment for mbox and mbox.gz") added a check to
in _pre_augment_mbox for the option being a ref() to distinguish
between default values and user-supplied values (which are
non-ref SCALARs from Getopt::Long).
However, LeiQuery failed to use a SCALAR ref as the default
value, making the check in _pre_augment_mbox useless. We
now update LeiQuery to use \1 instead of 1 as the default
value so "lei q -f mboxrd ..." to stdout works once again.
Unfortunately, testing with redirects pointed to regular
files didn't trigger the code paths being updated. Testing
with a FIFO revealed further bugs in the FIFO handling code
which are also fixed in this commit.
We'll also update the $lei->out error message to be
less-specific about "stdout" and use the term "output", instead,
since LeiToMail replaces stdout for all mbox outputs.
Eric Wong [Fri, 5 Mar 2021 03:10:58 +0000 (19:10 -0800)]
search: use "z:" instead of "bytes:" prefix
So far, searching by size has never been publicly documented,
and IMHO, of questionable utility. In any case, "z:" is what
mairix(1) uses, so it may be familiar to existing mairix users
(I've never used this prefix myself).
So far, this prefix is only used internally in tests and in
auto-translated queries from IMAP; thus this incompatible change
is unlikely to affect anyone.
Eric Wong [Thu, 4 Mar 2021 09:03:13 +0000 (17:03 +0800)]
lei_xsearch: cleanup {pkt_op_p} on exceptions
We must ensure pkt_op_p doesn't live beyond the scope of
->do_query in the top-level lei-daemon, otherwise it can leave a
stray socket hanging around in case of exceptions.
Eric Wong [Wed, 3 Mar 2021 13:48:56 +0000 (13:48 +0000)]
lei: use maildir_each_eml in more places
This saves us some code and redundant callsites for
eml_from_path. We'll change maildir_each_eml to include the
filename to facilitate an upcoming change to "lei q" without
--augment
Eric Wong [Wed, 3 Mar 2021 13:48:55 +0000 (13:48 +0000)]
lei_xsearch: add_eml for remote mboxrd, not set_eml
set_eml will clobber any existing keywords. Since remote
mboxrds cannot (and should not) be sending keywords to us,
we shouldn't let remote external requests clobber already-set
keywords if they exist.
Eric Wong [Mon, 1 Mar 2021 05:47:36 +0000 (11:47 +0600)]
lei p2q: fix /dev/null filenames, fix phrase quoting rules
/dev/null mis-handling was reported by Kyle Meyer.
Phrases quoting rules are also refined to avoid leaving spaces
unquoted when "phrase generator" characters exist. Also,
context-free hunk headers no longer clobber the in_diff
state of the parser, since git can still generate those.
Eric Wong [Sun, 28 Feb 2021 12:25:28 +0000 (18:25 +0600)]
lei q: improve early aborts w/ remote externals
We must issue LeiStore->done if a client disconnects
while we're streaming from a remote external. This
can happen via SIGPIPE, or if a client process is
interrupted by any other means.
Eric Wong [Sun, 28 Feb 2021 12:25:27 +0000 (18:25 +0600)]
lei q: fix "-" shortcut for --stdin
Due to the way our option parser handles this special case, it
must be the first option spec. This helps us document things
better, even, since many command accept either a pathname or
--stdin|-.
Eric Wong [Sun, 28 Feb 2021 12:25:26 +0000 (18:25 +0600)]
lei p2q: patch-to-query generator for "lei q --stdin"
Instead of teaching the to-be-implemented "lei show" to search
threads/messages based commits, this orthogonal sub-command is
designed to generate queries for use with "lei q --stdin".
URI-escaped query parameters may be generated with --uri for
HTTP(S) public-inbox instances, but otherwise the output is
designed for "lei q --stdin".
To find threads for a given git commit from a git worktree:
lei p2q $COMMIT_OID | lei q --stdin -t ...
It can also read via --stdin|-
curl $INBOX_URL/$MSGID/raw | lei p2q - | lei q --stdin -t
Or from the filesystem:
lei p2q $(git format-patch -1) | lei q --stdin -t
This defaults to only generating "dfpost:"-prefixed terms since
I've found those most useful for finding messages relating to a
commit. This is subject to change.
--want=s@ is a comma-separated or multi-value list of prefixes
that defaults to "dfpost7". Not all are implemented, yet, but
s, dfn, dfpre, and dfpost all seem to mostly work. Phrase
handling may need to be tweaked to work with Xapian.
OR, NEAR, ADJ, AND, NOT may be used with --want
(e.g. --want=dfpost,OR,dfn)
Prefixing the field prefix with '+' or '-' (e.g. --want=+dfpost)
generates "+dfpost:$EXTRACTED_OID" for Xapian. For non-boolean
search prefixes, wildcard (*) may also be supplied: (--want=dfn*)
For boolean search prefixes, suffixing the field prefix with a
digit (e.g. --want=dfpost7) provides a minimum length, allowing
truncated variations to be searched. This is helpful for
finding older messages as git chooses longer dfpost|dfpre
abbreviations as repos get larger.
Automatic date range generation is not implemented, yet.
Eric Wong [Fri, 26 Feb 2021 09:41:38 +0000 (22:41 -1100)]
lei q: support mbox locking by default
While this diverges from from mairix(1) behavior, it's the safer
option. We'll follow Debian policy by supporting fcntl and
dotlocks by default (in that order). Users who do not want
locking can use "--lock=none"
This will be used in a read-only capacity for watching
mailboxes for keyword updates via inotify or EVFILT_VNODE.
Eric Wong [Thu, 25 Feb 2021 10:11:06 +0000 (10:11 +0000)]
lei q: -tt marks direct hits as "flagged"
This can be used to quickly distinguish messages which were
direct hits when doing thread expansion vs messages that
were merely part of the same thread.
This is NOT mairix-derived behavior, but I occasionally found
it useful when looking at results in an MUA to know whether
a message was a direct hit or not.
This makes "-t" consistent with non-"-t" cases as far as keyword
reading goes.
Eric Wong [Thu, 25 Feb 2021 10:11:04 +0000 (10:11 +0000)]
lei import: use --in-format/-F for consistency
Since we recommend $IN_FORMAT:$LOCATION, this is hopefully not
intrusive (not that this is released software, yet). This is
to be consistent with "lei convert" usage.
We'll keep "-f" only for output formats, since that is used
for "lei q" and "lei convert" for outputs
Eric Wong [Thu, 25 Feb 2021 10:11:03 +0000 (10:11 +0000)]
lei convert: support IMAP output and "-F eml" inputs
eml ("message/rfc822" MIME type) is supported by "lei import",
so it probably makes sense to support via convert, at least
for tests. And IMAP support is supported in "lei q -o $MFOLDER",
so this only required renaming {nrd} => {net} and initializing
outputs before augment preparation (creating the IMAP folder)
Eric Wong [Wed, 24 Feb 2021 23:37:18 +0000 (05:37 +0600)]
lei q: auto-memoize remote messages into lei/store
This lets users avoid network traffic on subsequent searches at
the expense of local disk space. --no-import-remote may be
specified to reverse this trade-off for users with little
storage.
Eric Wong [Wed, 24 Feb 2021 23:37:17 +0000 (05:37 +0600)]
lei_external: don't treat IPv6 URLs as globs
IPv6 addresses are hexadecimals and colons inside brackets, so
add some DWIM-ery to ensure we don't attempt to treat addresses
like "http://[dead:beef]/foo/" as a glob.
Uwe Kleine-König [Wed, 24 Feb 2021 08:54:56 +0000 (09:54 +0100)]
www: use PublicInbox::WwwStream
This prevents the following problem logged to the webserver's error log:
E: Undefined subroutine &PublicInbox::WwwStream::code_footer called at /usr/share/perl5/PublicInbox/WwwListing.pm line 102.
in PublicInbox::ConfigIter=ARRAY(0x557aea68b1a8)::each_section at /usr/share/perl5/PublicInbox/ConfigIter.pm line 37.
Fixes: 7a3946ef122e ("www: support listing of inboxes")
Eric Wong [Tue, 23 Feb 2021 10:01:15 +0000 (04:01 -0600)]
lei q: reduce default lei2mail workers
While disk I/O is typically buffered for good scheduling,
git blob decoding uses a non-trivial amount of CPU time
and it helps to leave some CPU available for it.
Eric Wong [Tue, 23 Feb 2021 10:01:14 +0000 (04:01 -0600)]
lei: support "-C" to chdir in all sub commands
We'll also support "-C" at the end of most commands to give
users a little more flexibility when building command-lines.
This conflicts with "lei daemon-kill -CHLD", so that's
special-cased since "-C" makes no sense with daemon-kill,
anyways.
Unlike "git show", the to-be-implemented "lei show" will diverge
and enable "--find-copies[=<n>]" by default, so "-C[<n>]" won't
be necessary.
Eric Wong [Mon, 22 Feb 2021 11:22:59 +0000 (08:22 -0300)]
lei_auth: trim and remove leftover worker code
LeiAuth is no longer a separate worker process. Instead, it's
used directly by LeiToMail and LeiImport for sharing auth info
from the first worker to the rest of the workers, using
lei-daemon as a message router. So drop the old code to reduce
human cognitive load and interpreter memory overhead.
Eric Wong [Mon, 22 Feb 2021 11:22:57 +0000 (08:22 -0300)]
net_reader: mic_get: reuse connections if cache enabled
We only enable {mic_cached} in WQ workers, and those
aren't expected to fork again going forward. So cache
here avoid a penalty for the non-augmenting (imap_delete_all)
call with "lei q"
Eric Wong [Mon, 22 Feb 2021 11:22:56 +0000 (08:22 -0300)]
lei q: reduce wasted IMAP connection for auth
We can rework the first lei2mail worker to authenticate, and
then share auth info with the rest of the lei2mail workers. As
with "lei import", this uses PktOp and lei-daemon to share
updated credentials between the first an subsequent l2m workers.