Eric Wong [Wed, 21 Jul 2021 14:05:49 +0000 (14:05 +0000)]
extsearch: support publicinbox.*.boost parameter
This behaves identically the lei external "boost" parameter in
prioritizing raw messages for extindex.
Relying exclusively on the config file order doesn't work well
for mirrors since it's impossible to guarantee config file
ordering via grokmirror hooks.
Config file ordering remains the default if boost is
unconfigured, or in case of ties.
Note: I chose the name "boost" rather than "priority" or "rank"
since I always get confused by whether higher or lower numbers
take precedence when it comes to kernel scheduling. "weight" is
also a part of Xapian API terminology, which we currently do not
expose to configuration (but may in the future).
Eric Wong [Tue, 20 Jul 2021 08:58:58 +0000 (08:58 +0000)]
httpd: fix SIGHUP by invalidating cache on reload
Since we require separate PublicInbox::HTTPD instances for each
listen socket address (in order to support {SERVER_<NAME|PORT>}
for PSGI env), the old cache needed to be invalidated on rare
app refreshes.
SIGHUP has always been broken in -httpd (but not -imapd or
-nntpd) due to this cache.
Update the daemon documentation and 5.10.1-ize some bits while
we're in the area.
Eric Wong [Thu, 8 Jul 2021 08:25:19 +0000 (08:25 +0000)]
extindex: dedupe: reduce SQLite contention and dirty data
Complex queries causes SQLite to block readers for longer than
their retry period. For dedupe, it was also preventing us from
making good use of checkpoints due to the query time.
With many deduplications, checkpoints are necessary to maintain
system health due to having too much data piled up.
Eric Wong [Wed, 7 Jul 2021 23:24:55 +0000 (23:24 +0000)]
extsearchidx: ignore Eml warnings across the board
There's nothing we can do about misformatted emails and headers
we get from untrusted sources. They're too noisy and those
messages already exist in public-inboxes, anyways, so just
keep things quiet so we can spot real problems more easily.
Eric Wong [Tue, 6 Jul 2021 12:42:03 +0000 (12:42 +0000)]
extindex: --gc: avoid SQLite lock conflict on shard cleanup
Xapian shard cleanup only requires read-only access to
over.sqlite3, so avoid opening it with read-write access since
create_tables will hit lock conflicts on "INSERT OR IGNORE"
statements.
Eric Wong [Tue, 6 Jul 2021 12:42:02 +0000 (12:42 +0000)]
extindex: implement --dedupe to fix old extindices
This is intended to fix older indices that had deduplication
bugs for matching content. It'll also make dealing with
future changes to ContentHash easier since that's never
guaranteed stable.
It also supports --dry-run to print changes only without
making them.
Eric Wong [Tue, 6 Jul 2021 12:42:01 +0000 (12:42 +0000)]
eml: relax warn_ignore regexps for current Email::Address::XS
These seem needed with the data I'm currently working on, but I
haven't changed my version of Email::Address::XS since my last
Debian stable upgrade (to buster).
Eric Wong [Fri, 2 Jul 2021 21:02:23 +0000 (21:02 +0000)]
lei import: increase flags search batch size, display progress
IMAP flag-only synchronization doesn't fetch entire messages,
so we can safely bump the batch size iff a user specified one
for full messages to 10000 times that.
Since I sometimes wonder why nothing happens for several seconds
after starting "lei import $URL", we'll also show some progress
during the flag synchronization phase.
Eric Wong [Fri, 2 Jul 2021 20:42:09 +0000 (20:42 +0000)]
extsearchidx: extra assertions for deduplication flow
I haven't found any bugs from this (still looking for missed
deduplication bugs), and it's a bit shorter and more likely to
catch future bugs. Clean up an unnecessary ->{mid} array copy
while we're at it, too.
Eric Wong [Wed, 30 Jun 2021 17:58:54 +0000 (17:58 +0000)]
searchidx: default BATCH_BYTES to 8MB on 64-bit systems
This default seems closer to reasonable on 64-bit systems which
are the norm these days. 32-bit systems gain 48K so it's an
even 1 MB, but we need to keep 32-bit systems from using too
much since there's still some ancient systems out there with
small inboxes.
Eric Wong [Fri, 25 Jun 2021 01:06:39 +0000 (01:06 +0000)]
extindex: maintain pack symlinks and use "git multi-pack-index"
This is a fair amount of complexity, but it speeds up
"git cat-file --batch" startup by 3-4% with 50K packfiles
with a hot kernel cache.
This appears extremely sensitive to RAM available to
the kernel page cache with my SATA 2 SSD. Faster storage
and more RAM can bring loading pack.
2.60s vs 2.69s were the best cases on my workstation with and
without the multi-pack-index, however times could be all over
the place (even in the minutes) with more activity on my
workstation.
Getting sub-minute times requires a git patch to speed up
alt_odb_usable():
<https://lore.kernel.org/20210624005806.12079-1-e@80x24.org/>
Otherwise, prepare to wait several minutes.
It's also easier to patch and install git locally since the
git.git build system defaults to prefix=$HOME and dealing with
dynamic linking with libgit2 is more difficult for end users
relying on Inline::C.
libgit2 remains in use for the non-ALL.git case, but maybe it's
not necessary (libgit2 is significantly slower than git in
Debian 10 due to SHA-1 collision checking).
Eric Wong [Wed, 23 Jun 2021 11:14:22 +0000 (07:14 -0400)]
www: do not warn on blank query parameters
Sometimes users (or bots) may lead queries with '&' and
trigger uninitialized variable warnings, just ignore them
and give consumers a $ctx->{qp}->{''} entry.
While we're in the area, pass a regexp rather than scalar string
to the `split' perlop to prevent Perl from recompiling the
regexp on every call.
Eric Wong [Wed, 23 Jun 2021 11:14:21 +0000 (07:14 -0400)]
www_listing: start updating for pagination + search
When dealing with thousands of inboxes, displaying all of
them on a single page isn't going to work. So steal some
pagination and search results code from the message search
to generate some basic HTML output that looks good in w3m.
Eric Wong [Wed, 23 Jun 2021 11:14:20 +0000 (07:14 -0400)]
search: make xap_terms easier-to-use and use it more
This allows us to simplify callers throughout, and exceptions are
can no longer be silently hidden. MiscSearch now uses xap_terms
for looking up eidx_key terms for a code reduction.
We also simplify LeiStore->_msg_kw for runtime use by moving the
MsetIterator handling into t/lei_store.t test case.
Eric Wong [Tue, 22 Jun 2021 10:04:36 +0000 (10:04 +0000)]
lei: use open() perlop for -C (chdir)
This is for consistency with the open() at initial accept, in
case we hit a code path which expects Perl directory handles
rather than "file handles". Both work with the chdir() perlop
(fchdir(2), in our case).
Eric Wong [Sun, 20 Jun 2021 04:33:19 +0000 (04:33 +0000)]
lei sucks: don't warn or error out on missing dependencies
%INC can hold undef. This can be hit on a Linux machine missing
Linux::Inotify2. Loading PublicInbox::KQNotify is attempted and
PublicInbox/KQNotify.pm always exists, causing the `undef' entry
in %INC when it fails to load IO::KQueue.
Eric Wong [Sat, 19 Jun 2021 03:22:28 +0000 (03:22 +0000)]
view: extra check to for redundant messages in HTML view
There appears to be some cases of duplicates appearing due to
-extindex. I haven't nailed down the cause of it, yet, but
this should make things easier for readers using the PSGI
HTML interface in the meantime.
The raw mboxrd remains undeduplicated for now, and the
correct fix/workaround would be some fsck-like mode for
public-inbox-extindex.
Eric Wong [Fri, 18 Jun 2021 21:44:38 +0000 (18:44 -0300)]
scripts: add syscall-list tool for development
We'll be supporting inotify directly as we do with epoll so so
Linux users won't have to deal with XS, extra DSOs or install
Linux::Inotify2 (and common::sense) modules.
Eric Wong [Thu, 17 Jun 2021 22:00:47 +0000 (22:00 +0000)]
lei/store: cull redundant docids based on blob OID
I'm not sure how this happened (only once for me in March), but
it should not happen... In any case, we'll operate on the
lowest numbered docid and cull redundant index entries when
lei/store is open for read-write.
This also fixes the normal lei/store removal path to clean up
the xref3 table (since it's not done automatically for
public-facing -eidx due to the multi-list nature of it).
Eric Wong [Sun, 13 Jun 2021 18:12:06 +0000 (18:12 +0000)]
lei index+import: reject keywords from R/O IMAP
Since users can't set IMAP flags in read-only IMAP folders,
we won't clobber local flags when importing from IMAP. This
also enables the local_blob fallback used for lei-index to
be used for index deduplication.
Eric Wong [Sat, 12 Jun 2021 00:10:45 +0000 (00:10 +0000)]
net_reader: canonicalize URL args on add_url
This fixes cases when users specify an IMAP or NNTP URL
with standard port numbers explicitly.
In other words, this allows users to use
"lei ls-mail-source nntps://public-inbox.org:563/" and
"lei ls-mail-source imaps://public-inbox.org:993/"
without hitting "BUG:" errors.
Eric Wong [Fri, 11 Jun 2021 09:42:40 +0000 (09:42 +0000)]
lei ls-mail-source: list IMAP folders and NNTP groups
While other tools can provide the same functionality, having
integration with git-credential is convenient, here. Caching
and completion will be implemented separately.
Eric Wong [Wed, 9 Jun 2021 23:27:50 +0000 (20:27 -0300)]
lei tag: less confusing warning about unimported messages
"unimported" is more meaningful than "missing", here. And
instead of having every worker spew about unimported messages,
we'll accumulate and only print one warning line. This
necessitated alterating ->DESTROY behavior and persisting
the client socket within the $lei object itself, not just
the PktOp consumer object.
Eric Wong [Wed, 9 Jun 2021 22:39:24 +0000 (22:39 +0000)]
lei import: support --new-only for IMAP
Taking ~40s to synchronize a ~75K message IMAP folder is
still a lot of time, so support an option to only touch
new messages.
This is similar to "offlineimap -q" (quick) or "mbsync --new"
switches, but lei already accepts "-q" as a shortcut for
--quiet. "--new" could work, but "--new-only" might be more
descriptive (or "--only-new"?), since the default fetches
also fetches new messages.
v2: warn for non-IMAP sources, I'm not sure it's worth it for
Maildir or other sources, yet. It will also make sense
for MH and JMAP once we support them.
Eric Wong [Wed, 9 Jun 2021 07:47:49 +0000 (07:47 +0000)]
lei tag: parallelize Maildir access
Since Maildir isn't guaranteed to have any sort of order, we
can parallelize inputs, here. On a 4-core system, this reduced
one of my tag invocations from 5.5 to 1.4s.
Eric Wong [Wed, 9 Jun 2021 10:03:05 +0000 (10:03 +0000)]
lei/store: do eidx_init before creating R/W lms dbh
Sharing lms->{dbh} with eidx shards appears to be the cause of
the "Issuing rollback() due to DESTROY without explicit
disconnect() of DBD::SQLite::db handle" messages I've been
seeing from "lei up".
Eric Wong [Tue, 8 Jun 2021 23:56:13 +0000 (23:56 +0000)]
lei pmdir: fix nproc for <= 4 CPUs
I forgot my FreeBSD VM has 8 cores, actually, and tweaked the
nproc detection on that machine before finalizing commit 10b523eb017162240b1ac3647f8dcbbf2be348a7
("lei import: speed up repeated Maildir imports")
Fixes: 10b523eb01716224 ("lei import: speed up repeated Maildir imports")
Eric Wong [Tue, 8 Jun 2021 09:50:21 +0000 (09:50 +0000)]
lei import: speed up repeated Maildir imports
On a 4-core CPU, this speeds up "lei import" on a largish
Maildir inbox with 75K messages from ~8 minutes down to ~40s.
Parallelizing alone did not bring any improvement and may
even hurt performance slightly, depending on CPU availability.
However, creating the index on the "fid" and "name" columns in
blob2name yields us the same speedup we got.
Parallelizing IMAP makes more sense due to the fact most IMAP
stores are non-local and subject to network latency.
Followup-to: bdecd7ed8e0dcf0b45491b947cd737ba8cfe38a3 ("lei import: speed up kw updates for old IMAP messages")
Eric Wong [Tue, 8 Jun 2021 09:50:19 +0000 (09:50 +0000)]
lei: safety fix for multiple WQ classes
For commands utilizing multiple workers, this simple change
generalizes the persistence mechanism and and prevents
lei->dclose from causing script/lei to exit if there are
still in-flight workers.
This ougth to prevent read-after-write consistency problems that
occasionally manifest in scripts (e.g. test cases) but usually
go unnoticed in normal use.
Eric Wong [Mon, 7 Jun 2021 19:06:30 +0000 (19:06 +0000)]
lei/store: checkpoint commits mail_sync.sqlite3
We mainly rely on ->done with lei/store, but moving to
->checkpoint probably makes sense. Note: over, msgmap, and
mail_sync all have slightly different transacation behavior;
perhaps they can be unified in the future.
Eric Wong [Sat, 5 Jun 2021 21:04:50 +0000 (21:04 +0000)]
INSTALL: note about lei metadata storage
Since lei is for personal mailboxes, I don't think lei needs to
keep keyword and label changes in history. And fix a minor
wording problem ("or" => "nor") while we're at it.
Eric Wong [Thu, 3 Jun 2021 01:05:20 +0000 (01:05 +0000)]
lei import: speed up kw updates for old IMAP messages
On a 4-core CPU, this speeds up "lei import" on a largish IMAP
inbox with 75K messages from ~21 minutes down to 40s.
Parallelizing with the new LeiImportKw WQ worker class gives a
near-linear speedup and brought the runtime down to ~5:40.
The new idx_fid_uid index on the "fid" and "uid" columns of
blob2num in mail_sync.sqlite3 brought us the final speedup.
An additional index on over.sqlite3#xref3(oidbin) did not help,
since idx_nntp already exists and speeds up the new ->oidbin_exists
internal API.
I initially experimented with a separate "lei import-kw" command
but decided against it since it's useless outside of IMAP+JMAP
and would require extra cognitive overhead for both users and
hackers. So LeiImportKw is just a WQ worker used by "lei import"
and not its own user-visible command.
v2: fix ikw_done_wait arg handling (ugh, confusing API :x)
Eric Wong [Sun, 30 May 2021 11:45:44 +0000 (11:45 +0000)]
lei import: import IMAP flag changes from old messages
This makes "lei import" behavior with IMAP folders more
consistent with that with Maildir.
Opening IMAP folders read-write with "SELECT" (instead of
read-only with "EXAMINE") was necessary, since it lets an IMAP
server communicate to us as to whether or not it's worth
refetching IMAP flags of previously imported messages.
Fetching UID+FLAGS only is one of the fastest IMAP operations
with dovecot, our -imapd and presumably other common IMAP servers.
It is issued by common MUAs such as mutt after every SELECT.
Users may now rely on "lei import" exclusively to merge mail and
keywords into lei/store, and "lei export-kw" to propagate
keyword changes back to IMAP servers.
A sticks-and-stones workflow for personal mailboxes is currently:
lei import imaps://$MY_PERSONAL_INBOX
lei q --mua=$MUA -o /tmp/results SEARCH TERMS...
# do stuff from within $MUA to /tmp/results
lei import /tmp/results # read keyword changes from MUA
lei export-kw imaps://$MY_PERSONAL_INBOX
# repeat when new stuff shows up in personal inbox
The next goal is to automate repeated imports + export-kw
commands with with inotify and IMAP IDLE.
Eric Wong [Sat, 29 May 2021 20:20:39 +0000 (20:20 +0000)]
lei q: --sort and --save|v2 are incompatible
Saved searches rely on (reverse) docid ordering for efficient
incremental results, and sorting any other way prevents that.
Update comment description in LeiQuery while we're at it:
"ls-query" and "rm-query" are "ls-search" and "forget-search",
respectively, and "mv-query" is implicit with "edit-search"
Eric Wong [Sat, 29 May 2021 20:20:38 +0000 (20:20 +0000)]
lei import|lcat: improve+fix single message IMAP support
lcat can now dump the memoized contents of entire IMAP folders,
not just a single UID. It's now parallelized and pipelined for
multiple lei2mail workers.
Furthemore, various forms of JSON output work consistently
with blob-only output, now.
While working on this, I noticed NetReader was passing UID URLs
to imap_each callbacks, which was causing mail_sync.sqlite3 to
store UIDs in `folders' and clearly wrong so it's now fixed.
Eric Wong [Fri, 28 May 2021 22:37:21 +0000 (22:37 +0000)]
lei q|up: support v2:/path/to/inboxdir destination
This allows "lei-managed pseudo mailing lists" as described
by Konstantin.
Alternates use is optional and can be enables via --shared.
This doesn't manage or edit ~/.public-inbox/config; presumably
there'll need to be some tweaking of search parameters before
finalizing and making the inbox publicly accessible via HTTP/NNTP.
Eric Wong [Fri, 28 May 2021 19:47:23 +0000 (19:47 +0000)]
lei: retry_reopen on read-only Xapian access
Xapian DBs may be modified by a parallel process while we're
reading it, and Xapian's MVCC model places the burden on readers
to retry operations.
We'll also have retry_reopen croak instead of die on errors,
which ought to help us track down some "Document not found"
errors I've occasionally seen when using "lei <q|up>".
Eric Wong [Fri, 28 May 2021 00:07:54 +0000 (00:07 +0000)]
lei_mail_sync: debug code for uncommitted txn
I'm not 100% sure why, but "lei up" seems to cause uncommitted
transaction errors. LeiToMail calls sto->set_sync_info, but
LeiXSearch should call sto->done and lms_commit, so I'm not
sure where the uncommited transaction is coming from...
Eric Wong [Fri, 28 May 2021 00:07:53 +0000 (00:07 +0000)]
viewdiff: escape '{' and '}' for regexp
Perl 5 doesn't warn on this, yet, but it warns on unescaped
'(' and ')' nowadays, so it's conceivable Perl could start
warning on this in the future. So future-proof our code and
reduce reader confusion.
Eric Wong [Fri, 28 May 2021 00:07:52 +0000 (00:07 +0000)]
viewdiff: make $UNSAFE a variable
There's no sense in using a constant here since it
gets copied into the uri_escape_utf8 function anyways.
Furthermore, inlined constants still leave behind a
subroutine and subs cost several KB of memory.
Finally, add a comment as to why it's different than the default
escape, since I just spent a minute wondering that.
Eric Wong [Wed, 26 May 2021 18:08:57 +0000 (18:08 +0000)]
lei: require Socket::MsgHdr or Inline::C, drop oneshot
The cost of supporting separate code paths between oneshot and
daemon isn't worth the trouble; especially if there are more
users to support. The test suite time nearly doubles with
oneshot, so that's hurting developer productivity.
FD passing is currently required to work efficiently with
remote HTTP(S) queries which return large messages, as seen in
commit 708b182a57373172f5523f3dc297659d58e03b58
("ipc: wq: handle >MAX_ARG_STRLEN && <EMSGSIZE case").
Additionally, upcoming support for IMAP IDLE and inotify-based
monitoring of Maildirs cannot work properly without a background
daemon.
Eric Wong [Tue, 25 May 2021 22:20:01 +0000 (22:20 +0000)]
ipc: wq: handle >MAX_ARG_STRLEN && <EMSGSIZE case
WQWorkers are limited roughly to MAX_ARG_STRLEN (the kernel
limit of argv + environ) to avoid excessive memory growth.
Occasionally, we need to send larger messages via workqueues
that are too small to hit EMSGSIZE on the sender.
This fixes "lei q" when using HTTP(S) externals, since that
code path sends large Eml objects from lei_xsearch workers
directly to lei2mail WQ workers.
Eric Wong [Tue, 25 May 2021 22:20:00 +0000 (22:20 +0000)]
ipc: avoid potential stack-not-refcounted bug
This fixes a potential problem with Carp::longmess
firing somewhere deeper in the stack. This is not a known
problem at this time, but something I noticed while chasing
something else.
Eric Wong [Tue, 25 May 2021 11:01:36 +0000 (11:01 +0000)]
lei forget-mail-sync: new command to drop sync information
Sometimes a user stops caring to sync an IMAP or Maildir
folder, or wants to force a resync. Let them run this
command to have lei forget all the sync information about
the mail folder.
This won't delete any stored messages in git, but will
leave "lei index" users with dangling references.
Eric Wong [Sun, 23 May 2021 21:36:50 +0000 (21:36 +0000)]
lei inspect: use LeiMailSync->match_imap_url
Move match_imap_url into LeiMailSync so it can be used in more
places, such as "lei inspect". Upcoming commands such as
"lei forget-mail-sync" and {add,forget,pause,resume}-watch will
also support relaxed IMAP matching rules since there's
no reasonable way to expect users use ";UIDVALIDITY=" on the
command-line.
Eric Wong [Sun, 23 May 2021 08:01:16 +0000 (08:01 +0000)]
lei <q|up>: set \Recent on non-empty mbox and Maildir
Despite JMAP not supporting the equivalent of the IMAP \Recent
flag, it is useful for "lei q --augment", and "lei up" users to
be able to distinguish new results from old-but-unread messages
in an mbox or Maildir.
For mbox family messages, we'll drop the "O" status flag when
appending to mboxes, and we'll write to the "new" subdirectory
of Maildirs.
Behavior when writing to initially empty Maildirs and mboxes
remains unchanged since there's no need to distinguish between
new and old results in the initial case. Having users wait
for a rename(2) storm or complete mbox rewrite hurts UX.
With IMAP mailboxes, \Recent is already enforced by the IMAP
server and IMAP clients have no way of changing it(*)
(*) mutt uses the "Old" IMAP flag which isn't part of RFC 3501,
other MUAs may do similar things.
Eric Wong [Sun, 23 May 2021 01:38:28 +0000 (01:38 +0000)]
lei export-kw: relax IMAP URL matching
It's unreasonable to expect UIDVALIDITY= to be specified in
command-line arguments. We'll also check for cases without
"$USER@" or ";AUTH=", since we accept those forms on the
command-line.