]> Sergey Matveev's repositories - public-inbox.git/log
public-inbox.git
5 years agoedit+purge: support `--help' and `-h' like other commands
Eric Wong [Tue, 1 Sep 2020 01:15:00 +0000 (01:15 +0000)]
edit+purge: support `--help' and `-h' like other commands

And while we're at it, note edit is *destructive* to encourage
reading the fine manual.

5 years agoadmin: improve minimum version text
Eric Wong [Tue, 1 Sep 2020 01:14:59 +0000 (01:14 +0000)]
admin: improve minimum version text

"inboxes 1 inboxes not supported by ..." was non-sensical.
Now it'll show "-V1 inbox not supported by ...", instead.

5 years agoscript/*: set executable bit on -learn and -imapd
Eric Wong [Tue, 1 Sep 2020 01:14:58 +0000 (01:14 +0000)]
script/*: set executable bit on -learn and -imapd

It's useful to mark they're meant to be executable, even
if the shebang is useless.

5 years agot/v2dupindex: test indexing mirrors with duplicate messages
Eric Wong [Tue, 1 Sep 2020 05:55:45 +0000 (05:55 +0000)]
t/v2dupindex: test indexing mirrors with duplicate messages

While it's not a known problem, our deduplicating logic may
change in the future; or a BOFH could be manually injecting
duplicate messages directly into the git epoch repositories.

Ensure indexing in mirrors doesn't break when there's
duplicates.  This is in preparation for detached indices
for multi-inbox search.

5 years agoindex: check for xapian-compact when using --compact
Eric Wong [Tue, 1 Sep 2020 16:54:31 +0000 (16:54 +0000)]
index: check for xapian-compact when using --compact

Otherwise, users may be frustrated to discover it missing
a long indexing run.

5 years agoreplace ParentPipe with EOFpipe
Eric Wong [Mon, 31 Aug 2020 04:41:40 +0000 (04:41 +0000)]
replace ParentPipe with EOFpipe

ParentPipe was a subset of EOFpipe, except EOFpipe correctly
accounts for theoretical(*) spurious wakeups on the pipe.

(*) AFAIK, spurious wakeups are/were more likely on TCP sockets
    due to checksum failures, something that's not a problem on
    local pipes.  We're also not sharing pipes like we do with
    listen sockets on accept(2), so there's no chance of another
    process grabbing bytes (unless we have bugs in our code).

5 years agods: avoid unnecessary timer for waitpid
Eric Wong [Mon, 31 Aug 2020 04:41:39 +0000 (04:41 +0000)]
ds: avoid unnecessary timer for waitpid

It doesn't seem necessary, since we won't call dwaitpid()
until we see an EOF.

5 years agowatch: use EOFpipe to reduce dwaitpid wakeups
Eric Wong [Mon, 31 Aug 2020 04:41:38 +0000 (04:41 +0000)]
watch: use EOFpipe to reduce dwaitpid wakeups

It's a bit inefficient to use a pipe, here.  However, using
dwaitpid() on a process that's not expected to exit soon is
also inefficient as it causes excessive wakeups as most of
our inbox-writing code expects synchronous waitpid().

This only affects -watch instances configured for NNTP and IMAP
clients.

5 years agods: avoid excessive queueing when reaping PIDs
Eric Wong [Mon, 31 Aug 2020 04:41:37 +0000 (04:41 +0000)]
ds: avoid excessive queueing when reaping PIDs

We should not enqueue reap_pids() to run more than once per
EventLoop iteration.  We'll start reformatting reap_pids
to tabs, too, since we're no longer Danga::Socket.

We should also be able to remove timer usage for reaping
down-the-line once we stop abusing dwaitpid() in -watch.

5 years agowatch: comments and tiny cleanups
Eric Wong [Mon, 31 Aug 2020 04:41:36 +0000 (04:41 +0000)]
watch: comments and tiny cleanups

Get rid of an unused variable, prefix a warning and try to
better document control flow around various callbacks.

5 years agowatch: block signals before fork on non-signalfd/kevent systems
Eric Wong [Mon, 31 Aug 2020 04:41:35 +0000 (04:41 +0000)]
watch: block signals before fork on non-signalfd/kevent systems

In case there's non-Linux or BSD users w/o IO::KQueue, we
shouldn't let signal handlers fire in the child processes.

The child processes always assumed signals were blocked by
the parent, so no changes were necessary, there.

5 years agowatch: avoid unnecessary spawning on spam removals
Eric Wong [Mon, 31 Aug 2020 04:41:34 +0000 (04:41 +0000)]
watch: avoid unnecessary spawning on spam removals

This should further mitigate lock contention problems
when -watch is configured to watch on a Maildir for spam
while performing a large NNTP import.

There is now a small risk a message won't get removed because if
it's in the current (uncommitted) fast-import batch, but
unlikely given the batch size is now only 10 messages.

If a that small window is hit, flipping the \Seen flag
(e.g. marking it unread, and then read again) will trigger
another removal attempt via IMAP or Maildir.

5 years agowatch: log signal activities to STDERR
Eric Wong [Mon, 31 Aug 2020 04:41:33 +0000 (04:41 +0000)]
watch: log signal activities to STDERR

Sometimes it may not be apparent when/if a signal is
processed, this hopefully improves the situation.

We'll also change the process title when we're quitting
to better inform users.

5 years agorename WatchMaildir => Watch
Eric Wong [Mon, 31 Aug 2020 04:41:32 +0000 (04:41 +0000)]
rename WatchMaildir => Watch

This is no longer limited to Maildirs now that IMAP and NNTP
support exist; so give it a shorter name.

5 years agowatchmaildir: use v5.10.1, drop warnings
Eric Wong [Mon, 31 Aug 2020 04:41:31 +0000 (04:41 +0000)]
watchmaildir: use v5.10.1, drop warnings

Declare 5.10.1 to avoid potential compatibility problems with
Perl 7/8 down the line.  We'll rely on the command-line to set
or drop warnings during development, at least.

5 years agowatch: limit batch size of NNTP and IMAP workers, too
Eric Wong [Mon, 31 Aug 2020 04:41:30 +0000 (04:41 +0000)]
watch: limit batch size of NNTP and IMAP workers, too

We don't want to monopolize locks because processes can easily
block each other if using `watchspam' on a Maildir while a big
NNTP or IMAP import is happening.

This can also happen if somebody configured a single inbox to
watch from several sources to merge several mailboxes into one
(e.g. both an IMAP and Maildir are watched).

5 years agodoc: expand on indexBatchSize regarding fragmentation
Eric Wong [Mon, 31 Aug 2020 04:33:37 +0000 (04:33 +0000)]
doc: expand on indexBatchSize regarding fragmentation

And change the documentation reference in -tuning to
point to the -index manpage while we're at it.

5 years agoimapd: filter out unusable flags from search
Eric Wong [Sat, 29 Aug 2020 20:32:19 +0000 (20:32 +0000)]
imapd: filter out unusable flags from search

Quiet down logs from -imapd when clients are blindly
sending some unsupported flag conditions (e.g. "DRAFT",
"DELETED") specified in RFC 3501.

5 years agotests: check-run: fixup un-squashed simplification
Eric Wong [Sat, 29 Aug 2020 03:48:39 +0000 (03:48 +0000)]
tests: check-run: fixup un-squashed simplification

Link: https://public-inbox.org/meta/20200828221803.GA89978@dcvr/
5 years agotests: check-run: show skipped tests
Eric Wong [Fri, 28 Aug 2020 10:13:00 +0000 (10:13 +0000)]
tests: check-run: show skipped tests

We'll deduplicate redundant lines and show counts of skipped
tests to ensure it's easy to notice if something is unexpectedly
skipped.

5 years agoimaptracker: update_last: simplify callers
Eric Wong [Fri, 28 Aug 2020 10:12:59 +0000 (10:12 +0000)]
imaptracker: update_last: simplify callers

By making it a no-op if last_uid is not defined.  This isn't a
hot code path, so the extra method dispatch isn't an issue.
It'll save some indentation/wrapping in future commits.

5 years agowatch: flush changes to inbox before updating IMAPTracker
Eric Wong [Fri, 28 Aug 2020 10:12:58 +0000 (10:12 +0000)]
watch: flush changes to inbox before updating IMAPTracker

Data needs to hit inboxes, first.  Otherwise it's possible to
skip messages in case git-fast-import is killed before it sees
"done\n".  Now, -watch will just waste a little bandwidth in
re-downloading a seen message if it's interrupted immediately
before updating IMAPTracker.

5 years agoMakefile.PL: run check-man for <= 80 columns on check-run, too
Eric Wong [Fri, 28 Aug 2020 04:22:00 +0000 (04:22 +0000)]
Makefile.PL: run check-man for <= 80 columns on check-run, too

I mostly use "make check-run" instead of the slower "make check"
target, nowadays, so add this check to ensure the rendered
manpage is always be visible to more users who need big fonts.

5 years agowww: more descriptive pagination
Eric Wong [Thu, 27 Aug 2020 22:05:00 +0000 (22:05 +0000)]
www: more descriptive pagination

Being an easily confused person, I find "next" and "prev"
ambiguous as to whether messages on the next or previous page
will be newer or older than the current page.  Clarify that for
the threaded /$INBOX/ view and search results.

For search results sorted by relevance, we'll use "[>= $SCORE]"
or "[<= $SCORE]" to indicate to indicate directionality.

This also fixes $INBOX/new.html for unindexed v1 inboxes.

5 years agowww: improve navigation around contemporary threads
Eric Wong [Thu, 27 Aug 2020 22:04:59 +0000 (22:04 +0000)]
www: improve navigation around contemporary threads

Sometimes it's useful to quickly get to threads and messages
which are contemporaries of the current thread/message being
focused on.  This hopefully improves navigation by making:

a) the top line (where $INBOX_DIR/description) is shown
   a link to the latest topics in search results and
   per-thread/per-message views.

b) providing a link to contemporaries ("~YYYY-MM-DD") at
   around the thread overview skeleton area for per-thread
   and per-message views

5 years agodoc: watch: expand on NNTP and IMAP-specific knobs
Eric Wong [Thu, 27 Aug 2020 12:17:06 +0000 (12:17 +0000)]
doc: watch: expand on NNTP and IMAP-specific knobs

There's a few more, but maybe they're too esoteric
to be worth documenting at the moment (batch sizes, timeouts, etc).

5 years agodoc: move watch config docs to -watch manpage
Eric Wong [Thu, 27 Aug 2020 12:17:05 +0000 (12:17 +0000)]
doc: move watch config docs to -watch manpage

The -config manpage is a bit long and the -watch stuff is
isolated from the rest of it while we start documenting NNTP and
IMAP support.

I'm not entirely happy with the way IMAP and NNTP are
configured, it's still good enough for small setups.

This also fixes a long-standing misplaced comment about
`publicinboxwatch.spamcheck' affecting all configured inboxes,
that comment was actually for `publicinboxwatch.watchspam'.

We'll omit documenting NNTP for `watchspam', for now, given the
lack of \Seen flags in NNTP and I'm not sure if it's even
useful.  There may not be any newsgroups for sharing confirmed
spam, either...

5 years agowatch: imap: only remove \Seen spam
Eric Wong [Thu, 27 Aug 2020 12:17:04 +0000 (12:17 +0000)]
watch: imap: only remove \Seen spam

This matches the behavior of Maildir `watchspam' handling in not
removing unseen messages.  NNTP can't match this behavior, since
NNTP servers don't store flags, clients do.

5 years agodoc: speling fickses
Eric Wong [Thu, 27 Aug 2020 12:17:03 +0000 (12:17 +0000)]
doc: speling fickses

5 years agodoc: document graceful shutdown signals
Eric Wong [Thu, 27 Aug 2020 12:17:02 +0000 (12:17 +0000)]
doc: document graceful shutdown signals

Same as the read-only daemons.

5 years agooveridx: inline create_ghost sub
Eric Wong [Thu, 27 Aug 2020 12:17:01 +0000 (12:17 +0000)]
overidx: inline create_ghost sub

There's no need for this to be a separate sub since there's
only a single caller.  This saves a few kilobytes at least
in short-lived processes.

5 years agoimaptracker: preserve WAL journal_mode if set by user
Eric Wong [Thu, 27 Aug 2020 12:17:00 +0000 (12:17 +0000)]
imaptracker: preserve WAL journal_mode if set by user

It's no problem for most users to enable WAL, here, since
there's only a single process doing both reading and writing
(unlike the read-only daemons).  However, WAL doesn't work on
network filesystems, so it can't be enabled by default.

5 years agowatchmaildir: ensure I:/W:/E: prefixes in warnings
Eric Wong [Thu, 27 Aug 2020 12:16:59 +0000 (12:16 +0000)]
watchmaildir: ensure I:/W:/E: prefixes in warnings

For consistency in output, any URL/path-context-dependent
prefixes should have the same prefix as the actual warning which
triggered it.

5 years agogit: show more context info on failures
Eric Wong [Thu, 27 Aug 2020 07:51:25 +0000 (07:51 +0000)]
git: show more context info on failures

I'm seeing "read: Connection timed out" from in my syslog from
-httpd.  The fail() calls in PublicInbox::Git seems to be the
only code path of ours which could trigger it...

ETIMEDOUT shouldn't happen on pipes, only sockets; and all of
our socket operations are non-blocking.  So this could be
cgit-wwwhighlight-filter.lua, but that's connecting over
localhost, though on fairly loaded HW.

5 years agosearch: allow testing with current xapian.git and 1.5.x
Eric Wong [Wed, 26 Aug 2020 22:02:57 +0000 (22:02 +0000)]
search: allow testing with current xapian.git and 1.5.x

A `PI_XAPIAN' environment variable is now exposed for testing
purposes.  We'll also deal with the removal of
`NumberValueRangeProcessor' and use `NumberRangeProcessor'
in its place, but continue favoring the old Search::Xapian
since that's all that's packaged for Debian 10.x stable.

5 years agomsgmap: use v5.10.1
Eric Wong [Wed, 26 Aug 2020 08:17:42 +0000 (08:17 +0000)]
msgmap: use v5.10.1

We use the defined-or (`//', `//=') operators in 5.10,
so require 5.10.1 like the rest of our codebase.  Update
an outdated comment while we're at it.

5 years agoover*: use v5.10.1, drop warnings
Eric Wong [Wed, 26 Aug 2020 08:17:41 +0000 (08:17 +0000)]
over*: use v5.10.1, drop warnings

v5.10.1 lets us use the lighter parent.pm instead of base.pm,
and we'll rely on the shebang to enable warnings (or not).

While we're in the area, drop a no-longer-necessary import for
PublicInbox::Search, since OverIdx doesn't require search.

5 years agoover: recent: remove expensive COUNT query
Eric Wong [Wed, 26 Aug 2020 08:17:40 +0000 (08:17 +0000)]
over: recent: remove expensive COUNT query

As noted in commit 87dca6d8d5988c5eb54019cca342450b0b7dd6b7
("www: rework query responses to avoid COUNT in SQLite"),
COUNT on many rows is expensive on big SQLite DBs.

We've already stopped using that code path long ago in WWW
while -imapd and -nntpd never used it.  So we'll adjust our
remaining test cases to not need it, either.

5 years agoover: rename ->disconnect to ->dbh_close
Eric Wong [Wed, 26 Aug 2020 08:17:39 +0000 (08:17 +0000)]
over: rename ->disconnect to ->dbh_close

Since we got rid of over->connect, `disconnect' no longer pairs
with it.  So name it after the `close(2)' syscall it ultimately
issues.

5 years agoover: rename ->connect method to ->dbh
Eric Wong [Wed, 26 Aug 2020 08:17:38 +0000 (08:17 +0000)]
over: rename ->connect method to ->dbh

`->connect' is confused with the perlfunc for the `connect(2)'
syscall, and also `DBI->connect'.  Since SQLite doesn't use
sockets, the word "connect" needlessly confuses me.  Give
it a short name to match the field name we use for it, which
also matches the variable name used by the DBI(3pm) and
DBD::SQLite(3pm) manpages.

5 years agov2writable: compatibility with SWIG Xapian binding
Eric Wong [Tue, 25 Aug 2020 20:26:24 +0000 (20:26 +0000)]
v2writable: compatibility with SWIG Xapian binding

The SWIG binding won't auto-convert IV/UV to PV like the XS
Search::Xapian binding would, so workaround that shortcoming
for now.

Fixes: a367ec1b15a2458 ("mbox: disable "&t" on existing Xapian until full reindex")
5 years agogrok-pull.post_update_hook: flock(2) before SQLite check
Eric Wong [Tue, 25 Aug 2020 10:23:14 +0000 (10:23 +0000)]
grok-pull.post_update_hook: flock(2) before SQLite check

Unlike DBD::SQLite, the sqlite3(1) CLI does not have a default
busy timeout enabled, so it easily times out while acquiring a
SHARED lock for read-only queries.  We can avoid battery-wasting
polling from the SQLite timeout handler by relying on flock(2)
as we do in our Perl code.

Furthermore, this avoids triggering some locking problems[1]
from a long "SELECT COUNT(*) ..." query and reindex.

While there may be other SQLite-related parallelism issues[1],
this works around one of them by relying on flock(2).

[1] https://public-inbox.org/meta/20200825001204.GA840@dcvr/

5 years agoover+msgmap: respect WAL journal_mode if set
Eric Wong [Tue, 25 Aug 2020 03:02:47 +0000 (03:02 +0000)]
over+msgmap: respect WAL journal_mode if set

WAL actually seems to have ideal locking characteristics given
concurrency problems I'm experiencing with --reindex running
in parallel with expensive read-only SQLite queries:
<https://public-inbox.org/meta/20200825001204.GA840@dcvr/>

Unfortunately, we cannot blindly use WAL while preserving
compatibility with existing setups nor our guarantees that
read-only daemons are indeed "read-only".

However, respect an user's the choice to set WAL on their
own if they're comfortable with giving -nntpd/-httpd/-imapd
processes write permission to the directory storing SQLite DBs.

5 years agomsgmap: use "CREATE TABLE IF NOT EXISTS"
Eric Wong [Tue, 25 Aug 2020 03:02:46 +0000 (03:02 +0000)]
msgmap: use "CREATE TABLE IF NOT EXISTS"

It's fewer queries and matches what we do in OverIdx.

5 years agoover: skip nodatacow on the journal
Eric Wong [Tue, 25 Aug 2020 03:02:45 +0000 (03:02 +0000)]
over: skip nodatacow on the journal

This file gets truncated anyhow, so it won't fragment.

5 years agodoc: 1.6.0 release notes update
Eric Wong [Tue, 25 Aug 2020 10:51:29 +0000 (10:51 +0000)]
doc: 1.6.0 release notes update

A few more things happened, here.

5 years agodoc: add some more tuning notes
Eric Wong [Tue, 25 Aug 2020 10:51:20 +0000 (10:51 +0000)]
doc: add some more tuning notes

I've learned a thing or three about btrfs in the past few
weeks and remembered some old HDD things, too.

The Xapian MultiDatabase problem will need to be addressed
for 1.7...

5 years agosearchidx: croak for Xapian DB open failure
Eric Wong [Sun, 23 Aug 2020 21:00:27 +0000 (21:00 +0000)]
searchidx: croak for Xapian DB open failure

croak() can give more context on the failure, and setting
`PERL5OPT=-MCarp=verbose' can force a stacktrace.

5 years agoexamples: add imapd systemd examples
Eric Wong [Sun, 23 Aug 2020 07:49:18 +0000 (07:49 +0000)]
examples: add imapd systemd examples

We've got examples for all the other daemons, too!

5 years agoindex: --sequential-shard checkpoints after each shard
Eric Wong [Sat, 22 Aug 2020 19:51:36 +0000 (19:51 +0000)]
index: --sequential-shard checkpoints after each shard

There's no reason we'd want Xapian to defer flushing once we've
indexed everything belonging to a particular shard.

5 years agombox: disable "&t" on existing Xapian until full reindex
Eric Wong [Sat, 22 Aug 2020 06:06:27 +0000 (06:06 +0000)]
mbox: disable "&t" on existing Xapian until full reindex

Expanding threads via over.sqlite3 for mbox.gz downloads without
Xapian effectively collapsing on the THREADID column leads to
repeated messages getting downloaded.

To avoid that situation, use a "has_threadid" Xapian metadata
flag that's only set on --reindex (and brand new Xapian DBs).

This allows admins to upgrade WWW or do --reindex in any order;
without worrying about users eating up bandwidth and CPU cycles.

5 years agosearch: support downloading mboxes results with full thread
Eric Wong [Sat, 22 Aug 2020 06:06:26 +0000 (06:06 +0000)]
search: support downloading mboxes results with full thread

Finally, the addition of THREADID for collapsing results
in Xapian lets us emulate the "mairix --threads" feature.
That is, instead of returning only the matching messages,
the entire thread is included in the downloaded mbox.gz

This requires a "public-inbox-index --reindex" to be usable.

5 years agosearchidx: index THREADID in Xapian
Eric Wong [Sat, 22 Aug 2020 06:06:25 +0000 (06:06 +0000)]
searchidx: index THREADID in Xapian

This is the `tid' column from over.sqlite3; and will be used for
IMAP and JMAP search (among other things).

5 years agosearchidx: put all shard-related stuff in SearchIdxShard.pm
Eric Wong [Sat, 22 Aug 2020 06:06:24 +0000 (06:06 +0000)]
searchidx: put all shard-related stuff in SearchIdxShard.pm

We'll also rename the /^remote_/ prefix to "shard_", since
remote implies the process is on a different host.  These
methods only pass messages to a child process on the same host
OR perform operations within the same process.

5 years agosearchidxshard: clear $msgref buffer properly
Eric Wong [Sat, 22 Aug 2020 06:06:23 +0000 (06:06 +0000)]
searchidxshard: clear $msgref buffer properly

Merely assigning `undef' to a scalar does not free the
underlying buffer memory of a scalar.

5 years agosearchview: fix mbox.gz downloads for lynx users
Eric Wong [Sat, 22 Aug 2020 00:41:25 +0000 (00:41 +0000)]
searchview: fix mbox.gz downloads for lynx users

Unlike w3m and links, the lynx browser seems to require a `name'
attribute for `<input type=submit>' elements.  Maybe some other
browsers do, too.  The `name' attribute for submit elements
doesn't seem to cause any harm for w3m or links, users, either;
despite not (AFAIK) being part of historical or current HTML
specs.

5 years agosearch: add mset_to_artnums method
Eric Wong [Thu, 20 Aug 2020 20:24:57 +0000 (20:24 +0000)]
search: add mset_to_artnums method

We can avoid importing mdocid() in several places by using
this method, simplifying callers.

5 years agoinit+index: support --skip-docdata for Xapian
Eric Wong [Thu, 20 Aug 2020 20:24:56 +0000 (20:24 +0000)]
init+index: support --skip-docdata for Xapian

Since we no longer read document data from Xapian, allow users
to opt-out of storing it.

This breaks compatibility with previous releases of
public-inbox, but gives us a ~1.5% space savings on Xapian
storage (and associated I/O and page cache pressure reduction).

5 years agot/nntpd-v2: set PI_TEST_VERSION=2 properly
Eric Wong [Thu, 20 Aug 2020 20:24:55 +0000 (20:24 +0000)]
t/nntpd-v2: set PI_TEST_VERSION=2 properly

Numbers are hard :<

5 years agosmsg: remove from_mitem
Eric Wong [Thu, 20 Aug 2020 20:24:54 +0000 (20:24 +0000)]
smsg: remove from_mitem

We no longer read docdata.glass from anywhere in our code base.

Some adjustments were needed to t/search.t to deal with the
Xapian::WritableDatabase committing at different times, since
our ->query is avoided from PublicInbox::SearchIdx to avoid
needing a {over_ro} field.

5 years agombox: avoid Xapian docdata in search results
Eric Wong [Thu, 20 Aug 2020 20:24:53 +0000 (20:24 +0000)]
mbox: avoid Xapian docdata in search results

Another place where we can reduce kernel page cache overhead
by hitting over.sqlite3 instead of docdata.glass.

5 years agoextmsg: avoid using Xapian docdata
Eric Wong [Thu, 20 Aug 2020 20:24:52 +0000 (20:24 +0000)]
extmsg: avoid using Xapian docdata

Once again, over.sqlite3 contains everything necessary for
Message-ID resolution.  Also, Xapian may be completely
unnecessary with the advent of over.sqlite3, but that's for
another time.

5 years agosearchview: convert nested and Atom display to over.sqlite3
Eric Wong [Thu, 20 Aug 2020 20:24:51 +0000 (20:24 +0000)]
searchview: convert nested and Atom display to over.sqlite3

git blob retrieval dominates on these, "&x=t" (nested) is
roughly the same due to increased overhead for ->get_percent
storage balancing out the mass-loading from SQLite.

Atom "&x=A" is sped up slightly and uses less memory in the
long-lived response.

5 years agosearchview: speed up search summary by ~10%
Eric Wong [Thu, 20 Aug 2020 20:24:50 +0000 (20:24 +0000)]
searchview: speed up search summary by ~10%

Instead of loading one article at-a-time from over.sqlite3, we
can use SQL to mass-load IN (?,?, ...) all results with a single
SQLite query.  Despite SQLite being in-process and having no
network latency, the reduction in SQL query executions from
loading multiple rows at once speeds things up significantly.

We'll keep the over->get_art optimizations from the previous
commit, since it still speeds up long-lived responses, slightly.

5 years agosearchview: use over.sqlite3 instead of Xapian docdata
Eric Wong [Thu, 20 Aug 2020 20:24:49 +0000 (20:24 +0000)]
searchview: use over.sqlite3 instead of Xapian docdata

This is a step towards improving kernel page cache hit rates by
relying on over.sqlite3 for document data instead of Xapian.
Some micro-optimization to over->get_art was required to
maintain performance.

5 years agosmsg: reduce utf8::decode call sites
Eric Wong [Thu, 20 Aug 2020 20:24:48 +0000 (20:24 +0000)]
smsg: reduce utf8::decode call sites

Both callers of load_from_data call utf8::decode, so just
do utf8::decode in load_from_data.

5 years agosearch: make qparse_new an internal function
Eric Wong [Thu, 20 Aug 2020 20:24:47 +0000 (20:24 +0000)]
search: make qparse_new an internal function

We'll probably be reusing it from another package in a future commit.

5 years agosearchquery: split off from searchview
Eric Wong [Thu, 20 Aug 2020 20:24:46 +0000 (20:24 +0000)]
searchquery: split off from searchview

Since this was already a separate package, split it off
into its own file since SearchView may not handle inbox
groups.

5 years agosearch: export mdocid subroutine
Eric Wong [Thu, 20 Aug 2020 20:24:45 +0000 (20:24 +0000)]
search: export mdocid subroutine

No need to have awkward globrefs for this.

5 years agosearch: improve comments around constants
Eric Wong [Thu, 20 Aug 2020 20:24:44 +0000 (20:24 +0000)]
search: improve comments around constants

We'll probably be adding more value columns like THREADID to sort
on.

5 years agowww: reduce long-lived PublicInbox::Search references
Eric Wong [Thu, 20 Aug 2020 20:24:43 +0000 (20:24 +0000)]
www: reduce long-lived PublicInbox::Search references

While this is unlikely to be a problem in current practice,
keeping Xapian DBs open for long responses can interfere with
free space recovery after -compact.

In the future, it will interfere with inbox search grouping
and lead to unexpected results.

5 years agoxapcmd: simplify {reindex} parameter passing
Eric Wong [Thu, 20 Aug 2020 20:24:42 +0000 (20:24 +0000)]
xapcmd: simplify {reindex} parameter passing

No need to localize it, here, since we can just refer to it
in the `$opt' hashref.  Hopefully this improves readability
for others like it does for me.

I sometimes wonder if the concept of a stack in high-level
languages is even necessary...

5 years agosearch: v2: ensure shards are numerically sorted
Eric Wong [Thu, 20 Aug 2020 20:24:41 +0000 (20:24 +0000)]
search: v2: ensure shards are numerically sorted

This seems required to correctly get the NNTP article number
from Xapian docid on combined Xapian DBs.  The default
(ASCII-betical) sorting was only acceptable for -imapd users
until somebody hit 11 (or more) shards, which is a rare case.

5 years agoinit: drop -N alias for --skip-artnum
Eric Wong [Thu, 20 Aug 2020 20:24:40 +0000 (20:24 +0000)]
init: drop -N alias for --skip-artnum

It may be too easily confused for --newsgroup or --ng.  This is
too rarely used and never made it into a release, so it should
be fine.

5 years agoinit: support --newsgroup option
Eric Wong [Thu, 20 Aug 2020 20:24:39 +0000 (20:24 +0000)]
init: support --newsgroup option

We can reduce the need to edit the config file for NNTP group names
this way.

5 years agoinit: support --help and -?
Eric Wong [Thu, 20 Aug 2020 20:24:38 +0000 (20:24 +0000)]
init: support --help and -?

And speed those up with some lazy loading, too.

5 years agocompact: support --help/-? and perform lazy loading
Eric Wong [Thu, 20 Aug 2020 20:24:37 +0000 (20:24 +0000)]
compact: support --help/-? and perform lazy loading

This probably won't be used much, but --help can still
make sense.

5 years agoadmin: progress shows the inbox being indexed
Eric Wong [Thu, 20 Aug 2020 20:24:36 +0000 (20:24 +0000)]
admin: progress shows the inbox being indexed

This is helpful with --all, or when multiple inboxes
are being indexed.

5 years agodoc: note -compact and -xcpdb are rarely used
Eric Wong [Thu, 20 Aug 2020 20:24:35 +0000 (20:24 +0000)]
doc: note -compact and -xcpdb are rarely used

Slowly improving the learning curve...

5 years agov2writable: show newline after "indexing all of .. " message
Eric Wong [Tue, 11 Aug 2020 19:52:02 +0000 (19:52 +0000)]
v2writable: show newline after "indexing all of .. " message

Otherwise things get very confusing when verbosity is enabled :x

5 years agosmsg: handle wide characters in raw mail headers
Eric Wong [Wed, 19 Aug 2020 08:02:33 +0000 (08:02 +0000)]
smsg: handle wide characters in raw mail headers

There may be messages in the wild with wide characters in
headers which aren't non-RFC2047 encoded.  Assume UTF-8 so
those fields can round trip through over.sqlite3.

This doesn't affect docdata.glass in Xapian, but it does
affect how over.sqlite3 stores the same deflated info.

5 years agodoc: add public-inbox-tuning(7) manpage
Eric Wong [Sat, 15 Aug 2020 05:21:02 +0000 (05:21 +0000)]
doc: add public-inbox-tuning(7) manpage

Determining storage device speed and latencies doesn't
seem portable or even possible with the wide variety
of storage layers in use.

This means we need to write a tuning document and hope
users read and improve on it :P

5 years agogrok-pull.post_update_hook: favor --sequential-shard for HDD
Eric Wong [Thu, 13 Aug 2020 08:04:04 +0000 (08:04 +0000)]
grok-pull.post_update_hook: favor --sequential-shard for HDD

--sequential-shard offers better performance on HDD than -j0
since the on-disk active set can be kept small (with -j $HIGH_NUM).
--batch-size can also be helpful for systems with much RAM.

5 years agoindex|compact|xcpdb: support --all switch
Eric Wong [Thu, 13 Aug 2020 08:04:37 +0000 (08:04 +0000)]
index|compact|xcpdb: support --all switch

For -index, this is a convenient way to quickly index all
inboxes after a grok-pull.  Might as well support it for
rarely used commands like -compact and -xcpdb, too.

5 years agov2writable: remove IdxStack import
Eric Wong [Wed, 12 Aug 2020 09:17:19 +0000 (09:17 +0000)]
v2writable: remove IdxStack import

We use IdxStack via log2stack() from SearchIdx, now.

5 years agoxcpdb: wire up new index options and --help
Eric Wong [Wed, 12 Aug 2020 09:17:18 +0000 (09:17 +0000)]
xcpdb: wire up new index options and --help

--sequential-shard also disables the copy parallelism (--jobs),
so it can be useful for systems unable to handle parallel random
I/O but still want many shards.

There was a missing "use strict", too, which is fixed.

5 years agoadmin: don't warn when --jobs exceeds shards
Eric Wong [Wed, 12 Aug 2020 09:17:17 +0000 (09:17 +0000)]
admin: don't warn when --jobs exceeds shards

Established tools like make(1), prove(1) and xargs(1) don't warn
when the desired parallelism level can't be met, either.

5 years agoxapcmd: reduce CPU idling when shards exceeds job count
Eric Wong [Wed, 12 Aug 2020 09:17:16 +0000 (09:17 +0000)]
xapcmd: reduce CPU idling when shards exceeds job count

In case there's unbalanced shards AND we're limiting parallelism
while using many shards, spawn the next task in the queue ASAP
once a task is done, instead of waiting for all tasks to finish
before spawning the next batch.

Unbalanced shards probably isn't a big issue for most users;
however many smaller shards with few jobs can be useful for HDD
users to reduce the effect of random writes.

5 years agoxcpdb: support --no-fsync from CLI
Eric Wong [Wed, 12 Aug 2020 09:17:15 +0000 (09:17 +0000)]
xcpdb: support --no-fsync from CLI

This was omitted in 8b1950055d51d436 :x

Fixes: 8b1950055d51d436 ("index+xcpdb: rename `--no-sync' to `--no-fsync'")
5 years agoxapcmd: simplify sub reference
Eric Wong [Wed, 12 Aug 2020 09:17:14 +0000 (09:17 +0000)]
xapcmd: simplify sub reference

We don't need to fully-qualify when referring to subs in
the same namespace, nor do we need make a SCALAR ref only
to dereference it

(Yes, still learning Perl :x)

5 years agoconvert: set No_COW on copied SQLite files
Eric Wong [Mon, 10 Aug 2020 02:12:05 +0000 (02:12 +0000)]
convert: set No_COW on copied SQLite files

We'll use our existing logic and use sqlite_backup_from_file,
which appeared in 1.39 (along with sqlite_backup_to_file).

5 years agoconvert: check ARGV more correctly
Eric Wong [Mon, 10 Aug 2020 02:12:04 +0000 (02:12 +0000)]
convert: check ARGV more correctly

Instead of silently ignoring excessive args, don't let a user
specify an extra directory.  Furthermore, we'll support the odd
case where BOFH wants to name an $INBOX_DIR to be `0' :P

5 years agoconvert: speed up --help
Eric Wong [Mon, 10 Aug 2020 02:12:03 +0000 (02:12 +0000)]
convert: speed up --help

Lazy-loading dependencies speeds up --help by several hundred
milliseconds and is a huge step towards user-friendliness.

5 years agoconvert: support new -index options
Eric Wong [Mon, 10 Aug 2020 02:12:02 +0000 (02:12 +0000)]
convert: support new -index options

Converting v1 inboxes from v2 can be a painful experience
on HDD.  Some of the new options in the CLI or config
file make it less painful.

5 years agosearchidx: use singular `$opt' for consistency with v2
Eric Wong [Mon, 10 Aug 2020 02:12:01 +0000 (02:12 +0000)]
searchidx: use singular `$opt' for consistency with v2

The rest of our indexing code uses `$opt' instead of `$opts'.

5 years agoindex: cleanup internal variables
Eric Wong [Mon, 10 Aug 2020 02:12:00 +0000 (02:12 +0000)]
index: cleanup internal variables

Move away from hard-to-read alllowercase naming and favor
snake_case or separated-by-dashes.

We'll keep `--indexlevel' as-is for now, since it's been around
for several releases; but we'll support `--index-level' in the
CLI and update our documentation in a few months.

We'll also clarify that publicInbox.indexMaxSize is only
intended for -index, and not -watch or -mda.

5 years agoadmin: use a generic variable name
Eric Wong [Mon, 10 Aug 2020 02:11:59 +0000 (02:11 +0000)]
admin: use a generic variable name

We parse other options, too, not just --max-size

5 years agoavoid File::Temp::tempfile in more places
Eric Wong [Mon, 10 Aug 2020 02:11:58 +0000 (02:11 +0000)]
avoid File::Temp::tempfile in more places

We can use open(..., undef) natively in Perl in t/import.t

In places where we need a pathname, the File::Temp OO API
gives us auto-unlinking for free.

5 years agomsgmap: tmp_clone: simplify + meaningful filename
Eric Wong [Mon, 10 Aug 2020 02:11:57 +0000 (02:11 +0000)]
msgmap: tmp_clone: simplify + meaningful filename

Trying to use the newer ->sqlite_backup_to_dbh method doesn't
seem worth it, as we'll have to support DBD::SQLite <= 1.60
another decade or more.

Dumping 'msgmap-XXXXXXX' into $INBOX_DIR can appear a bit
confusing to users, so give it a "mm_tmp-$PID-XXXXXXXX" name
to emphasize it's a temporary file tied to a given PID.

We also don't want to penalize read-only daemons with
loading File::Temp, so do it lazily.

5 years agoindex+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior
Eric Wong [Mon, 10 Aug 2020 02:11:56 +0000 (02:11 +0000)]
index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior

-index now invokes ->DESTROY like xcpdb does, which is necessary
to cleanup $INBOX_DIR/msgmap-XXXXXXX files.  We'll also exit
with the expected values for various signals by adding 128
as described in <https://www.tldp.org/LDP/abs/html/exitcodes.html>

-xcpdb now terminates worker processes and xapian-compact(1)
invocations when prematurely killed, too.