]> Sergey Matveev's repositories - public-inbox.git/log
public-inbox.git
4 years agowatchmaildir: use v5.10.1, drop warnings
Eric Wong [Mon, 31 Aug 2020 04:41:31 +0000 (04:41 +0000)]
watchmaildir: use v5.10.1, drop warnings

Declare 5.10.1 to avoid potential compatibility problems with
Perl 7/8 down the line.  We'll rely on the command-line to set
or drop warnings during development, at least.

4 years agowatch: limit batch size of NNTP and IMAP workers, too
Eric Wong [Mon, 31 Aug 2020 04:41:30 +0000 (04:41 +0000)]
watch: limit batch size of NNTP and IMAP workers, too

We don't want to monopolize locks because processes can easily
block each other if using `watchspam' on a Maildir while a big
NNTP or IMAP import is happening.

This can also happen if somebody configured a single inbox to
watch from several sources to merge several mailboxes into one
(e.g. both an IMAP and Maildir are watched).

4 years agodoc: expand on indexBatchSize regarding fragmentation
Eric Wong [Mon, 31 Aug 2020 04:33:37 +0000 (04:33 +0000)]
doc: expand on indexBatchSize regarding fragmentation

And change the documentation reference in -tuning to
point to the -index manpage while we're at it.

4 years agoimapd: filter out unusable flags from search
Eric Wong [Sat, 29 Aug 2020 20:32:19 +0000 (20:32 +0000)]
imapd: filter out unusable flags from search

Quiet down logs from -imapd when clients are blindly
sending some unsupported flag conditions (e.g. "DRAFT",
"DELETED") specified in RFC 3501.

4 years agotests: check-run: fixup un-squashed simplification
Eric Wong [Sat, 29 Aug 2020 03:48:39 +0000 (03:48 +0000)]
tests: check-run: fixup un-squashed simplification

Link: https://public-inbox.org/meta/20200828221803.GA89978@dcvr/
4 years agotests: check-run: show skipped tests
Eric Wong [Fri, 28 Aug 2020 10:13:00 +0000 (10:13 +0000)]
tests: check-run: show skipped tests

We'll deduplicate redundant lines and show counts of skipped
tests to ensure it's easy to notice if something is unexpectedly
skipped.

4 years agoimaptracker: update_last: simplify callers
Eric Wong [Fri, 28 Aug 2020 10:12:59 +0000 (10:12 +0000)]
imaptracker: update_last: simplify callers

By making it a no-op if last_uid is not defined.  This isn't a
hot code path, so the extra method dispatch isn't an issue.
It'll save some indentation/wrapping in future commits.

4 years agowatch: flush changes to inbox before updating IMAPTracker
Eric Wong [Fri, 28 Aug 2020 10:12:58 +0000 (10:12 +0000)]
watch: flush changes to inbox before updating IMAPTracker

Data needs to hit inboxes, first.  Otherwise it's possible to
skip messages in case git-fast-import is killed before it sees
"done\n".  Now, -watch will just waste a little bandwidth in
re-downloading a seen message if it's interrupted immediately
before updating IMAPTracker.

4 years agoMakefile.PL: run check-man for <= 80 columns on check-run, too
Eric Wong [Fri, 28 Aug 2020 04:22:00 +0000 (04:22 +0000)]
Makefile.PL: run check-man for <= 80 columns on check-run, too

I mostly use "make check-run" instead of the slower "make check"
target, nowadays, so add this check to ensure the rendered
manpage is always be visible to more users who need big fonts.

4 years agowww: more descriptive pagination
Eric Wong [Thu, 27 Aug 2020 22:05:00 +0000 (22:05 +0000)]
www: more descriptive pagination

Being an easily confused person, I find "next" and "prev"
ambiguous as to whether messages on the next or previous page
will be newer or older than the current page.  Clarify that for
the threaded /$INBOX/ view and search results.

For search results sorted by relevance, we'll use "[>= $SCORE]"
or "[<= $SCORE]" to indicate to indicate directionality.

This also fixes $INBOX/new.html for unindexed v1 inboxes.

4 years agowww: improve navigation around contemporary threads
Eric Wong [Thu, 27 Aug 2020 22:04:59 +0000 (22:04 +0000)]
www: improve navigation around contemporary threads

Sometimes it's useful to quickly get to threads and messages
which are contemporaries of the current thread/message being
focused on.  This hopefully improves navigation by making:

a) the top line (where $INBOX_DIR/description) is shown
   a link to the latest topics in search results and
   per-thread/per-message views.

b) providing a link to contemporaries ("~YYYY-MM-DD") at
   around the thread overview skeleton area for per-thread
   and per-message views

4 years agodoc: watch: expand on NNTP and IMAP-specific knobs
Eric Wong [Thu, 27 Aug 2020 12:17:06 +0000 (12:17 +0000)]
doc: watch: expand on NNTP and IMAP-specific knobs

There's a few more, but maybe they're too esoteric
to be worth documenting at the moment (batch sizes, timeouts, etc).

4 years agodoc: move watch config docs to -watch manpage
Eric Wong [Thu, 27 Aug 2020 12:17:05 +0000 (12:17 +0000)]
doc: move watch config docs to -watch manpage

The -config manpage is a bit long and the -watch stuff is
isolated from the rest of it while we start documenting NNTP and
IMAP support.

I'm not entirely happy with the way IMAP and NNTP are
configured, it's still good enough for small setups.

This also fixes a long-standing misplaced comment about
`publicinboxwatch.spamcheck' affecting all configured inboxes,
that comment was actually for `publicinboxwatch.watchspam'.

We'll omit documenting NNTP for `watchspam', for now, given the
lack of \Seen flags in NNTP and I'm not sure if it's even
useful.  There may not be any newsgroups for sharing confirmed
spam, either...

4 years agowatch: imap: only remove \Seen spam
Eric Wong [Thu, 27 Aug 2020 12:17:04 +0000 (12:17 +0000)]
watch: imap: only remove \Seen spam

This matches the behavior of Maildir `watchspam' handling in not
removing unseen messages.  NNTP can't match this behavior, since
NNTP servers don't store flags, clients do.

4 years agodoc: speling fickses
Eric Wong [Thu, 27 Aug 2020 12:17:03 +0000 (12:17 +0000)]
doc: speling fickses

4 years agodoc: document graceful shutdown signals
Eric Wong [Thu, 27 Aug 2020 12:17:02 +0000 (12:17 +0000)]
doc: document graceful shutdown signals

Same as the read-only daemons.

4 years agooveridx: inline create_ghost sub
Eric Wong [Thu, 27 Aug 2020 12:17:01 +0000 (12:17 +0000)]
overidx: inline create_ghost sub

There's no need for this to be a separate sub since there's
only a single caller.  This saves a few kilobytes at least
in short-lived processes.

4 years agoimaptracker: preserve WAL journal_mode if set by user
Eric Wong [Thu, 27 Aug 2020 12:17:00 +0000 (12:17 +0000)]
imaptracker: preserve WAL journal_mode if set by user

It's no problem for most users to enable WAL, here, since
there's only a single process doing both reading and writing
(unlike the read-only daemons).  However, WAL doesn't work on
network filesystems, so it can't be enabled by default.

4 years agowatchmaildir: ensure I:/W:/E: prefixes in warnings
Eric Wong [Thu, 27 Aug 2020 12:16:59 +0000 (12:16 +0000)]
watchmaildir: ensure I:/W:/E: prefixes in warnings

For consistency in output, any URL/path-context-dependent
prefixes should have the same prefix as the actual warning which
triggered it.

4 years agogit: show more context info on failures
Eric Wong [Thu, 27 Aug 2020 07:51:25 +0000 (07:51 +0000)]
git: show more context info on failures

I'm seeing "read: Connection timed out" from in my syslog from
-httpd.  The fail() calls in PublicInbox::Git seems to be the
only code path of ours which could trigger it...

ETIMEDOUT shouldn't happen on pipes, only sockets; and all of
our socket operations are non-blocking.  So this could be
cgit-wwwhighlight-filter.lua, but that's connecting over
localhost, though on fairly loaded HW.

4 years agosearch: allow testing with current xapian.git and 1.5.x
Eric Wong [Wed, 26 Aug 2020 22:02:57 +0000 (22:02 +0000)]
search: allow testing with current xapian.git and 1.5.x

A `PI_XAPIAN' environment variable is now exposed for testing
purposes.  We'll also deal with the removal of
`NumberValueRangeProcessor' and use `NumberRangeProcessor'
in its place, but continue favoring the old Search::Xapian
since that's all that's packaged for Debian 10.x stable.

4 years agomsgmap: use v5.10.1
Eric Wong [Wed, 26 Aug 2020 08:17:42 +0000 (08:17 +0000)]
msgmap: use v5.10.1

We use the defined-or (`//', `//=') operators in 5.10,
so require 5.10.1 like the rest of our codebase.  Update
an outdated comment while we're at it.

4 years agoover*: use v5.10.1, drop warnings
Eric Wong [Wed, 26 Aug 2020 08:17:41 +0000 (08:17 +0000)]
over*: use v5.10.1, drop warnings

v5.10.1 lets us use the lighter parent.pm instead of base.pm,
and we'll rely on the shebang to enable warnings (or not).

While we're in the area, drop a no-longer-necessary import for
PublicInbox::Search, since OverIdx doesn't require search.

4 years agoover: recent: remove expensive COUNT query
Eric Wong [Wed, 26 Aug 2020 08:17:40 +0000 (08:17 +0000)]
over: recent: remove expensive COUNT query

As noted in commit 87dca6d8d5988c5eb54019cca342450b0b7dd6b7
("www: rework query responses to avoid COUNT in SQLite"),
COUNT on many rows is expensive on big SQLite DBs.

We've already stopped using that code path long ago in WWW
while -imapd and -nntpd never used it.  So we'll adjust our
remaining test cases to not need it, either.

4 years agoover: rename ->disconnect to ->dbh_close
Eric Wong [Wed, 26 Aug 2020 08:17:39 +0000 (08:17 +0000)]
over: rename ->disconnect to ->dbh_close

Since we got rid of over->connect, `disconnect' no longer pairs
with it.  So name it after the `close(2)' syscall it ultimately
issues.

4 years agoover: rename ->connect method to ->dbh
Eric Wong [Wed, 26 Aug 2020 08:17:38 +0000 (08:17 +0000)]
over: rename ->connect method to ->dbh

`->connect' is confused with the perlfunc for the `connect(2)'
syscall, and also `DBI->connect'.  Since SQLite doesn't use
sockets, the word "connect" needlessly confuses me.  Give
it a short name to match the field name we use for it, which
also matches the variable name used by the DBI(3pm) and
DBD::SQLite(3pm) manpages.

4 years agov2writable: compatibility with SWIG Xapian binding
Eric Wong [Tue, 25 Aug 2020 20:26:24 +0000 (20:26 +0000)]
v2writable: compatibility with SWIG Xapian binding

The SWIG binding won't auto-convert IV/UV to PV like the XS
Search::Xapian binding would, so workaround that shortcoming
for now.

Fixes: a367ec1b15a2458 ("mbox: disable "&t" on existing Xapian until full reindex")
4 years agogrok-pull.post_update_hook: flock(2) before SQLite check
Eric Wong [Tue, 25 Aug 2020 10:23:14 +0000 (10:23 +0000)]
grok-pull.post_update_hook: flock(2) before SQLite check

Unlike DBD::SQLite, the sqlite3(1) CLI does not have a default
busy timeout enabled, so it easily times out while acquiring a
SHARED lock for read-only queries.  We can avoid battery-wasting
polling from the SQLite timeout handler by relying on flock(2)
as we do in our Perl code.

Furthermore, this avoids triggering some locking problems[1]
from a long "SELECT COUNT(*) ..." query and reindex.

While there may be other SQLite-related parallelism issues[1],
this works around one of them by relying on flock(2).

[1] https://public-inbox.org/meta/20200825001204.GA840@dcvr/

4 years agoover+msgmap: respect WAL journal_mode if set
Eric Wong [Tue, 25 Aug 2020 03:02:47 +0000 (03:02 +0000)]
over+msgmap: respect WAL journal_mode if set

WAL actually seems to have ideal locking characteristics given
concurrency problems I'm experiencing with --reindex running
in parallel with expensive read-only SQLite queries:
<https://public-inbox.org/meta/20200825001204.GA840@dcvr/>

Unfortunately, we cannot blindly use WAL while preserving
compatibility with existing setups nor our guarantees that
read-only daemons are indeed "read-only".

However, respect an user's the choice to set WAL on their
own if they're comfortable with giving -nntpd/-httpd/-imapd
processes write permission to the directory storing SQLite DBs.

4 years agomsgmap: use "CREATE TABLE IF NOT EXISTS"
Eric Wong [Tue, 25 Aug 2020 03:02:46 +0000 (03:02 +0000)]
msgmap: use "CREATE TABLE IF NOT EXISTS"

It's fewer queries and matches what we do in OverIdx.

4 years agoover: skip nodatacow on the journal
Eric Wong [Tue, 25 Aug 2020 03:02:45 +0000 (03:02 +0000)]
over: skip nodatacow on the journal

This file gets truncated anyhow, so it won't fragment.

4 years agodoc: 1.6.0 release notes update
Eric Wong [Tue, 25 Aug 2020 10:51:29 +0000 (10:51 +0000)]
doc: 1.6.0 release notes update

A few more things happened, here.

4 years agodoc: add some more tuning notes
Eric Wong [Tue, 25 Aug 2020 10:51:20 +0000 (10:51 +0000)]
doc: add some more tuning notes

I've learned a thing or three about btrfs in the past few
weeks and remembered some old HDD things, too.

The Xapian MultiDatabase problem will need to be addressed
for 1.7...

4 years agosearchidx: croak for Xapian DB open failure
Eric Wong [Sun, 23 Aug 2020 21:00:27 +0000 (21:00 +0000)]
searchidx: croak for Xapian DB open failure

croak() can give more context on the failure, and setting
`PERL5OPT=-MCarp=verbose' can force a stacktrace.

4 years agoexamples: add imapd systemd examples
Eric Wong [Sun, 23 Aug 2020 07:49:18 +0000 (07:49 +0000)]
examples: add imapd systemd examples

We've got examples for all the other daemons, too!

4 years agoindex: --sequential-shard checkpoints after each shard
Eric Wong [Sat, 22 Aug 2020 19:51:36 +0000 (19:51 +0000)]
index: --sequential-shard checkpoints after each shard

There's no reason we'd want Xapian to defer flushing once we've
indexed everything belonging to a particular shard.

4 years agombox: disable "&t" on existing Xapian until full reindex
Eric Wong [Sat, 22 Aug 2020 06:06:27 +0000 (06:06 +0000)]
mbox: disable "&t" on existing Xapian until full reindex

Expanding threads via over.sqlite3 for mbox.gz downloads without
Xapian effectively collapsing on the THREADID column leads to
repeated messages getting downloaded.

To avoid that situation, use a "has_threadid" Xapian metadata
flag that's only set on --reindex (and brand new Xapian DBs).

This allows admins to upgrade WWW or do --reindex in any order;
without worrying about users eating up bandwidth and CPU cycles.

4 years agosearch: support downloading mboxes results with full thread
Eric Wong [Sat, 22 Aug 2020 06:06:26 +0000 (06:06 +0000)]
search: support downloading mboxes results with full thread

Finally, the addition of THREADID for collapsing results
in Xapian lets us emulate the "mairix --threads" feature.
That is, instead of returning only the matching messages,
the entire thread is included in the downloaded mbox.gz

This requires a "public-inbox-index --reindex" to be usable.

4 years agosearchidx: index THREADID in Xapian
Eric Wong [Sat, 22 Aug 2020 06:06:25 +0000 (06:06 +0000)]
searchidx: index THREADID in Xapian

This is the `tid' column from over.sqlite3; and will be used for
IMAP and JMAP search (among other things).

4 years agosearchidx: put all shard-related stuff in SearchIdxShard.pm
Eric Wong [Sat, 22 Aug 2020 06:06:24 +0000 (06:06 +0000)]
searchidx: put all shard-related stuff in SearchIdxShard.pm

We'll also rename the /^remote_/ prefix to "shard_", since
remote implies the process is on a different host.  These
methods only pass messages to a child process on the same host
OR perform operations within the same process.

4 years agosearchidxshard: clear $msgref buffer properly
Eric Wong [Sat, 22 Aug 2020 06:06:23 +0000 (06:06 +0000)]
searchidxshard: clear $msgref buffer properly

Merely assigning `undef' to a scalar does not free the
underlying buffer memory of a scalar.

4 years agosearchview: fix mbox.gz downloads for lynx users
Eric Wong [Sat, 22 Aug 2020 00:41:25 +0000 (00:41 +0000)]
searchview: fix mbox.gz downloads for lynx users

Unlike w3m and links, the lynx browser seems to require a `name'
attribute for `<input type=submit>' elements.  Maybe some other
browsers do, too.  The `name' attribute for submit elements
doesn't seem to cause any harm for w3m or links, users, either;
despite not (AFAIK) being part of historical or current HTML
specs.

4 years agosearch: add mset_to_artnums method
Eric Wong [Thu, 20 Aug 2020 20:24:57 +0000 (20:24 +0000)]
search: add mset_to_artnums method

We can avoid importing mdocid() in several places by using
this method, simplifying callers.

4 years agoinit+index: support --skip-docdata for Xapian
Eric Wong [Thu, 20 Aug 2020 20:24:56 +0000 (20:24 +0000)]
init+index: support --skip-docdata for Xapian

Since we no longer read document data from Xapian, allow users
to opt-out of storing it.

This breaks compatibility with previous releases of
public-inbox, but gives us a ~1.5% space savings on Xapian
storage (and associated I/O and page cache pressure reduction).

4 years agot/nntpd-v2: set PI_TEST_VERSION=2 properly
Eric Wong [Thu, 20 Aug 2020 20:24:55 +0000 (20:24 +0000)]
t/nntpd-v2: set PI_TEST_VERSION=2 properly

Numbers are hard :<

4 years agosmsg: remove from_mitem
Eric Wong [Thu, 20 Aug 2020 20:24:54 +0000 (20:24 +0000)]
smsg: remove from_mitem

We no longer read docdata.glass from anywhere in our code base.

Some adjustments were needed to t/search.t to deal with the
Xapian::WritableDatabase committing at different times, since
our ->query is avoided from PublicInbox::SearchIdx to avoid
needing a {over_ro} field.

4 years agombox: avoid Xapian docdata in search results
Eric Wong [Thu, 20 Aug 2020 20:24:53 +0000 (20:24 +0000)]
mbox: avoid Xapian docdata in search results

Another place where we can reduce kernel page cache overhead
by hitting over.sqlite3 instead of docdata.glass.

4 years agoextmsg: avoid using Xapian docdata
Eric Wong [Thu, 20 Aug 2020 20:24:52 +0000 (20:24 +0000)]
extmsg: avoid using Xapian docdata

Once again, over.sqlite3 contains everything necessary for
Message-ID resolution.  Also, Xapian may be completely
unnecessary with the advent of over.sqlite3, but that's for
another time.

4 years agosearchview: convert nested and Atom display to over.sqlite3
Eric Wong [Thu, 20 Aug 2020 20:24:51 +0000 (20:24 +0000)]
searchview: convert nested and Atom display to over.sqlite3

git blob retrieval dominates on these, "&x=t" (nested) is
roughly the same due to increased overhead for ->get_percent
storage balancing out the mass-loading from SQLite.

Atom "&x=A" is sped up slightly and uses less memory in the
long-lived response.

4 years agosearchview: speed up search summary by ~10%
Eric Wong [Thu, 20 Aug 2020 20:24:50 +0000 (20:24 +0000)]
searchview: speed up search summary by ~10%

Instead of loading one article at-a-time from over.sqlite3, we
can use SQL to mass-load IN (?,?, ...) all results with a single
SQLite query.  Despite SQLite being in-process and having no
network latency, the reduction in SQL query executions from
loading multiple rows at once speeds things up significantly.

We'll keep the over->get_art optimizations from the previous
commit, since it still speeds up long-lived responses, slightly.

4 years agosearchview: use over.sqlite3 instead of Xapian docdata
Eric Wong [Thu, 20 Aug 2020 20:24:49 +0000 (20:24 +0000)]
searchview: use over.sqlite3 instead of Xapian docdata

This is a step towards improving kernel page cache hit rates by
relying on over.sqlite3 for document data instead of Xapian.
Some micro-optimization to over->get_art was required to
maintain performance.

4 years agosmsg: reduce utf8::decode call sites
Eric Wong [Thu, 20 Aug 2020 20:24:48 +0000 (20:24 +0000)]
smsg: reduce utf8::decode call sites

Both callers of load_from_data call utf8::decode, so just
do utf8::decode in load_from_data.

4 years agosearch: make qparse_new an internal function
Eric Wong [Thu, 20 Aug 2020 20:24:47 +0000 (20:24 +0000)]
search: make qparse_new an internal function

We'll probably be reusing it from another package in a future commit.

4 years agosearchquery: split off from searchview
Eric Wong [Thu, 20 Aug 2020 20:24:46 +0000 (20:24 +0000)]
searchquery: split off from searchview

Since this was already a separate package, split it off
into its own file since SearchView may not handle inbox
groups.

4 years agosearch: export mdocid subroutine
Eric Wong [Thu, 20 Aug 2020 20:24:45 +0000 (20:24 +0000)]
search: export mdocid subroutine

No need to have awkward globrefs for this.

4 years agosearch: improve comments around constants
Eric Wong [Thu, 20 Aug 2020 20:24:44 +0000 (20:24 +0000)]
search: improve comments around constants

We'll probably be adding more value columns like THREADID to sort
on.

4 years agowww: reduce long-lived PublicInbox::Search references
Eric Wong [Thu, 20 Aug 2020 20:24:43 +0000 (20:24 +0000)]
www: reduce long-lived PublicInbox::Search references

While this is unlikely to be a problem in current practice,
keeping Xapian DBs open for long responses can interfere with
free space recovery after -compact.

In the future, it will interfere with inbox search grouping
and lead to unexpected results.

4 years agoxapcmd: simplify {reindex} parameter passing
Eric Wong [Thu, 20 Aug 2020 20:24:42 +0000 (20:24 +0000)]
xapcmd: simplify {reindex} parameter passing

No need to localize it, here, since we can just refer to it
in the `$opt' hashref.  Hopefully this improves readability
for others like it does for me.

I sometimes wonder if the concept of a stack in high-level
languages is even necessary...

4 years agosearch: v2: ensure shards are numerically sorted
Eric Wong [Thu, 20 Aug 2020 20:24:41 +0000 (20:24 +0000)]
search: v2: ensure shards are numerically sorted

This seems required to correctly get the NNTP article number
from Xapian docid on combined Xapian DBs.  The default
(ASCII-betical) sorting was only acceptable for -imapd users
until somebody hit 11 (or more) shards, which is a rare case.

4 years agoinit: drop -N alias for --skip-artnum
Eric Wong [Thu, 20 Aug 2020 20:24:40 +0000 (20:24 +0000)]
init: drop -N alias for --skip-artnum

It may be too easily confused for --newsgroup or --ng.  This is
too rarely used and never made it into a release, so it should
be fine.

4 years agoinit: support --newsgroup option
Eric Wong [Thu, 20 Aug 2020 20:24:39 +0000 (20:24 +0000)]
init: support --newsgroup option

We can reduce the need to edit the config file for NNTP group names
this way.

4 years agoinit: support --help and -?
Eric Wong [Thu, 20 Aug 2020 20:24:38 +0000 (20:24 +0000)]
init: support --help and -?

And speed those up with some lazy loading, too.

4 years agocompact: support --help/-? and perform lazy loading
Eric Wong [Thu, 20 Aug 2020 20:24:37 +0000 (20:24 +0000)]
compact: support --help/-? and perform lazy loading

This probably won't be used much, but --help can still
make sense.

4 years agoadmin: progress shows the inbox being indexed
Eric Wong [Thu, 20 Aug 2020 20:24:36 +0000 (20:24 +0000)]
admin: progress shows the inbox being indexed

This is helpful with --all, or when multiple inboxes
are being indexed.

4 years agodoc: note -compact and -xcpdb are rarely used
Eric Wong [Thu, 20 Aug 2020 20:24:35 +0000 (20:24 +0000)]
doc: note -compact and -xcpdb are rarely used

Slowly improving the learning curve...

4 years agov2writable: show newline after "indexing all of .. " message
Eric Wong [Tue, 11 Aug 2020 19:52:02 +0000 (19:52 +0000)]
v2writable: show newline after "indexing all of .. " message

Otherwise things get very confusing when verbosity is enabled :x

4 years agosmsg: handle wide characters in raw mail headers
Eric Wong [Wed, 19 Aug 2020 08:02:33 +0000 (08:02 +0000)]
smsg: handle wide characters in raw mail headers

There may be messages in the wild with wide characters in
headers which aren't non-RFC2047 encoded.  Assume UTF-8 so
those fields can round trip through over.sqlite3.

This doesn't affect docdata.glass in Xapian, but it does
affect how over.sqlite3 stores the same deflated info.

4 years agodoc: add public-inbox-tuning(7) manpage
Eric Wong [Sat, 15 Aug 2020 05:21:02 +0000 (05:21 +0000)]
doc: add public-inbox-tuning(7) manpage

Determining storage device speed and latencies doesn't
seem portable or even possible with the wide variety
of storage layers in use.

This means we need to write a tuning document and hope
users read and improve on it :P

4 years agogrok-pull.post_update_hook: favor --sequential-shard for HDD
Eric Wong [Thu, 13 Aug 2020 08:04:04 +0000 (08:04 +0000)]
grok-pull.post_update_hook: favor --sequential-shard for HDD

--sequential-shard offers better performance on HDD than -j0
since the on-disk active set can be kept small (with -j $HIGH_NUM).
--batch-size can also be helpful for systems with much RAM.

4 years agoindex|compact|xcpdb: support --all switch
Eric Wong [Thu, 13 Aug 2020 08:04:37 +0000 (08:04 +0000)]
index|compact|xcpdb: support --all switch

For -index, this is a convenient way to quickly index all
inboxes after a grok-pull.  Might as well support it for
rarely used commands like -compact and -xcpdb, too.

4 years agov2writable: remove IdxStack import
Eric Wong [Wed, 12 Aug 2020 09:17:19 +0000 (09:17 +0000)]
v2writable: remove IdxStack import

We use IdxStack via log2stack() from SearchIdx, now.

4 years agoxcpdb: wire up new index options and --help
Eric Wong [Wed, 12 Aug 2020 09:17:18 +0000 (09:17 +0000)]
xcpdb: wire up new index options and --help

--sequential-shard also disables the copy parallelism (--jobs),
so it can be useful for systems unable to handle parallel random
I/O but still want many shards.

There was a missing "use strict", too, which is fixed.

4 years agoadmin: don't warn when --jobs exceeds shards
Eric Wong [Wed, 12 Aug 2020 09:17:17 +0000 (09:17 +0000)]
admin: don't warn when --jobs exceeds shards

Established tools like make(1), prove(1) and xargs(1) don't warn
when the desired parallelism level can't be met, either.

4 years agoxapcmd: reduce CPU idling when shards exceeds job count
Eric Wong [Wed, 12 Aug 2020 09:17:16 +0000 (09:17 +0000)]
xapcmd: reduce CPU idling when shards exceeds job count

In case there's unbalanced shards AND we're limiting parallelism
while using many shards, spawn the next task in the queue ASAP
once a task is done, instead of waiting for all tasks to finish
before spawning the next batch.

Unbalanced shards probably isn't a big issue for most users;
however many smaller shards with few jobs can be useful for HDD
users to reduce the effect of random writes.

4 years agoxcpdb: support --no-fsync from CLI
Eric Wong [Wed, 12 Aug 2020 09:17:15 +0000 (09:17 +0000)]
xcpdb: support --no-fsync from CLI

This was omitted in 8b1950055d51d436 :x

Fixes: 8b1950055d51d436 ("index+xcpdb: rename `--no-sync' to `--no-fsync'")
4 years agoxapcmd: simplify sub reference
Eric Wong [Wed, 12 Aug 2020 09:17:14 +0000 (09:17 +0000)]
xapcmd: simplify sub reference

We don't need to fully-qualify when referring to subs in
the same namespace, nor do we need make a SCALAR ref only
to dereference it

(Yes, still learning Perl :x)

4 years agoconvert: set No_COW on copied SQLite files
Eric Wong [Mon, 10 Aug 2020 02:12:05 +0000 (02:12 +0000)]
convert: set No_COW on copied SQLite files

We'll use our existing logic and use sqlite_backup_from_file,
which appeared in 1.39 (along with sqlite_backup_to_file).

4 years agoconvert: check ARGV more correctly
Eric Wong [Mon, 10 Aug 2020 02:12:04 +0000 (02:12 +0000)]
convert: check ARGV more correctly

Instead of silently ignoring excessive args, don't let a user
specify an extra directory.  Furthermore, we'll support the odd
case where BOFH wants to name an $INBOX_DIR to be `0' :P

4 years agoconvert: speed up --help
Eric Wong [Mon, 10 Aug 2020 02:12:03 +0000 (02:12 +0000)]
convert: speed up --help

Lazy-loading dependencies speeds up --help by several hundred
milliseconds and is a huge step towards user-friendliness.

4 years agoconvert: support new -index options
Eric Wong [Mon, 10 Aug 2020 02:12:02 +0000 (02:12 +0000)]
convert: support new -index options

Converting v1 inboxes from v2 can be a painful experience
on HDD.  Some of the new options in the CLI or config
file make it less painful.

4 years agosearchidx: use singular `$opt' for consistency with v2
Eric Wong [Mon, 10 Aug 2020 02:12:01 +0000 (02:12 +0000)]
searchidx: use singular `$opt' for consistency with v2

The rest of our indexing code uses `$opt' instead of `$opts'.

4 years agoindex: cleanup internal variables
Eric Wong [Mon, 10 Aug 2020 02:12:00 +0000 (02:12 +0000)]
index: cleanup internal variables

Move away from hard-to-read alllowercase naming and favor
snake_case or separated-by-dashes.

We'll keep `--indexlevel' as-is for now, since it's been around
for several releases; but we'll support `--index-level' in the
CLI and update our documentation in a few months.

We'll also clarify that publicInbox.indexMaxSize is only
intended for -index, and not -watch or -mda.

4 years agoadmin: use a generic variable name
Eric Wong [Mon, 10 Aug 2020 02:11:59 +0000 (02:11 +0000)]
admin: use a generic variable name

We parse other options, too, not just --max-size

4 years agoavoid File::Temp::tempfile in more places
Eric Wong [Mon, 10 Aug 2020 02:11:58 +0000 (02:11 +0000)]
avoid File::Temp::tempfile in more places

We can use open(..., undef) natively in Perl in t/import.t

In places where we need a pathname, the File::Temp OO API
gives us auto-unlinking for free.

4 years agomsgmap: tmp_clone: simplify + meaningful filename
Eric Wong [Mon, 10 Aug 2020 02:11:57 +0000 (02:11 +0000)]
msgmap: tmp_clone: simplify + meaningful filename

Trying to use the newer ->sqlite_backup_to_dbh method doesn't
seem worth it, as we'll have to support DBD::SQLite <= 1.60
another decade or more.

Dumping 'msgmap-XXXXXXX' into $INBOX_DIR can appear a bit
confusing to users, so give it a "mm_tmp-$PID-XXXXXXXX" name
to emphasize it's a temporary file tied to a given PID.

We also don't want to penalize read-only daemons with
loading File::Temp, so do it lazily.

4 years agoindex+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior
Eric Wong [Mon, 10 Aug 2020 02:11:56 +0000 (02:11 +0000)]
index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior

-index now invokes ->DESTROY like xcpdb does, which is necessary
to cleanup $INBOX_DIR/msgmap-XXXXXXX files.  We'll also exit
with the expected values for various signals by adding 128
as described in <https://www.tldp.org/LDP/abs/html/exitcodes.html>

-xcpdb now terminates worker processes and xapian-compact(1)
invocations when prematurely killed, too.

4 years agodoc: add some notes around -xcpdb / -edit / -purge
Eric Wong [Mon, 10 Aug 2020 02:11:55 +0000 (02:11 +0000)]
doc: add some notes around -xcpdb / -edit / -purge

These rarely-used commands have some caveats that needed
expanding on.

4 years agodoc: index: more notes about latest changes
Eric Wong [Mon, 10 Aug 2020 02:11:54 +0000 (02:11 +0000)]
doc: index: more notes about latest changes

With LKML on an HDD, a giant --batch-size of 500m ends up being
pretty useful.  I was able to index LKML in ~16 hours on a
system that had other activity on it.  The big downside was it
was eating up over 5g of RAM :x.

We'll also fix up a duplicated indexBatchSize section, fix
formatting around global vs per-inbox indexSequentialShard,
and ensure section 5 manpages are linked correctly.

4 years agoindex: --sequential-shard works incrementally
Eric Wong [Mon, 10 Aug 2020 02:11:53 +0000 (02:11 +0000)]
index: --sequential-shard works incrementally

We should never reindex all data in Xapian unless --reindex is
specified on the command-line.  This means users who put
publicInbox.indexSequentialShard in their config file won't have
to put up with a full reindex at every invocation, only when
they specify --reindex.

We'll also cleanup the progress output to not emit non-sensical
ranges where the starting number is higher than the end.

4 years agoindex: require --reindex when using --xapian-only
Eric Wong [Mon, 10 Aug 2020 02:11:52 +0000 (02:11 +0000)]
index: require --reindex when using --xapian-only

This to avoid user error of a currently undocumented switch;
since --xapian-only always goes through the full history at
the moment.

4 years agofavor `getconf _NPROCESSORS_ONLN` over GNU nproc
Eric Wong [Sat, 8 Aug 2020 11:24:05 +0000 (11:24 +0000)]
favor `getconf _NPROCESSORS_ONLN` over GNU nproc

getconf(1) itself is POSIX, while `_NPROCESSORS_ONLN' is not.
However, FreeBSD (tested 11.4 and 12.1) and glibc (tested CentOS
7.x and Debian 10.x) both support `getconf _NPROCESSORS_ONLN'.

GNU coreutils (and thus `nproc' or `gnproc') are not installed
by default on the *BSDs, so we'll try the option most likely
to exist on both glibc and *BSDs out-of-the-box.

4 years agodir_idle: require Perl 5.22+ for kqueue
Eric Wong [Sat, 8 Aug 2020 04:59:49 +0000 (04:59 +0000)]
dir_idle: require Perl 5.22+ for kqueue

IO::KQueue requires us to use fileno(DIRHANDLE) for setting up
kqueue watches.  This use of fileno() is only supported since
Perl 5.22, so BSD users on older Perl will have to fall back to
old polling.

This affects users of -watch, currently; but will affect other
read-only Xapian users soon.

4 years agosupport setting No_COW on Perl <5.22
Eric Wong [Sat, 8 Aug 2020 04:59:48 +0000 (04:59 +0000)]
support setting No_COW on Perl <5.22

fileno(DIRHANDLE) only works on Perl 5.22+, so we need to use
dirfd(3) ourselves from Inline::C (or rely on chattr(1) being
installed).

While we're at it, rename `set_nodatacow' to `nodatacow_fd'
for consistency with `nodatacow_dir'.

4 years agoindex: add built-in --help / -?
Eric Wong [Fri, 7 Aug 2020 10:52:18 +0000 (10:52 +0000)]
index: add built-in --help / -?

Eventually, commonly-used commands run by the user will all
support --help / -? for user-friendliness.   The changes from
up-front `use' to lazy `require' speed up `--help' by 3x or so.

4 years agosearchidx: use Perl truthiness to detect XAPIAN_FLUSH_THRESHOLD
Eric Wong [Fri, 7 Aug 2020 10:52:17 +0000 (10:52 +0000)]
searchidx: use Perl truthiness to detect XAPIAN_FLUSH_THRESHOLD

XAPIAN_FLUSH_THRESHOLD is a C string in the environment, so
users may be tempted to assign an empty string in in their
shell, e.g. `XAPIAN_FLUSH_THRESHOLD= <command>' instead of using
`unset' POSIX shell built-in.

With either a value of "0" or "" (empty string), Xapian will
fall back to its default (10000 documents), which causes grief
for memory-starved users.

4 years agoindex: max out XAPIAN_FLUSH_THRESHOLD if using --batch-size
Eric Wong [Fri, 7 Aug 2020 10:52:16 +0000 (10:52 +0000)]
index: max out XAPIAN_FLUSH_THRESHOLD if using --batch-size

If XAPIAN_FLUSH_THRESHOLD is unset, Xapian will default to
10000.  That limits the effectiveness of users specifying
extremely large values of --batch-size.

While we're at it, localize the changes to globals since -index
may be eval-ed in tests (and perhaps production code in the
future).

4 years agoindex: --compact respects --sequential-shard
Eric Wong [Fri, 7 Aug 2020 10:52:15 +0000 (10:52 +0000)]
index: --compact respects --sequential-shard

Since the --compact switch works on Xapian shards,
it makes sense that --sequential-shard affects our
usage of xapian-compact(1).

4 years agov2writable: fix batch size accounting
Eric Wong [Fri, 7 Aug 2020 10:52:14 +0000 (10:52 +0000)]
v2writable: fix batch size accounting

We need to account for whether shard parallelization is
enabled or not, since users of parallelization are expected
to have more RAM.

4 years agoindex+xcpdb: rename `--no-sync' to `--no-fsync'
Eric Wong [Fri, 7 Aug 2020 01:14:06 +0000 (01:14 +0000)]
index+xcpdb: rename `--no-sync' to `--no-fsync'

We'll continue supporting `--no-sync' even if its yet-to-make it
it into a release, but the term `sync' is overloaded in our
codebase which may be confusing to new hackers and users.

None of our our code nor dependencies issue the sync(2) syscall,
either, only fsync(2) and fdatasync(2).

4 years agoindex: support --xapian-only switch
Eric Wong [Fri, 7 Aug 2020 01:14:05 +0000 (01:14 +0000)]
index: support --xapian-only switch

This is useful for speeding up indexing runs when only Xapian
rules change but SQLite indexing doesn't change.  This mostly
implies `--reindex', but does NOT pick up new messages (because
SQLite indexing needs to occur for that).

I'm leaving this undocumented in the manpage for now since it's
mainly to speed up development and testing.  Users upgrading to
1.6.0 will be advised to `--reindex --rethread', anyways, due to
the threading improvements since 1.1.0-pre1.

It may make sense to document for 1.7+ when there's Xapian-only
indexing changes, though.