Eric Wong [Sun, 10 May 2020 22:37:11 +0000 (22:37 +0000)]
xt/eml_check_limits: check limits against an inbox
This allows maintainers to easily check limits against the
contents of existing inboxes. This script covers most of
the new limits enforced by PublicInbox::Eml.
Eric Wong [Mon, 11 May 2020 04:27:36 +0000 (04:27 +0000)]
spawn: use ~/.cache/public-inbox/inline-c if writable
Despite several memory reductions and pure Perl performance
improvements, Inline::C spawn() still gives us a noticeable
performance boost.
More user-oriented command-line programs are likely coming,
setting PERL_INLINE_DIRECTORY is annoying to users, and so is
is poor performance. So allow users to opt-in to using our
Inline::C code once by creating a `~/.cache/public-inbox/inline-c'
directory.
XDG_CACHE_HOME is respected to override the location of ~/.cache
independent of HOME, according to
https://specifications.freedesktop.org/basedir-spec/0.6/ar01s03.html
v2: use "/nonexistent" if HOME is undefined, since that's
the home of the "nobody" user on both FreeBSD and Debian.
Eric Wong [Sun, 10 May 2020 19:38:23 +0000 (19:38 +0000)]
build: check-manifest runs after tests
And just treat it as a non-fatal nag when checking the rest of the
codebase. Calling it "check-manifest" as a `make' target
preserves the old behavior, which causes the check to fail
if a file were added to the worktree without changing the
MANIFEST.
Eric Wong [Sun, 10 May 2020 06:21:05 +0000 (06:21 +0000)]
eml: rename limits to match postfix names
They're still part of our internal API at this point, but
reusing the same names as those used by postfix makes sense for
now to reduce cognitive overheads of learning new things.
There's no "mime_parts_limit", but the name is consistent
with "mime_nesting_limit".
Eric Wong [Sun, 10 May 2020 06:21:04 +0000 (06:21 +0000)]
eml: enforce a maximum header length
While our header processing is more efficient than
Email::*::Header, capping the maximum size for a `m//g' match
still limits memory growth on a header we care for.
Use the same limit as postfix (header_size_limit=102400), since
messages fetched via git/HTTP/NNTP/etc can bypass MTA limits.
Eric Wong [Sat, 9 May 2020 08:37:00 +0000 (08:37 +0000)]
search: remove documentation for "lid:"
I'm not sure it's necessary, since "mid:" is similarly
undocumented. Also, "t:", "c:", "f:" don't offer boolean
analogues for exact matches on To/Cc/From headers, despite
having similar tokens as List-Id inside angle brackets.
Eric Wong [Sat, 9 May 2020 08:27:38 +0000 (08:27 +0000)]
emlcontentfoo: quiet warning on missing attributes
This bug was also present in Email::MIME::ContentType:
commit ae081fb576d8507efca4928116ad81efa756c723 (refs/pull/pull/9/head)
in https://github.com/rjbs/Email-MIME-ContentType.git
Our fix is shorter, but dependent on 5.10+ as our codebase
relies on Perl 5.10 features, anyways.
Eric Wong [Sat, 9 May 2020 08:27:37 +0000 (08:27 +0000)]
eml: speed up common LF-only emails
Emails a *nix MTA are typically LF-only, so we don't need the
complexity of the RE engine when a simple index() works. We
still need to ensure there's no "\r\n\r\n" before the first
"\n\n", but two calls to index() is still faster than a RE
match.
This gives a 2-5% speedup in some informal tests and saves ~30MB
when scanning a 30MB spam message on newer versions of Perl.
I'll have to diagnose why Perl wastes so much memory doing
RE matches on giant strings, though.
Eric Wong [Sat, 9 May 2020 08:27:36 +0000 (08:27 +0000)]
eml: reduce RE captures and possible side effects
Since Perl 5.6, the `@-' (aka @LAST_MATCH_START) and `@+' (aka
@LAST_MATCH_END) arrays provides integer offsets for every match
as documented in perlvar(1), regardless of regexp modifiers.
We can avoid relying on $1 in the epilogue scan, entirely.
So use these instead of relying on m//g and pos(), since the `g'
modifier can be affected by m//g matches performed in other
places.
Unrelated, but while we're in the area: remove some unnecessary
use of (?:...), too.
Kyle Meyer [Sat, 9 May 2020 18:57:46 +0000 (18:57 +0000)]
viewdiff: don't increment the reported hunk line number
For a diff hunk starting at line N, diff_hunk() constructs the link
with "#n(N + 1)". This sends the viewer one line below the first
context line. Although this is minor and may not even be noticed,
there's not an obvious reason to increment the line number, so switch
to using the reported value as is.
Eric Wong [Thu, 7 May 2020 21:05:53 +0000 (21:05 +0000)]
eml: remove dependency on Email::MIME::Encodings
Since Email::MIME usage is going away, Email::MIME::Encodings
might as well go away, too. We can also use fewer branches
and just rely on hash lookups, unlike E::M::E.
Eric Wong [Thu, 7 May 2020 21:05:52 +0000 (21:05 +0000)]
EmlContentFoo: relax Encode version requirement
We want to support Perl v5.10.1 out-of-the-box with minimal
download/installation time. Installing Encode from CPAN
requires a compiler and lengthy build+install time.
So mimic find_mime_encoding() using what Perl v5.10.1 provides
out-of-the box.
Since we're getting rid of Email::MIME, get rid of
Email::MIME::ContentType, too; since we may introduce
speedups down the line specific to our codebase.
Eric Wong [Thu, 7 May 2020 21:05:48 +0000 (21:05 +0000)]
eml: pure-Perl replacement for Email::MIME
Email::MIME eats memory, wastes time parsing out all the
headers, and some problems can't be fixed without breaking
compatibility for other projects which depend on it.
Informal benchmarks show a ~2x improvement in general
stats gathering scripts and ~10% improvement in HTML
view rendering.
We also don't need the ability to create MIME messages, just
parse them and maybe drop an attachment.
While this isn't the zero-copy or streaming MIME parser of my
dreams; it's still an improvement in that it doesn't keep a
scalar copy of the raw body around along with subparts. It also
doesn't parse subparts up front, so it can also replace our uses
of Email::Simple.
Eric Wong [Thu, 7 May 2020 21:05:47 +0000 (21:05 +0000)]
smsg: use capitalization for header retrieval
PublicInbox::Eml will have case-sensitive memoization to
avoid the need to call `lc' to retrieve common headers,
so ensure we call $mime->header() with the common
capitalization.
Unfortunately, we need to continue using lowercase for field
names for smsg, since NNTP requires case-insensitivity when
matching headers and method dispatch is expensive.
Eric Wong [Thu, 7 May 2020 21:05:46 +0000 (21:05 +0000)]
filter/rubylang: avoid recursing subparts to strip trailers
Mailman only seems to add trailers (or signatures) as
attachments at the top-level of MIME messages. So don't bother
recursing with ->walk_parts since ->walk_parts is non-trivial to
recreate in the Email::MIME replacement I'm working on.
Eric Wong [Thu, 7 May 2020 21:05:45 +0000 (21:05 +0000)]
msg_iter: pass $idx as a scalar, not array
This doesn't make any difference for most multipart
messages (or any single part messages). However,
this starts having space savings when parts start
nesting.
Eric Wong [Thu, 7 May 2020 21:05:44 +0000 (21:05 +0000)]
msg_iter: make ->each_part method for PublicInbox::MIME
The reliance on Email::MIME->subparts is a tad inefficient with
a work-in-progress module to replace Email::MIME. So move
towards using ->each_part as a class-specific iterator which can
take advantage of more class-specific optimizations in the
yet-to-be-revealed PublicInbox::Eml and PublicInbox::Gmime
classes.
The msg_iter() sub remains for compatibility with existing
3rd-party scripts/modules which use our small public Perl API
and Email::MIME.
Eric Wong [Fri, 8 May 2020 01:59:01 +0000 (01:59 +0000)]
www: preload: load all encodings at startup
Encode lazy-loads encodings on an as-needed basis. This is
great for short-lived programs, but leads to fragmentation in
long-lived daemons where immortal allocations can get
interleaved with short-lived, per-request allocations.
Since we have no idea which encodings will be needed when
there's a constant flow of incoming mail, just preload
everything available at startup.
Eric Wong [Thu, 7 May 2020 03:00:09 +0000 (03:00 +0000)]
search: support searching on List-Id
We'll support both probabilistic matches via `l:' and boolean
matches via `lid:' for exact matches, similar to how both `m:'
and `mid:' are supported. Only text inside angle braces (`<'
and `>') are supported, since I'm not sure if there's value in
searching on the optional phrases (which would require decoding
with ->header_str instead of ->header_raw).
Eric Wong [Wed, 6 May 2020 10:40:54 +0000 (10:40 +0000)]
viewdiff: stricter highlighting and linkification check
Sometimes senders draw ASCII tables and such which we
get fooled into attempting highlighting and diffstat
anchoring.
We now require 3 consecutive diff header lines:
/^--- /, /^\Q+++\E /, and /^@@ /
to enable diff highlighting (whether generated with git or not).
The presence of a line matching /^diff / is not sufficient or
even useful to us for highlighting diffs, since that could just
be part of a line-wrapped sentence.
However, we'll now check for the presence of a line matching
/^diff --git / before enabling diffstat anchors. Otherwise
cover letters for a patch series may fool us into creating
anchors for diffstats.
Eric Wong [Wed, 6 May 2020 10:40:53 +0000 (10:40 +0000)]
viewdiff: assume diffstat and diff order are identical
For non-malicious messages, we can assume the diffstat and actual
diff appear in the same order. Thus we can store {-long_paths} as
an arrayref and only compare the first element when we encounter
a truncated path.
This should make HTML rendering stable when there's basename
conflicts in message such as
https://lore.kernel.org/backports/1393202754-12919-13-git-send-email-hauke@hauke-m.de/
This diffstat anchor linkification can still be defeated by
users who make actual path names beginning with "...", but we
won't waste CPU cycles on it, either.
Eric Wong [Wed, 29 Apr 2020 11:14:43 +0000 (11:14 +0000)]
t/precheck: remove Email::Simple->create from tests
It's likely we'll replace Email::Simple using our Email::MIME
alternative/replacement, as well. So reduce the API surface we
interact with and make it easier to swap implementations.
Eric Wong [Tue, 28 Apr 2020 08:48:58 +0000 (08:48 +0000)]
git: various minor speedups
While testing performance improvements elsewhere, I noticed some
micro-optimizations could give a small ~2-3% speedup in my test
using the git async API to parse a large inbox.
The `read' perlfunc already has read-in-full behavior (unless
git is killed unexpectedly), so there's no point in using a
loop. SearchIdxShard in the parallel v2 indexing code path
never looped on `read', either.
Furthermore, we can avoid method dispatch overhead on ->getline
and ->print by using `readline' and `print' as ops which can be
resolved during the Perl compilation phase.
Finally, avoid passing the IO handle around as a parameter,
since avoiding hash lookups with a local variable has its own
costs in stack and refcount bumping.
Eric Wong [Sat, 25 Apr 2020 05:52:22 +0000 (05:52 +0000)]
tests: remove Email::MIME->create use entirely
Replace them with .eml files generated with the help of
Email::MIME, but without some extraneous and unnecessary
headers, and strip mime_load down to just loading files.
This will give us more freedom to experiment with other mail
libraries which may be more correct, better maintained, use
less memory and/or be faster than Email::MIME.
Eric Wong [Sat, 25 Apr 2020 05:52:21 +0000 (05:52 +0000)]
testcommon: introduce mime_load sub
We'll use this to create, memoize, and reuse .eml files. This
will be used to reduce (and eventually eliminate) our dependency
on Email::MIME in tests.
Eric Wong [Tue, 21 Apr 2020 20:30:06 +0000 (20:30 +0000)]
doc: note some changes for 1.5
As an established project (:P), it's important to document when
new features appear in manpages. Users may be reading new
documentation online which doesn't reflect an older version they
have installed.
Eric Wong [Tue, 21 Apr 2020 21:16:12 +0000 (21:16 +0000)]
t/*.t: use Email::MIME->create over PublicInbox::MIME->create
PublicInbox::MIME only supports ->new, and is only different
from Email::MIME for old versions of Email::MIME. In the
future, PublicInbox::MIME may not be a subclass of Email::MIME
at all.
Eric Wong [Tue, 21 Apr 2020 06:57:34 +0000 (06:57 +0000)]
make zlib-related modules a hard dependency
This allows us to simplify some of our existing code and make
future changes easier.
I doubt anybody goes through the trouble to have a Perl
installation without zlib support. The zlib source code is even
bundled with Perl since 5.9.3 for systems without existing zlib
development headers and libraries.
Of course, zlib is also a requirement of git, too; and we're not
going to stop using git :)
In the second line, the omission character " is appended, but the
entire subject is shown. To display the subject with duplicated parts
omitted, regenerate it from the array that is modified by
dedupe_subject().
view: strip omission character from current message in thread view
In the thread view shown at the top of a message, the subject for the
current message is dropped, leaving just the sender's name. However,
if skel_dump() omitted part of the subject because it was duplicated,
the omission character is still displayed:
* [PATCH v2] t/www_listing: avoid 'once' warnings
2020-03-21 1:10 ` [PATCH 2/2] t/www_listing: avoid 'once' warnings Eric Wong
@ 2020-03-21 5:24 ` " Eric Wong
Note the " on the last line.
Adjust the regular expression in _th_index_lite() to account for the
omission character.
Eric Wong [Tue, 21 Apr 2020 03:22:51 +0000 (03:22 +0000)]
t/nntpd: reduce dependencies on internal API
Since the advent of run_script(), we can rely on it to simplify
our test code. Changes like this will let us evolve the
internal API more easily while preserving stable CLI interfaces,
especially since we test the v2 path by default, now.
Eric Wong [Mon, 20 Apr 2020 22:55:37 +0000 (22:55 +0000)]
index: support --max-size / publicinbox.indexMaxSize
In normal mail paths, we can rely on MTAs being configured with
reasonable limits in the -watch and -mda mail injection paths.
However, the MTA is bypassed in a git-only delivery path, a BOFH
could inject a large message and DoS users attempting to mirror
a public-inbox.
This doesn't protect unindexed WWW interfaces from Email::MIME
memory explosions on v1 inboxes. Probably nobody cares about
unindexed WWW interfaces anymore, especially now that Xapian is
optional for indexing.
Eric Wong [Fri, 17 Apr 2020 09:33:31 +0000 (09:33 +0000)]
qspawn: remove Perl 5.16.x leak workaround
It seems no longer necessary to workaround this Perl 5.16.3 bug
after the removal of anonymous subs from all of our internal
code in
https://public-inbox.org/meta/20191225075104.22184-1-e@80x24.org/
Tested with repeated clones (both aborted and completed)
in a CentOS 7.x VM which was once able to reproduce leaks
before the workaround appeared in 2fc42236f72ad16a
("qspawn: workaround Perl 5.16.3 leak, re-enable Deflater")
Cc: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Eric Wong [Mon, 20 Apr 2020 09:33:38 +0000 (09:33 +0000)]
drop needless `eval {}' around Config->new
It hasn't been needed since commit 089cca37fa036411
("config: ignore missing config files"). And we
actually want to propagate errors when we can't
start new processes or if git(1) is missing.
Eric Wong [Sun, 19 Apr 2020 23:19:35 +0000 (23:19 +0000)]
import: init_bare: use pure Perl
Even on systems with Inline::C spawn(), this cuts a primed
"make check-run" time by 2-3% on Linux, and roughly 5-7% on
FreeBSD when using vfork-enabled spawn.
I doubt anybody cares: this omits the sample hooks and some
empty and useless-for-us or obsolete directories created by
git-init(1).
The watchheader key supports only a single value. Supporting multiple
watchheader values was mentioned in discussion [1] of 8d3e3bd8 (doc:
explain publicinbox.<name>.watchheader, 2019-10-09), and it wasn't
clear if there was a need.
One scenario in which matching multiple headers would be convenient is
when someone wants to set up public-inbox archives for some small
projects but does _not_ want to run mailing lists for them, instead
allowing others to follow the project by any of the pull mechanisms.
Using a common underlying address, an address alias for each project
is configured via a third-party email provider, with messages for each
alias being exposed as a separate public-inbox archive. In this
setup, messages for an inbox cannot be selected by a List-ID header
but can be identified by the inbox's address in either the To or Cc
header.
To support such a use case, update the watchheader handling to
consider multiple values, accepting a message if it matches any value.
While selecting a message based on matching _any_ rather than _all_
values is motivated by the above scenario, it's worth noting that the
"any" behavior is consistent with how multiple listid config values
are handled.
Eric Wong [Fri, 17 Apr 2020 10:24:45 +0000 (10:24 +0000)]
doc: start writeup on semi-automatic memory management
I don't consider Perl's memory management "automatic". Instead,
having an extra bit of control as a hacker is nice and there's
no need to burden ordinary users with GC tuning knobs.
Eric Wong [Sat, 18 Apr 2020 03:38:53 +0000 (03:38 +0000)]
reduce scope of mbox From_ line removal
It's unnecessary overhead for anything which does Email::MIME
parsing. It was never done for v2 indexing, even though v1->v2
conversions did NOT remove those From_ lines. There was never a
need to remote From_ lines the v1 SearchIdx paths, either.
Hitting a /$INBOX_URL/$MSGID/T/ endpoint with an 18 message
thread reveals a ~0.5% speed improvement. This will become
more apparent when we have a faster MIME parser.
Eric Wong [Sat, 18 Apr 2020 03:38:50 +0000 (03:38 +0000)]
favor `do {}' over `eval {}' for localized slurp
I did not know to use the return value of `do' back in the day.
There's probably no practical difference in these cases, but
`eval' is overkill for these uses and may hide actual errors.
We can get rid of a few redundant `scalar' ops and pass scalar
refs to Email::MIME->new to avoid copies in a few more places,
too.
Eric Wong [Sat, 18 Apr 2020 03:38:47 +0000 (03:38 +0000)]
searchidx: die on cat-file failures
We always use the object ID from "git <log|rev-list>" for
retrieving blobs, so fail loudly if the git repository is
corrupt instead of silently continuing.
Eric Wong [Sat, 18 Apr 2020 03:38:46 +0000 (03:38 +0000)]
inboxwritable: mime_from_path: reuse in more places
There's nothing Maildir-specific about the function, so
`maildir_path_load' was a bad name. So give it a more
appropriate name and use it in our tests.
This save ourselves some code and inconsistency by reusing an
existing internal library routine in more places. We can drop
the "From_" line in some of our (formerly) mbox sample files.
Eric Wong [Fri, 17 Apr 2020 09:28:49 +0000 (09:28 +0000)]
searchthread: reduce indirection by removing container
We can rid ourselves of a layer of indirection by subclassing
PublicInbox::Smsg instead of using a container object to hold
each $smsg. Furthermore, the `{id}' vs. `{mid}' field name
confusion is eliminated.
This reduces the size of the $rootset passed to walk_thread by
around 15%, that is over 50K memory when rendering a /$INBOX/
landing page.
Eric Wong [Thu, 16 Apr 2020 00:29:38 +0000 (00:29 +0000)]
t/httpd-corner: improve reliability and diagnostics
The graceful-shutdown-on-PUT test is unreliable because we can't
rely on a FIFO as we do with the GET tests. So increase the
delay to 100ms since that seems enough on my system even with
CONFIG_HZ=100.
Add a timeout and backtrace to the $check_self sub to help with
further diagnostics while we're at it, too.
It would be nice if there were a portable syscall tracing
mechanism we could attach to the -httpd process to make the test
more determistic...
I've observed FreeBSD 11.2 read(2) having one of three
behaviors after a failed write(2) on a socket:
1) returning number of bytes read
2) failing with ECONNRESET
3) returning with EOF
1) is the most common, and I've only seen 1) on Linux. It may
be possible to use SO_LINGER or shutdown(2) to ensure 1) always
happens, but SO_LINGER behavior seems inconsistent across OSes,
especially with non-blocking sockets.
Since these tests are corner-cases where we're dealing with
broken/malicious clients, lets continue spending the least
amount of syscalls protecting ourselves in the daemon and
instead make the client-side test code tolerate more socket
implementations.
Eric Wong [Sat, 11 Apr 2020 10:53:28 +0000 (10:53 +0000)]
dskqxs: ignore EV_SET errors on EVFILT_WRITE
Just like the EPOLL_CTL_ADD emulation path, the EPOLL_CTL_MOD
and EPOLL_CTL_DEL emulation paths can fail if attempting to
install an EVFILT_WRITE for a read-only pipe.
I've only observed this on the EPOLL_CTL_DEL emulation path, but
I suspect it could happen on the EPOLL_CTL_MOD path as well.
Increasing the amount of read-only pipes we rely on with altid
exports via sqlite3 made this old bug more apparent and
reproducible while looping the test suite.
This may be adjusted in the future to deal with write-only
pipes, but we currently don't have any of those watched by
kqueue.
Eric Wong [Tue, 7 Apr 2020 09:35:14 +0000 (04:35 -0500)]
doc: add technical/whyperl
Some people don't like Perl; but it exists, there's no
avoiding it with everything that depends on it. And
nearly all code still works unmodified after 20 years.
Eric Wong [Tue, 7 Apr 2020 21:55:53 +0000 (21:55 +0000)]
tests: document run_mode=1 as not implemented
It was implemented at some point, but it was more things to
support and the worst of both worlds: both unrealistic compared
to real-world use and slower than run_mode=2.
Eric Wong [Mon, 6 Apr 2020 08:32:52 +0000 (08:32 +0000)]
view: do not redundantly obfuscate addresses
We shouldn't rerun the address obfuscator on data we've
already run through. Instead, run through the unescaped
text part and substitute the UTF-8 "\x{2022}" substitution
before it hits HTML escaping