Eric Wong [Sat, 10 Dec 2016 01:09:46 +0000 (01:09 +0000)]
search: favor In-Reply-To over last References iff IRT exists
Some email clients set the References headers backwards, so
trust the In-Reply-To header if (and only if) it exists and
is parseable as direct parent of the current message.
For affected repos, this will require reindexing (via
"public-inbox-index --reindex"), but there will be no
version bump for this bugfix.
Eric Wong [Tue, 6 Dec 2016 23:40:33 +0000 (23:40 +0000)]
linkify: implement Markdown link compatibility (again)
Although unescaped parentheses in URLs are technically allowed,
they are uncommon. However, Markdown-like syntaxes are
unfortunately common for URLs, so we might as well support them.
This fixes parentheses detection at sentence endings, as seen
in practice on emails.
Eric Wong [Tue, 6 Dec 2016 23:01:39 +0000 (23:01 +0000)]
linkify: implement Markdown link compatibility
Although unescaped parentheses in URLs are technically allowed,
they are uncommon. However, Markdown-like syntaxes are
unfortunately common for URLs, so we might as well support them.
Eric Wong [Sat, 3 Dec 2016 00:24:06 +0000 (00:24 +0000)]
atom: switch to getline/close for response bodies
This will let us stream larger Atom documents bodies without
wasting too much memory and reduce the amount of round-trip
requests needed to get necessary information.
Hopefully clients are using streaming (SAX) parsers, too.
This is the final transition in the core public-inbox
code to allow migrating to a "pull"-based body streaming
scheme which allows a HTTP server to respond appropriately
to backpressure from slow clients.
Eric Wong [Sat, 3 Dec 2016 00:24:51 +0000 (00:24 +0000)]
searchview: fix <title> tag in Atom feed
This only affects the Atom feed for search results.
"xmlstarlet val" failed to detect or warn about this,
and I only noticed this bug while working on another
patch.
Eric Wong [Tue, 29 Nov 2016 21:40:35 +0000 (21:40 +0000)]
note the source code is AGPL for cloning
This should be adequate warning for folks who may be
uncomfortable or uncertain about even possessing AGPL
source code due to employer agreements and such.
Disclaimer: I remain completely in favor of AGPL and strong
copyleft, and am more than willing to risk my own future on it.
However, I refuse to even nudge people into downloading AGPL
source code if it presents any legal risk to them.
Eric Wong [Fri, 4 Nov 2016 21:11:35 +0000 (21:11 +0000)]
index: allow indexing before configuration
One may build the initial index on a powerful host and transfer
it to a weaker one for incremental indexing. Thus there is
no requirement to have a configured public-inbox for building
the index unless a user needs altid support or some such.
Eric Wong [Wed, 5 Oct 2016 23:47:29 +0000 (23:47 +0000)]
thread: use hash + array instead of hand-rolled linked list
This starts to show noticeable performance improvements when
attempting to thread over 400 messages; but the improvement
may not be measurable with less.
However, the resulting code is much shorter and (IMHO)
much easier to understand.
Eric Wong [Wed, 5 Oct 2016 23:47:28 +0000 (23:47 +0000)]
thread: fix sorting without topmost
This bug was hidden, and we may not be able to efficiently
implement a topmost subroutine with the hash-based (vs
linked-list) based container for threading in the next
commit.
Eric Wong [Fri, 9 Sep 2016 09:05:18 +0000 (09:05 +0000)]
TODO: updates for done items
The existing string -> number date range Xapian query is good
enough, and having too much flexibility is probably bad for
caching (as well as increasing our attack surface, because
parsing queries is tricky).
Tags-as-skiplists are probably not worth the effort given
Xapian, and we may have to import old messages after-the-fact,
anyways, and message delivery for mirrors is never orderly.
Other items are all done and need to be maintained (like the
search engine docs for the mairix-compatibility features that
just got pushed out)
Eric Wong [Fri, 9 Sep 2016 00:01:29 +0000 (00:01 +0000)]
search: avoid mindlessly calling body_set
It's not worth entering a complex codepath in Email::MIME to
save some (probably immeasurable amount of) memory, here. We've
already stopped doing this in our WWW code a while back, too.
If we really cared enough about it, we'd prioritize work on a
streaming replacement for Email::MIME.
Eric Wong [Fri, 9 Sep 2016 00:01:28 +0000 (00:01 +0000)]
search: fix compatibility with Debian wheezy
Specifying the "d:" field only worked for
NumberValueRangeProcessor in older versions of Xapian, such
as the one in Debian wheezy (libsearch-xapian-perl=1.2.10.0-1)
This slipped through since I rarely use wheezy, anymore, and
perhaps nobody else does, either. Perhaps wheezy support may be
dropped, soon.
Unfortunately, this requires a schema version bump.
Eric Wong [Fri, 9 Sep 2016 00:01:27 +0000 (00:01 +0000)]
search: increase term positions for each quoted hunk
We pay a storage cost for storing positional information
in Xapian, make good use of it by attempting to preserve
it for (hopefully) better search results.
Eric Wong [Fri, 9 Sep 2016 00:01:25 +0000 (00:01 +0000)]
search: fix space regressions from recent changes
As of Xapian 1.0.4 (from 2007) is possible to use
Search::Xapian::QueryParser::add_prefix multiple times with the
same user field name but different term prefixes.
This brings my current git@vger mirror from 6.5GB to 2.1GB
(both sizes are after xapian-compact).
Eric Wong [Fri, 9 Sep 2016 00:01:24 +0000 (00:01 +0000)]
search: more granular message body searching
"bs:" and "b:" are adapted from mairix(1)
We will also support searching explicitly for quoted vs
non-quoted text via "q:" and "nq:" prefixes since sometimes
readers will not care for quoted text.
In the future, we will support parsing diffs (perhaps when
repobrowse integration is complete).
Note: this roughly doubles the size of the Xapian database due
to the additional information; so this change may not be worth
it.
Eric Wong [Fri, 9 Sep 2016 00:01:23 +0000 (00:01 +0000)]
search: drop longer subject: prefix for search
We only document the "s:" anyways. While the long name is more
descriptive, the ambiguity makes agnostic caching (by Varnish or
similar) slightly harder and longer URLs are more likely to be
accidentally truncated when shared.
Eric Wong [Fri, 9 Sep 2016 00:01:22 +0000 (00:01 +0000)]
search: allow searching user fields (To/Cc/From)
Sometimes it can be useful to search based on who the
message was sent to, sent by, or Cc:-ed. Of course,
headers can be faked, but they usually are not...
Anyways this mostly matches the behavior of mairix(1).
Eric Wong [Thu, 8 Sep 2016 20:15:25 +0000 (20:15 +0000)]
doc: document PERL_INLINE_DIRECTORY usage
For now, we will document this since it allows better
performance without the burden of extensions. Perhaps one day
far in the future Perl can natively support vfork(2) AND that
version of Perl will be widely available, but I suspect that day
is at least a decade away, if not two:
Eric Wong [Thu, 8 Sep 2016 19:44:16 +0000 (19:44 +0000)]
view: handle missing Content-Type in message
Email::MIME internally assumes "text/plain" for messages
missing a Content-Type, but does not expose that in the
Email::MIME::content_type API method. We must assume it
ourselves to avoid uninitialized value warnings for the
rare (nowadays) MUAs which do not set it.
Eric Wong [Tue, 23 Aug 2016 21:23:53 +0000 (21:23 +0000)]
www: give tor2web some exposure, too
Not everybody can run Tor, hopefully more can use Tor2web
even if it compromises their privacy. This should help
make system more resilient for users unable to use Tor.
Eric Wong [Thu, 18 Aug 2016 04:44:07 +0000 (04:44 +0000)]
www: implement generic help text
Begin documenting some basic help functionality.
I may tweak the anchor names of the various HTML endpoints
to be more consistent with each other (old ones will be
supported for a short while), so I'm not documenting
those, for now.
This may become part of a builtin key-value store for
basic texts, but this probably shouldn't become a wiki
engine, either.
Eric Wong [Thu, 18 Aug 2016 02:02:50 +0000 (02:02 +0000)]
linkify: be stricter about matching RFC 3986
We're not to-the-letter about percent-encoding, but
we should allow all the characters. This is mainly
so we can effectively use the link to some Wikipedia
pages with parentheses in them:
Eric Wong [Thu, 18 Aug 2016 01:10:35 +0000 (01:10 +0000)]
view: try assuming UTF-8 for bogus charsets
For some reason, Alpine will set X-UNKNOWN for valid UTF-8.
Since we favor UTF-8 HTML anyways, try forcing Email::MIME to
handle text/plain as UTF-8 which might show up better.
Eric Wong [Thu, 18 Aug 2016 00:54:25 +0000 (00:54 +0000)]
view: try to display bogus charsets for text/plain
Alpine seems to set charset=X-UNKNOWN for valid UTF-8 text,
which causes Email::MIME::body_str to fail as X-UNKNOWN
is not a valid encoding. So, blindly display the body
as plain-text but warn users about possibly mangled text.
Reported-by: Thomas Ferris Nicolaisen <tfnico@gmail.com>
Eric Wong [Tue, 16 Aug 2016 08:49:26 +0000 (08:49 +0000)]
search: add YYYYMMDD search range via "d:" prefix
This is similar to mairix in that it uses a "d:" prefix; but
only takes YYYYMMDD, for now. Using custom date/time parsers
via Perl will be much more work:
Eric Wong [Tue, 16 Aug 2016 08:49:25 +0000 (08:49 +0000)]
search: drop pointless range processors for Unix timestamp
The Unix timestamp isn't meaningful for users searching,
we will start indexing the YYYYMMDD date stamp which may
use StringValueRangeProcessor, instead.
Eric Wong [Sun, 14 Aug 2016 10:21:10 +0000 (10:21 +0000)]
www: do not double-clean Message-IDs from internal DBs
Ensure we usually strip one level of '<>' from Message-IDs,
since our internal SQLite, Xapian, and SHA-1 storage all
assume that.
Realistically, we screw up if somebody has '<<' or '>>',
but those are screwed up mail clients and we can deal with
it another time. Currently, this means some messages with
'>>' in References or Message-Id are not handled correctly,
yet, but we match the behavior of Mail::Thread in keeping
the extra '>'.
Eric Wong [Sun, 14 Aug 2016 10:21:09 +0000 (10:21 +0000)]
www: do not unecessarily escape some chars in paths
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&',
"'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed
in path-absolute where we have the Message-ID.
In any case, it seems '@' is fairly common in path components
nowadays and too common in Message-IDs.
Eric Wong [Sun, 14 Aug 2016 10:21:17 +0000 (10:21 +0000)]
www: ensure XML validity for some odd ASCII chars
I've seen 0x1b (\e) in at least one message and some other
possibly non-printable chars. In any case, make sure they're
valid XML with us-ascii encoding as far as xmlstarlet(1) thinks
so.
Eric Wong [Sat, 13 Aug 2016 00:22:01 +0000 (00:22 +0000)]
extmsg: reorder and add a more Message-ID lookup services
gmane is down at the moment, so lower that in priority
(hopefully it will be brought back up, again). Wikipedia also
lists a few more project-specific list providers, so include
those as well: https://en.wikipedia.org/wiki/Message-ID
Eric Wong [Fri, 12 Aug 2016 19:52:35 +0000 (19:52 +0000)]
www: allow including links to NNTP sites in HTML footer
Improve the discoverability of NNTP endpoints for users
who still know what NNTP is.
==> ~/.public-inbox/config <==
; aliases for the locally-run nntpd can be specified in
; the "publicinbox" section:
[publicinbox]
nntpserver = nntp://ou63pmih66umazou.onion/
nntpserver = news.public-inbox.org
; NNTPS is not supported natively, yet,
; but one can use haproxy or similar
; nntpserver = nntps://news.public-inbox.invalid/
; mirrors for specific inboxes may be specified either as full
; NNTP (or NNTPS) URLs, or with the server name only if the
; newsgroup name is specfied for a local NNTP server
[publicinbox "git"]
...
newsgroup = inbox.a.b.c
nntpmirror = nntp://czquwvybam4bgbro.onion/
nntpmirror = hjrcffqmbrq6wope.onion
; there may be a mirror on a different server with a
; different name:
nntpmirror = nntp://news.example.com/differently.named.group
; (And I really need to write manpages for all this...)
Eric Wong [Thu, 11 Aug 2016 00:23:48 +0000 (00:23 +0000)]
search: support alt-ID for mapping legacy serial numbers
For some existing mailing list archives, messages are identified
by serial number (such as NNTP article numbers in gmane). Those
links may become inaccessible (as is the current case for
gmane), so ensure users can still search based on old serial
numbers.
Now, I run the following periodically to get article numbers
from gmane (while news.gmane.org remains):
; relative pathnames expand to $mainrepo/public-inbox/$file
altid = serial:gmane:file=gmane.sqlite3
And run "public-inbox-index --reindex /path/to/git.vger.git"
periodically.
This ought to allow searching for "gmane:12345" to work for
Xapian-enabled instances.
Disclaimer: while public-inbox supports NNTP and stable article
serial numbers, use of those for public links is discouraged
since it encourages centralization.
Eric Wong [Tue, 9 Aug 2016 23:59:10 +0000 (23:59 +0000)]
searchidx: allow searching Message-IDs in free-form text
It is not unheard of for users to attempt finding messages by
entering Message-IDs into the "Search" box instead of using the
existing URL structure. So make it possible for them.
Fwiw, I've definitely encountered users who enter entire URLs
into generic search engines.
Eric Wong [Tue, 9 Aug 2016 00:41:37 +0000 (00:41 +0000)]
searchidx: avoid holding Xapian lock in cat-file
We must ensure cat-file process is launched before Xapian
grabs lock, too. Our use of "git cat-file --batch" has
the same problem as "git log" did, (which was fixed in
commit 3713c727cda431a0dc2865a7878c13ecf9f21851)
"searchidx: release Xapian FDs before spawning git log"
Eric Wong [Sat, 6 Aug 2016 01:58:47 +0000 (01:58 +0000)]
mbox: be fair to other HTTP clients
At least for public-inbox-httpd, this allows us to avoid having
a client monopolize one event loop tick of the server for too
long. It hurts throughput for the /all.mbox.gz endpoint, but I
doubt anybody cares and the latency improvement for other
clients would be appreciated.
We already do the same fairness thing for HTML pages.