Eric Wong [Sat, 22 Aug 2015 11:41:22 +0000 (11:41 +0000)]
search: consistently pass options and flags
Most of our special query functions require exact matches, so none
of the flags we normally use are necessary for query parsing.
Eric Wong [Sat, 22 Aug 2015 11:41:21 +0000 (11:41 +0000)]
view: reference total followups
In case there's huge threads, readers should know about them
even though we currently lack the navigation to display them.
Eric Wong [Sat, 22 Aug 2015 11:41:20 +0000 (11:41 +0000)]
view: misc cleanups and simplifications
Less code should be easier-to-read.
Eric Wong [Sat, 22 Aug 2015 11:41:19 +0000 (11:41 +0000)]
search: split search indexing to a separate file
This makes organization easier and reduces the amount of code
loaded for a PSGI, mod_perl or CGI instance.
Eric Wong [Sat, 22 Aug 2015 08:07:57 +0000 (08:07 +0000)]
view: prevent 'once' warnings for sub ref
Perl seems to incorrectly warn for this, workaround it.
Eric Wong [Sat, 22 Aug 2015 08:00:37 +0000 (08:00 +0000)]
remove XML::Atom::SimpleFeed dependency
We will attempt to generate Atom feeds "by hand" as the
XML::Atom::SimpleFeed API does not support streaming output.
Since email is large and servers are small, this should prevent
wasting memory when we generate larger feeds.
Of course, we hope clients use SAX parsers capable of handling
large streams without slurping.
Eric Wong [Sat, 22 Aug 2015 05:06:57 +0000 (05:06 +0000)]
www: enable and expand preload from mod_perl2
Hopefully this saves us some memory with CoW on *nix.
Eric Wong [Sat, 22 Aug 2015 05:06:56 +0000 (05:06 +0000)]
INSTALL: document IO::Compress::Gzip dependency
Otherwise folks won't get downloadable mboxes
Eric Wong [Sat, 22 Aug 2015 05:06:55 +0000 (05:06 +0000)]
cgi: remove static file generation support for now
We may not support this after all, CGI.pm is already
legacy-enough and far more powerful.
Eric Wong [Sat, 22 Aug 2015 00:06:45 +0000 (00:06 +0000)]
stream HTML views as much as possible
This should allow progressive rendering on the client and reduce
memory usage on the server. Unfortunately XML::Atom::SimpleFeed
does not yet support streaming, so we may not use it in the
future.
Eric Wong [Fri, 21 Aug 2015 23:43:12 +0000 (23:43 +0000)]
search: s/count/total/ for results
This is hopefully less ambiguous, as the word "count" confused
me, too.
Eric Wong [Fri, 21 Aug 2015 23:34:29 +0000 (23:34 +0000)]
mbox: drop unnecessary imports
These are not necessary, anymore
Eric Wong [Fri, 21 Aug 2015 21:42:23 +0000 (21:42 +0000)]
switch to gzipped mboxes
Mboxes may be huge, so only support downloading gzipped mboxes
to save bandwidth and to get free checksumming.
Streaming output means we should not be wasting too much memory
on this unless the chosen server sucks.
Eric Wong [Fri, 21 Aug 2015 21:42:22 +0000 (21:42 +0000)]
mbox: stream entire thread, regardless of size
Since mbox is usually downloaded, support fetching infinitely large
responses via streaming.
Eric Wong [Fri, 21 Aug 2015 01:29:04 +0000 (01:29 +0000)]
support dumping thread as an mbox
Some folks may not want to download and install Perl code like
ssoma, so allow downloading an mbox containing the entire
thread.
Eric Wong [Fri, 21 Aug 2015 01:29:03 +0000 (01:29 +0000)]
view: "next" link in thread view goes to next Subject line
It's a bit disconcerting to jump to the authorship line.
Eric Wong [Fri, 21 Aug 2015 01:29:02 +0000 (01:29 +0000)]
view: cleanup and reduce duplication
This also avoids incorrectly incrementing $part_nr when
we skip a part due to bad Content-Type.
Eric Wong [Thu, 20 Aug 2015 19:15:18 +0000 (19:15 +0000)]
feed: fix extra, unnecessary quote
Oops!
Eric Wong [Thu, 20 Aug 2015 10:17:34 +0000 (10:17 +0000)]
search: preserve References: order in document data
We need proper ordering of References to thread messages
correctly. We would lose this order if we load the terms
from the database, so set it directly document data.
Do not bother with a separate In-Reply-To, since Mail::Thread
just merges the IRT into References. This bumps our schema
version once again.
Eric Wong [Thu, 20 Aug 2015 08:54:32 +0000 (08:54 +0000)]
avoid using header_raw for Message-ID retrieval
This is for consistency with ssoma. I doubt it makes
a difference in practice, but in case somebody decides
any of the Message-ID-containing headers should have
strange characters, we'll decode and attempt to thread
them. This isn't an attack vector, just a way to
make messages thread improperly which is pointless...
Eric Wong [Thu, 20 Aug 2015 08:51:51 +0000 (08:51 +0000)]
view: simplify message threading dumpers
Eric Wong [Thu, 20 Aug 2015 06:44:39 +0000 (06:44 +0000)]
dead code cleanup
We may not be using subject_path after all.
Eric Wong [Thu, 20 Aug 2015 06:23:27 +0000 (06:23 +0000)]
www: remove useless no-op assignment statement
Oops
Eric Wong [Thu, 20 Aug 2015 04:15:31 +0000 (04:15 +0000)]
misc documentation updates
Threading in Xapian is mostly supported by now; so start
documenting things.
Eric Wong [Thu, 20 Aug 2015 04:01:59 +0000 (04:01 +0000)]
replace references to lynx
Table rendering in lynx is crap compared to w3m and links.
However, we still use it for filtering HTML since the renderer
is otherwise nice...
Eric Wong [Tue, 18 Aug 2015 06:23:06 +0000 (06:23 +0000)]
search: index_sync allows specifying alternate HEAD
This should allow us to sync the index to a temporary head
to update the Xapian index before we update the real HEAD
index.
Eric Wong [Thu, 20 Aug 2015 02:51:28 +0000 (02:51 +0000)]
view: do not fold top-level messages in thread
This hopefully reduces clicking. We may drop folding entirely
since we can use Xapian to make searching easier.
Eric Wong [Thu, 20 Aug 2015 02:43:20 +0000 (02:43 +0000)]
index: layout fix + title and Atom feed links at top
Add some spacing between topics to improve readability when
scanning or in case a subject gets too long.
The title and Atom feed may not be highly-visible otherwise.
While we're at it, use the proper "Atom feed" terminology since
some folks may not understand just what "atom" means.
Eric Wong [Thu, 20 Aug 2015 02:32:29 +0000 (02:32 +0000)]
search: bump schema version to 5 for subject_path
In "index: simplify main landing page if search-enabled",
subject normalization went a little farther to drop trailing
'.' characters, so we will need to re-index.
Eric Wong [Thu, 20 Aug 2015 02:30:32 +0000 (02:30 +0000)]
view: reduce memory usage when displaying large threads
We want to minimize the time any large objects or strings
are referenced. We can do threading entirely from the
mini_mime-generated messages and lazilly load full messages
when rendering the display.
Eric Wong [Thu, 20 Aug 2015 02:30:31 +0000 (02:30 +0000)]
search: reject ghosts in all cases
We do not need ghost messages in any of our thread views
Eric Wong [Thu, 20 Aug 2015 02:30:30 +0000 (02:30 +0000)]
search: avoid needless decode
Email::MIME should handle everything for us and make things
work nicely with Xapian (assuming I understand how encoding
works in Perl).
While we're at it, reduce temporary strings and arrays by
using destructive operations and clobbering parts as we
iterate through them.
Eric Wong [Thu, 20 Aug 2015 02:30:29 +0000 (02:30 +0000)]
index: simplify main landing page if search-enabled
We can display /t/$MESSAGE_ID.html easily with a Xapian search
index, so rely on it instead of trying to display messages inline.
Eric Wong [Thu, 20 Aug 2015 02:30:28 +0000 (02:30 +0000)]
view: avoid nesting <a> tags from auto-linkification
It is wrong HTML to have <a> tags nested due to auto-linkification.
Eric Wong [Thu, 20 Aug 2015 02:30:27 +0000 (02:30 +0000)]
use tables for rendering comment nesting
This is more space efficient since we don't need to place padding
bytes in front of every line. While this unfortunately does not
render well on lynx; w3m, links, elinks can all render tables
sanely.
Tables are also superior for long lines which require wrapping
inside <pre> containers.
Eric Wong [Thu, 20 Aug 2015 02:30:26 +0000 (02:30 +0000)]
feed: move timestamp parsing to view
We don't need share duplicate logic across both files.
Eric Wong [Thu, 20 Aug 2015 02:30:25 +0000 (02:30 +0000)]
feed: remove threading from index
We'll be making the index smarter for people with search
support enabled. Otherwise, it'll be chronological and
a bit dumb. At least it'll use less memory.
Eric Wong [Wed, 19 Aug 2015 19:46:22 +0000 (19:46 +0000)]
www: redirect /f/$MESSAGE_ID.txt links to /m/$MESSAGE_ID.txt
Some people (e.g. myself :p) may try to guess URLs and hit a
404. Redirect to the /m/ version.
Note: we prefer to redirect to canonical URLs to improve
caching.
Eric Wong [Wed, 19 Aug 2015 19:36:11 +0000 (19:36 +0000)]
view: return empty string to avoid undefined values
Sometimes we have filter bugs and let HTML slip through...
Eric Wong [Wed, 19 Aug 2015 19:31:08 +0000 (19:31 +0000)]
view: fix spacing on missing ghosts
We must not prematurely indent if we have no message header to
display.
Eric Wong [Tue, 18 Aug 2015 03:17:17 +0000 (03:17 +0000)]
view: close anchor tag correctly before starting another
Noticed by tidy
Eric Wong [Tue, 18 Aug 2015 03:17:16 +0000 (03:17 +0000)]
public-inbox-index: exit with usage if not given an arg
I often forget how to use this myself :x
Eric Wong [Tue, 18 Aug 2015 02:05:32 +0000 (02:05 +0000)]
thread: another workaround for a Mail::Thread bug
Yay for monkey patching!
ref: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=795913
ref: https://rt.cpan.org/Ticket/Display.html?id=106498
Eric Wong [Tue, 18 Aug 2015 01:13:03 +0000 (01:13 +0000)]
search: bump SCHEMA_VERSION to 4
The following two commits affect indexing behavior, so
change the schema version to avoid compatibility problems
or missing messages:
search: common Subject: normalization for Re: prefixes
search: avoid creating ghosts for circular References
Eric Wong [Tue, 18 Aug 2015 01:11:06 +0000 (01:11 +0000)]
search: expose $PublicInbox::Search::LANG variable
This makes it easier to reconfigure for non-English users
Eric Wong [Tue, 18 Aug 2015 01:11:05 +0000 (01:11 +0000)]
search: common Subject: normalization for Re: prefixes
Drop German ("Aw:") support since it's non-standard and
is not supported by Mail::Thread and non-English prefixes
are more likely to conflict with prefixes used in Free Software
development where ("subsection:") prefixes are common and English is the
common language.
Anyways we don't filter "Vs: " (Finnish) or "Sv: "
(Norwegian, Swedish, Danish, Icelandic), either.
ref:
https://en.wikipedia.org/wiki/RE_(e-mail)#Abbreviations_in_other_languages
Eric Wong [Tue, 18 Aug 2015 01:11:04 +0000 (01:11 +0000)]
search: avoid creating ghosts for circular References
Some mail software incorrectly creates circular references
and causes us to create ghosts before the actual mail doc
is created.
Eric Wong [Tue, 18 Aug 2015 01:08:28 +0000 (01:08 +0000)]
view: cleaner Message-ID filtering for References
Avoid compiling a weird and potentially fragile regexp every
time and use the same logic as our search module to dedupe
References.
Eric Wong [Mon, 17 Aug 2015 20:15:31 +0000 (20:15 +0000)]
view: do not recompress already-compressed MID for anchors
This is merely for display, so on the off chance somebody does
send a 40-byte MID with nothing but hexadecimal characters,
the worst that could happen is we repeat an anchor name in the
rendered HTML. This has no impact on git archival or Xapian
indexing.
Eric Wong [Mon, 17 Aug 2015 16:49:31 +0000 (16:49 +0000)]
search: simplify indexing operation
There's no need to make a transaction for each message when doing
incremental indexing against a git repository. While we're at it,
simplify the interface for callers, too and do not auto-create
the Xapian database if it was not explicitly enabled.
Eric Wong [Mon, 17 Aug 2015 08:19:44 +0000 (08:19 +0000)]
public-inbox-{learn,mda}: automatically sync index
We'll ignore errors, for now, but should eventually warn or
log. And yes, this is a dirty, dirty hack but we'll fix this
ASAP tomorrow.
Eric Wong [Mon, 17 Aug 2015 08:05:03 +0000 (08:05 +0000)]
view: always compress Message-IDs for anchors
Valid URLs do not make valid anchor ids.
Eric Wong [Mon, 17 Aug 2015 07:56:39 +0000 (07:56 +0000)]
search: bump schema version for '%' compression change
commit
0fea7793b22efd2596983283947ee43687e0cfac
("mid: compress Message-IDs with '%' in them")
requires re-indexing of repositories with '%' in Message-IDs :<
Eric Wong [Mon, 17 Aug 2015 07:46:54 +0000 (07:46 +0000)]
mid: compress Message-IDs with '%' in them
Some HTTP servers (apache2 2.2.22-13+deb7u5) on my system
apparently do not handle "%25" correctly. I'm not yet sure if
it's something weird with my rewrite rules or what....
Eric Wong [Mon, 17 Aug 2015 03:20:44 +0000 (03:20 +0000)]
search: apply mid_compression to subject paths, too
Otherwise we'll be wasting space in our index for long
subjects.
Eric Wong [Mon, 17 Aug 2015 02:41:18 +0000 (02:41 +0000)]
drop bodies and messages ASAP after processing
We can rely on reference counting to lower memory usage for
big messages.
Eric Wong [Mon, 17 Aug 2015 02:41:16 +0000 (02:41 +0000)]
feed: disable the generator statement
No need to waste bandwidth, here
Eric Wong [Mon, 17 Aug 2015 02:41:14 +0000 (02:41 +0000)]
search: use raw headers without MIME decoding
This should be less error-prone in case somebody tries to screw with
us and our thread_id mechanism or somehow waste our resources.
Unfortunately Mail::Thread isn't smart enough for this, yet, so we
may need to downgrade to Email::Simple objects as a workaround.
Or simply not worry about the display so much if somebody is
intentionally trying to make it thread badly/incorrectly.
Eric Wong [Mon, 17 Aug 2015 02:41:13 +0000 (02:41 +0000)]
terminology: replies => followups
Replies are only direct replies, but followups could be any message
further down the thread. The latter is more useful.
Eric Wong [Mon, 17 Aug 2015 02:41:12 +0000 (02:41 +0000)]
www: simplify parameter passing to feed
No need to create a new hash when we can reuse the existing one
more.
Eric Wong [Mon, 17 Aug 2015 02:41:11 +0000 (02:41 +0000)]
WWW: eliminate "top" parameter for feeds
This parameter hasn't been used since
commit
5adf8d639e9b5abd4cbac975d70ddc0fb76541fc
("feed: dead code elimination around dropped endpoints")
Eric Wong [Mon, 17 Aug 2015 02:41:10 +0000 (02:41 +0000)]
favor /t/ to /s/, since subjects may change mid-thread
/t/ always falls back to subject path searching anyways,
so there's little lost besides perhaps more readable URLs.
Unfortunately people still use non-compliant mail clients which fail
to set In-Reply-To or References headers :<
Eric Wong [Mon, 17 Aug 2015 02:41:09 +0000 (02:41 +0000)]
feed: remove unnecesary time paramenter in index state
We no longer do "smart" time displays as of
commit
ea0e8649f90d1fd0850a41c0ca16642faadf4f14
("view: simplify timestamp generation").
In retrospect, that commit also made us more cache-friendly, too.
Eric Wong [Mon, 17 Aug 2015 02:41:06 +0000 (02:41 +0000)]
skip search test if search support is missing
We will not require Search::Xapian to be installed.
Eric Wong [Mon, 17 Aug 2015 03:11:43 +0000 (03:11 +0000)]
Merge remote-tracking branch 'origin/search'
* origin/search:
view: deduplicate common code for loading search results
SearchMsg: ensure metadata for ghost messages mid
implement /s/$SUBJECT_PATH.html lookups
search: remove unnecessary xpfx export
www: /t/$MESSAGE_ID.html for threads
view: hoist out index_walk function
view: reply threading adjustment
thread: common sorting code
view: display replies in per-message view
search: make search results more OO
extract redundant Message-ID handling code
search: implement index_sync to fixup indexer
initial search backend implementation
Eric Wong [Sun, 16 Aug 2015 20:51:05 +0000 (20:51 +0000)]
view: kill leading empty lines correctly
Was too sleepy to be coding last night :x
Eric Wong [Sun, 16 Aug 2015 09:12:24 +0000 (09:12 +0000)]
view: cleaner killing of leading/trailing whitespace
No point in wasting bytes even if gets compressed over
the wire, it'll use more memory when rendering on the
client.
Eric Wong [Sun, 16 Aug 2015 01:42:13 +0000 (01:42 +0000)]
view: hoist out index_walk function
We will reuse it for thread views when powered by Xapian.
Eric Wong [Sun, 16 Aug 2015 08:53:41 +0000 (08:53 +0000)]
view: deduplicate common code for loading search results
More to come later.
Eric Wong [Sun, 16 Aug 2015 08:32:18 +0000 (08:32 +0000)]
SearchMsg: ensure metadata for ghost messages mid
Ghosts have no document data in them.
Perhaps we should just rely on terms for Message-ID
and avoid storing that in the document data...
Eric Wong [Sun, 16 Aug 2015 08:14:40 +0000 (08:14 +0000)]
implement /s/$SUBJECT_PATH.html lookups
Quick-and-dirty wiring up of to Subject: paths.
This may prove more memorizable and easier-to-share than
/t/$MESSAGE_ID.html links, but less strict.
This changes our schema version to 1, since we now
use lower-case subject paths.
Eric Wong [Sun, 16 Aug 2015 07:25:11 +0000 (07:25 +0000)]
search: remove unnecessary xpfx export
SearchMsg calls it with the full module path anyways.
Eric Wong [Sun, 16 Aug 2015 02:17:14 +0000 (02:17 +0000)]
www: /t/$MESSAGE_ID.html for threads
This should bring up nearly the entire thread a given
Message-ID is linked to.
Eric Wong [Sun, 16 Aug 2015 01:42:13 +0000 (01:42 +0000)]
view: hoist out index_walk function
We will reuse it for thread views when powered by Xapian.
Eric Wong [Sat, 15 Aug 2015 23:57:39 +0000 (23:57 +0000)]
view: reply threading adjustment
Give changes in subject their own line to reduce line wrapping,
but avoid showing any redundant subjects by maintaining a hash
of subjects already displayed.
Eric Wong [Sat, 15 Aug 2015 23:41:21 +0000 (23:41 +0000)]
thread: common sorting code
We'll be sharing the same threading, so it makes sense to sort
replies using the same code and message headers without repeating
ourselves.
This also standardizes on sorting on X-PI-TS (Unix epoch in seconds)
instead over using X-PI-Date differently in two different places
Eric Wong [Sat, 15 Aug 2015 09:28:34 +0000 (09:28 +0000)]
view: display replies in per-message view
This can be used to quickly scan for replies in a message without
displaying an entire thread.
Eric Wong [Sat, 15 Aug 2015 09:28:33 +0000 (09:28 +0000)]
search: make search results more OO
This will relieve callers of the need to decode the data
we store internally in Xapian
Eric Wong [Sat, 15 Aug 2015 09:28:32 +0000 (09:28 +0000)]
extract redundant Message-ID handling code
Quit repeating ourselves and use a common MID module
instead.
Eric Wong [Sat, 15 Aug 2015 09:28:31 +0000 (09:28 +0000)]
search: implement index_sync to fixup indexer
We need to make the indexer executable and installable
while we're at it.
Eric Wong [Thu, 13 Aug 2015 02:32:22 +0000 (02:32 +0000)]
initial search backend implementation
This shall allow us to search for replies/threads more easily.
Eric Wong [Wed, 12 Aug 2015 22:41:10 +0000 (22:41 +0000)]
view: consistent ordering of Cc: addresses
This fixes a minor test failure in t/cgi.t
Tested with perl 5.18.2-2ubuntu1 on Ubuntu 14.04.3 LTS
Eric Wong [Wed, 5 Aug 2015 23:36:42 +0000 (23:36 +0000)]
view: remove unused $enc_mime Encoding object
Unneeded since commit
e022d3377fd2c50fd9931bf96394728958a90bf3
("huge refactor of encoding handling")
Eric Wong [Wed, 5 Aug 2015 23:29:34 +0000 (23:29 +0000)]
view: pass fallback encoding to enc_for
This fixes the fallback to message encoding if the message
itself was not UTF-8
Eric Wong [Sun, 2 Aug 2015 06:35:57 +0000 (06:35 +0000)]
public-inbox-learn: preserve headers for ham injection
We must inject headers properly for injecting ham, otherwise
List-Id headers get dropped.
Eric Wong [Wed, 29 Jul 2015 18:09:41 +0000 (18:09 +0000)]
view: simplify timestamp generation
It's seems less ambiguous to parse a consistent in quiet lists
where messages are sparse.
Eric Wong [Mon, 20 Jul 2015 21:53:14 +0000 (21:53 +0000)]
feed: extract subroutines for threading
We'll be using this in the future for displaying per-thread
views.
Eric Wong [Tue, 14 Jul 2015 21:09:50 +0000 (21:09 +0000)]
scripts/dc-dlvr.pre: ensure stderr gets back to the MTA
We want to be able to reject errors back to the MTA.
Eric Wong [Tue, 14 Jul 2015 21:01:18 +0000 (21:01 +0000)]
reject HTML loudly and automatically
This should hopefully reduce the delay between when a user fails
to send plain-text to when an admin such as myself notices the
HTML mail in a sea of spam.
Unfortunately, this can lead to backscatter, so avoid doing it
until its passed through spamc, at least.
Eric Wong [Mon, 6 Jul 2015 21:22:22 +0000 (21:22 +0000)]
feed: compile regexps only once
This avoids some runtime penalties associated with recompiling
a regexp based on a constant local variable.
Eric Wong [Mon, 6 Jul 2015 20:52:33 +0000 (20:52 +0000)]
view: reduce empty <a>, use "id" instead of "name" attributes
This is probably more compliant, and saves us a few bytes
on the uncompressed HTML.
Eric Wong [Mon, 6 Jul 2015 20:11:29 +0000 (20:11 +0000)]
feed: close body tag correctly in index
Oops, noticed by manual inspection. One day we'll run tidy in tests
to validate.
Eric Wong [Fri, 5 Jun 2015 17:45:26 +0000 (17:45 +0000)]
public-inbox-mda: preserve SpamAssassin headers in spam
We want to be able to prioritize spam downstream to check for
borderline cases.
Eric Wong [Wed, 4 Mar 2015 20:50:34 +0000 (20:50 +0000)]
view: fix linkification and quote-folding conflicts
We can't add newlines to links, unfortunately, because
quote-folding is line-based and (being regexp-based) needs
to happen after linkification.
Eric Wong [Mon, 9 Feb 2015 22:33:50 +0000 (22:33 +0000)]
view: generate links for common protocols in browsers
SpamAssassin queries URI blacklists, so it's probably OK
to enable this without being used as a linkfarm.
Eric Wong [Mon, 29 Dec 2014 19:25:50 +0000 (19:25 +0000)]
doc/design_www: remove item for auto-generated links
SpamAssassin queries URI blacklists, so it's probably OK
to start generating links in the future...
Eric Wong [Mon, 12 Jan 2015 01:16:04 +0000 (01:16 +0000)]
import_slrnspool: fork a process for each message
This prevents process growth when importing large messages.
Memory growth could be due to the sliding sbrk window in glibc malloc
or a circular reference in the Email::* Perl code somewhere.
Eric Wong [Sun, 11 Jan 2015 23:58:55 +0000 (23:58 +0000)]
import_slrnspool: load private config key
PublicInbox::Config->lookup won't return unknown keys
Eric Wong [Sun, 11 Jan 2015 23:55:27 +0000 (23:55 +0000)]
import_slrnspool: graceful exit for interruptibility
This should alleviate fears of interrupting the process.
Eric Wong [Sun, 11 Jan 2015 23:13:08 +0000 (23:13 +0000)]
import_slrnspool: make filtering optional