3 public-inbox-index - create and update search indices
7 public-inbox-index [OPTIONS] INBOX_DIR...
9 public-inbox-index [OPTIONS] --all
13 public-inbox-index creates and updates the search, overview and
14 NNTP article number database used by the read-only public-inbox
15 HTTP and NNTP interfaces. Currently, this requires
16 L<DBD::SQLite> and L<DBI> Perl modules. L<Search::Xapian>
17 is optional, only to support the PSGI search interface.
19 Once the initial indices are created by public-inbox-index,
20 L<public-inbox-mda(1)> and L<public-inbox-watch(1)> will
21 automatically maintain them.
23 Running this manually to update indices is only required if
24 relying on L<git-fetch(1)> to mirror an existing public-inbox;
25 or if upgrading to a new version of public-inbox using
26 the C<--reindex> option.
28 Having the overview and article number database is essential to
29 running the NNTP interface, and strongly recommended for the
30 HTTP interface as it provides thread grouping in addition to
31 normal search functionality.
39 Influences the number of Xapian indexing shards in a
40 (L<public-inbox-v2-format(5)>) inbox.
42 See L<public-inbox-init(1)/--jobs> for a full description
45 C<--jobs=0> is accepted as of public-inbox 1.6.0 (PENDING)
46 to disable parallel indexing regardless of the number of
49 If the inbox has not been indexed or initialized, C<JOBS - 1>
50 shards will be created (one job is always needed for indexing
51 the overview and article number mapping).
53 Default: the number of existing Xapian shards
57 Compacts the Xapian DBs after indexing. This is recommended
58 when using C<--reindex> to avoid running out of disk space
59 while indexing multiple inboxes.
61 While option takes a negligible amount of time compared to
62 C<--reindex>, it requires temporarily duplicating the entire
63 contents of the Xapian DB.
65 This switch may be specified twice, in which case compaction
66 happens both before and after indexing to minimize the temporal
67 footprint of the (re)indexing operation.
69 Available since public-inbox 1.4.0.
73 Forces a re-index of all messages in the inbox.
74 This can be used for in-place upgrades and bugfixes while
75 NNTP/HTTP server processes are utilizing the index. Keep in
76 mind this roughly doubles the size of the already-large
77 Xapian database. Using this with C<--compact> or running
78 L<public-inbox-compact(1)> afterwards is recommended to
81 public-inbox protects writes to various indices with
82 L<flock(2)>, so it is safe to reindex (and rethread) while
83 L<public-inbox-watch(1)>, L<public-inbox-mda(1)> or
84 L<public-inbox-learn(1)> run.
86 This does not touch the NNTP article number database.
87 It does not affect threading unless C<--rethread> is
92 Index all inboxes configured in ~/.public-inbox/config.
93 This is an alternative to specifying individual inboxes directories
98 Regenerate internal THREADID and message thread associations
101 This fixes some bugs in older versions of public-inbox. While
102 it is possible to use this without C<--reindex>, it makes little
105 Available in public-inbox 1.6.0 (PENDING).
109 Run L<git-gc(1)> to prune and expire reflogs if discontiguous history
110 is detected. This is intended to be used in mirrors after running
111 L<public-inbox-edit(1)> or L<public-inbox-purge(1)> to ensure data
112 is expunged from mirrors.
114 Available since public-inbox 1.2.0.
116 =item --max-size SIZE
118 Sets or overrides L</publicinbox.indexMaxSize> on a
119 per-invocation basis. See L</publicinbox.indexMaxSize>
122 Available since public-inbox 1.5.0.
124 =item --batch-size SIZE
126 Sets or overrides L</publicinbox.indexBatchSize> on a
127 per-invocation basis. See L</publicinbox.indexBatchSize>
130 When using rotational storage but abundant RAM, using a large
131 value (e.g. C<500m>) with C<--sequential-shard> can
132 significantly speed up and reduce fragmentation during the
133 initial index and full C<--reindex> invocations (but not
134 incremental updates).
136 Available in public-inbox 1.6.0 (PENDING).
140 Disables L<fsync(2)> and L<fdatasync(2)> operations on SQLite
141 and Xapian. This is only effective with Xapian 1.4+. This is
142 primarily intended for systems with low RAM and the small
143 (default) C<--batch-size=1m>. Users of large C<--batch-size>
144 may even find disabling L<fdatasync(2)> causes too much dirty
145 data to accumulate, resulting on latency spikes from writeback.
147 Available in public-inbox 1.6.0 (PENDING).
149 =item --sequential-shard
151 Sets or overrides L</publicinbox.indexSequentialShard> on a
152 per-invocation basis. See L</publicinbox.indexSequentialShard>
155 Available in public-inbox 1.6.0 (PENDING).
159 Stop storing document data in Xapian on an existing inbox.
161 See L<public-inbox-init(1)/--skip-docdata> for description and caveats.
163 Available in public-inbox 1.6.0 (PENDING).
169 For v1 (ssoma) repositories described in L<public-inbox-v1-format(5)>.
170 All public-inbox-specific files are contained within the
171 C<$GIT_DIR/public-inbox/> directory.
173 v2 inboxes are described in L<public-inbox-v2-format(5)>.
179 =item publicinbox.indexMaxSize
181 Prevents indexing of messages larger than the specified size
182 value. A single suffix modifier of C<k>, C<m> or C<g> is
183 supported, thus the value of C<1m> to prevents indexing of
184 messages larger than one megabyte.
186 This is useful for avoiding memory exhaustion in mirrors
187 via git. It does not prevent L<public-inbox-mda(1)> or
188 L<public-inbox-watch(1)> from importing (and indexing)
191 This option is only available in public-inbox 1.5 or later.
195 =item publicinbox.indexBatchSize
197 Flushes changes to the filesystem and releases locks after
198 indexing the given number of bytes. The default value of C<1m>
199 (one megabyte) is low to minimize memory use and reduce
200 contention with parallel invocations of L<public-inbox-mda(1)>,
201 L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
203 Increase this value on powerful systems to improve throughput at
204 the expense of memory use. The reduction of lock granularity
205 may not be noticeable on fast systems. With SSDs, values above
206 C<4m> have little benefit.
208 For L<public-inbox-v2-format(5)> inboxes, this value is
209 multiplied by the number of Xapian shards. Thus a typical v2
210 inbox with 3 shards will flush every 3 megabytes by default
211 unless parallelism is disabled via C<--sequential-shard>
214 This influences memory usage of Xapian, but it is not exact.
215 The actual memory used by Xapian and Perl has been observed
216 in excess of 10x this value.
218 This option is available in public-inbox 1.6 or later.
219 public-inbox 1.5 and earlier used the current default, C<1m>.
221 Default: 1m (one megabyte)
223 =item publicinbox.indexSequentialShard
225 For L<public-inbox-v2-format(5)> inboxes, setting this to C<true>
226 allows indexing Xapian shards in multiple passes. This speeds up
227 indexing on rotational storage with high seek latency by allowing
228 individual shards to fit into the kernel page cache.
230 Using a higher-than-normal number of C<--jobs> with
231 L<public-inbox-init(1)> may be required to ensure individual
232 shards are small enough to fit into cache.
234 Warning: interrupting C<public-inbox-index(1)> while this option
235 is in use may leave the search indices out-of-date with respect
236 to SQLite databases. WWW and IMAP users may notice incomplete
237 search results, but it is otherwise non-fatal. Using C<--reindex>
238 will bring everything back up-to-date.
240 Available in public-inbox 1.6.0 (PENDING).
242 This is ignored on L<public-inbox-v1-format(5)> inboxes.
244 Default: false, shards are indexed in parallel
246 =item publicinbox.<name>.indexSequentialShard
248 Identical to L</publicinbox.indexSequentialShard>,
249 but only affect the inbox matching E<lt>nameE<gt>.
259 Used to override the default "~/.public-inbox/config" value.
261 =item XAPIAN_FLUSH_THRESHOLD
263 The number of documents to update before committing changes to
264 disk. This environment is handled directly by Xapian, refer to
265 Xapian API documentation for more details.
267 For public-inbox 1.6 and later, use C<publicinbox.indexBatchSize>
270 Setting C<XAPIAN_FLUSH_THRESHOLD> or
271 C<publicinbox.indexBatchSize> for a large C<--reindex> may cause
272 L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and
273 L<public-inbox-watch(1)> tasks to wait long and unpredictable
274 periods of time during C<--reindex>.
276 Default: none, uses C<publicinbox.indexBatchSize>
282 Occasionally, public-inbox will update it's schema version and
283 require a full index by running this command.
287 Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org>
289 The mail archives are hosted at L<https://public-inbox.org/meta/>
290 and L<http://hjrcffqmbrq6wope.onion/meta/>
294 Copyright 2016-2020 all contributors L<mailto:meta@public-inbox.org>
296 License: AGPL-3.0+ L<https://www.gnu.org/licenses/agpl-3.0.txt>
300 L<Search::Xapian>, L<DBD::SQLite>