On a powerful (by my standards) machine with 16GB RAM and an
7200 RPM HDD marketed for "enterprise" use, indexing a 8.1G (in
git) LKML snapshot from Sep 2019 did not finish after 7 days
with the default number (3) of Xapian shards (`--jobs=4') and
`--batch-size=10m'.
Indexing starts off fast, but progressively get slower as
contents of the inbox (including Xapian + SQLite DBs) could no
longer be cached by the kernel. Once the on-disk size
increased, HDD seek contention between the Xapian shard workers
slowed the process down to a crawl.
With a single shard, it still took around 3.5 days to index on
the HDD. That's not good, but it's far better than not
finishing after 7 days. So allow unfortunate HDD users to
easily specify a single shard on public-inbox-init.
For reference, a freshly TRIM-ed low-end TLC SSD on the SATA II
bus on the same machine indexes that same snapshot of LKML in
~7 hours with 3 shards and the same 10m batch size. In the past,
a higher-end consumer grade MLC SSDs on similar hardware indexed
a similarly sized-data set in ~4 hours.
Default: unset, no epochs are skipped
Default: unset, no epochs are skipped
+=item -j, --jobs=JOBS
+
+Control the number of Xapian index shards in a
+C<-V2> (L<public-inbox-v2-format(5)>) inbox.
+
+It is useful to use a single shard (C<-j1>) for inboxes on
+high-latency storage (e.g. rotational HDD) unless the system has
+enough RAM to cache 5-10x the size of the git repository.
+
+It is generally not useful to specify higher values than the
+default due to contention in the top-level producer process.
+
+Default: the number of online CPUs, up to 4
+
my $version = undef;
my $indexlevel = undef;
my $skip_epoch;
my $version = undef;
my $indexlevel = undef;
my $skip_epoch;
my %opts = (
'V|version=i' => \$version,
'L|indexlevel=s' => \$indexlevel,
'S|skip|skip-epoch=i' => \$skip_epoch,
my %opts = (
'V|version=i' => \$version,
'L|indexlevel=s' => \$indexlevel,
'S|skip|skip-epoch=i' => \$skip_epoch,
);
GetOptions(%opts) or usage();
PublicInbox::Admin::indexlevel_ok_or_die($indexlevel) if defined $indexlevel;
);
GetOptions(%opts) or usage();
PublicInbox::Admin::indexlevel_ok_or_die($indexlevel) if defined $indexlevel;
+if (defined $jobs) {
+ die "--jobs is only supported for -V2 inboxes\n" if $version == 1;
+ die "--jobs=$jobs must be >= 1\n" if $jobs <= 0;
+ $creat_opt->{nproc} = $jobs;
+}
+
PublicInbox::InboxWritable->new($ibx, $creat_opt)->init_inbox(0, $skip_epoch);
# needed for git prior to v2.1.0
PublicInbox::InboxWritable->new($ibx, $creat_opt)->init_inbox(0, $skip_epoch);
# needed for git prior to v2.1.0
ok(-d "$tmpdir/m/git/$i.git", "mirror $i OK");
}
ok(-d "$tmpdir/m/git/$i.git", "mirror $i OK");
}
-@cmd = ("-init", '-V2', 'm', "$tmpdir/m", 'http://example.com/m',
+@cmd = ("-init", '-j1', '-V2', 'm', "$tmpdir/m", 'http://example.com/m',
'alt@example.com');
ok(run_script(\@cmd), 'initialized public-inbox -V2');
'alt@example.com');
ok(run_script(\@cmd), 'initialized public-inbox -V2');
+my @shards = glob("$tmpdir/m/xap*/?");
+is(scalar(@shards), 1, 'got a single shard on init');
ok(run_script([qw(-index -j0), "$tmpdir/m"]), 'indexed');
ok(run_script([qw(-index -j0), "$tmpdir/m"]), 'indexed');