Michael McCandless commented on LUCENE-1458:
In KS and Lucy, at least, we're focused on optimizing for the use case of dedicated search clusters where each box has enough RAM to fit the entire index/shard - in which case we won't have to worry about the OS swapping out those pages.
I suspect that in many circumstances the term dictionary would be a hot file even if RAM were running short, but I don't think it's important to worry about maxing out performance on such systems - if the term dictionary isn't hot the posting list files are definitely not hot and search-time responsiveness is already compromised.
In other words...
* I trust the OS to do a decent enough job on underpowered systems.
* High-powered systems should strive to avoid swapping entirely. To aid in that endeavor, we minimize per-process RAM consumption by maximizing our use of mmap and treating the system IO cache backing buffers as interprocess shared memory.
These are the two extremes, but, I think most common are all the apps
in between. Take a large Jira instance, where the app itself is also
consuming alot of RAM, doing alot of its own IO, etc., where perhaps
searching is done infrequently enough relative to other operations
that the OS may no longer think the pages you hit for the terms index
are hot enough to keep around.
This is a good read, but I find it overly trusting of VM.
How can the VM system possibly make good decisions about what to swap
out? It can't know if a page is being used for terms dict index,
terms dict, norms, stored fields, postings. LRU is not a good policy,
because some pages (terms index) are far far more costly to miss than
>From Java we have even more ridiculous problems: sometimes the OS
swaps out garbage... and then massive swapping takes place when GC
runs, swapping back in the garbage only to then throw it away. Ugh!
I think we need to aim for *consistency*: a given search should not
suddenly take 10 seconds because the OS decided to swap out a few
critical structures (like the term index). Unfortunately we can't
really achieve that today, especially from Java.
I've seen my desktop OS (Mac OS X 10.5.5, based on FreeBSD) make
stupid VM decisions: if I run something that does a single-pass
through many GB of on-disk data (eg re-encoding a video), it then
swaps out the vast majority of my apps even though I have 6 GB RAM. I
hit tons (many seconds) of swapping just switching back to my mail
client. It's infuriating. I've seen Linux do the same thing, but at
least Linux let's you tune this behavior ("swappiness"); I had to
disable swapping entirely on my desktop.
Similarly, when a BG merge is burning through data, or say backup
kicks off and moves many GB, or the simple act of iterating through a
big postings list, the OS will gleefully evict my terms index or norms
in order to populate its IO cache with data it will need again for a
very long time.
I bet the VM system fails to show graceful degradation: if I don't
have enough RAM to hold my entire index, then walking through postings
lists will evict my terms index and norms, making all searches slower.
In the ideal world, an IndexReader would be told how much RAM to use.
It would spend that RAM wisely, eg first on the terms index, second on
norms, third maybe on select column-stride fields, etc. It would pin
these pages so the OS couldn't swap them out (can't do this from
java... though as a workaround we could use a silly thread). Or, if
the OS found itself tight on RAM, it would ask the app to free things
up instead of blindly picking pages to swap out, which does not happen
>From Java we could try using WeakReference but I fear the
communication from the OS -> JRE is too weak. IE I'd want my
WeakReference cleared only when the OS is threatening to swap out my
> Plus during that binary search the IO system is loading whole pages into
> the IO cache, even though you'll only peak at the first few bytes of each.
I'd originally been thinking of mapping only the term dictionary index files. Those are pretty small, and the file itself occupies fewer bytes than the decompressed array of term/pointer pairs. Even better if you have several search app forks and they're all sharing the same memory mapped system IO buffer.
But hey, we can simplify even further! How about dispensing with the index file? We can just divide the main dictionary file into blocks and binary search on that.
I'm not convinced this'll be a win in practice. You are now paying an
even higher overhead cost for each "check" of your binary search,
especially with something like pulsing which inlines more stuff into
the terms dict. I agree it's simpler, but I think that's trumped by
the performance hit.
In Lucene java, the concurrency model we are aiming for is a single
JVM sharing a single instance of IndexReader. I do agree, if fork()
is the basis of your concurrency model then sharing pages becomes
critical. However, modern OSs implement copy-on-write sharing of VM
pages after a fork, so that's another good path to sharing?
bq. Killing off the term dictionary index yields a nice improvement in code and file specification simplicity, and there's no performance penalty for our primary optimization target use case.
Have you tried any actual tests swapping these approaches in as your
terms index impl? Tests of fully hot and fully cold ends of the
spectrum would be interesting, but also tests where a big segment
merge or a backup is running in the background...
bq. That doesn't meet the design goals of bringing the cost of opening/warming an IndexReader down to near-zero and sharing backing buffers among multiple forks.
That's a nice goal. Our biggest cost in Lucene is warming the
FieldCache, used for sorting, function queries, etc. Column-stride
fields should go a ways towards improving this.
bq. It's also very complicated, which of course bothers me more than it bothers you. So I imagine we'll choose different paths.
I think if we make the pluggable API simple, and capture the
complexity inside each impl, such that it can be well tested in
isolation, it's acceptable.
bq. If we treat the term dictionary as a black box, it has to accept a term and return... a blob, I guess. Whatever calls the lookup needs to know how to handle that blob.
In my approach here, the blob is opaque to the terms dict reader: it
simply seeks to the right spot in the tis file, and then asks the
codec to decode the entry. TermsDictReader is entirely unaware of
what/how is stored there.
> Further steps towards flexible indexing
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.9
> Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch
> I attached a very rough checkpoint of my current patch, to get early
> feedback. All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
> * Switches to a new more efficient terms dict format. This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo). At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta. Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array). It should be faster to init too.
> This part is basically done.
> * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers. EG there is no more TermInfo used
> when reading the new format.
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> This part is basically done.
> * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> This replaces TermEnum/Docs/Positions. SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> Next steps:
> * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
> * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
> * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility). EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
> * Test performance & iterate.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.