> it'd be nice to genericize MultiLevelSkipListWriter so that it could index
arbitrary files
+1 on this idea. Using skip lists for the term index would be an
improvement.
On Tue, Nov 18, 2008 at 12:27 PM, Michael McCandless (JIRA) <
jira@apac...
> wrote:
>
> [
>
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648739#action_12648739]
>
> Michael McCandless commented on LUCENE-1458:
> --------------------------------------------
>
> bq. Can we design a format that allows us rely upon the operating system's
> virtual memory and avoid caching in process memory altogether?
>
> Interesting! I've been wondering what you're up to over on KS, Marvin :)
>
> I'm not sure it'll be a win in practice: I'm not sure I'd trust the
> OS's IO cache to "make the right decisions" about what to cache. Plus
> during that binary search the IO system is loading whole pages into
> the IO cache, even though you'll only peak at the first few bytes of
> each.
>
> We could also explore something in-between, eg it'd be nice to
> genericize MultiLevelSkipListWriter so that it could index arbitrary
> files, then we could use that to index the terms dict. You could
> choose to spend dedicated process RAM on the higher levels of the skip
> tree, and then tentatively trust IO cache for the lower levels.
>
> I'd like to eventually make the TermsDict index pluggable so one could
> swap in different indexers like this (it's not now).
>
>
> > Further steps towards flexible indexing
> > ---------------------------------------
> >
> > Key: LUCENE-1458
> > URL:
https://issues.apache.org/jira/browse/LUCENE-1458
> > Project: Lucene - Java
> > Issue Type: New Feature
> > Components: Index
> > Affects Versions: 2.9
> > Reporter: Michael McCandless
> > Assignee: Michael McCandless
> > Priority: Minor
> > Fix For: 2.9
> >
> > Attachments: LUCENE-1458.patch, LUCENE-1458.patch
> >
> >
> > I attached a very rough checkpoint of my current patch, to get early
> > feedback. All tests pass, though back compat tests don't pass due to
> > changes to package-private APIs plus certain bugs in tests that
> > happened to work (eg call TermPostions.nextPosition() too many times,
> > which the new API asserts against).
> > [Aside: I think, when we commit changes to package-private APIs such
> > that back-compat tests don't pass, we could go back, make a branch on
> > the back-compat tag, commit changes to the tests to use the new
> > package private APIs on that branch, then fix nightly build to use the
> > tip of that branch?o]
> > There's still plenty to do before this is committable! This is a
> > rather large change:
> > * Switches to a new more efficient terms dict format. This still
> > uses tii/tis files, but the tii only stores term & long offset
> > (not a TermInfo). At seek points, tis encodes term & freq/prox
> > offsets absolutely instead of with deltas delta. Also, tis/tii
> > are structured by field, so we don't have to record field number
> > in every term.
> > .
> > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> > .
> > RAM usage when loading terms dict index is significantly less
> > since we only load an array of offsets and an array of String (no
> > more TermInfo array). It should be faster to init too.
> > .
> > This part is basically done.
> > * Introduces modular reader codec that strongly decouples terms dict
> > from docs/positions readers. EG there is no more TermInfo used
> > when reading the new format.
> > .
> > There's nice symmetry now between reading & writing in the codec
> > chain -- the current docs/prox format is captured in:
> > {code}
> > FormatPostingsTermsDictWriter/Reader
> > FormatPostingsDocsWriter/Reader (.frq file) and
> > FormatPostingsPositionsWriter/Reader (.prx file).
> > {code}
> > This part is basically done.
> > * Introduces a new "flex" API for iterating through the fields,
> > terms, docs and positions:
> > {code}
> > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> > {code}
> > This replaces TermEnum/Docs/Positions. SegmentReader emulates the
> > old API on top of the new API to keep back-compat.
> >
> > Next steps:
> > * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> > fix any hidden assumptions.
> > * Expose new API out of IndexReader, deprecate old API but emulate
> > old API on top of new one, switch all core/contrib users to the
> > new API.
> > * Maybe switch to AttributeSources as the base class for TermsEnum,
> > DocsEnum, PostingsEnum -- this would give readers API flexibility
> > (not just index-file-format flexibility). EG if someone wanted
> > to store payload at the term-doc level instead of
> > term-doc-position level, you could just add a new attribute.
> > * Test performance & iterate.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
java-dev-unsubscribe@luce...
> For additional commands, e-mail:
java-dev-help@luce...
>
>
opensubscriber is not affiliated with the authors of this message nor responsible for its content.