Matt Kangas wrote:
> On Mon, 17 Jan 2005 16:17:46 +0100, Andrzej Bialecki <
ab@geto...> wrote:
>
>>.... Or we could provide a separate hook to call some other type of filter,
>>let's say ExtendedContentFilter, after the Content has been parsed:
>>
>> Content filter(Content content, Parse parse);
>>
>>This approach has also the benefit that you could replace the original
>>content with something more suitable for web interface preview (e.g.
>>replace PDF with HTML - currently Nutch doesn't allow you out-of-the-box
>>to view cached copies of non-html formats).
>
>
> Andrej, I think this is a great idea. The ContentFilter interface
> would be much more useful if the parsed data was available for
> analysis too. I'd suggest keeping the interface very simple -- perhaps
> the above signature is all that's needed. If a given filter doesn't
> care about Parse data, it can ignore it.
>
> However, I'm not sure about content-transforming filters. Wouldn't you
> want to get both Content and Parse back from filter() if this was the
> goal?
In general case - I don't know yet... ;-) Both arguments are passed by
reference, so if you replace them both with other instances then you are
in trouble. You could add a simple data holding class with these two
fields for using this in filter()..
But in the case of non-html formats like PDF and Word I imagine that
normally you would want to keep text and metadata from the parsing of
the original format, optionally adding some metadata to Parse instance
(but without replacing it with a new instance...). In other words, I
think that only the Content instance would be completely replaced (hence
it needs to be returned), and the Parse instance would be only slightly
modified.
I now had a look at available methods in Content and Parse/ParseData -
Content is basically read-only, and Parse.text as well, also ParseData
allows you to change only the metadata part... Hmmm. Not much can be
changed here by the filter. We could add setters to these classes, though.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....
http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
Nutch-developers@list...
https://lists.sourceforge.net/lists/listinfo/nutch-developers
opensubscriber is not affiliated with the authors of this message nor responsible for its content.