Matt Kangas wrote:
> Stefan and/or Doug,
>
> Here's a followup to my Jan 3 diff. This time I added two hooks to the
> Fetcher, for URLFilter and also for a new interface, ContentFilter.
> These allow one to:
> - filter out URLs prior to fetching, and
> - filter out fetched content prior to writing to a segment
While the idea of ContentFilter is very useful, I have some doubts
regarding the use of URLFilter during fetching. If you don't want to
fetch some urls, then you should not put them in the fetchlist in the
first place. In other words, I think this patch should be moved to the
FetchListTool.java, between lines 508-509.
Also, in other places we use the factory pattern to get an instance of
URLFilter, without using setters. Perhaps we should use the same pattern
here as well?
>
> This should provide a lot of flexibility for people who don't want to
> index the entire web. The only drawback I see is that the interface is
> too simple to be leveraged from the command-line; you'd have to make
> your own custom CrawlTool and plug in filters at the appropriate point
> in the crawl cycle.
There is a middle-ground solution here, I think: you could implement a
simple content filter, which filters e.g. based on a regex match of the
content metadata. Regexes could be read from a text file. The filter
could be then activated from the command-line with switch, pointing to
the location of the regex file.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....
http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
Nutch-developers@list...
https://lists.sourceforge.net/lists/listinfo/nutch-developers
opensubscriber is not affiliated with the authors of this message nor responsible for its content.