opensubscriber
   Find in this group all groups
 
Unknown more information…

n : nutch-developers@lists.sourceforge.net 17 January 2005 • 11:17PM -0500

Re: [Nutch-dev] Implementing geography-by-IP filtering?
by Andrzej Bialecki

REPLY TO AUTHOR
 
REPLY TO GROUP






Chirag Chaman wrote:
> Andrzej:
>
> On the same note, let me list examples of certain analysis that should be
> helpful and I'd appreciate it if you can point where is an appropriate place
> to add the code. Right now these sit external for us, but it would be nice
> to integrate them to Nutch.

A general note: my point of view on these issues is that one should
implement the control points as soon as it is possible in the processing
chain, i.e. when sufficient information is available to make an informed
decision. This is to limit the amount of data to be processed, which
could make a huge difference in terms of storage/cpu/bandwidth.

However, one may want to postpone decisions to later stages if some
other processing (like e.g. language detection) is expensive and is run
anyway in one of the later stages.

>
> 1. Content - total size < X bytes - discard and mark.

The content size is available only at the fetch stage. However...

I'm working slowly on moving the interactions between fetcher and
protocol plugins to use FetchListEntry data instead of just URL (this is
needed to implement dynamic re-fetch interval). In other words a
Protocol would use:

Content getContent(FetchListEntry fle)

instead of the current:

Content getContent(String url)

because the protocol plugins will need to make protocol-dependent
decisions whether to fetch the content based on metadata available
during fetching (like Last-Modified or If-Modified-Since).

If/when I complete this change, then it will be easier to put all
protocol-dependent decisions into protocol plugins (IMO a new factory,
e.g. ProtocolFilterFactory should be used for that), and
content-dependent decisions using ContentFilter into FetcherThread
(Fetcher.java:108).

Some of these decisions could be delayed to the stage of updating the
database or building indexes, but then you would have to either
re-filter all segment data (trivial, but time/space consuming task) to
delete unwanted content, or use some logic to "hide" it from the WebDB
update stage and from the segment indexing stage. So it seems the
Fetcher is still the best place to do it...

>
> 2. Content - HTML tag to content ratio < threshold -- discard and mark

Well, this is format-specific, so it could be put into the parse plugin
specific for this format. But perhaps it would be simpler to centralize
these kind of decisions in Fetcher, so it could be implemented as a
ContentFilter in Fetcher. But this adds a new requirement to a
ContentFilter interface that it should also consider Parse results. Or
we could provide a separate hook to call some other type of filter,
let's say ExtendedContentFilter, after the Content has been parsed:

Content filter(Content content, Parse parse);

This approach has also the benefit that you could replace the original
content with something more suitable for web interface preview (e.g.
replace PDF with HTML - currently Nutch doesn't allow you out-of-the-box
to view cached copies of non-html formats).

I'm not sure which way is better...

> 3. Link analysis - incoming to outgoing link ratio is too low

You only know the outgoing links after you have parsed the Content. So
it's the same situation as with the case above.

>
> 4. File Size - the max file size to fetch based on type. Example, a file of
> 64k for HTML maybe fine, but not for a PDF -- this currently in Nutch will
> cause a "Fetched but cannot parse error". Thus it would be nice to have a
> property in the plugin xml file that specifies the max fetch bytes, and the
> action if this is hit (parse or discard)

I agree. Currently there is only a single value for all types of
plugins, which as you say is often inappropriate.

The PluginManifestParser supports the use of arbitrary attributes in
definitions of <implementation> elements - these values are then passed
to the plugin implementation (see for example the plugin in
language-identifier/plugin.xml).

So, it's possible even now to modify individual plugins and set their
limits separately from plugin.xml files. However, I'm a bit afraid of
the configuration hassle this could bring - instead of one central
config file (nutch-site.xml), which defines your runtime parameters, now
you need to check many files... Perhaps a better way would be to put
these limits in the nutch-default.xml config file?

--
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
Nutch-developers@list...
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Bookmark with:

Delicious   Digg   reddit   Facebook   StumbleUpon

Related Messages

opensubscriber is not affiliated with the authors of this message nor responsible for its content.