opensubscriber
   Find in this group all groups
 
Unknown more information…

n : nutch-developers@lists.sourceforge.net 17 January 2005 • 11:19AM -0500

RE: [Nutch-dev] Implementing geography-by-IP filtering?
by Chirag Chaman

REPLY TO AUTHOR
 
REPLY TO GROUP






Matt:

This is a great addition. We do this external to Nutch right now, but your
code is going to make it a breeze to integrate all the rules we have to keep
Spam out!

Thankx

CC-



-----Original Message-----
From: nutch-developers-admin@list...
[mailto:nutch-developers-admin@list...] On Behalf Of Matt
Kangas
Sent: Sunday, January 16, 2005 4:04 PM
To: dev@nutc...
Subject: Re: [Nutch-dev] Implementing geography-by-IP filtering?

Stefan and/or Doug,

Here's a followup to my Jan 3 diff. This time I added two hooks to the
Fetcher, for URLFilter and also for a new interface, ContentFilter.
These allow one to:
- filter out URLs prior to fetching, and
- filter out fetched content prior to writing to a segment

This should provide a lot of flexibility for people who don't want to index
the entire web. The only drawback I see is that the interface is too simple
to be leveraged from the command-line; you'd have to make your own custom
CrawlTool and plug in filters at the appropriate point in the crawl cycle.

Speaking of CrawlTool, I think it'd be great if end users could customize
specific steps of the crawl cycle, in Java, w/o having to cut-and-paste the
whole class. Template method is the pattern I'm thinking of here. Does this
sound useful to anybody else?

--Matt

On Wed, 12 Jan 2005 10:50:15 -0800, Doug Cutting <cutting@nutc...> wrote:
> Good point.  I meant thread-safe, not re-entrant.
>
> Doug
>
> Kragen Sitaker wrote:
> > On Fri, 2005-01-07 at 11:34 -0800, Doug Cutting wrote:
> >
> >>It's usually pretty easy to replace fields that must be synchronized
> >>with ThreadLocals in order to make a class re-entrant.  Perhaps we
> >>should do this to RegexURLFilter?
> >
> >
> > Nitpick --- as far as I know, ThreadLocals don't make things
> > re-entrant, only thread-safe, which is a strictly weaker property.  
> > RegexURLFilter probably doesn't need to be re-entrant, because it's
> > not very likely that it's going to call some client-provided code in
> > the middle of filtering a URL and have that client-provided code
> > call RegexURLFilter again --- right?
> >
> > I'd hate to have to argue with someone who thinks ThreadLocals make
> > things re-entrant in some context where re-entrancy matters, having
> > gotten the idea from a trusted source.
> >




-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
Nutch-developers@list...
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Bookmark with:

Delicious   Digg   reddit   Facebook   StumbleUpon

Related Messages

opensubscriber is not affiliated with the authors of this message nor responsible for its content.