opensubscriber
   Find in this group all groups
 
Unknown more information…

n : nutch-developers@lists.sourceforge.net 7 December 2004 • 8:19AM -0500

Re: [Nutch-dev] Implementing geography-by-IP filtering?
by Stefan Groschupf

REPLY TO AUTHOR
 
REPLY TO GROUP






Hi Matt,
very interesting we working an a similar issue.
We use a german commercial 'Zip Code to Geo Coordinates' DB.
We haven't that much URLs that's why, we haven't such a big problem
with performance yet, however we are in development process and didn't
test that much.

We extract the zip code from the dns whois lookup. This lookup we do
until indexing since we index the coordinates as well.
Anyway we cache all informations in a mysql database and lookup first
this database and then do the whois query.
This saves a lot of DNS and Whois traffic, but you may haven't the
latest information, in our case that is secondary.

Anyway I think to refactor the fetch list tool to a multi thread style
would be useful in other scenarios as well.
Sure there must be a singleton that writes the list but to have a
multithread filter would be good.
As well I would love to change the Interface based UrlFilter to a
Extensionpoint based Filter, since this would allow to have multiple
filter installed. (localbased and contenbased e.g. all restaurants in
NY)

Stefan


Am 07.12.2004 um 00:46 schrieb Matt Kangas:

> Hi folks,
>
> A few weeks ago, I decided to create a Nutch extension that would
> allow one to crawl URLs only within a certain geographic area. It
> could be handy for a Canadian to build a Nutch setup that crawls all
> Canadian sites, including the .com and .orgs. Or, since I'm in New
> York, I'd like to search local content in the NYC area w/o needing the
> disk space to crawl the entire web.
>
> One way to do this is to IP-to-location lookup, using something like
> the MaxMind.com GeoIP database. The free version resolves to the
> country level, and pay versions resolve down to metroarea. So I
> implemented a subclass of net.nutch.net.RegexURLFilter that does this.
> (see attached)
>
> The result, IPRegexURLFilter, works as advertised: it filters by regex
> *and* country-netblock. It's also very, very slow. The reason is quite
> simple. To do an IP-to-country lookup from a URL, I first have to do a
> DNS lookup on the hostname, which has high latency. So the
> single-threaded sections of code that call URLFilter.filter()
> implementations spend most of their time waiting for the lookup to
> complete.
>
> My instincts tell me there are two way to improve this situation:
> 1) Move the IP-based filter into the multithreaded parts of Fetcher,
> e.g. FetcherThread
> 2) Or, push it all the way down to where the Fetcher does its own DNS
> lookup, so we eliminate duplicate lookups for each non-filtered URL
>
> (2) would require hooking into each Protocol implementation that deals
> with hostnames, e.g. protocol-http AND protocol-ftp. That seems like a
> bad idea. Considering that the JVM will cache DNS requests, perhaps
> it's not worth going this far to eliminate the double-lookup.
>
> So, if (1) is a better course of action, I would need to hook into
> FetcherThread.run() and run a filter before the call to
> protocol.getContent(url).
>
> What's the best way to achieve this? More importantly, what's the
> Nutch way? Since FetcherThread is a inner class, subclassing it isn't
> the answer. A delegate of some kind seems more appropriate. Perhaps
> Fetcher could gain a URLFilter ivar, which if not null, FetcherThread
> calls before protocol.getContent(url)?
>
> I think this would be a generally-useful extension to the crawler, and
> am willing to write it & submit as a patch.
>
> Nutch committers, what do you think?
>
> (ps: I don't work for MaxMind, I just think their product is useful.
> The DB access API and GeoIP Free DB are both GPL'd)
>
> --Matt
> <ipregexurlfilter.java>



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://productguide.itmanagersjournal.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@list...
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Bookmark with:

Delicious   Digg   reddit   Facebook   StumbleUpon

Related Messages

opensubscriber is not affiliated with the authors of this message nor responsible for its content.