arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

dynamic-ish configs

rated by 0 users
Answered (Verified) This post has 1 verified answer | 4 Replies | 2 Followers

Top 25 Contributor
19 Posts
offbored posted on Fri, Dec 11 2009 2:02 PM

I have some questions. Ok, maybe a lot of questions. Since I'm not sure I'd even be asking the right ones if I got too specific, maybe the best way to go about it is to just lay out what I'd like to accomplish in a general way and ask you nicely to point me in the right direction. 

I'd like to be able to define several configurable "content streams" by identifying one or more sites and keyword sets to restrict the content pulled, and a refresh schedule. Ultimately I'd like to enable chaining, i.e., using one config as a parameter in another. E.g.,  ConfigA gets the top 10 Yahoo! News/GoogleTrends/etc. results for the keyword "tiger" (yeah, I know), extracts a new set of keywords from that to use against other sites identified in ConfigB. Some basic advice on the best way to accomplish this using AN would be appreciated. 

I'd also like to be able to define 1) which content on a page to pull (e.g., the top post), 2) some basic ordinal logic (e.g., the top 5 posts), and 3) (optimally) some advanced logic (e.g., the top 5 posts where the "Posted By" line contains "Bob Abooey", or whatever). The templater seems like the thing, but i don't want to assume. Is that correct, and do you have some sample code that uses it?

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by offbored

Another example of the Templater.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

OK, always glad to answer, so ask away!  Big Smile

#1 - Sounds like you want to crawl in two distinct crawls.  First, crawl your sites that you are using for your keywords.  You can either call (NEW) methods in Crawler.cs to clear the Discoveries (which keeps track of where AN has been)...

 public void ClearDiscoveries()

 public

 

void ClearUncrawledCrawlRequests()

 public

 

void ClearPolitenesses()

 ...or, create a new Crawler.  You can then use DAO methods to get the WebPages.  You could even switch the ConnectionString, so CrawlerA points to Database instance A, and CrawlerB points to Database instance B.  This way, you can be sure you have your data separate. 

AN is a class library, and not an application (even though it has a search project), so you tell AN what to do.

#2 - Templater.cs is used for extracting the 'relevant' or 'article' portion from a WebPage.  It isn't finished, as it exists, in the source now.  Do you have any experience with xpaths?  (Click on the image below...)  Let's focus on #1 first: Data Collection.

So, the first step is to get AN crawling some SERPS.  By default, Google.com is excluded frorm crawling (check cfg.DisallowedDomains) ... if you are going to crawl Google, please change the UserAgent string in cfg.Configuration.  Big Smile  Submit a link like 'http://www.google.com/search?q=tiger+woods&rls=com.microsoft:en-us:IE-SearchBox&ie=UTF-8&oe=UTF-8&sourceid=ie7&rlz=1I7ADBF_en' and when the Crawler finishes, retrieve the contents from the database.  If you don't elect to submit the source 'cfg.Configuration -> InsertWebPageSource', then check DiscoveryManager.cs to get the path on disk from the AbsoluteUri and ApplicationSettings.DownloadedWebPagesDirectory.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts
Verified by offbored

Another example of the Templater.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
19 Posts

Thanks, Mike, this helps. I think I'll keep the "content" DB central and create a secondary DB/App to manage configs so that I don't end up processing/creating too much redundant data. That should also give me some better options if I end up needing to scale up/out. Figure I'll manage the xpath templates out of the same solution.

Hopefully I'll have a chance to play with this a little later today or tomorrow. I think I have enough to move forward, so I'll go ahead and mark this answered. I'll post an update if I think it could be useful.

Thanks again, have a great weekend.

Top 10 Contributor
1,905 Posts

Great!

If it gets tough, hang in there... crawling the internet isn't an easy problem... :)

Feel free to ask questions whenever you have them...

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (5 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC