arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Setting the Depth in porgram.cs in cosole project

rated by 0 users
Answered (Verified) This post has 1 verified answer | 11 Replies | 2 Followers

Top 75 Contributor
6 Posts
Aleksandar_e posted on Sat, Jan 23 2010 3:24 AM

Hi to all,

Can I please get a short explanation of this parametar?

The explanation given in the code:

 
//Setting the Depth to int.Max means to crawl the first page, and then int.MaxValue - 1 hops away from the initial CrawlRequest AbsoluteUri - so, the entire site.

//The higher the value for 'Priority', the higher the Priority.

is somehow confusing me about that "depth" and its values.

Also,i don't understand this OR here:

//You can logically OR the UriClassificationTypes to set what a CrawlRequest crawls!!!

 

Please advise so I can continue with my crawler project :)

Best regards,

Aleksandar.

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Depth of 1 means that page and all content on that page.  Depth of 2 means that page and every page found from the first page.

The higher the priority for a CrawlRequest, the sooner it will be crawled.

(OR) If you submitted http://arachnode.net/Home.aspx and set the RestrictCrawlTo parameter to UriClassificationType.Host | UriClassificationType.FileExtension you would only crawl .aspx pages from arachnode.net.

Which version are you using?

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Depth of 1 means that page and all content on that page.  Depth of 2 means that page and every page found from the first page.

The higher the priority for a CrawlRequest, the sooner it will be crawled.

(OR) If you submitted http://arachnode.net/Home.aspx and set the RestrictCrawlTo parameter to UriClassificationType.Host | UriClassificationType.FileExtension you would only crawl .aspx pages from arachnode.net.

Which version are you using?

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
6 Posts

Hi Mike,

thanks a lot for your fast response,

I'm using version 1.4.

The goal of my project is to make service that crawl ALL HTML from only one site (only to download html, not pictures,files...). Which depth should I use to crawl the whole site?

Also, I can not understand how this crawler works if it should crawl continiously (lets say,every 20 minutes)? after it crawls for the first time, what will it crawl the second time it is stared? Only new and updated sites or it will overwrite everything from the beginig? If it can handle this, what are the configuration parameters? Is there any performance issues?

Best Regards,

Aleksandar.

Top 10 Contributor
1,905 Posts

No problem!

To crawl an entire site, and ensure that you crawl the entire site, use a depth of Int.Max.

If you don't want to download Images and Files, turn 'AssignFileAndImageDiscoveries' in cfg.Configuration.

What is your desired method of crawling?  Do you want to start from the beginning each time, or stop, perform analysis and the continue to crawl... AN can crawl however you'd like... just have to flip a switch here and there. 

Going out for a bike ride... BB in about 6 hours.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
6 Posts

HI Mike,

did you have a nice ride? :)

 

my goal is to set the crawler to work like this:

-when I first start it, i want to download all the html and put them in database (i suppose in table dbo.webpagemetadata, converted in xml),

-then every next time the crawler starts, just to download the new and the updated html (not all again from the beginning) .

 

Mike, i'm very pleased for your help. Thanks a lot again.

Br,

Aleksandar.

Top 10 Contributor
1,905 Posts

I did.  Was a little shorter than expected but all in all a nice venture.  (20 miles)

So, if there are 1000 pages in a site, and one of those pages changes, at, say Depth 15, you only want to download that one page and skip the rest?

Which site are you looking to crawl?

Thanks!  Always glad to help.

::Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
6 Posts

Good morning Mike (it is morning here in Macedonia :) ),

Yes, I would like to crawl exactly that way. I'll put that crawler in a service that will be started every 20 minutes.

I would like to crawl from a site with advertisements, www.pazar3.com.mk.

Thanks,

BR,

Aleksandar.

Top 10 Contributor
1,905 Posts

Unfortunately, I don't know how you would do this - or if it's even possible...

Explanation: Out of say 1,000 pages in a site, page 667 is new and is found at depth 8 from your initial crawling point... how would you know to go to that page without downloading/processing other pages in the site?  This is precisely the reason why Google implemented their sitemaps program.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
6 Posts

Hi Mike,

is it possible to download all the pages the first time I crawl, and then, every next time the crawler is started just to PROCESS all the pages and download ONLY the updated-one, not all of them again from the begining. that is my opinion about updated pages, and i don't have any idea about the deleted one :(.

BR,

Aleksandar.

Top 10 Contributor
1,905 Posts

This depends on what your definitions of 'PROCESS' and 'DOWNLOAD' are.  Big Smile

AN, by default, will not DOWNLOAD the page if it hasn't changed, but will allow you to PROCESS the content, if you desire.  But, if your 'NEW' page is somewhere in the pile of webpages in a site, you have to ask the existing pages to get to that new page.  You should consider looking at the site's RSS feed, if it has one.

If you can find the Google sitemap for the website, then this will tell you what pages have been updated.  But, this really could be anywhere and isn't really public information.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
6 Posts

Hi Mike,

I'll  try that and i'll test it. I hope to get what i want :).

Thanks for your support, I will inform you.

Br,

Aleksandar.

Top 10 Contributor
1,905 Posts

Great!  Let me know!  Big Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (12 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC