arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

robots.txt rule

rated by 0 users
Answered (Verified) This post has 1 verified answer | 3 Replies | 2 Followers

Top 10 Contributor
229 Posts
megetron posted on Sat, Jan 9 2010 10:01 PM

when crawling I recieve this message:

If you're spidering the site please ensure, that you are following robots.txt, that you do not initiate more than one page load at a time, and that consecutive page loads are no more frequent than one per ten seconds.

 

how to implement such effectively?

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Check cfg.CrawlRules.  Robots.txt should be enabled by default.

Also, check the Frequency rule, which is tied to the Politness.cs class, used in Cache.cs.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Check cfg.CrawlRules.  Robots.txt should be enabled by default.

Also, check the Frequency rule, which is tied to the Politness.cs class, used in Cache.cs.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

ok, I eneabled the robots.txt rules, and the frequency rule. also set ThreadSleepTimeInMillisecondsBetweenWebRequests=10000 for teh 10 seconds rule.

I will test it to see behaviour please let me know if I am missing something.

another issus is that I am recieving from time to time 503 error message: The remote server returned an error: (503) Server Unavailable.

I wonder why, I will keep trying figure this out.

Top 10 Contributor
1,905 Posts

Hey -

On vacation right now.  For now, try reducing the number of CrawlThreads.  You are likely being throttled by your ISP or hardware.

Back on Tuesday.  Big Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC