An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does scale? | Download the latest release
Troubleshooting differing crawl environments and differing crawl results...

1.) Ensure that your crawling code and configurations are identical.  After crawling, use SQL Data Compare to compare your results.

2.) Ensure that your operating systems are identical.  If they cannot be, strive to use Server-class operating systems, and x64.

3.) Ensure that your network connections are identical.  This includes NICs, Gateways, Routers, and ISP Connections.

4.) Run your comparision crawls at the same time of day.

  • Many sites experience usage peaks that may obfuscate your efforts to obtain similar results at varying times of the day.

5.) Be respectful.

  • Many sites will only allow you to make a limit number of connections per period.  While does have measures in place to crawl politely, these can be disabled, and certain crawl configurations can hamper the politeness effect.  If you submit a crawl of 1 domain, with 100 threads and turn the Robots.txt rule off you will likely experience many 'Unable to connect to the remote server' exceptions in the Exceptions data table.  This exception means you are being throttled by the website, or by your ISP.

6.) Examing the DisallowedAbsoluteUris and Exceptions table thoroughly.

  • Every explanation for why an AbsoluteUri was not crawled is found in these two tables.

7.) Run your comparisons on a single domain, with a limited depth and increase depth until the domain is crawled according to your business requirements and then expand to additional domains.

  • creates a LOT of data, and it can be difficult to sift through domain after domain, and AbsoluteUri after AbsoluteUri trying to compare differences.

8.) If crawl result differences persists between/among crawling environments, run AN in a single thread.

  • Running AN in a single thread alleviates 99.9% of throttling issues presented by websites and ISPs.

Posted Thu, Apr 29 2010 9:05 AM by
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, LLC