An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does scale? | Download the latest release

General Questions / Threading / Abstraction

rated by 0 users
This post has 0 Replies | 1 Follower

Top 10 Contributor
Posts 1,905 Posted: Mon, Apr 6 2015 8:17 PM

[I received these questions in email and these are great to share]

Been reviewing all the code… Killer Smile, Questions:

1) Am I able to upgrade my license from personal to business and apply the original cost of the personal towards the business license?

Yes, absolutely.  I appreciate your honesty.  Commercial applications of AN usually require increased support requirements and thus, my time.

2)      As far as business cases, I am trying to identity possible revenue ideas for this, my original thoughts were around gathering product info for amazon and finding price gaps between the different regions of the world, but I am sure this Is just the tip of the iceberg. Is there a market for selling the info, or different ideas that could be used towards spidering the web in order to generate revenue. Any Ideas / thoughts on how other people are using your spider?

The use cases are extremely varied...  some use AN to scrape price data, some use it to scrape business location data, some use AN to monitor for brand compliance, to analyze links, to validate partner site compliance, to validate EU cookie policies, to find new twitter users, to find subject matter experts, to create search for their sites...

3)      Originally I was looking at this as an example for winform development and threading which I am weak in,  Is there any reason why you create local references of all your variables in each class, IE: the crawler creates all the manager classes but then passes them into all the other classes constructors, like the engine, so they can have access to the local class variables. Is there a reason for this… in terms of threading / performance? Are they not all just pointers to the original classes created in the crawler? Just curious if there is a reason for this that I am not grasping or was it just a matter of convenience.  Is there any reason / benefit to using the old threading model vs the .NET 4.0 threading patterns available. Sometimes new is not better / faster.. wondering if this falls into this category, or if it is not broken… don’t fix it category… or No time.. working on better features Smile

Using a local (private) variable reference is a programming convention whereby the internals of the class (not to be confused with the internal access modified) may be accessed without respect to the public/private access modifier of the getter/setter factoring inevitable change of the access modifier in class library usage - in short, it's a style thing - I prefer to use _privateVariable as I know this variable belongs to the class, or is inherited - how it is accessed it another level of abstraction - whether it is public/private matters not to the private variable.  When compiled the compiler creates a private variable for AutoProperties anyway - there is no performance difference.  Why do most classes contain references to most classes it uses in the constructor?  AN is a plugin, entirely.  The principle is called dependency injection ( - based upon this you may change/override ANY of AN's classes to suit your own custom behavior - don't like the way it Renders, override just one method in the class which handles the Rendering and tell the Crawler to use your new class and the rest of the behavior is untouched.

Or, in this example, should you want to change the way HyperLinks are parsed you may do so, and leave the rest of the Processing alone.  The same concepts apply to generalizing/abstracting DB storage.  Inside the Custom* class you are free to call the base implementation of the Custom* class and validate/override those results, or process the pipeline in your own custom way.  This feature is here to fully support the paradigm of class abstraction/IOC for the ArachnodeDAO and AN as a whole.  Not sure what the 'new' threading model is - perhaps this is the Task Parallel Library?  AN code, at the core, dates back to .NET 2.0, long before the threading helpers were added.  Under the covers, the TPL spawns a thread just like AN does.  There isn't any reason to change the code as it works and has been tested/verified/validated by thousands of users and introducing a change to the core would likely break things - so, in short, it just works and does the same things under the hood that the TPL does.  Also, there are a few cases where we need to change the ApartmentState - we need explicit control over the threads, needing to selectively pause/stop threads and let others pass.  I am sure the TPL would be fine, but AN works as is and without the 'safety checks' the TPL likely implements to help those new to threading AN is very likely as fast or faster.

4) I saw a post from you a while back that you have crawled over a billion pages.. that is a killer Smile, any chance you know what was the size of the DB was…and how long was it running etc. Looking at sizing etc..

It was ~245GB, but this will vary greatly depending upon what you crawl and how fast your hardware/network is...  it took ~two weeks from a seed list of internet domains - from a single machine - ~two years ago.

5)      Last but not least… If I understand correctly, discoveries are cached into ram. So if you are running multiple instances of the crawler, each instance is storing its own discoveries and you can potentially be proceesing duplicate entries as each spider has its own cache? Is that correct? Or are you hitting the DB each time a discover is added which would create a performance degradation? Is there any way to share the cache in memory across multiple spiders? I hope that makes sense.

Yes, each crawling instance stores Discoveries in RAM in a sliding window cache.  Each Discovery is checked against the DB for a match before being allowed to proceed - the SQL Server BufferManager ( handles the shared cache between/among crawling instances - SQL Server has it's own caching mechanisms - SQL Server will use up the maximum RAM under load - for AN's usage, most of what is in the BufferManager will be Discoveries.  No, you won't process duplicate entries.  It is tempting to think of 'hitting the DB' as a performance hit, and when I was writing the DB caching code I struggled with this...  how do I avoid this?  The answer is, "you don't... " you either cache everything to RAM (remember, not every machine has 192GB of RAM, etc. - and not every site can be crawled using the available RAM on the crawling machine) and then when you run out of RAM crawling stops, or you use a BloomFilter and accept that you may receive false positives/negatives and your crawling may never stop or misses links (depending upon the size of the filter, in RAM) or you figure out an intelligent way to use the DB.  As you appear to be really thinking about this (a GREAT THING, BTW!  Big Smile), experiment with the settings 'ApplicationSettings.DesiredMaximumMemoryUsageInMegabytes' and 'ApplicationSettings.InsertDiscoveries = false' - look at public override bool IsUsingDesiredMaximumMemoryInMegabytes(bool isForCrawlRequestConsideration) in SiteCrawler\Managers\MemoryManager.cs - set ApplicationSettings.DesiredMaximumMemoryUsageInMegabytes to something low, like 50-75MB and AN will use the DB exclusively - crawl the Web Test Site (Web\Test) and note the times taken to crawl when using only RAM (high ApplicationSettings.DesiredMaximumMemoryUsageInMegabytes = 4096) or both RAM and DB (ApplicationSettings.DesiredMaximumMemoryUsageInMegabytes = 50-75?) or DB only (ApplicationSettings.DesiredMaximumMemoryUsageInMegabytes = 25-50) - note the times.  There is a slight performance hit for accessing a disk (about 8% if I recall correctly), of course, but without using a DB backing the option is either use a lossy method of "where have I been" (BloomFilter) or use virtual memory (really, really, really slow)...  As AN is a general crawling engine, a lossy method isn't appropriate as the number 1 support question would be, "why am I missing results?"  Most other crawlers stop at providing scalable DB/Stateful storage as is it difficult to [tune]/[scale]/[implement locking] at the proper locations and takes FOREVER to test to validate that your 1M page crawl completes without crawling in a loop.  This particular topic is extremely involved and I am glad you are interested in it - let's keep this thread going!

Also, take a look at this:

Thank you,

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (1 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, LLC