arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

questions

rated by 0 users
Answered (Verified) This post has 1 verified answer | 6 Replies | 2 Followers

posted on Mon, Jan 25 2010 11:37 AM

Hi, this is a professional solid work :)

My punch of questions are, if I want to deploy it on multiple servers, from where I can start and how many servers I can use to maximize its performance, and what are the components that should be on each server? and do you know a place like website directories where we can find servers and domains to add them to the crawler database?

Thank you :)

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Important: The biggest limiting factor in AN, using the default configuration, is the speed of your database disks.

That said, how AN performs depends on what you have turned on.  Big Smile

That said, if you aren't taxing the DB, the biggest limiting factor very well may be your internet connection and connection H/W... specifically the number of simultaneous connections you can make.

AN currently supports one DB machine, but multiple crawlers and can distribute each of the DownloadedImages/DownloadedFiles/DownloadedWebPages directories across any number of servers, provided you use DFS or any other FS clustering technology.

So, crawl code (the solution files) go on the crawling machines, and the DB is restored to the DB server.

You could have three additional machine that do nothing other than provide file shares for the Discoveries (Files, Images, WebPages), thereby offloading this work from, say, the DB server.

Again, the balance of resources will depend on what you want to crawl... (wouldn't make sense to have a killer DB machine if you aren't storing tons of data...)

Does this answer your question?

You can check http://directory.google.com/ for sites to crawl.  AN comes pre-configured with about 1 million Priorities for WebPages, to crawl by priority, of course.

If you purchase a license(s), I am more than happy to help you set AN up across multiple machines.

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

Thank you!

Would you register so I know who you are, please?  This question is a bit involved, and if you register you will be notified when the thread is updated.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 200 Contributor
1 Posts

Hi,

I registered in the forums, and upon your request I will ask my questions again :)

If I want to deploy it on multiple servers, from where I can start and how many servers I can use to maximize its performance, and what are the components that should be on each server? and do you know a place like website directories where we can find servers and domains to add them to the crawler database? Thank you :)

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Important: The biggest limiting factor in AN, using the default configuration, is the speed of your database disks.

That said, how AN performs depends on what you have turned on.  Big Smile

That said, if you aren't taxing the DB, the biggest limiting factor very well may be your internet connection and connection H/W... specifically the number of simultaneous connections you can make.

AN currently supports one DB machine, but multiple crawlers and can distribute each of the DownloadedImages/DownloadedFiles/DownloadedWebPages directories across any number of servers, provided you use DFS or any other FS clustering technology.

So, crawl code (the solution files) go on the crawling machines, and the DB is restored to the DB server.

You could have three additional machine that do nothing other than provide file shares for the Discoveries (Files, Images, WebPages), thereby offloading this work from, say, the DB server.

Again, the balance of resources will depend on what you want to crawl... (wouldn't make sense to have a killer DB machine if you aren't storing tons of data...)

Does this answer your question?

You can check http://directory.google.com/ for sites to crawl.  AN comes pre-configured with about 1 million Priorities for WebPages, to crawl by priority, of course.

If you purchase a license(s), I am more than happy to help you set AN up across multiple machines.

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

replied on Sat, Jan 30 2010 5:45 AM

Thank you for your wow reply, how can i purchase AN 1.4, and BTW, the buy link doesn't work!! should I browse the site using firefox? and how much will it cost?

Top 10 Contributor
1,905 Posts

You are very welcome! 

Try this direct link: https://checkout.google.com/view/buy?o=shoppingcart&shoppingcart=973929896308267

Try using Firefox.  (Really surprising that the Google checkout link doesn't show up...)

Which version depends on how you will use it.  Commercial/Personal.

Question: Which browser/version are you using?  (Thanks so much for telling me...)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Also, v2.0 should be out today.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (7 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC