arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Can we delete Unnecessary projects from AN crawler ?

rated by 0 users
Answered (Verified) This post has 1 verified answer | 3 Replies | 3 Followers

Top 25 Contributor
25 Posts
Dinesh posted on Tue, Jul 2 2013 12:43 AM

Hi Mike,

We are using few projects like Console, Configuration, DataAccess, Datasource, Plugins, SiteCrawler. etc. We don't want to run all the projects on AN crawler. I doubt there must be some relations all projects.

1. How do we delete unnecessary projects like Web, Test, Administrations, Application, DemoFiles, Documentation, etc projects from AN crawler. Will it effect on AN if we delete these unnecessary projects ?

2. All projects are necessary to run AN successfully ?

3. One more thing we noticed that there are so many Sql server database secondary data files are creating like 'arachnode.net_1.ndf' which are causing memory problems. How do we avoid creating multiple secondary data files ?

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

1.) Just unload them, if you find the others distracting or you want to improve build time.  If you delete them and I update the solution your solution may not merge.  If you do decide to delete projects, make a copy of your source before you do, JIC you find you need something in the source at a later date.

2.) No.  (example: Documentation isn't required...)

3.) There isn't a way to change this and this is a SQL feature, unrelated to AN and I doubt that the best practice of splitting logical table groups into SQL Server FILEGROUPS is causing memory problems.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

1.) Just unload them, if you find the others distracting or you want to improve build time.  If you delete them and I update the solution your solution may not merge.  If you do decide to delete projects, make a copy of your source before you do, JIC you find you need something in the source at a later date.

2.) No.  (example: Documentation isn't required...)

3.) There isn't a way to change this and this is a SQL feature, unrelated to AN and I doubt that the best practice of splitting logical table groups into SQL Server FILEGROUPS is causing memory problems.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts

Hi Mike

We are unloaded all the projects except Cache, Configuration, Console, Data Access, Data Source, Functions, Library, Performance, Plugins, Proxy, Renderer, Sitecrawler, structures. After unloading rest of them except the above ones AN working fine but I am concerning about unloading projects except the above ones effects the performance?

I have few more questions about AN

1. For instance, when we give 100 websites to crawler, we can see uncrawledrequests count is 100 but some of the requests are not crawling. I mean to say in database we can see only 70 websites data. It means 30 websites (absolute uris) are missed. what could be the reason? 

2. If the absolute URLs are missed, where will they save in database, so that we can identify them and re-crawl it

3. What are color specifications on console screen. I know green means the absolute URL crawled successfully and status code is 'Ok'. what about the others colors?. Some times we are getting forbidden status but when we open same url on browser, its opening properly .

4. For many requests we are getting like "The remote name could not be resolved: www.trollandtoad.com". We are thinking that this would come due to no proper http or https infront of absolute URL but when we open the same link on browser, it opens correctly. What do you say on this? What could be the reason?

5. For instance one of absolute URL is http://www.BRMi.com but when we open this page it redirects to http://public.brmiconsulting.com/. In this case will AN crawl content from http://public.brmiconsulting.com website or Will give us error like "The remote name could not be resolved"?

Top 10 Contributor
1,905 Posts

1.) You are probably crawling too fast for your internet connection.  Try looking in the Exceptions table.  Smile  Also, the DisallowedAbsoluteUris table too.  Look at the perfmon counter while you are crawling and compare against how fast your internet connection is:

2.) In the Exceptions table.

3.) Look at ConsoleManager.cs - will tell you what the colors mean.  Yellow means an exception of some sort, usually a Web exception.  Red means serious error, obviously.  :)  If you are getting forbidden you are trying to make too many connections to one web server at a time - use Proxy servers.

4.) You are crawling too fast for your internet connection.

5.) If 'ApplicationSettings.AllowAutoRedirect' is set to true AN will gracefull redirect and continue crawling.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC