arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Regarding partial crawling - indexing

rated by 0 users
Answered (Verified) This post has 1 verified answer | 17 Replies | 2 Followers

Top 10 Contributor
83 Posts
InvestisDev posted on Thu, Oct 10 2013 11:23 PM

Hi,

We are currently looking into partial crawling and indexing. I also saw a previous post from this account @http://arachnode.net/forums/p/1747/16842.aspx. The problem here is we cannot store the already crawled data to database and filesystem because of the number of sites we crawl and size of the downloaded data.

My question here is, is it possible to store last modified date to index file and based on that we crawl and download only updated pages and update existing index file with the changed documents only? May be I am optimistic here, but any solution near to the above question would be working great for us given we can't store the downloaded data to our resources. Smile

Thanks

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

AN won't crawl a page if the Last-modified header indicates the page hasn't changed.

Continuous indexing/searching is supported/provided out of the box - as I understand your solution you have changed the service that does this - perhaps something has changed on your end?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

You can turn off the Console for prod. systems.

If you haven't taken a look - look at the DB in the _RecycleBin folder in the 3.5 code - use something like Red-Gate SQL Compare to compare your DB with the 3.5 one - there is a sig. perf. upgrade I made to the DB when storing Discoveries that have an associated _Discoveries table in the DB.

You can prob. increase the Max RAM.  You can prob. turn off 'InsertDisallowedAbsoluteUris' in prod. systems.

Use an SSD for anything _DiscoveryRelated, and for the HyperLinks, CrawlRequests and Discoveries.  Setting the MaxThreads to anything higher than ~200 usually chokes TCP/IP. 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

Will try above suggestion and let you know how it goes.

arachnode.net:
Setting the MaxThreads to anything higher than ~200 usually chokes TCP/IP. 

Currently, we have used 30 threads and are checking on performance. Also, TCP/IP choking you mentioned, is at server side or is it from the machine where crawler is installed?

Thanks

Top 10 Contributor
1,905 Posts

Crawling machine.  Sorry - depends on the OS, and a few TCP/IP settings in Windows.

30 threads is fine - if you are inserting HyperLinks - take a look at the DB in _RecycleBin in the 3.5 code.  There is a tweak to the DB setup that greatly improves insert performance.

Also, look at the trans. per second - it is very easy to overrun just about any single disk setup with AN, the more and more you switch on.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 2 of 2 (18 items) < Previous 1 2 | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC