arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

The document count differes on each crawl

rated by 0 users
Answered (Verified) This post has 1 verified answer | 6 Replies | 2 Followers

Top 10 Contributor
83 Posts
InvestisDev posted on Wed, Oct 23 2013 4:39 AM

Hi,

I crawl same site say abc.com twice and I get different number of documents from index when checked in Luke.

I tried this many times but the count is very much random. It always varies by ~10-20 documents. I also tried crawling with pure arachnode.net 2.6 + AN.Next, but same issue found.

Can you please explain why would this happen?

Thanks

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Hehe, of course...  Smile

With a large site like abc.com the content is always changing.

Try running it against something like http://test.arachnode.net or the local test site http://localhost:56830/Test/1.htm and the Document counts will be the same.

Aloha,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Hehe, of course...  Smile

With a large site like abc.com the content is always changing.

Try running it against something like http://test.arachnode.net or the local test site http://localhost:56830/Test/1.htm and the Document counts will be the same.

Aloha,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

Smile

sorry for missing this, but abc.com is a site in my local machine which is bound to be unchanged till I crawl it twice.

Although, I get your point here that document count shouldn't change.

Thanks

Top 10 Contributor
1,905 Posts

Any exceptions in the Exceptions table?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

some 404s in exception table but again the count varies every time I crawl. One thing we noticed is that for small sites the count remains same, but for huge sites we find varying count. And I believe as per you this should not happen, right? If so, need to look into issue from our side and get back to you with some findings.

Top 10 Contributor
1,905 Posts

Hopefully I did not misunderstand...

If you are crawling large sites like 'abc.com' or 'nbc.com' or 'msn.com' you will see broken pages...

The smaller the site, usually the smaller the churn.  Pages are added/removed all the time on larger sites.  If you were crawling 'abc.com' I would expect there would be 404's - for a 404, load up the page in the browser.

Some sites will return 404's when you are crawling too fast.

Aloha,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Local sites can still have the same error conditions - it's still a website (of course. Smile) - if it is local, is the page really there?  Is dynamic content not being served properly.  Try test.arachnode.net - this won't change - 14K+ static pages to test against.

Aloha,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (7 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC