arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

404 error, but cant find urls

rated by 0 users
Answered (Verified) This post has 1 verified answer | 1 Reply | 2 Followers

Top 50 Contributor
10 Posts
samar posted on Thu, May 24 2012 6:03 AM

Hi

When I run my crawler against a site, it runs as it should. But in the exception table I can see that it is returning some 404 error, not found. 

For example

The AbsoluteUri1 column is showing an url. www.parenturl.com/something/?id=2222

And the AbsoluteUri2 is showing www.childurl.com/something/?id=4444

Now I would have expected that when going to www.parenturl.com/something/?id=2222 and looking

at its source code I would be able to find the second url: www.childurl.com/something/?id=4444, but no this 

url is not to be found anywhere in the source of first url. (by the way the second url is infact a dead link.

So why cant I find the second link on the source code of the first link?

And where did the crawler get the second url from?

 

 

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Many sites serve different content to different UserAgent strings.

It is possible that this site uses JavaScript to remove the dead links as a method of detecting spiders.

Thanks,
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC