arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Crawl not working.

rated by 0 users
Answered (Verified) This post has 1 verified answer | 13 Replies | 3 Followers

Top 50 Contributor
11 Posts
VE6CPU posted on Tue, Jul 21 2009 8:37 AM

I figured out how to get arachnode.net to craw my site.  The problem now is it keeps adding /images to the uri like this:

dt:07/21/2009 9:33:58 AM|ot:ProcessCrawlRequest|tn:1|crd:0|AbsoluteUri:http://skircr.com/images/images/images/images/new_sidemenu/section_about.gif

It found the original which is at http://skircr.com/images/new_sidemenu/section_about.gif and I'm not sure why it continues to add more and more /images to the uri.

Anybody have a clue as to what its doing?

Stephen

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

I'll get a crawl going and see what I can see with your site.

So, I did and I'm not seeing the repeating AbsoluteUris.  Quick question: Which version are you using?  1.1?  1.2?

Also, you can turn off breaking on WebExceptions in the Debug > Exceptions menu.  This is a VS/solution option and not an AN one.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

You have a page on your site that is invalid, but serves up valid HTML and relative links, thereby creating a spider trap.

I'll crawl your site, if you don't mind and see if you've found a bug.  Let me know, OK?

Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
11 Posts
VE6CPU replied on Wed, Jul 22 2009 9:06 AM

Go right ahead and thanks for the help.  I did comment out the

throw new WebException(webException.Message, webException);

in WebClient.ca line 170 as it kept trowing 404 and 406 errors and I would have to keep hitting F5 to continue execution.

Stephen

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

I'll get a crawl going and see what I can see with your site.

So, I did and I'm not seeing the repeating AbsoluteUris.  Quick question: Which version are you using?  1.1?  1.2?

Also, you can turn off breaking on WebExceptions in the Debug > Exceptions menu.  This is a VS/solution option and not an AN one.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
11 Posts
VE6CPU replied on Thu, Jul 23 2009 8:18 AM

I am now using the CVS version.  Just downloaded this morning and will be trying it out today.

I'll turn it off and see what happens.

Top 10 Contributor
1,905 Posts

Cool.  Let me know - I didn't see any repeating AbsoluteUris in the DB using 1.2.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
11 Posts
VE6CPU replied on Fri, Jul 24 2009 9:42 AM

Everything seems to crawl just fine now.  Thanks for you help.

Top 10 Contributor
1,905 Posts

You are very welcome.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
7 Posts

I have following the instructions, but I still get webException, what shall I do?

How can i turn off this exception?

thanks for any help!

 

 

 

Top 50 Contributor
7 Posts

I find that while "robots.txt"  does not exist,  a "WebException" will be produced, for examplewww.taobao.com"   what shall I do?

Top 10 Contributor
1,905 Posts

This is the correct behavior for robots.txt.

Did you follow the instuctions for the solution listed here: http://arachnode.net/forums/p/321/10290.aspx ?

(Debug > Exceptions...)

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
7 Posts

Oh, I forgotted it,    Thank you very much!

 

Top 10 Contributor
1,905 Posts

You are very welcome!  Big Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

replied on Wed, Sep 30 2009 3:27 PM

Mike you are a genius!! thanks so much for all your help!!

Page 1 of 1 (14 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC