arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

how to add a new link during the crawl?

rated by 0 users
Answered (Verified) This post has 1 verified answer | 5 Replies | 2 Followers

Top 10 Contributor
229 Posts
megetron posted on Mon, Jan 4 2010 3:39 PM

Hello,

Bad months have been for me, but I'm happy to announce I am back to busniess, and the crawling is on course again.

There is a pages called video.asp?id=100 in this page there are page navigation links. this links are as follow: video.asp?id=100?page=2 ; video.asp?id=100?page=3 and so on....

Now, to crawl this pages, what I am doing on console application is to add all pages with id=0 till id=1000 and so I verify I crawl for all pages without crawling unneccesary pages. the depth of this crawl is 0, so no other crawl request is added by AN, which is exactly what I need.

The problem begins when there are few video.asp pages that contains in it page=1 ; page2 and so on, and so, I will have to add new crawl request inside the plugin.

So, my first question, is how to detect how many pages there are for each video.asp?id={0} page...

The second question is it possible to add a crawl request from inside the plugin?

Thanks for support.

Best regards,

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts

Welcome back, megetron!

1.) Each CrawlRequest will come through the Plugin system, even if the Crawl returns a 404.  So, if you are crawling video.asp?id=1900 and the WebResponse/WebException (crawlRequest.WebClient.WebException != null) is OK then you know you can create another CrawlRequest.

2.) Yes, you can add additional Crawls from a plugin.  Use: crawlRequest.Crawl.Crawler.Crawl(...);

Happy crawling, and, again, welcome back! Big Smile

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

Welcome back, megetron!

1.) Each CrawlRequest will come through the Plugin system, even if the Crawl returns a 404.  So, if you are crawling video.asp?id=1900 and the WebResponse/WebException (crawlRequest.WebClient.WebException != null) is OK then you know you can create another CrawlRequest.

2.) Yes, you can add additional Crawls from a plugin.  Use: crawlRequest.Crawl.Crawler.Crawl(...);

Happy crawling, and, again, welcome back! Big Smile

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

arachnode.net:

1.) Each CrawlRequest will come through the Plugin system, even if the Crawl returns a 404.  So, if you are crawling video.asp?id=1900 and the WebResponse/WebException (crawlRequest.WebClient.WebException != null) is OK then you know you can create another CrawlRequest.

What do you do id page video.asp?id=1900 is handled by the webserver and so no 404 error will be displayed? what do you then to detect that this page is not really exist? once I solve this issue by detect the content length in the rules srction, but that was relevant only to a specific site.
is there a better way detect this issue?

Top 10 Contributor
1,905 Posts

Yeah - I think you would have to write something specific for each site.  Basically, the case where the dynamic page doesn't exist in the database, but the webserver returns a friendly page, but not a 404... Look for a specific string, or the absence of a specific string, to determine whether you have a page that truly matches.  I can't think of a better way, honestly...

Did you get your download?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

Thanks Mike, I am not familiar with any other trick. Adding the relevant crawlrequest during plugin as you suggested and it works fine.

 

Abot download, yes I got the download. I still check the new features of 1.3.

where can I find the new 1.4 release features?

 

Top 10 Contributor
1,905 Posts
  • dynamic data administration
  • .doc/.pdf/.ppt/.xls indexing  - check the Web project, and you'll see you can now search Files and Images.
  • These are the new features since 1.3.

    For best service when you require assistance:

    1. Check the DisallowedAbsoluteUris and Exceptions tables first.
    2. Cut and paste actual exceptions from the Exceptions table.
    3. Include screenshots.

    Skype: arachnodedotnet

    Page 1 of 1 (6 items) | RSS
    An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

    copyright 2004-2017, arachnode.net LLC