arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Quick Start Guide

rated by 0 users
Answered (Verified) This post has 2 verified answers | 3 Replies | 2 Followers

Top 150 Contributor
2 Posts
gondreus Idea [I] posted on Tue, Apr 14 2015 7:30 PM
Hi Mike,
Thank you for the support, i've manage to download and run the console app. This is really a great product! but the info/documentation are really overwhelming i just wish you can put some Quick Start guide, i've searching the forums but got information overload. :D Hope you can help me with that.
Basically what i'm trying to do is: 1. Crawl a website let's just call it mywebsite.com 2. Dont crawl subdomain.mywebsite.com or other external link
3. Dont crawl an url with path contain '/dont/' e.g. mywebsite.com/dont/ 4. (And/or) crawl an url with only specific path like '/this/' e.g. mywebsite.com/(.*)/this/(.*)/ 5. For every page that crawled, gather the info and put it on db, i'm trying to extract a. Title b. Http response code (500, 503,404,301,302,410 etc) c. the url d. don't download the css, js, image, icon, etc.
e. the page (html)
f. Some other variable and put it on different column (I know how to use htmlagilitypack) 6. Limit the crawl just for x pages. 7. (and/or) Limit the crawl just for x depth. 8. Stop and resume
The great goal is to make a sitemap and also i'm planning to learn about NLP, classification, categorization, context analysis for each pages. (if you have any resources to share i really appreciate it).
If you already have this quick start guide somewhere maybe you can point me where it is.

Thank you,

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Thank you very much for the compliments!  I really appreciate hearing them...  there IS a lot to AN, and can take a bit to get through everything it offers.

I am going to answer your question in stages as AN in very popular and I have quite a few people to craft answer/support for.  Smile  If you subscribe to this post via email the site will notify you as I make updates to this forum post.

Feature (and documentation) development is largely driven by customer requests.

Things like this: http://arachnode.net/blogs/arachnode_net/archive/2015/04/01/ajax-dynamic-content.aspx and http://arachnode.net/blogs/arachnode_net/archive/2015/02/18/arachnode-net-mysql-xamarin-port-in-development-now.aspx are largely customer driven.

The basics of understanding how AN works:

I will be adding comments to this file shortly - check SVN in an hour or two (this file, Console\CrawlRequests.txt, is how the initial experience is fed CrawlRequests - the C# class which is passed through the Crawl pipeline and contains all info about the Crawl and result of the request):

Also, look at Console\Program.cs (these settings control the initial state behavior of AN, and are named as to what they actually do):

OK, more in a bit...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

1.) OK.

2. & 3. & 4.) Not crawling external links is covered by setting the RestrictCrawlTo parameter in the CrawlRequest constructor to UriClassificationType.Domain.  As you want to restrict to the Host (the Host usually means a specific subdomain, like cars.msn.com, over the Domain of msn/com) along with other sets of rules you want to look at creating a Plugin - AN can filter on all but 'mywebsite.com/(.*)/this/(.*)/' out of the box - and for best understanding your business logic when combined with AN's 'IsDisallowed' functionality it is best to create a Plugin: https://arachnode.net/Content/CreatingPlugins.aspx  There are existing/complete/tested examples of Plugins at Plugins\CrawlRules - applicable to you are 'AbsoluteUri.cs', 'StatusCode.cs'.

Plugins are either enabled from the cfg.CrawlActions, cfg.CrawlRules and/or cfg.EngineActions database tables.

Here is how to easily enable them through code:


5.) If the WebPage doesn't eventually resolve in a '200' (it can be redirected) it will be logged in the Exceptions table with the StatusCode.

If you already know the HtmlAgilityPack then extracting the title is easy.  :)  Grab that from creating your own HtmlAgilityPack.HtmlDocument or set ApplicationSettings.ExtractWebPageMetaData = true; and look at ((ManagedWebPage)crawlRequest.ManagedDiscovery).HtmlDocument in a PostRequest CrawlAction (not CrawlRule) - CrawlActions are intended to act on the downloaded data, CrawlRules are intended to direct crawling:

If you want to bite this knowledge off one chunk at a time, look at Console\Program.cs\private static void Engine_CrawlRequestCompleted(CrawlRequest<ArachnodeDAO> crawlRequest) - this information is available here too:

If you don't want to download anything but HTML, set ApplicationSettings.ExtractFilesAndImages = false;  Most (hopefully everything) is named in AN exactly as to what it does - also, check the XML comments on ApplicationSettings.cs - there should be a comment for every setting.

AbsoluteUri is a thematic/persistent concept throughout the AN world: It is stored in the DB and properly referenced (DB ForeignKey) throughout.  Look at crawlRequest.Discovery.AbsoluteUri - this is your 'url'.

Look at the table 'Documents' - there is already an open storage table for you to store your custom properties.

Look at Functions.ExtractResponseHeader(...); - you can extract any piece of information after the fact if you prefer to crawl, then extract and store.

6.) The Crawler has all of these properties:

7.) You can crawl by Breadth or Depth - set this in the Crawler(...); contructor.  The CrawlRequest parameter [Depth] sets the maximum depth (hops away - 1) AN will crawl.

8.) Run the Crawler in a loop - you can get to every single property available through the Crawler or through the CrawlRequest -> crawlRequest.Crawl.Crawler.Engine.Stop();

9.) NLP Tools: https://arachnode.net/blogs/arachnode_net/archive/2009/03/13/sentiment-text-mining-tools.aspx

And now you have your QuickStart guide.  Smile

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Thank you very much for the compliments!  I really appreciate hearing them...  there IS a lot to AN, and can take a bit to get through everything it offers.

I am going to answer your question in stages as AN in very popular and I have quite a few people to craft answer/support for.  Smile  If you subscribe to this post via email the site will notify you as I make updates to this forum post.

Feature (and documentation) development is largely driven by customer requests.

Things like this: http://arachnode.net/blogs/arachnode_net/archive/2015/04/01/ajax-dynamic-content.aspx and http://arachnode.net/blogs/arachnode_net/archive/2015/02/18/arachnode-net-mysql-xamarin-port-in-development-now.aspx are largely customer driven.

The basics of understanding how AN works:

I will be adding comments to this file shortly - check SVN in an hour or two (this file, Console\CrawlRequests.txt, is how the initial experience is fed CrawlRequests - the C# class which is passed through the Crawl pipeline and contains all info about the Crawl and result of the request):

Also, look at Console\Program.cs (these settings control the initial state behavior of AN, and are named as to what they actually do):

OK, more in a bit...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

1.) OK.

2. & 3. & 4.) Not crawling external links is covered by setting the RestrictCrawlTo parameter in the CrawlRequest constructor to UriClassificationType.Domain.  As you want to restrict to the Host (the Host usually means a specific subdomain, like cars.msn.com, over the Domain of msn/com) along with other sets of rules you want to look at creating a Plugin - AN can filter on all but 'mywebsite.com/(.*)/this/(.*)/' out of the box - and for best understanding your business logic when combined with AN's 'IsDisallowed' functionality it is best to create a Plugin: https://arachnode.net/Content/CreatingPlugins.aspx  There are existing/complete/tested examples of Plugins at Plugins\CrawlRules - applicable to you are 'AbsoluteUri.cs', 'StatusCode.cs'.

Plugins are either enabled from the cfg.CrawlActions, cfg.CrawlRules and/or cfg.EngineActions database tables.

Here is how to easily enable them through code:


5.) If the WebPage doesn't eventually resolve in a '200' (it can be redirected) it will be logged in the Exceptions table with the StatusCode.

If you already know the HtmlAgilityPack then extracting the title is easy.  :)  Grab that from creating your own HtmlAgilityPack.HtmlDocument or set ApplicationSettings.ExtractWebPageMetaData = true; and look at ((ManagedWebPage)crawlRequest.ManagedDiscovery).HtmlDocument in a PostRequest CrawlAction (not CrawlRule) - CrawlActions are intended to act on the downloaded data, CrawlRules are intended to direct crawling:

If you want to bite this knowledge off one chunk at a time, look at Console\Program.cs\private static void Engine_CrawlRequestCompleted(CrawlRequest<ArachnodeDAO> crawlRequest) - this information is available here too:

If you don't want to download anything but HTML, set ApplicationSettings.ExtractFilesAndImages = false;  Most (hopefully everything) is named in AN exactly as to what it does - also, check the XML comments on ApplicationSettings.cs - there should be a comment for every setting.

AbsoluteUri is a thematic/persistent concept throughout the AN world: It is stored in the DB and properly referenced (DB ForeignKey) throughout.  Look at crawlRequest.Discovery.AbsoluteUri - this is your 'url'.

Look at the table 'Documents' - there is already an open storage table for you to store your custom properties.

Look at Functions.ExtractResponseHeader(...); - you can extract any piece of information after the fact if you prefer to crawl, then extract and store.

6.) The Crawler has all of these properties:

7.) You can crawl by Breadth or Depth - set this in the Crawler(...); contructor.  The CrawlRequest parameter [Depth] sets the maximum depth (hops away - 1) AN will crawl.

8.) Run the Crawler in a loop - you can get to every single property available through the Crawler or through the CrawlRequest -> crawlRequest.Crawl.Crawler.Engine.Stop();

9.) NLP Tools: https://arachnode.net/blogs/arachnode_net/archive/2009/03/13/sentiment-text-mining-tools.aspx

And now you have your QuickStart guide.  Smile

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 150 Contributor
2 Posts

Thanks Mike for the guide, this definitely will help me get started, i'll try it. 

 

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC