arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Crawl a site without sub domain

rated by 0 users
Answered (Verified) This post has 1 verified answer | 1 Reply | 2 Followers

Top 10 Contributor
83 Posts
InvestisDev posted on Sun, Mar 4 2012 11:41 PM

Hello,

I want to crawl a site like "http://www.xyz.com" and It is having sub domains like

http://login.xyz.com

http://blog.xyz.com .... etc like

When i am crawling a site "http://www.xyz.com" and i want to exclude its subdomains to get crawled with it.

I had checked your one forun "http://arachnode.net/forums/p/106/401.aspx#401" Is is a proper way or in upgraded version we have any flag kind of option to stop including sub-domains to get crawled with it.

Thanks,

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts

Yes, use the UriClassificationType.Host value.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC