An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does scale? | Download the latest release

dynamic list of websites have to be crawled

rated by 0 users
Answered (Not Verified) This post has 0 verified answers | 18 Replies | 2 Followers

Top 200 Contributor
1 Posts
Sergii posted on Mon, Apr 30 2012 7:23 PM


I am thinking to use your product to crawl up to million web sites. My goal is to find top websites that abound in pdf documents. I am not really interested in indexing, downloading content and storing it to a hard drive. I just need pdfs discovering within the only sites I specified, but with unlimited depth of discovering.

The list of websites will be dynamic. Is it possible to add new crawl requests dynamically, not changing source code like it's done in the demo console application? 

And the last question, can I run crawler on schedule basis?

P.S: - not found

Thank you,


All Replies

Top 10 Contributor
1,905 Posts


Yes, AN can help you.  You can easily filter for .pdf document and nothing else, simply crawling pages and validating the content type for suspected .pdf's and only downloading and processing those documents.

It is possible to add them dynamically.  You could add them to the CrawlRequests database table while crawling, or could have your process read from CrawlRequests.txt, like the Service (Service project) does.

Yes, it's easily possible to run AN on a schedule.

Thanks for the broken link.  Looks like CommunityServer needs to re-index the Media galleries.


For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, LLC