An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does scale? | Download the latest release
New Features/Fixes in 2.5

Hello all:

It is release time again, and time to list everything that is new and improved.

  • Fixed a bug in IsDisallowed.cs where Disallowed* wasn't being processed properly under certain conditions.
  • Added Web\Test.aspx for use in performance testing and crawl verification.
  • Fixed a bug in EXIFExtractor.cs which would throw a NRE under certain conditions.
  • Added Utilities.Web functions for resetting the ASP.Net webservers as well as IIS.
  • Updated ManageLuceneDotNetIndexes.cs to use Lucene.NET
  • Added BreadthFirst or DepthFirst CrawlMode to the Crawler.
  • Added additional tests for Crawler.cs.
  • Added IEnumerable to PriorityQueue.cs, enabling foreach.
  • Added HttpWebRequestRetries, thereby allowing AN to retry unresponsive WebPages at the end of the normal crawl cycle.
  • Changed the 'Platform Target' to x86 for DEBUG to allow 'Edit and Continue' on x64-based systems.
  • Updated API Documentation using the latest version of NDoc.
  • Added 'Renderer', allowing complete rendering of AJAX/JavaScript-based sites and content, and allowing dynamic DOM interaction.
  • Added EngineActions for populating CrawlRequests from alternate sources.
  • Added Storable.cs to illustrate how to use the new AStorable functionality.
  • Updated the Lucene.Net Highlighter functionality to provide better results when search text is found in HTML tags or SCRIPT.
  • Added CustomManageLuceneDotNetIndexes.cs, illustrating how to implement custom fields in the Lucene.Net indexes.
  • Added DiscoveryChain.cs, allowing explicit illustration of how a Discovery was ultimately found.
  • Improved Lucene.Net indexing speed and AutoCommit functionality.
  • Added ExceptionSeverity designation for configuration exceptions.
  • Improved the Service, and added facilities to recrawl a seed list of AbsoluteUris from the Service.
  • Added additional Application Log events for Engine State changes.
  • Added helper code to Console\Program.cs to enhance the DEMO experience.
  • Added SiteCrawler\Actions\Renderer.cs, illustrating how to interact with the DOM.
  • Improved Cache handling in WebClient.cs.
  • Improved Cookie handling in WebClient.cs, allowing the currently logged on user's cookies to be submitted with each HttpWebRequest.
  • Improved Cache/Discovery handling for the Cache under low-memory situations.
  • Enabled DYNAMIC enabling/disabling of CrawlActions/CrawlRules/EngineActions at Crawl time.
  • Implemented AStorable.cs, which allows for selective storage of content while still allowing crawling to proceed.
  • Improved Cache handling when AN cannot locate cached content due to user configuration error or data loss.
  • Improved ContentType handling when ContentTypes are malformed when returned from HttpWebRequests.
  • Improved RegEx handling for HyperLinks.
  • Improved Encoding detection when an improper set of Encoding detection attributes are returned from HttpWebRequests.

Posted Thu, Jun 17 2010 4:53 PM by
Filed under:
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, LLC