An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does scale? | Download the latest release


Overview is a C# class library which downloads content from the internet, optionally renders and allows interaction like a browser, indexes this content and provides methods to customize the process.

Technologies: Lucene.NET, HtmlAgilityPack, NClassifier, OpenTextSummarizer, RSS.NET, iTextSharp

Usages: C# web crawling, C# site crawling, C# screen scraping, C# data analysis, C# data mining, C# site mirroring

Key Features

.NET architecture is the most comprehensive open source C#/.NET web crawler available.  Use from any .NET language.

Configurable Rules and Actions
Implement custom pre- and post-request crawl rules and actions without source recompilation.  The existing crawl rules and actions architecture easily enables crawling enhancements such as federation, partitioning and distributed caching.

Lucene.NET Integration
Lucene.NET integration allows for full-text search through a familiar web interface.  Easily integrate your search results into Solr or other Lucene index utilization solutions, whether they be in .NET, Java or any other language that supports Lucene. 

SQL Server 2008/2012 and full-text indexing
SQL Server 2008/2012 full-text indexing is configured at all appropriate content storage locations for files, images and web pages.

.DOC/.PDF/.PPT/.XLS Indexing
Crawl, index and search Microsoft Word, PowerPoint and Excel and Adobe PDF documents.  The indexing architecture is easily understood and customization is simple.

Downloaded WebPages can be converted to XML stored in SQL Server 2005/2008 through the HtmlAgilityPack. Use xpath to extract common elements from downloaded content using the pre-configured XML indexes.

Full JavaScript/AJAX Functionality
Render dynamic content like a browser does without requiring multiple instances of Internet Explorer or the .NET WebBrowser control. uses the lower level mshtml.dll component instead which powers AxShDocVw.dll which powers the .NET WebBrowser control.  Interpretation and rendering is performed out of process thereby bypassing the two concurrent download and render limitation of AxShDocVw.dll and the .NET WebBrowser control.  Additionally, as the controls do not need to be rendered to a form-based control, CPU and RAM usage is 1/10 of what the other two approaches consume.

Multi-threading and Throttling can be configured to run any number of threads and to use as much or as little processor time and memory.

Respectful Crawling provides pre- and post-request rules governing address and content filtering, robots.txt behavior, request frequency and crawl depth.  The default crawling environment is respectful, courteous and kind.

Analysis Services comes with over 250 stored procedures, views and functions designed for use with Anaylsis Services and other business intelligence software. These procedures and views address trending, popularity, term extraction, phrase extraction and many other common analysis and reporting needs.

SQL Server 2005/2008 and SSIS comes pre-configured with several SSIS procedures to extract and prepare key information from collected data for text mining and analysis.

EXIF data extraction can extract, store, and index all discoverable EXIF data fields from discovered images.

WebService Interface
All search operations are supported both through a traditional 'Google-like' search interface as well as a WebService for programmatic application consumption.



Content Aggregation:
Use for personal content aggregation, crawling intranets of any size or crawling the Internet as a whole.  Discovered content is parsed and stored into multiple configurable forms and locations.

Research and Analysis:
Extract, collect and parse downloaded content into multiple forms, including XML. SSIS packages and CLR functions extract terms and phrases from text content, and provide over 250 stored procedures, views and functions to jumpstart Analysis Services or other text mining applications.

Discovered content is indexed and stored in Lucene.NET indexes (and optionally, SQL Full-text indexing) and can be searched through a familiar Web interface, or through the provided WebService interface. 

Text Mining:
Extract words, phrases, tags and text from discovered content using the included SSIS procedures.  Two plugins for Bayes' classification and automated content extraction are also provided.

Learn introductory to advanced crawling techniques, and features of the .NET Framework and SQL Server 2005/2008, including full-text indexing, multi-threading, caching, reflection, interfaces, object-oriented concepts, SQL common language runtime functions and regular expressions.



C# crawler, C# spider, C# crawler SQL Server, C# spider SQL Server, C# crawler MongoDB, C# spider RavenDB, C# crawler MySQL, C# spider MySQL, C# crawler Hadoop, C# spider Hadoop

Site Activity

An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, LLC