arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release
About arachnode.net

http://arachnode.net 1.0 release +lucene.net

What is arachnode.net?
arachnode.net is an open source Web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages. Arachnode.net is written in C# using SQL Server 2005.

Applications
Content Aggregation:

Use for personal content aggregation, crawling intranets of any size or crawling the Internet as a whole.  Discovered content is parsed and stored into multiple configurable forms and locations.

Research and Analysis:
Extract, collect and parse downloaded content into multiple forms, including XML. SSIS packages and CLR functions extract terms and phrases from text content, and provide over 250 stored procedures, views and functions to jumpstart Analysis Services or other text mining applications.

Search:
Discovered content is indexed an stored in Lucene.NET indexes and can be searched through a familiar Web interface.


Text Mining:
Extract words, phrases, tags and text from discovered content.

Education:
Learn introductory to advanced crawling techniques, and features of the .NET Framework and SQL Server 2005, including full-text indexing, multi-threading, caching, reflection, interfaces, object-oriented concepts, SQL common language runtime functions and regular expressions.

Key Features
.NET architecture
Arachnode.net is the most complete open source .NET site crawler available to the general public.

Configurable Rules and Actions
Implement your own custom pre- and post-request crawl rules and actions without source recompilation.  The existing crawl rules and actions architecture easily enables crawling enhancements such as federation, partitioning and distributed caching.

Lucene.NET Integration
Lucene.NET integration allows for full-text searching through a familiar web interface. 

SQL Server 2005 and full-text indexing
SQL Server 2005 full-text indexing is configured at all appropriate content storage locations.

HTML to XML and XHTML
Downloaded WebPages can be converted to XML stored in SQL Server 2005 through the HtmlAgilityPack. Use xpath to extract common elements from downloaded content using the pre-configured XML indexes.

Multi-threading and Throttling
Arachnode.net can be configured to run any number of threads and to use as much or as little processor time and memory as you require.

Respectful Crawling
arachnode.net provides pre- and post-request rules governing address and content filtering, robots.txt behavior, request frequency and crawl depth.  The default crawling environment is respectful, courteous and kind.

Analysis Services
arachnode.net comes with over 250 stored procedures, views and functions designed for use with Anaylsis Services and other business intelligence software. These procedures and views address trending, popularity, term extraction, phrase extraction and many other common analysis and reporting needs.

SQL Server 2005 and SSIS
arachnode.net comes pre-configured with several SSIS procedures to extract and prepare key information from collected data for text mining and analysis.

EXIF data extraction
Arachnode.net can extract, store, and index all discoverable EXIF data fields from discovered images.

How can I get it?
Source code and a database backup are available on this site and at Sourceforge.net. Arachnode.net is released under the GNU General Public License.


Posted Mon, Jan 5 2009 8:35 PM by arachnode.net
Filed under:
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC