arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Installation Instructions

Summary for arachnode.net (Requires MS-SQL):

(always disable Intellitrace)

1.) Restore the arachnode.net database (using SQL Server Management Studio (http://en.wikipedia.org/wiki/SQL_Server_Management_Studio)) - arachnode.net.bak_2008.zip or execute RestoreDatabaseWithNames.sql located at the root of the solution (using SSMS or from within Visual Studio).

2.) Run the stored procedure '[dbo].[arachnode_usp_arachnode.net_RESET_DATABASE]'.

3.) Open the solution (arachnode.net.sln).  Visual Studio may prompt you to modify your connection strings.  If the solution doesn't build/deploy, check the connection string in the 'Database' tab of the 'Functions' project properties.  If you restored the database to a named instance update Configuration\ConnectionStrings.config to '[YOUR_SERVER_NAME]\[INSTANCE_NAME]'.  '.\[INSTANCE_NAME]' is also acceptable.  Build and publish the 'Functions' project.

  • Default Instance: <add name="
       arachnode_net_ConnectionString" connectionString="Data Source=.;Initial Catalog=arachnode.net;Integrated Security=True;Connection Timeout=3600;" providerName="System.Data.SqlClient" />
  • Named Instance: <add name="
       arachnode_net_ConnectionString" connectionString="Data Source=.\SQLExpress;Initial Catalog=arachnode.net;Integrated Security=True;Connection Timeout=3600;" providerName="System.Data.SqlClient" />

4.) Set 'Console' as the startup project.  Run and wait (2 minutes for the demo) for the crawl to complete.  If the Console appears to hang on 'Resetting the database...' or does nothing after displaying the arachnode.net or AN.Next version, see step #3.

5.) Set 'Web' as the startup project.  Search.

6.) Done.

Summary for AN.Next (Optional MS-SQL):

1.) Open the arachnode.net solution (arachnode.net.sln).

2.) Press 'F5'.

If SQL Server is not installed you may receive the following prompt:

Click 'Yes' to continue.  This prompt originates from the 'Functions' project and deployment is not necessary for AN.Next.


Full instructions for arachnode.net (Requires MS-SQL):

I. Restore the arachnode.net database:

(AN.Next requires the following installation steps only if using MS-SQL Server) 

1.) Navigate to the source and open arachnode.net.bak_2008.zip.  Your operating system should have the latest service pack applied.

2.) Selected the zipped database backup.

3.) Copy the selected database backup.

4.) Return to the source.

5.) Paste the copied database backup.

6.) Wait for the paste operation to complete.

7.) Start SQL Server Management Studio.

8.) Enter your database server connection configuration.

9.) Click 'Connect'.

10.) Right-click on 'Databases' and select 'Restore Database...'

11.) Select 'From device:'

12.) Click the ellipses.

13.) Click 'Add'.

14.) Select the database backup. Click 'OK'.

15.) Click 'OK'.

16.) Click 'Restore'.

17.) Select 'arachnode.net' from the 'To database:' drop down list.

18.) Click 'Options' in the left-hand pane.

19.) Click 'Overwrite the existing database (WITH REPLACE).

20.) Click 'OK'.

21.) Click 'OK'.

22.) Navigate to 'Stored Procedures'.

23.) Locate '[dbo].[arachnode_usp_arachnode.net_RESET_DATABASE', right-click and select 'Execute Stored Procedure...'

24.) Click 'OK'.

25.) Navigate to 'Tables'. Locate 'cfg.Configuration', right-click and select 'Edit Top 200 Rows'. Perform the same steps for cfg.CrawlActions and cfg.CrawlRules. These three tables control common configuration options. The file 'Program.cs' in the 'Console' project in the Visual Studio 2008 solution overrides the options in these three tables through the ApplicationSettings static object as an example of controlling configuration at crawl time.

26.) Observe the configuration options in cfg.Configuration and the associated help.

II: Open the arachnode.net solution:

1.) Open Visual Studio 2008 as an Administrator.

2.) Choose 'Open > Project/Solution...' from the 'File' menu.

3.) Select the 'arachnode.net' solution and click 'Open'.

4.) The 'Functions' project will prompt you to update its database connection settings. Click 'Yes'.

5.) Select 'Microsoft SQL Server' and click 'Continue'.

6.) Supply your database server name.

7.) Select 'arachnode.net' from the drop down list.

8.) Click 'OK'.

9.) Click 'OK'.

10.) Select the 'Functions' project and open 'Functions.csproj.user' and locate the connection string.

11.) Select the 'Configuration' project and open 'ConnectionStrings.config' and locate the connection string. Ensure compatibility with the connection string in the previous step.  (connection string reference: link)

12.) Select the 'Console' project.

13.) Right-click the 'Console' project and select 'Set as StartUp Project'.

14.) Double-click 'Program.cs' in the 'Console' project.

15.) Scroll down to examine how database configuration settings in cfg.Configuration, cfg.CrawlActions and cfg.CrawlRules may be modified at crawl time.

16.) Scroll down to examine how CrawlRequests are submitted to the Crawler for crawling. Before continuing, read 'Program.cs' in its entirety and visit http://arachnode.net/Content/FrequentlyAskedQuestions.aspx.

17.) Press 'F5' or click the green arrow in the tool strip to begin crawling.

18.) The DEMO version of arachnode.net presents this warning as no debug information is present for SiteCrawler.dll, yet the current solution configuration is 'debug', and all other projects contain debug information. The LICENSED version of arachnode.net contains complete source code and provides complete debugging symbols. Click 'OK'.

19.) Click the 'Console' icon in the taskbar.

20.) Provide the following answers for the following prompts:

  • Reset Database and perform initial setup tasks: y - Executes '[dbo].[arachnode_usp_arachnode.net_RESET_DATABASE]', which resets the database, all user data, and enables CLR functions.
  • Reset Directories: y - Deletes all files in 'ConsoleOutputLogsDirectory', 'Downloaded[Files/Images/WebPages]Directoriy', 'LuceneDotNetIndexDirectory'.
  • Reset Crawler: y - If the Crawler is stopped before crawling is complete, all CrawlRequests and Discoveries are saved to the database, enabling the Crawler to resume crawling at the point of interruption.  Not applicable for DEMO installations.
  • Reset IIS: n - If testing Crawler behavior using [Web]\Test.aspx, resetting IIS resets static variables in [Web]\Global.asax, resetting the initial state of [Web]\Test.aspx.  As the DEMO installation uses ASP.NET's WebServer, this setting beyond the scope of this tutorial.
  • Start perfmon: n - arachnode.net provides a complete set of Performance Counters.  Illustration of the Performance Counters is beyond the scope of this tutorial.

If the Crawler appears to hang after resetting the database, your ConnectionStrings are likely incorrect.

Using a default instance?

connectionString="Data Source=.;Initial Catalog=arachnode.net;Integrated Security=True;Connection Timeout=3600;"

Using a named instance?

connectionString="Data Source=YOURSERVER\SQLEXPRESS;Initial Catalog=arachnode.net;Integrated Security=True;Connection Timeout=3600;"

21.) The Console window shows the current state of all CrawlActions, CrawlRules and EngineActions loaded into the Crawler. In this configuration, the code in step 13 has disabled all CrawlActions, CrawlRules and EngineActions except ManageLuceneDotNetIndexes.cs.

22.) Crawling... if using the DEMO, wait for 5 minutes for the Crawler to stop.

23.) The Crawl has stopped.

24.) Select the 'Web' project.

25.) Right-click the 'Web' project and select 'Set as StartUp Project'.

26.) Select 'Search.aspx' in the 'Web' project.

27.) Right-click 'Search.aspx' and select 'Set as Start Page'.

28.) Press 'F5' or click the green arrow in the tool strip to start the 'Web' project.

29.) The DEMO version of arachnode.net presents this warning as no debug information is present for SiteCrawler.dll, yet the current solution configuration is 'debug', and all other projects contain debug information. The LICENSED version of arachnode.net contains complete source code and provides complete debugging symbols. Click 'OK'.

30.) Select 'Internet Explorer' from the task bar.

31.) Enter a search term and click 'Search'.

32.) Examine the results. Congratulations. You have successfully installed arachnode.net.

III. Examine your data:

1.) Select the 'Administration' project.

2.) Right-click the 'Administration' project and select 'Set as StartUp Project'.

3.) Select 'Default.aspx' in the 'Administration' project.

4.) Right-click 'Default.aspx' in the 'Administration' project and select 'Set as Start Page'.

5.) Press 'F5' or click the green arrow in the toolstrip to start the 'Administration' project.

6.) Examine 'Default.aspx'. The dynamic data administration project provides and easy way to drill through your data without requiring knowledge of T-SQL.

7.) Scroll down and click 'WebPages'.

8.) The data from the 'WebPages' table is displayed. Most data in arachnode.net is associated by foreign key, allowing you to drill down/through many levels of the data hierarchy.

9.) Switch to SQL Server Management Studio. Select the database 'arachnode.net' and click 'New Query'.

10.) Enter the text as shown in the new query window. Click 'Execute'.

11.) Examine the results. Notice that the current configuration does not submit the File, Image or WebPage Source to the database. Instead, File, Image and WebPage Source is saved to disk. This configuration is many times more performant than inserting large binary objects into SQL Server. This configuration setting is controlled by 'Insert[File/Image/WebPage]Source', and may be set from cfg.Configuration or from ApplicationSettings.cs.

12.) Enter the text shown in the new query window and press 'F5' or click 'Execute' to execute the query. Examine the disk location of Files, Images and WebPages. These locations were set automatically by the first !RELEASE helper code section in 'Program.cs' in the 'Console' project.

13.) Examine the on disk location of Files, Images and WebPages.

14.) Examine a Discovery, which is a WebPage as shown. The on disk location is segmented by AbsoluteUri directory to enable and enhance DFS or other distributed file systems, rather than store all downloaded Discoveries in a single directory. The name is calculated from a SHA1 hash. The Discovery may be retrieved programmatically using DiscoveryManager.cs using the methods GetFileSource(...);, GetImageSource(...); and GetWebPageSource(...);. These methods are found only in the LICENSED version of arachnode.net.

15.) Finally, the two most important tables for troubleshooting why a CrawlRequest or Discovery was not found where expected: Discoveries and Exceptions.

An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC