arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

More Friday Afternoon questions

rated by 0 users
Answered (Verified) This post has 1 verified answer | 1 Reply | 2 Followers

Top 25 Contributor
23 Posts
JCrawl posted on Fri, Apr 10 2015 7:19 PM
I am really impressed by the amount of thought put into this application and how extensible it is. After another week of looking at what is can do, I have a few more questions on features / concepts: 1) Can you give me an example of where and when you would use the cookieManager. Seems like an interesting concepts but trying to clearly identify the use. 2) What was the advantage of the SQL User Functions instead of using sp etc. More of a design question but I am curious of the benefits for this. I have searched on the forum for info on this but most of the comments are concerned around compiling the Functions DLL. 3) The QueryProcessor: Seems like it is used to dissect the query string and make decisions base up formats etc. Thought this was more like a custom plugin feature but it is in the core design. Just wondering where this is going in the future. 4) Identifying new developments: is there a section on the site that identifies what new features are coming and what has been released recently. I know the MySQL Dao is coming but not sure where to look to see if it has arrived. 5) Is there anywhere that identifies when to consider using certain feature etc. EG: I get the CrawlPeer when setting up multple spiders, but when would you use multiple databasePeers etc. Thank you Jason

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Jay:

Thank you very much - it took an incredible amount of time to get portions of AN to where I wanted them to be - as I started AN in 2004 it was also an incredible way to learn how to code and to learn about nearly all of the .NET technologies.

Almost everything .NET is represented in AN:

     * ASP.Net

     * SQL Server

     * Remoting

     * TableAdapters

     * Linq2Sql

     * Linq

     * WinForms

     * COM Interaction

     * P/Invoke

     * Multi-threading

     * SQLCLR

     * Interfaces/Abstract Classes/IOC

     * Windows Services

     * Analysis Services / Integration Services

     * And probably plenty of other things...

1.) CookieManager is used by WebClient.cs (SiteCrawler) - you shouldn't have to interact with this class directly - use the CookieContainer as shown in Console\Program.cs :: http://arachnode.net/blogs/arachnode_net/archive/2011/06/02/how-to-crawl-facebook-com-twitter-com-and-linkedin-com.aspx

2.) There are a good number of functions in the Functions project which are shared between C# and SQL.  (dbo.ExtractHost)  It made sense to have them shared as the configuration data is lengthy and (spans multiple database tables) for the various parsing functions.

USE [arachnode.net]

GO

/****** Object:  StoredProcedure [rpt].[arachnode_rsp_WebPages_Hosts_Discoveries_INSERT]    Script Date: 04/10/2015 09:50:51 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

ALTER PROCEDURE [rpt].[arachnode_rsp_WebPages_Hosts_Discoveries_INSERT]

AS 

    SET NOCOUNT ON

    INSERT  WebPages_Hosts_Discoveries

            SELECT  wp.ID,

                    hd.ID

            FROM    WebPages as wp

                    LEFT OUTER JOIN WebPages_Hosts_Discoveries as wphd ON wp.ID = wphd.WebPageID

                    INNER JOIN Hosts as h ON dbo.ExtractHost(wp.AbsoluteUri) = h.Host

                    INNER JOIN Hosts_Discoveries AS hd ON h.ID = hd.HostID

                                                          AND hd.DiscoveryTypeID = 7

            WHERE   wphd.WebPageID IS NULL

3.) The QueryProcessor was one of the very first classes and when AN was just a few files I used it to allow input such as "http://somesite.com/page?=[1-100]" and AN would create links from parsing the [1-100] portion.  I never really found a use for it after adding all of the RestrictCrawlTo options (see the CrawlRequest constructor)...

4.) My Blog is the best place - I try and detail what I am working on the best I can - also good to look at are the check-in notes in SVN.

5.) Database peers are REALLY advanced, and intended for use if you are a SQL Server Master and understand the in's and out's of SQL Server's Transactional Replication configuration.  It takes a good bit of understanding of what you are crawling after understanding the limitations of your hardware/network - they were originally intended to help in the case of slower disk subsystems - with the advent of SSDs and massive vertical scaling multiple DB's aren't really necessary - preferring Queue-based instances and a final merge when the crawl is done - or use distributed transactions/linked servers to join the results.

What are you crawling, specifically?  (I know you have mentioned in email)  Feel free to "anonymize" your domains.

Are you crawling one site?  Are you starting from a list of sites?  Are the crawls restricted to these domains?  Does it run on a timed cycle (grab what you can grab in the allotted time) or are you aiming for complete coverage?

From this I can detail some of the challenges/motivations behind how one creates a general purpose crawler...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC