arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Tagging webpage discoveries

rated by 0 users
Answered (Verified) This post has 1 verified answer | 1 Reply | 2 Followers

Top 500 Contributor
1 Posts
ildb posted on Sun, Sep 1 2013 5:28 PM

HI Mike,

I want to be able to submit multiple crawlRequests with individual "identifier tag", and then I want all discoveries and webpages found from the resultant crawl to be tagged with the same - or at least I need to be able to roll back the discovery tree to the webpage and then know which identifier tag(s) generated that webpage result.

  • There is a tag field on the crawlrequest object, so I assume that will serve my purpose. However this field does not seem to be persisted in the database which worries me somewhat.  Thoughts on using this field?
  • Is there a difference in usage between CrawlRequest.Tag and CrawlRequest.UserData?
  • How do i pass this tag field on to each automatically generated child crawlrequest?
  • Is there an event called when the last crawlrequest generated from an originator crawlrequest has finised. I.e. once the whole website or folder has finished crawling.

Many thanks,

Richard.

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts

Richard:

There is a tag field on the crawlrequest object, so I assume that will serve my purpose. However this field does not seem to be persisted in the database which worries me somewhat.  Thoughts on using this field?

No need to worry.  Smile  This is the HTML tag that is captured by the DiscoveryManager.cs.

Is there a difference in usage between CrawlRequest.Tag and CrawlRequest.UserData?

Tag is the HTML tag.  UserData is an open field to be used for, well, anything...

How do i pass this tag field on to each automatically generated child crawlrequest?

There are a few ways CR's can be created.  In code you would modify CrawlRequest.cs to accept a 'UserData' parameter and modify the dbo.CrawlRequests table and associated SP's to accept a 'UserData' parameter.  Additionally, you would want to modify the EmailAddresses, Files, HyperLinks, Images and WebPages tables (and their associated SP's) to accept a 'UserData' parameter.  Or, you could have a plugin that knows the Originator and/or has knowledge of what the UserData should be per CR.  It would be a PreRequest plugin - CrawlAction would be most appropriate.

Is there an event called when the last crawlrequest generated from an originator crawlrequest has finised. I.e. once the whole website or folder has finished crawling.

There isn't.  Trying to think of a case for this and how to (easily) achieve this even considering the multi-server crawling/caching and knowing of the changes that are coming next.  I am very close on releasing a refactor to the main AN codebase that allows for generalizing DB storage - so you can use MySQL, Cassandra, whatever...  included in this is promotion of some static objects to instance members to make crawling within crawling 
(batching) within a crawl easier.  Also, I am pulling up abstract classes for just about everything in the SiteCrawler project so if there is something that you'd like to change, you can simply replace the reference in the _crawler.  ( to be released soon: _crawler.DiscoveryManager = YourDiscoverManager where YourDiscoveryManager : ADiscoveryManager )

However... we you really want to do is keep your crawls separate and decide whether to allow the cache to be shared.  Does one crawl know what the other is doing?  (my answers assume some additonal knowledge from private emails...)  

I want to be able to submit multiple crawlRequests with individual "identifier tag", and then I want all discoveries and webpages found from the resultant crawl to be tagged with the same - or at least I need to be able to roll back the discovery tree to the webpage and then know which identifier tag(s) generated that webpage result.

About 7 years ago I added sourcing to each Discovery.  HyperLink -> HyperLink_Discovery -> CrawlIdentifier.  So, a specific HyperLink  was found by multiple WebPages, there are rows in the HyperLinks_Discoveries table linking to the HyperLinks tabloe and there are rows in another table that is linked to HyperLinks_Discoveries which track which Crawl (batch) the particular Discovery belongs to.  What I found was that the best way to source the crawled data was to NOT to add this additional association following the way that Discoveries are sourced but to implement a bit of logic around the process instead.

Adding the additional normal form extension to HyperLinks_Discoveries begins to inflate the (say) HyperLinks_Discoveries_Crawl table rather quickly.  Knowing that there are 20-50 HyperLinks per WebPage (average range), and that ~25 of these are found on each WebPage, this means 25 HyperLinks are created with 625 HyperLink_Discoveries and 625 Crawl references are created for each WebPage at depth 2.  If you crawl with another Crawl Identifier you will generate 625 more Crawl references.  This is fine, and really the best way in SQL to balance storage and JOIN performance - however - there are benefits to de-nomalizing...  more on this in a bit...

One of the earliest lessons in AN development came about after a large crawl latched onto a number of adult sites and wouldn't let go.  I let AN free-crawl and the index/content was to my liking until I discovered images that I didn't want to see.  I tried to delete the content but there was just too much of it.  I had something around 400M WebPages (+ all of the files and images), back in 2009, and there was just too much of it to KNOW that I had deleted all of the adult content.  I was aiming at hosting the index - ultimately I chose to write an adult filter and start the index from scratch.  The resulting knowledge was that I no longer put data into the index that I MIGHT delete later.  Only data that I KNOW will stay is allowed to enter.  As MS-SQL (and a lot of other storage formats) grows so does the time to modify/update/delete.

Around the time I also added sourcing, per Crawl.  I found that when I had 100's of GB's of data that deleting unwanted data took a lot longer than being selecting and inserting ONLY data that I knew I wanted to keep.  Still inline with the adult content example...

I wondered, "Does it really matter where content comes from?  If I restrict crawls to a host/domain and I keep track of the starting AbsoluteUris and turn on ApplicationSettings.InsertHyperLinks = true; then I know the sources and DiscoveryTree (also a plugin - look at DiscoveryTree.cs) for each Crawl batch.  In a free-crawl scenario, does it matter where things come from?  What is the benefit of knowing that WebSite X was found in Crawl 1 and WebSite X was also found in Crawl 2, where the Crawls were started from WebSite A and WebSite B?  I could never come up with anything knowing...

1.) The datetime stamps in the DB help with knowing whether a subsequent Crawl found the same content.

2.) The HyperLinks table can source everthing.  (Files and Images tables too - and a handful of others...)

3.) If you saved the Discoveries table after a crawl (rather than letting the Engine clear it - Stop(...); method) then you'll know what the Crawl found - this is especially handy as you don't have to create addiontion FK checks, stats, indexes on a HyperLinks_Discoveries_Crawls table.

Anyway - this is a bit of a ramble, something that DOES require a bit of thinking...

Perhaps, tell me more about the business logic behind what you want to do with the data in the end?  (and we can go from there...)

Thanks!
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC