arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

newbie questions

rated by 0 users
Answered (Verified) This post has 1 verified answer | 4 Replies | 3 Followers

Top 200 Contributor
1 Posts
vitorsilva posted on Sat, Oct 24 2009 6:38 AM

.crawling pages with a specific url
i want limit the crawl to pages with a specific url like domain.com/page.aspx?id=NUMBER where number is something that i know, like a number from 10 to 20. is there any way i can specify this?

.reports
i noticed some tables and sprocs with a rpt namespace but every table i open seems to be empty. do i need to execute  something to populate those tables?

.how/where is the WebPages_MetaData table populated?

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

1.) Find AbsoluteUri.cs.  You can either modify this plugin, or create a copy and follow the pattern.

2.) Execute the 'Update Reporting' stored procedure.

3.) Check the cfg.Configuration table.  Set 'ExtractWebPageMetaData' to 'true' and 'InsertWebPageMetaData' to 'true'.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

Victor -

Just about to start reorganizing my living space for the day - I will answer your questions later on...

Thanks!
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

1.) Find AbsoluteUri.cs.  You can either modify this plugin, or create a copy and follow the pattern.

2.) Execute the 'Update Reporting' stored procedure.

3.) Check the cfg.Configuration table.  Set 'ExtractWebPageMetaData' to 'true' and 'InsertWebPageMetaData' to 'true'.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 150 Contributor
2 Posts

Hi,

 

About the 'Update Reporting' stored procedure, it requires two parameters, what should I put there? I still wasn't able to populate the report tables.

 

Thanks,

Ori

Top 10 Contributor
1,905 Posts

ALTER PROCEDURE [rpt].[arachnode_rsp_UpdateReporting]

@NumberOfRecords [int] = 1000,

@NTileGroups [int] = 10

    WITH EXECUTE AS CALLER

AS 

 

http://msdn.microsoft.com/en-us/library/ms175126.aspx

 

Number of Records is the number of rows per reporting tables.

Did you crawl with ClassifyAbsoluteUris on?  If not, check out the PostProcessing project to reprocess your WebPages.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (5 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC