arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Questions about the Templater etc'

rated by 0 users
Answered (Not Verified) This post has 0 verified answers | 3 Replies | 2 Followers

Top 50 Contributor
Female
9 Posts
Anat posted on Thu, Sep 3 2009 2:13 AM

Hi Mike,

Thanks for the quick reply on mail.

As I said, I already have the enviorment up and running - Crawling my sites and searching.

My main goal is to be able to have a list of sites that the user can enter and save (which I assume can be saved in the CrawlRequest Table + enabling the CreateCrawlRequestsFromDatabaseFiles?) and also have a list of words to be searched (that will also be entered and saved by the user).So I created a new table to handle the list of words.

Finnaly, only the Pages from those sites that contains any of the words in my list should be downloaded and then I need to work on them and get the relevant text out - for that I thought to be using the Templater which is doing some of text extracting and such.

As I understand, currenlty the arachnode works by first crawling, downloading, indexing and then searching. I want to make the search while crawling and download only relevant pages, once I have them to get the text out.

Hope I was clear so far.

Any thoughts on how should I proceed from here? is the Templater relevant for my needs? if so, how do I use it? I also saw that there were some bugs in that class in one of the earlier posts last month, is there any new version for that class?

Thanks again,

Anat.

All Replies

Top 10 Contributor
1,905 Posts

The Templater class is a bit of AI used for extracting the "meat" of a page programatically... like how can a web spider tell what is the main blog post on a page, or which posts are comments in a blog page...

Don't think the Templater is relevant for what you are trying to do.

Once a CrawlRequest is processed, it is removed from the CrawlRequests table, so I would create another table to hold the CR's you want to save.

If you have a list of words that you want to filter sites by, create a custom CrawlRule and set the 'IsDisallowed' property in the rule.  Look at the DisallowedAbsoluteUris table... :)

You are correct in your assessment of how AN works.  :)

CreateCrawlRequestsFromDatabaseFiles will create CR's from the Files database table.

Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
Female
9 Posts
Anat replied on Thu, Nov 5 2009 1:25 AM

Hi Mike,

Thanks for your last answer and advices.

I'm currently trying to understand how creating a CrawlRule can help me to filter result web pages while crawling. I've searched and read posts I've found about rules in the forom but still didn't figure completely how to define one.

Can you please give an example of the basic steps needed to define a new rule that would filter my results? (including which parameters in which tables) for example, if I'm crawling the CNN News sites (http://edition.cnn.com/WORLD/) and I would like to get back only the pages that contains stories related to Afghanistan. So for my example on the pages that are in this list http://edition.cnn.com/search/?query=afganistan&primaryType=mixed&sortBy=date&intl=true would appear in my folder when the crawler is finished.

 

Thanks again,

Anat.

Top 10 Contributor
1,905 Posts

Anat -

We have a basic set of help documents coming very soon.

Additionally, consider purchasing a license and/or a support contract.  This is now the best way to receive directed help on your issues.

Thanks!
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC