arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Setting arachnode up for rss-collection

rated by 0 users
Answered (Verified) This post has 1 verified answer | 18 Replies | 2 Followers

Top 50 Contributor
14 Posts
JohnnyJr posted on Thu, Aug 20 2009 5:48 AM

Hi,

I would like arachnode to crawl around looking for blogs and storing the url to the rss-feeds. How would I set it up to achieve this?

As of now I can't even get it crawling. I have tried to insert a crawlrequest, both directly in the dbo.CrawlRequests table and through arachnodeDAO.InsertCrawlRequest. Both fails since my URL violates the CK (the url is http://www.bloggtoppen.se/kategori/alla/, wich is a Swedish blogcatalog that would be suiting to start crawling from).

Kind regards,

Johan

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

JohnnyJr:

Hi,

I would like arachnode to crawl around looking for blogs and storing the url to the rss-feeds. How would I set it up to achieve this?

As of now I can't even get it crawling. I have tried to insert a crawlrequest, both directly in the dbo.CrawlRequests table and through arachnodeDAO.InsertCrawlRequest. Both fails since my URL violates the CK (the url is http://www.bloggtoppen.se/kategori/alla/, wich is a Swedish blogcatalog that would be suiting to start crawling from).

Kind regards,

Johan

(Remove the 'www.')

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 50 Contributor
14 Posts

Ok, managed to insert the crawlrequest by removing www, but still it won't find any pages so I am doing something wrong here...

Top 50 Contributor
14 Posts

Ok, kind of having a conversation with myself here, but I discovered that it is because that domain prohibits crawling in robots.txt.

Anyway, my initial question remains: How would arachnode be best configured to gather URLs of rss-feeds?

Top 10 Contributor
1,905 Posts

Just for others that read this post... when troubleshooting AN, start by looking in the database tables 'Exceptions' and 'DisallowedAbsoluteUris'.

For your question: Take a look at cfg.AllowedDataTypes.

This table controls what you are allowed to crawl.  First step it to make sure that you have all of the DataTypes configured to the 'Content-Types' that you want to collect as listed in cfg.ContentTypes, for all of the RSS/ATOM formats you want to collect.  Try searching on Google, etc. for 'RSS feeds to 'Content-Type' association', etc. to ensure that all of the RSS/ATOM formats are properly represented in cfg.AllowedDataTypes.

Then, send out a crawl or 10 and examine what is present in DisallowedAbsoluteUris to ensure that you haven't missed any.  If you see 'http://technorati.com/main.rss, etc.' in the DisallowedAbsoluteUris table then the cfg.AllowedDataTypes is missing a DataType for the AbsoluteUri/WebPages you are trying to crawl.  After you get this going we can tweak according to your specific use case.

Crawling assumes the default configuration, which can be ensured by running the 'RESET' SP with '1'.

-Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
14 Posts

Thank you!

Now it seems to be crawling and fetching urls allright. No rss-feeds end up in DisallowedAbsoluteUris and the Hyperlinks table is filled with links to rss-feeds (is this really the correct table to be looking in? How can I determine that a hyperlink is to an rss feed?)

So how can we tweak this? Is there a way to collect ONLY rss feeds, but crawl all types of pages?

/Johan

Top 10 Contributor
1,905 Posts

Check the files table - your RSS feeds should be there!  :)

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
14 Posts

That was that I initially thought, but the only entries in that table are a few .swf-files from youtube.

Top 10 Contributor
1,905 Posts

Are your RSS files listed in DisallowedAbsoluteUris?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
14 Posts
JohnnyJr replied on Sat, Aug 22 2009 10:15 AM

No they are not

Top 50 Contributor
14 Posts

Do you have any further insights? As of now I use the Hyperlinks table collecting all uris containing "feed" or "rss", but that is not a very neat way of doing it. I get results I don't want and probably miss out on those that I would want.

/Johan

Top 10 Contributor
1,905 Posts

Take a screenshot of the values in cfg.AllowedDataTypes and post it please?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
14 Posts

Thank you! That was enough for me to discover what I've done wrong. I had misstyped the id... (doh!)

Now the rss-feeds end up in Files after crawl is done. Is that how it is supposed to be? Or should the Files-table be updated as we go along, as the Hyperlinks table is?

If not, is there a way to get it to do that? I want the spider to search widely and deeply and have another application harvest the feeds and read them.

Top 10 Contributor
1,905 Posts

You are correct on both accounts.  RSS will be placed in the Files table and will be updated as updated content is found.

I'm not 100% sure what you are expecting, but I think AN crawls the way you want it to.  :)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
14 Posts

I see now that I might have been a little unclear in my question. So to be more concise:

First I tried a crawl, starting with a blog that I knew contained feeds, and with a depth of 1. When it was finished there was rss-feeds in the FIles table.

I then started a broader crawl, starting from a blog portal, with a depth of 10. Since it would not be finished for several hours I expected the Files table to get entries as we go along, but it didn't - even though there is feeds in the Hyperlinks table.

My guess is that I still haven't set up the AllowedDataTypes table correctly. At the moment there are four entries for different Content Types that could be feeds (application/atom+xml, etc). All four entries have the Extension set to ".xml" (because my guess was that this was the extension for the to be saved with?). In reality feeds could have any extension, depending on the script creating the feed (php, aspx etc). Should this be taken into account in any way?

I will be stepping throuh the arachnode code to see if I can answer my own questions, but would be very happy if you could assist...

Top 50 Contributor
14 Posts

The answer is simple, the feeds has been added to CrawlRequests, but not yet been processed - which leads me to a new question:

Is there a way to set the priority of presumed feeds to a higher number automatically, so that they will be processed as soon as they are found? Since the sole purpose of my crawl is to find them this would be very desirable.

Page 1 of 2 (19 items) 1 2 Next > | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC