arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

ArachNode with Facebook and Twitter

rated by 0 users
Answered (Verified) This post has 1 verified answer | 1 Reply | 2 Followers

Top 500 Contributor
1 Posts
Senthamil posted on Tue, Aug 12 2014 9:22 PM

Hi,

  I am new to this arachnode. Currently in the process of evaluating this software for my use. I have few questions

  • Is there any sample or option to crawl the Facebook and the Twitter? Any add-in or plug in recommended? I saw one thread here about it but eventually I am getting page not found when I click next on thread.
  • Is there any image processing capability on the crawler?
  • How do I get the crawled content from the DB? Is it getting only few line of content or complete content from Url?
  • Is there support for incremental crawl and does only changes in the web? If so it is saving whole or just delta?

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Hello:

  • Is there any sample or option to crawl the Facebook and the Twitter? Any add-in or plug in recommended? I saw one thread here about it but eventually I am getting page not found when I click next on thread.

(answering this one offline...)

  • Is there any image processing capability on the crawler?

Yes, images are first class citizens in the AN DB.  Exif extraction, storage, DB retrieval, FullTextType if you have iFilters to apply.

  • How do I get the crawled content from the DB? Is it getting only few line of content or complete content from Url?

There are a number of functions to get from the DB, as well as to retrieve from disk, if you have stored your content there.  You can store the entire website if you elect to.

  • Is there support for incremental crawl and does only changes in the web? If so it is saving whole or just delta?

AN honors the 'Last-Updated' HttpHeader, and you can configure the caching behavior for outside access.  In order to retrieve just the changes of a page the destination WebServer would need to honor HttpRange requests and we'd have to know what that range was.  Then, merge Html and/or Script of course...  This is further compounded by Gzip.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC