arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Renderer Functionality

rated by 0 users
This post has 0 Replies | 1 Follower

Top 10 Contributor
Posts 1,905
arachnode.net Posted: Thu, Apr 9 2015 9:14 PM

I received this excellent question via Skype and thought it was a good one to share and to answer publicly with images.

Ok. I will just post my question, pls answer it whenever u have time, shouldn't take too long I hope. Will the crawler detect URLs which are activated through ajax / buttons / whatever other interaction. E.g.

[7:17:26 AM] Skype User: <a href="http://www.whitehouse.gov">The Government..</a>

<button onclick="window.location.href='http://www.nsa.gov'">..makes us safe!</button>

[7:17:33 AM] Skype User: Will both links be crawled?

[7:17:44 AM] Skype User: I found so far that they will not, out of the box.

[7:18:49 AM] Skype User: So, the question is, if we have buttons on a site, which contain javascript code, and when clicked take us someplace.. can we make Arachnode 'press' the buttons and crawl the resulting sites?

[7:20:33 AM] Skype User: But please note this is not limited to buttons, I just use them as an example, the question is more general in nature, it concerns any way any site may navigate to some URI (or post an ajax request), e.g. with a mouse-over, scroll (I saw that in your blog), or something else...

My reply:

As a precursor to understanding AN's capabilities, read this first: http://arachnode.net/blogs/arachnode_net/archive/2015/04/01/ajax-dynamic-content.aspx

By default, Rendering is not enabled for the demo experience - there are a few additional setup steps to allow cross-process communication (IPC/MQ) and this sometimes leads to confusion for newer users of Visual Studio and .NET technologies.  (The Renderers functionality takes care of dynamic/AJAX content and runs out of process so that crashes in the COM context don't bring down the entire crawling process - .NET try/catch does not cover ALL unhandled COM exceptions.)

Yes, AN can interact with web pages in any way in which a human would interact with those web pages.  You can click buttons, click dynamic links, intercept AJAX calls to webservices (think pinterest scrolling) using the local proxy server - intercept WebFonts...

I have recently started adding RendererActions as the requests for/questions around rendering capabilities have skyrocketed and so this has become my point of attention.

We may first take a look at the OUT-OF-PROCESS Rendering by setting the Renderer project at the start up project and by making the following settings:

After each page is downloaded and then processed for dynamic content the resulting structures are passed to the RendererActions plugins.

The 'Hrefs' plugin captures all href attribute properties, including 'document.location.href'.

How do we interact with a WebPage?  Look at the Inputs RendererAction.

Simply grab the elements you wish to modify, set their values and invoke their members.

EDIT: An AN user suggested looking at this as a possible third option for the Rendering functionality: https://github.com/cefsharp/CefSharp


For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (1 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC