arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Is this the tool for me

rated by 0 users
Answered (Verified) This post has 1 verified answer | 2 Replies | 3 Followers

Top 200 Contributor
1 Posts
CMSLoader posted on Mon, Nov 23 2009 8:27 PM

Hi All,

I've had a quick look through the site but am not sure if this is the tool for me. I need something that I can use to scrape client sites and kick out XML for a tool I build that ports XML into our standard content management system. This would happen each time we deploy.

I'm wanting to build a tool that will scrape a site, and present a business user with a set of current pages, in a tree that resembles their current site information architecture. (as best as I can, usually just going by the URL folder structure. (and download all the binary assets required for the content (images, pdfs, etc.)

The user would them be able to rearrange their content to whatever information architecture they want (in a tree), apply a bunch of filters (regex, dom selectors, whatever) to clean out the html to justt the appropriate body content, and export to XML with a parent child structure from the home page down through the sections of the site. The XML indicates content, and site structure.

So my question is, how far does arachnode get me to that goal?
I'm an intermediate level c# coder - how much effort do you think it would take to implement something like this given the set of tools arachnode give you?

Any/all comments appreciated!

Thanks,
Wayne.

Answered (Verified) Verified Answer

Top 10 Contributor
Male
101 Posts
Answered (Verified) Kevin replied on Tue, Nov 24 2009 9:47 AM
Verified by arachnode.net

Wayne,

I echo what Mike said.  It sounds like AN has all the core tools you need to make things happen.  It can pull down all content and assets for you.  It has parsing opportunites, you can use the HtmlAgilityPack, and you can write custom plug-ins to do things with data as it is visited and retrieved.

There is definitely some coding involved in building your particular features on top of the AN core features.  But, as an intermediate C# developer, I bet you would really like the overall architecture of the code.  Mike has done a great job.

We are active on the forums, and we also have commercial support available to get you additional support for your activities.

It sounds like most of the work you'll have is in designing and coding the interface on top of AN, to accomplish the processes you want.  We are available for contract work, based on workload and next release priorities.

I definitely encourage you to pull down the personal version and get a look at the code, if you have not already.  Check out the help and getting started video as well.  And post here in the forums as you have done!

Kevin

 

All Replies

Top 10 Contributor
1,905 Posts

AN can collect the data for no problem... AN also parses WebPages to XML.  Just about anything that you would want to do with the incoming byte streams is either present in AN as a plugin, or has been implemented by third parties.

For a GUI, I can put you in touch with an AN team member who does EXACTLY what you are looking for.

And, as always, I am available every day to answer questions.  Big Smile

- Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Answered (Verified) Kevin replied on Tue, Nov 24 2009 9:47 AM
Verified by arachnode.net

Wayne,

I echo what Mike said.  It sounds like AN has all the core tools you need to make things happen.  It can pull down all content and assets for you.  It has parsing opportunites, you can use the HtmlAgilityPack, and you can write custom plug-ins to do things with data as it is visited and retrieved.

There is definitely some coding involved in building your particular features on top of the AN core features.  But, as an intermediate C# developer, I bet you would really like the overall architecture of the code.  Mike has done a great job.

We are active on the forums, and we also have commercial support available to get you additional support for your activities.

It sounds like most of the work you'll have is in designing and coding the interface on top of AN, to accomplish the processes you want.  We are available for contract work, based on workload and next release priorities.

I definitely encourage you to pull down the personal version and get a look at the code, if you have not already.  Check out the help and getting started video as well.  And post here in the forums as you have done!

Kevin

 

Page 1 of 1 (3 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC