arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Crawling Facebook and twitter

rated by 0 users
Answered (Verified) This post has 1 verified answer | 8 Replies | 1 Follower

posted on Sun, Jan 17 2010 10:21 AM

Hi Mike

 

I have download arachnode.net 2 days back and i am impressed by your work. my question can I crawl Facebook and twitter using arachnode?

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Twitter shouldn't be any problem at all.  You may have to ignore the Robots.txt file to crawl ALL of twitter, but you CAN turn this off in AN.

I ran a test crawl from Facebook, while logged out, and downloaded a ton of pages.  So, tell me - what do you want to crawl on Facebook?

I started crawls from here: http://www.facebook.com/people/Mike-Anderson/773002299 and here: http://twitter.com/arachnode_net

...and got these results (SUCCESS, to me, anyway...)

Also, if you register, you can receive email notifications when there are replies to your posts.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
Male
101 Posts
Kevin replied on Mon, Jan 18 2010 7:54 AM

Twitter links are available whether logged in or not.  So, you can submit a crawl request for something like http://twitter.com/search?q=haiti and walk it no problem.  Rules/templates for walking twitter content is likely quite different than for html pages though ;)

Regarding facebook, I'm not sure about that one.  I believe you have to be logged in with a valid fb account to see most anything.

Hope that helps a bit.

 

replied on Mon, Jan 18 2010 8:02 AM

Hey!  Thanks!

I am coming back from vacation today and can answer your question when I get back... (just wanted to say 'Hello' though...)

Mike

replied on Mon, Jan 18 2010 11:57 AM

thanks kevin for ur info

replied on Mon, Jan 18 2010 11:58 AM

Waiting for u Mike

Top 10 Contributor
1,905 Posts

Cool.  Just got back - weary and worn but had an amazing time.

Will write more in the morning.

Thanks!

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Twitter shouldn't be any problem at all.  You may have to ignore the Robots.txt file to crawl ALL of twitter, but you CAN turn this off in AN.

I ran a test crawl from Facebook, while logged out, and downloaded a ton of pages.  So, tell me - what do you want to crawl on Facebook?

I started crawls from here: http://www.facebook.com/people/Mike-Anderson/773002299 and here: http://twitter.com/arachnode_net

...and got these results (SUCCESS, to me, anyway...)

Also, if you register, you can receive email notifications when there are replies to your posts.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

replied on Wed, Jan 20 2010 1:56 PM

aha I managed to crawl facebook I had to coment out the Politness.cs code .. ooops :S and I was logged on to my facebook account .. is it the correct way to do this for any social site

Top 10 Contributor
1,905 Posts

You can get a lot of mileage out of using the CredentialCache in the Crawler.

Which version are you using, BTW?

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (9 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC