Jay:
Thank you! arachnode.net (AN) has been around in one form or another since 2004.5... there are thousands of users and thousands and thousands of hours into the work. I really appreciate the kind words.
1.) It is - if you are planning to share a DB the config depends on do you intend to share the DB simultaneously with unrelated Crawls? If so, look at the ApplicationSettings.UniqueIdentifier. This setting append a QueryString to each AbsoluteUri, so be sure to select something unique for each crawling instance r group of machine which will logically crawl together. If you intent for each machine to participate in the same crawl then just point each machine at the central database. You'll want to make one small change to the Engine...
...always return 'false' from this an the Crawler instances will keep trying to pull from the DB. They will always be running.
Or, depending upon your source - you may elect to create an EngineAction, which will load from a non-SQL source (should you choose)...
What are you trying to crawl? Give as much detail as possible, please?
2.) How to restrict? You want to set the RestrictCrawlTo in the CrawlRequest Constructor to Domain | Host, depending upon exactly where in a site you want to restrict the Crawl to - look at creating a Plugin - a PreAndPostRequest CrawlRule - this way you can set IsDisallowed on everything that doesn't match your site's Domain | Host or the CDN. Look at UserDefinedFunctions.ExtractDomain and UserDefinedFunctions.ExtractHost. Work through this tutorial: https://arachnode.net/Content/CreatingPlugins.aspx
3.) Yes, just drop them into ProxyServers.txt - Address:Port - then, look at private static void LoadProxyServers() in Console\Program.cs - here they go through an initialization routine - including a string to look for after fulfilling an HttpWebRequest from the intended Crawl target.
4.) Yes, look at private static void Engine_CrawlRequestCompleted(CrawlRequest<ArachnodeDAOMySQL> crawlRequest) in Console\Program.cs - Forbidden is the most common status returned.
Thanks for the tip: Yes, they are probably set to a non-standard setup - just run through the DEMO setup when you get a fresh checkout from SVN and all paths will be fixed up as they should.
Thanks,Mike
For best service when you require assistance:
Skype: arachnodedotnet
Hello Mike Thank you for all the information... 1) What is the difference between CrawlRequest and Discoveries? 2) When setting up CrawlRequests in the DB.. What is the differnce between AbsoluteUri1,AbsoluteUri2 and AbsoluteUri0. 3) do you know of a good tutorial on Lucene. As for Crawling, Attempting to crawl amazon for data and images... however.. they make it really hard to stay within a category. Do you know of any pluggins that assist in crawling amazon?
Lastly.. I really like the scraper, the browser tab is great for identifying sections of a site, the second tab "Path Filter" looks very promising however, after I grab the paths and setup the crawl and scrape actions.... the version I have does nothing. Is there a more recent version or am I missing something. The last 2 tabs are also blank.
Thank you
Jason
1.) A CrawlRequest is anything you wish to Crawl, and a Discovery is a piece of internet content - look in the DB, anything with _Discoveries after is qualifies as a Discovery. (WebPages do too...)
Here are some good links: https://arachnode.net/Content/FrequentlyAskedQuestions.aspx | https://arachnode.net/forums/p/739/11292.aspx#11292
2.) 0 is where it originally came from, and this may or may not be set depending upon what RestrictCrawlTo (CrawlRequest(...); Constructor) is set to. 1 is the parent. 2 is the child.
3.) This is the official reference for Lucene syntax: http://lucene.apache.org/core/3_6_2/queryparsersyntax.html Other than this, I find stackoverflow.com really helpful: http://stackoverflow.com/questions/2297794/how-would-one-use-lucene-net-to-help-implement-search-on-a-site-like-stack-overf?rq=1
Crawling Amazon: I don't provide Plugins for specific sites at 1.) the site layouts change all the time - I would spend time every few days updating them... 2.) AN is a general purpose tool for data collection and analysis - I can't provide specific code to parse Amazon.com, if you know what I mean. ;)
The Scraper: Not finished.
Thanks!Mike