arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Regarding the site encoding other than utf-8

rated by 0 users
Answered (Verified) This post has 1 verified answer | 7 Replies | 2 Followers

Top 10 Contributor
83 Posts
InvestisDev posted on Mon, Sep 30 2013 1:38 AM

Hi Mike,

We are having issue for the site that has content encoding - windows-1252 or ISO-8859-1. Some of the characters are corrupted in the downloaded documents as well as index files. Can you please look into this issue? or Let us know if some configuration is already available and unknown to us.

Thanks

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Fix is checked in.  Smile

EncodingManager.cs

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

Strange - international encoding functionality has been in place for about four years - curious...

1.) Which sites?

2.) Which characters?

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

following is the site causing the issues:

http://www.arcelormittalbioflorestas.com.br

Currently, we solved the problem by passing encoding as utf-8 in GetResponseStream() method.

Again, the issue here is not with characters as same characters can be indexed for other sites using utf-8 in their content-encoding header. BTW characters like " ãçõ " are causing issues.

Please let us know if any better/configurable way to do this.

Thanks

Top 10 Contributor
1,905 Posts

This fix isn't correct.  UTF-8 isn't a universal character encoding.  Hebrew, for example, won't display properly with UTF-8.  I will take a look.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

>> This fix isn't correct.

Agree. But, we needed this change for single use, so we went ahead with the same.

Awaiting you response.

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Fix is checked in.  Smile

EncodingManager.cs

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

Thanks Mike!

Top 10 Contributor
1,905 Posts

It may be checked into the 3.5 code - sorry if it is.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (8 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC