Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Question
Thursday, March 6, 2014 8:45 AM
Hello all,
I have some text files in japanese (codepage 932 - SHIFT-JIS), I would like to detect this code page pro grammatically by C#.
I have tried to use the StreamReader but it detects wrong codepage then when writing back into file, it is wrong with garbage characters.
Anyone know?
Thanks.
All replies (6)
Friday, March 7, 2014 6:45 AM âś…Answered
"I don't know if it guesses or it has a proper solution to detect the encoding of the file."
You're lucky and it guesses. Or you have changed your system's code page to 932 and VS uses that by default.
"BTW, any one knows any tool that helps to convert japanese file (or any) to Unicode and support convert whole folder also?"
Not that I know of. But it's trivial to write such a tool for the case where the encoding of the file(s) is known:
static void Main(string[] args) {
var shiftJis = Encoding.GetEncoding(932);
foreach (var path in Directory.EnumerateFiles(args[0], "*.txt")) {
var text = File.ReadAllText(path, shiftJis);
File.WriteAllText(path, text, Encoding.UTF8);
}
}
Thursday, March 6, 2014 10:04 AM
There's no reliable way of detecting the encoding of a text file. At best you can try to do some statistical analysis of the file contents and guess the (possibly wrong) encoding. Notepad tries to do something like this to detect Unicode files and gets it wrong sometimes.
StreamReader never attempts to detect the encoding, it assumes UTF-8 unless otherwise specified.
Thursday, March 6, 2014 3:16 PM
At least one of the StreamReader constructors allows you to tell it to detect the encoding from the byte order mark (BOM), but this will only work if a BOM was used. But it's better than nothing.
As Mike says, there is no reliable way in the absence of a BOM.
Convert between VB, C#, C++, & Java (http://www.tangiblesoftwaresolutions.com)
Instant C# - VB to C# Converter
Instant VB - C# to VB Converter
Thursday, March 6, 2014 4:05 PM
"At least one of the StreamReader constructors allows you to tell it to detect the encoding from the byte order mark (BOM)"
The BOM is only useful to distinguish between Unicode encodings - UTF8, UTF16 LE and UTF16 BE. The idea of a BOM simply doesn't exist in non Unicode encodings like SHIFT-JIS so the BOM can be useful only in certain situations.
For example you have a bunch of files and you know in advance that some are SHIFT-JIS and some are Unicode and have a BOM. Then you can "detect" the SHIFT-JIS files because as far as I can tell the bytes used for BOM appear to be unused in SHIFT-JIS.
Thursday, March 6, 2014 5:42 PM
There's no reliable way of detecting the encoding of a text file. At best you can try to do some statistical analysis of the file contents and guess the (possibly wrong) encoding. Notepad tries to do something like this to detect Unicode files and gets it wrong sometimes.
The Internet Explorer is also somewhat decent at guessing Encoding.
But overall it is true: You must know the encoding, guessing cannot work. Or to cite this article:
http://www.joelonsoftware.com/articles/Unicode.html
"It does not make sense to have a string without knowing what encoding it uses."
Let's talk about MVVM: http://social.msdn.microsoft.com/Forums/en-US/wpf/thread/b1a8bf14-4acd-4d77-9df8-bdb95b02dbe2 Please mark post as helpfull and answers respectively.
Friday, March 7, 2014 4:23 AM
Guessing may not be correct anyway. I see that Visual Studio 2012 shows correctly the encoding of the file (SHIFT-JIS), I don't know if it guesses or it has a proper solution to detect the encoding of the file.
BTW, any one knows any tool that helps to convert japanese file (or any) to Unicode and support convert whole folder also?
Thanks.