Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Question
Tuesday, October 16, 2007 7:43 PM
Hi all,
I am trying to learn everything related to TCHAR's so I can write programs that can be compiled in Unicode if need be. I understand the differences between narrow and wide characters, char and wchar_t (and the generic TCHAR), the L prefix and generic _T() macro, and how to use preprocessor defines to tell the compiler which to use.
However, I haven't gotten very far in development and have already run into a problem. I would like to know the number of characters in a TCHAR array or TCHAR*. Unfortunately, every length function I can find (_tcslen(), even wstring::length()) seems to be returning the number of BYTES, not characters, which is no good for the string manipulation I need to do.
I've looked through a lot of material that talks about using _tcslen() instead of strlen(), but none of them adress this question. Can anyone help a newbie out -- how do I get the number of characters in a Unicode string?
TIA for any advice.
All replies (4)
Wednesday, October 17, 2007 10:22 AM âś…Answered
Viorel is right about _tcslen returning count in TCHARs.
If you want to count the Unicode code points, i.e. the Unicode "characters", you must be aware that in Unicode UTF-16 there are some code points made up by two 16-bits values (so two WCHARs): they are called "surrogate pairs". A surrogate pair is made up by two 16-bits values, but it actually represents one code point, i.e. one Unicode "character".
I believe that both _tcslen and CString::GetLength return the count in TCHARs, not properly considering surrogate pairs.
I don't know if the C++ library or ATL/MFC have a method like Java's String.codePointCount, that would return the number of code points (i.e. characters), proper considering surrogate pairs:
http://java.sun.com/mailers/techtips/corejava/2006/tt0822.html
Giovanni
Wednesday, October 17, 2007 6:26 AM
In that case simply divide the return value from _tcslen by sizeof(TCHAR).
You can even make a macro for that like this:
#define mystrlen(a) _tcslen(a) / sizeof(TCHAR)
But be aware that if your unicode string contains characters that consist of 2 16 bit values (I don't know the proper terminology for this) you will get errors because certain characters will be counted twice in that case.
Wednesday, October 17, 2007 6:56 AM
Abtakha wrote: | |
|
According to documentation, functions like _tcslen, lstrlen and other, applied on TCHAR strings, return the length in characters, not in bytes.
Thursday, October 18, 2007 12:56 AM
Thank you all for replying! I was referred elsewhere to the function StringCchLength(), which seems to do the trick.
For others' reference: use the above function! Whatever _tcslen() and basic_string<>::length() return, a Japanese character counts as three, and a Roman character count as one. I'm not sure if it's surrogate pair or something else that causes the confusion.
PS. Giovanni -- thanks for passing along that link. That will come in handy when I work in Java.