Share via


Getting the number of characters in a TCHAR string

Question

Tuesday, October 16, 2007 7:43 PM

Hi all,

I am trying to learn everything related to TCHAR's so I can write programs that can be compiled in Unicode if need be. I understand the differences between narrow and wide characters, char and wchar_t (and the generic TCHAR), the L prefix and generic _T() macro, and how to use preprocessor defines to tell the compiler which to use.

However, I haven't gotten very far in development and have already run into a problem. I would like to know the number of characters in a TCHAR array or TCHAR*. Unfortunately, every length function I can find (_tcslen(), even wstring::length()) seems to be returning the number of BYTES, not characters, which is no good for the string manipulation I need to do.

I've looked through a lot of material that talks about using _tcslen() instead of strlen(), but none of them adress this question. Can anyone help a newbie out -- how do I get the number of characters in a Unicode string?

TIA for any advice.

All replies (4)

Wednesday, October 17, 2007 10:22 AM âś…Answered

Viorel is right about _tcslen returning count in TCHARs.

 

If you want to count the Unicode code points, i.e. the Unicode "characters", you must be aware that in Unicode UTF-16 there are some code points made up by two 16-bits values (so two WCHARs): they are called "surrogate pairs". A surrogate pair is made up by two 16-bits values, but it actually represents one code point, i.e. one Unicode "character".

 

I believe that both _tcslen and CString::GetLength return the count in TCHARs, not properly considering surrogate pairs.

 

I don't know if the C++ library or ATL/MFC have a method like Java's String.codePointCount, that would return the number of code points (i.e. characters), proper considering surrogate pairs:

 

http://java.sun.com/mailers/techtips/corejava/2006/tt0822.html

 

Giovanni

 


Wednesday, October 17, 2007 6:26 AM

In that case simply divide the return value from _tcslen by sizeof(TCHAR).

You can even make a macro for that like this:

#define mystrlen(a) _tcslen(a) / sizeof(TCHAR)

 

But be aware that if your unicode string contains characters that consist of 2 16 bit values (I don't know the proper terminology for this) you will get errors because certain characters will be counted twice in that case.


Wednesday, October 17, 2007 6:56 AM

 Abtakha wrote:
[...] I would like to know the number of characters in a TCHAR array or TCHAR*. Unfortunately, every length function I can find (_tcslen(), even wstring::length()) seems to be returning the number of BYTES, not characters, which is no good for the string manipulation I need to do. [...]

 

According to documentation, functions like _tcslen, lstrlen and other, applied on TCHAR strings, return the length in characters, not in bytes.

 


Thursday, October 18, 2007 12:56 AM

Thank you all for replying! I was referred elsewhere to the function StringCchLength(), which seems to do the trick.

For others' reference: use the above function! Whatever _tcslen() and basic_string<>::length() return, a Japanese character counts as three, and a Roman character count as one. I'm not sure if it's surrogate pair or something else that causes the confusion.

PS. Giovanni -- thanks for passing along that link. That will come in handy when I work in Java.