Any clever trick to get strtok() to handle a null token

Article
2010-04-12

Question

_{Monday, April 12, 2010 11:40 PM}

I've got a file of lines of the format

SomeKey: "somedata"
OtherKey: "otherdata"
etc.

I parse it using strtok(), first using delimiters of blank and colon to the the key, and then twice with a delimiter of " to get the data.

It fails or at least does not behave well (as is well documented) on Key: ""

Is there any trick to get strtok() to return a definite indicator of a zero-length token? I don't need it to return a "proper" null string; I'm just looking for some sequence of calls that would let me recognize a null string definitively.

Thanks in advance,

Charles

All replies (15)

_{Tuesday, April 13, 2010 4:19 AM ✅Answered | 1 vote}

tsrCharles -

You might be able to make use of this:

Stptok.C Improved tokenizing function

from here:

http://cpp.snippets.org/code/

From the code comments:

"stptok() -- public domain by Ray Gardner,
modified by Bob Stout

You pass this function a string to parse,
a buffer to receive the "token" that gets
scanned, the length of the buffer, and a
string of "break" characters that stop the
scan. It will copy the string into the
buffer up to any of the break characters,
or until the buffer is full, and will always
leave the buffer null-terminated. It will
return a pointer to the first non-breaking
character after the one that stopped the scan."

It *may* give you a zero-length string in the
receiving buffer when you have consecutive
delimiters (untested). Or you may be able
to tweek the code to achieve the desired
outcome.

Wayne

_{Monday, April 12, 2010 11:52 PM}

I took the following example from MSDN and modified it a bit -

#include <string.h>
#include <stdio.h>

char seps[]   = " :\"";
char *token;

int main( void )
{
   printf( "Tokens:\n" );
 
   // Establish string and get the first token:
   token = strtok( string, seps ); // C4996
   // Note: strtok is deprecated; consider using strtok_s instead
   while( token != NULL )
   {
      // While there are tokens in "string"
      printf( " %s\n", token );

      // Get next token: 
      token = strtok( NULL, seps ); // C4996
   }
}

See if this works.

«_Superman_»
Microsoft MVP (Visual C++)

_{Monday, April 12, 2010 11:54 PM | 2 votes}

Check out boost::tokenizer<>. It handles your situation quite elegantly.

_{Tuesday, April 13, 2010 12:02 AM}

First of all, use strtok_s() instead as strtok() is deprecated.

Second, I don't see the problem. strtok() and strtok_s() simply ignores any leading tokens. Since you have two consecutive tokens ("") just before the end of the string, the token is null. Can't you just say:

if (token == NULL)
{
    //This means an empty string after the token name.
}

It looks simple enough to me. Or am I missing something?

MCP

_{Tuesday, April 13, 2010 12:29 AM}

@Superman and @webJose, I'm actually using wcstok_s() on wide characters; I just simplified the story a little.

The loop from MSDN works, but, as documented, when it encounters a "" sequence it returns a pointer to the first character following the second quote. There are typically blanks following the second quote, so the resulting string is something like " \n" -- not great as a proxy for a null token.

I'd really like to be able to handle multiple Key: "data" sequences on a single line; that's out of the question unless I can recognize the "" reliably. Otherwise Key1: "" Key2: "foo" will return < Key2: > as the first data token (I think -- have not tried yet.)

boost::tokenizer looks interesting. I'd just as soon avoid taking on a whole new component but if no one else has a better idea I may well take a look at it.

Charles

_{Tuesday, April 13, 2010 12:41 AM}

I just tried the loop and it seems to work okay.

I guess I'm missing something here.

«_Superman_»
Microsoft MVP (Visual C++)

_{Tuesday, April 13, 2010 12:43 AM}

Don't remember right now, but I believe strtok_s() will replace the tokens with NULL. If "" is the only possible 2-token sequence, before deciding anything about the key value, see if token[-1] and token[-2] are both null (of course, be careful not to read memory not allocated, etc. etc.). If they are both null, then you found an empty string before the token being returned.MCP

_{Tuesday, April 13, 2010 12:45 AM}

If you give it a string of <Foo: "" \n> your first token will be <Foo>, your second will be < > (no problem, discard it) and your third will be < \n> -- not a great proxy for a null token.

if you give it a string of <Foo: "" Bar: "whatever"> your third token will be < Bar: > which is totally incorrect.

Charles

_{Tuesday, April 13, 2010 1:11 AM}

This was exactly my point. the boost::char_separater<> class is used in conjunction with boost::tokenizer<> to deal with (that is, to identify) the situation when two adjacent tokens are encountered. strtok() essentially treats them as a single token.

_{Tuesday, April 13, 2010 1:25 AM}

Overly simplified, I'm saying this:

            token = strtok_s(val, nextTkn);
            if (!(token[-1] || token[-2]))
            {
                //Two consecutive tokens.
                //Variable "token" to be used as key for next pair.
            }
            else
            {
                //Variable "token" to be used as the pair's value.
            }

Sounds good to me, even in the example that you pose. I am not discarding the token. I must use it to move on to the next value pair. In other words, if the conditio in the IF above is true, then do two things: Assign an empty string as the value of the current value pair, and use token as the value name of the next pair.

MCP

_{Tuesday, April 13, 2010 3:57 AM}

>strtok() and strtok_s() simply ignores any
>leading tokens.

... leading delimiters/separators.

>Since you have two consecutive tokens ("")

... consecutive delimiters/separators.

>Don't remember right now, but I believe strtok_s()
>will replace the tokens with NULL ...

... will replace the delimiters/separators with
a nul character

>If "" is the only possible 2-token sequence ...

... 2-delimiter/separator sequence

A word about terminology:

"token" refers to the substring(s) which strtok()
parses a string into.

"delimiters" or "separators" refers to the
characters used to separate the tokens (substrings).

http://www.cplusplus.com/reference/clibrary/cstring/strtok/

http://www.dinkumware.com/manuals/default.aspx?manual=compleat&page=string.html#strtok

NULL is defined as a null pointer . It is the nul
character with which strtok() replaces delimiters.

http://c-faq.com/null/nullor0.html

Wayne

_{Tuesday, April 13, 2010 7:06 PM}

Thanks, Wayne. That looks like a possibility. It looks like a "smaller" solution than boost. (This work is for a client and I don't want to have to "explain" boost.) With the source code at least I may be able to tweak it.Charles

_{Tuesday, April 13, 2010 9:13 PM | 1 vote}

>With the source code at least I may be
>able to tweak it.

It looks like it will skip consecutive break
characters, like strtok does. However, if you
remove the inner loop:

if (*s == *b)
{
// ++s;
// b = brk;
}

I think it will give a zero-length string in
the buffer when a break character is followed
immediately by another break character. So given
a string of "123,456,,789" and a break character
of "," it will yield four tokens in the buffer:

"123" (length 3)
"456" (length 3)
"" (length 0)
"789" (length 3)

(This is based on a cursory examination of the
code, so vet it carefully.)

I leave it to you to massage the code as needed
for wide characters.

Wayne

_{Wednesday, April 14, 2010 6:50 PM}

I haven't had a chance to implement the http://cpp.snippets.org/code/ solution but I am going to go ahead and mark it as the answer. I will probably try to make it into a template so it can handle any kind of "character" -- seems better than just changing ever reference to char to WCHAR.

If I get a chance I will report back.

With regard to some of the debate above, yes, the things like commas and quotes are "separators." The meaningful stuff in between are tokens. The inability to handle null tokens is a well-documented behavior of strtok() -- it's not me going "hey, the C library doesn't work!" Strtok() strips leading delimiters on each call, so if you are pulling off quote delimited tokens as I am, then call "n" leaves you pointing to the first character after the leading quote, and if call "n+1" sees a quote as the next character, it considers it a leading delimiter and strips it off, leaving you with the stuff after your null token as the returned value. That's just the way it works.

Thanks all.

Charles

_{Friday, June 17, 2011 11:29 PM}

I just wanted to add my contribution to this discussion in case anyone is looking for a quick fix to this issue:

webJose's suggestion is a quick hacky fix that I decided to run with because my file would only give me the issue of two consecutive delimeters once per line in a CSV file. You could easily extrapolate this technique to loop and search for any number of consecutive delimiters that appeared before the current token but I'll leave that to you :) The problem I encountered was that only the FIRST of these delimiters are actually changed to the null character '\0'. The second remains as the delimiter and strtok just returns a pointer to the next non-delimiter character in the string. Here's my code in case you're interested (just don't try to copyright it :P):

// Parses n data that are separated by any of the characters in string delim
// and stores them in the input array 'data'. Returns number of words parsed.
int parse_data(char * str,int n,char ** data,char * delim = ",") {
char * ptr = strtok(str,delim);

int i = 0;
while (i<n && ptr != NULL) {
    // Hack to fix bug where ONE blank datum entry shifts all data over
    if (i>0 && ptr[-1] == ',' && ptr[-2] == '\0') {
      data[i++][0] = '\0';
    }
    sprintf(data[i++],ptr);
    ptr = strtok(NULL,delim);
}
return i+1;
}

Share via

Any clever trick to get strtok() to handle a null token

Question

All replies (15)

Additional resources