Converting

From C

Jump to: navigation, search

Contents

Converting Strings to Numbers

C offers a few means of converting a sequence of bytes to numbers. The most simplistic way is using one of the scanf functions.

There are three main scanf functions:

    int
    scanf(const char * restrict format, ...);
    int
    fscanf(FILE * restrict stream, const char * restrict format, ...);
    int
    sscanf(const char * restrict str, const char * restrict format, ...);

There are variable-argument versions of these functions as well, but they are beyond the scope of this article. Using these functions is the same, the difference is from where each gets its input -- scanf from stdin, fscanf from a stream, and sscanf from a string. scanf uses basically the same format specifiers as does printf. So to get an int, float, and double you could do:

    scanf("%d", &myint);
    scanf("%f", &myfloat);
    scanf("%lf", &mydouble);

The scanf functions have various problems, however. They generally do not handle invalid input in a very useful way. People tend to avoid scanf functions because they cause many other issues (again outside the scope of this article).


Another means of converting a string to a number is by using atoi, atol, or atof. The name atof is misleading, perhaps, since it turns a string to a double, not a float. The prototypes are:

    int
    atoi(const char *nptr);
    long int
    atol(const char *nptr);
    long long int
    atoll(const char *nptr);
    double
    atof(const char *nptr);

Example usage:

    int x = atoi("666");
    long y = atol("42000");
    long long y = atoll("4300000000");
    double z = atof("12.5");

These functions look nice, but have a terrible issue with error detection. On error, the value to which atoi, atol, and atof evaluate is undefined. Since these functions evaluate to an int, long, long long or double you will get a valid integer or double, but you will not know if it is correct or not. Most atoi implementations that I have seen return 0 on error, but 0 is still a valid number, so you cannot be sure that 0 is really the number represented by the string. Considering the input of "zfefefe", if atoi converts it to 0 then you have no idea that it is not actually 0 and have to do extra work to really be sure it is valid. We also run into more problems if we have an actual number, but it is outside the range of an int type on your particular machine, such as "123123213213213413211312". The value of atoi is undefined in this case as well. atoi also does not allow you to specify which base you are working with, which can be very limiting. Wouldn't it be nice if C had a standard function which solved both of these problems, giving us the ability to know when an error has occurred in processing the string and being able to specify a base? Thankfully, standard C does.


These miracle functions are:

    long
    strtol(const char * restrict nptr, char ** restrict endptr, int base);
    long long
    strtoll(const char * restrict nptr, char ** restrict endptr, int base);
    unsigned long
    strtoul(const char * restrict nptr, char ** restrict endptr, int base);
    unsigned long long
    strtoull(const char * restrict nptr, char ** restrict endptr, int base);
    double
    strtod(const char * restrict nptr, char ** restrict endptr);
    float
    strtof(const char * restrict nptr, char ** restrict endptr);
    long double
    strtold(const char * restrict nptr, char ** restrict endptr);

Just from looking at the prototypes, these functions appear to offer more control over processing strings. The first thing to note is the second parameter of all of these functions. endptr is used for error detection. People often make the assumption that endptr is used to tell the function where to stop processing the string. This is not true. On error, endptr is set to the location of the first invalid character in the string. So if you pass "123zzfee" to strtol and are converting in base 10, endptr will point to the first 'z'. endptr becomes crucial when deciding if the return value of the function is indication of an error or not. On an invalid input which is not an overflow (will cover shortly), the functions return 0, and set endptr. If the string is actually "0", then endptr will point to the end of the string, which is the null character with a value of '\0', telling you that it processed the entire string and its actual value is 0. These functions also set errno to EINVAL when an error occurs.

Another difference in these functions is how they handle numbers which are valid, but outside the range of what a long/long long/double/float/long double can represent. When the string is a number which is too large or too small to be represented in the corresponding datatype, then the _MAX or _MIN value of the corresponding datatype is returned. The C99 final draft says:

 The strtol, strtoll, strtoul, and strtoull functions return the converted value, if any.  
 If no conversion could be performed, zero is returned. If the correct value is outside the range 
 of representable values, LONG_MIN, LONG_MAX, LLONG_MIN, LLONG_MAX, ULONG_MAX,or ULLONG_MAX is 
 returned (according to the return type and sign of the value, if any), and the value of the 
 macro ERANGE is stored in errno. 


This means that if a number is too large to fit inside the datatype being converted to, then *_MAX is returned by the function, and vice versa *_MIN is returned, and errno is set to ERANGE to inform you that there really was an error.

These functions offer a more robust means of converting an string to some sort of numeric type, and are generally recommended over atoi/scanf, even if they appear to be more complex.

  • Note that the unsigned variants like strtoul() will accept a negative number as input, without signaling an error.


test_strtol

Here's an example which illustrates all possible return conditions of strtol:

   #include <stdlib.h>
   #include <stdio.h>
   #include <limits.h>
   #include <errno.h>
   #include <assert.h>
   
   void test_strtol(const char *input)
   {
       long n;
       char *endptr;
       
       errno = 0;
       n = strtol(input, &endptr, 10);
       
       printf("strtol(\"%s\", &endptr, 10) returns %ld with errno == %d\n",
           input, n, errno);
       
       switch (errno) {
       
       case ERANGE:
           switch (n) {
           case LONG_MIN:
               puts("UNDERFLOW");
               break;
           case LONG_MAX:
               puts("OVERFLOW");
               break;
           default:
               assert(!"ERANGE with invalid return value!?");
               break;
           }
           break;
       
       default:
           assert(!"Invalid errno!?");
           break;
       
       case 0:
           if (endptr == input)
               puts("Invalid input; no conversion took place");
           else if (*endptr == '\0')
               puts("Input is complete; successful conversion");
           else
               printf("Successful conversion with possible garbage at end of input: >>%s<<\n", endptr);
           break;
       }
   }
   
   int
   main(int argc, char **argv)
   {
       while (*++argv)
           test_strtol(*argv);
       return 0;
   }


Converting sequences of numbers using strtol()

The "end-pointer" argument to the strtol() function has a special argument which can be used to parse a list of numerals.

   #include <stdlib.h>
   #include <ctype.h>

   size_t
   parselist(int a[], size_t max, const char *line)
   {
       size_t n;
       const char *p;
       char *e;
       for (n = 0, p = line; n < max && *p; p = e) {
           int x = strtol(p, &e, 0);
           if (e == p) {
               /*
                * e == p means that strtol did not find anything
                * to convert.   Here, we handle this error
                * condition by ignoring it and bumping the
                * pointer along past the next non-whitespace
                * character.  A "real" program might do something
                * entirely different.
                */
               while (isspace(*e)) e++;
               if (*e != '\0') e++;
           }
           else
               a[n++] = x;
       }
       return n;
   }

   #include <stdio.h>
   #include <assert.h>

   int
   main(int argc, char **argv)
   {
       size_t i, n;
       int x[40];
       assert(argv[1] != 0);
       n = parselist(x, sizeof x / sizeof *x, argv[1]);
       for (i = 0; i < n; i++)
           printf("%d, ", x[i]);
       putchar('\n');
       return EXIT_SUCCESS;
   }

Converting Numbers to Strings

After seeing atoi, and strtol, people often expect ltostr or itoa in order to go in the other direction. They are suprised to find that these functions are not provided. Note that while itoa does exist on some systems, it is not part of C so you should not use it.

So how does one convert a number to a string? The answer is to use the printf family of functions.

    int
    printf(const char * restrict format, ...);
    int
    fprintf(FILE * restrict stream, const char * restrict format, ...);
    int
    sprintf(char * restrict str, const char * restrict format, ...);
    int
    snprintf(char * restrict str, size_t size, const char * restrict format, ...);

You should already know how to use printf and fprintf, while sprintf and snprintf may be new to you. These work exactly like printf and fprintf, except they output to a string rather than a FILE*. People will often complain that formatted output functions are too much overhead to use only to convert a number to a string. However, unless you are converting numbers to strings as the main part of your program, processing format strings is generally not all that much overhead. Since people usually want to include string information in their output other than just numbers, formatted output is generally not a big deal. For example:

 sprintf(buffer, "Your age in 1000 years will be %d.", age + 1000);

One should also often not use sprintf, but instead use one of the alternatives such as snprintf(). This will help prevent buffer overflows, due to C style Strings.

The standard printf family of functions do have a minor flaw. Although it is usually not a problem, they are limited to converting integer values to either decimal, hex, or octal representations -- conversions to arbitrary bases are not possible.

Converting to and from Octet Sequences (binary encodings)

Many times, integer values are encoding as byte (octet) sequences. Some people refer to this as "binary" data. The easiest way to think about this is that the number is represented in base 256 (bicentisexquinquagesimal), where each octet is a digit in the sequence.

Here is the same value represented in base 10, base 16, and base 256:

12345678910 = 1×108 + 2×107 + 3×106 + 4×105 + 5×104 + 6×103 + 7×102 + 8×101 + 9×100

75BCD1516 = 7×166 + 5×165 + 11×164 + 12×163 + 13×162 + 1×161 + 5×160

[7][91][205][21]256 = 7×2563 + 91×2562 + 205×2561 + 21×2560

Note that since the "size" of each digit in base 256 is one octet, the last example may be encoded as an octet sequence:

char encoding[4] = { 7, 91, 205, 21 };
char encoding[4] = { 0x07, 0x5b, 0xcd, 0x15 };
char encoding[4] = { 07, 0133, 0315, 025 };

Note that these three definitions are equivalent. Each base-256 digit is itself represented in source as either decimal, hexadecimal, or octal.

Complete this article. Be sure to talk about endianness.

Personal tools