String functions in SAS

SAS offers plethora of functions for string manipulations. In this tutorial we shall learn about them in detail.

Let us firstly assign our LIBNAME:

libname mylib '/home/u50133827/My_datasets';

Concatenating strings

There are majorly 3 functions used for joining multiple string in SAS:

CAT: Concatenate strings without any separator, retains leading and trailing spaces.

CATS: Concatenate strings without any separator, removing leading and trailing spaces.

CATX: Concatenate strings with a pre-defined separator.

In the following code we are trying to concatenate Region and Product columns using various concatenation functions.

In Region column we have trailing spaces thus in CAT function these spaces are retained.

In CATX function our separator is "_" which is the first argument of the CATX function.

DATA MYLIB.SHOES(Keep= Region Product CAT_output CATS_output CATX_output);
SET SASHELP.SHOES;
Cat_output =  CAT(Region,Product);
Cats_output =  CATS(Region,Product);
Catx_output =  CATX("_",Region,Product);
RUN;

Converting strings to lowercase or uppercase

There are majorly 3 functions used for changing the lowercase or uppercase in SAS:

LOWCASE: Converts the strings to lowercase (all alphabets are small)

UPCASE: Converts the strings to uppercase(all alphabets are in capitals)

PROPCASE: Converts the strings to proper case(all alphabets are small but the first letter is in capital)

DATA MYLIB.SHOES(Keep= Region Lowcase_output Upcase_output Propcase_output);
SET SASHELP.SHOES;
Lowcase_output = LOWCASE(Region);
Upcase_output = UPCASE(Region);
Propcase_output = PROPCASE(Region);
RUN;

Subsetting and reversing a string

SUBSTR( string, start position, number of letters to retrieve): It retrieves the part of a substring. User needs to define the start position and number of letters needed.
REVERSE(string): It returns the reversed string

DATA _NULL_;
String1 = "Hello I am a newbie to learning SAS!";
Substr_output = Substr(String1,6,7);
Reverse_output = REVERSE(String1);
PUT String1;
PUT 'Substr output is:' Substr_output;
PUT 'Reverse output is:' Reverse_output;
RUN;

String replacement

In SAS to replace a phrase in the string with some other phrase we can use TRANWRD function:

TRANWRD(string, 'old string to be replaced', 'new string which will be replaced')

DATA _NULL_;
String1 = "This string will be used for TRANWRD function string";
Tranwrd_output = TRANWRD(String1,'string','CODE');

PUT String1;
PUT 'TRANWRD output is:' Tranwrd_output;
RUN;

Note that in the output 'string' has been replaced by 'CODE'

Fetching Nth word in a string

In SAS SCAN function is used to fetch Nth word.

SCAN(string,position of the word to be fetched, <delimiter>)

If we do not specify any delimiter then SAS considers a blank space as a default delimiter.

In the following code:

SCAN_1_OUTPUT: No delimiter is provided by the user.

SCAN_2_OUTPUT: Delimiter is comma

SCAN_3_OUTPUT: Multiple delimiters (comma and white space) are defined

SCAN_4_OUTPUT: SCAN with position as -1 returns the last word in the string.

DATA _NULL_;
String1 = "This string, will be used, for SCAN, function";

Scan_1_output = SCAN(String1,3);
Scan_2_output = SCAN(String1,3,",");
Scan_3_output = SCAN(String1,3, ", ");
Scan_4_output = SCAN(String1,-1);

PUT String1;
PUT 'Scan output (delimiter: white space) is:' Scan_1_output;
PUT 'Scan output (delimiter: comma) is:' Scan_2_output;
PUT 'Scan output (delimiter: multiple)is:' Scan_3_output;
PUT 'Scan output (delimiter: white space), pisition -1 is:' Scan_4_output;
RUN;

Finding the position of a pattern in the string

INDEX function is used to find the first occurence of a pattern in a string

INDEX(string,'pattern')

If the pattern appears multiple times (like INDEX_2_output in following code) then INDEX will return the position of first occurence.

DATA _NULL_;

String1 = "This string will be used for INDEX function string";
Index_output = INDEX(String1,'will');
Index_2_output = INDEX(String1,"string");

FIND_output = FIND(String1,'will');
FIND_2_output = FIND(String1, 'string',14);

PUT String1;
PUT 'Index output is:' Index_output;
PUT 'Index 2nd output (with repetition of the keyword) is:' FIND_output;
PUT 'Find output is:' Index_output;
PUT 'Find 2nd output (with repetition of the keyword) is:' FIND_2_output;

RUN;

FIND function also returns the position of the pattern but has an optional parameter of specifying the position.

FIND(string,pattern, <position to start seaching onwards>)

In FIND_2_output we are telling SAS to search for the pattern string in our string1 , provided that it should start the search from 14th character onwards.

Note that INDEX always returns 13 (i.e. the first occurence) while if you don't specify any third parameter in FIND then both FIND and INDEX will return the same output.

Counting the number of words or patterns in a string

COUNT function returns the number of times a particular pattern appears in the string.

COUNT(string, pattern whose occurrence is to be counted, "i")

Letter "i" as third parameter denotes that the pattern is not case sensitive.

COUNTW function counts the number of words in a string separated by a white space.

DATA _NULL_;

String1 = "This string will be used for COUNT function string";
Count_OUTPUT = COUNT(String1,'o');
Count_2_output = COUNT(String1,'o',"i");
CountW_output = COUNTW(String1);

PUT String1;
PUT 'Count output is:' Count_OUTPUT;
PUT 'Count 2nd output (ignoring case sensitivity) is:' Count_2_output;
PUT 'COUNTW output is:' CountW_output;

RUN;

Note that in String1 Letter 'o' occurs only twice (considering case sensitivity) while Count_2_output returns 3 because it ignores case sensitivity.

Removing leading and trailing spaces or junk values from a string!

Our data might not always be clean and hence it can have junk values or leading or trailing spaces which can later on pose a problem while filtering or joining the datasets. There are several string functions to clean the text data!

LEFT: Moves the spaces from the left side of the string (leading spaces) to the right

TRIM: Removes the spaces from the right side of the string(trailing spaces)

LEFT and TRIM together: Moves the leading spaces to right and then removes the trailing spaces

STRIP: Removes both leading and trailing spaces

COMPBL: Converts multiple white spaces to a single white space

COMPRESS: Removes the specified characters from the string.

COMPRESS(string,<pattern to be removed>)

In COMPRESS if you do not specify the characters then by default COMPRESS removes all the white spaces.

DATA _NULL_;
String1 = "    This string     has -a- lot of   %@$#%$#! spaces!!!!     ";

Left_output = LEFT(String1);
Trim_output = TRIM(String1);
Trim_Left_output = TRIM(LEFT(String1));
Strip_output = STRIP(String1);
COMPBL_Output = COMPBL(String1);
COMPRESS_Output = COMPRESS(String1);
COMPRESS_Output2 = COMPRESS(String1,"-!@#$%^&*()-");

PUT String1;
PUT 'Left output is:' Left_output ;
PUT 'Trim output is:' Trim_output ;
PUT'Trim_Left output is:' Trim_Left_output;
PUT 'Strip output is:' Strip_output ;
PUT 'Compbl output is:' COMPBL_Output;
PUT 'Compress output is:' COMPRESS_Output;
PUT 'Compress output is:' COMPRESS_Output2;
RUN;