Both IF and WHERE keywords are used to filter the datasets but not both of them can be used in all the situations. In this tutorial we shall try to understand which keyword can be used in which situations.
Following is a brief summary about which keyword to use in which situation:
Let us firstly define our LIBNAME:
libname mylib '/home/u52922927/My_datasets';
WHERE is more efficient than IF: When you are using already existing columns for filtering:
Although in such a scenario where you only need to filter an already existing column, both where and if lead to same results, but on working with large datasets WHERE is more efficient than IF.
Following is the comparison of the log, when WHERE and IF statements are used.
WHERE firstly subsets the data (BASEBALL) and then reads it, while IF firstly reads all of the observations from the dataset (BASEBALL) and then filters it.
Thus using WHERE statement in such scenarios is preferable.
IF statement wins: When you are using newly created columns for filtering:
Let us create a new column 'NEWCOL' = NHITS + 10 and try to fetch the rows where NEWCOL < 85:
Using where statement in such a scenario will cause an error because SAS searches for the variable 'NEWCOL' in the dataset BASEBALL to filter, but it could not find it.
However, if you run the following command, it shall publish the output, because you are not creating any new column.
WHERE statement wins: While printing output using PROC PRINT.
If we try to filter for rows while using PROC PRINT IF statement throws an error while WHERE keyword will successfully retrieve the desired rows.
WHERE statement wins: While using special filtering operators LIKE and CONTAIN
If we try to filter for rows while using LIKE or CONTAIN keywords then IF statement will throw an error while WHERE keyword will successfully retrieve the desired rows.
IF statement wins: While subsetting data when data is read using INPUT statement.
When we are reading raw data using INPUT statement we can filter the dataset using IF statement to save our processing time.
In the following data we are filtering for rows where number of cases is more than 200.
DATA MYLIB.Hosp_data;
INPUT Department : $30. Month $ No_of_cases ;
IF no_of_cases > 200;
cards;
E&A Jan 234
E&A Feb 136
Cardiology Jan 754
Cardiology Feb 195
Neurology Jan 932
Neurology Feb 249
;
RUN;
Since INPUT statement is there only in the Program Data Vector (PDV) and not exists in the dataset thus filtering using WHERE keyword will lead to errors:
DATA MYLIB.Hosp_data;
INPUT Department : $30. Month $ No_of_cases ;
WHERE no_of_cases > 200;
cards;
E&A Jan 234
E&A Feb 136
Cardiology Jan 754
Cardiology Feb 195
Neurology Jan 932
Neurology Feb 249
;
RUN;
IF statement wins: While subsetting data with temporary variables like FIRST. or LAST. or _N_ or _ERROR_
Let us create a data hosp_data as follows:
DATA MYLIB.Hosp_data;
INPUT Department : $30. Month $ No_of_cases ;
cards;
E&A Jan 234
E&A Feb 136
Cardiology Jan 754
Cardiology Feb 195
Neurology Jan 932
Neurology Feb 249
;
RUN;
When we sort the data by PROC SORT temporary variables like FIRST. and LAST. get created in the PDV. To filter using these temporary variables we use IF statement.
PROC SORT DATA = MYLIB.HOSP_DATA;
BY Department;
RUN;
DATA MYLIB.TEMP;
SET MYLIB.HOSP_DATA;
BY Department;
IF FIRST.Department;
RUN;
Using WHERE keyword for filtering these temporary variables will lead to an error.
DATA MYLIB.TEMP;
SET MYLIB.HOSP_DATA;
BY Department;
WHERE FIRST.Department;
RUN;
WHERE statement wins: Subsetting data in DATA or SET statement.
While subsetting the data at DATA or SET step only WHERE keyword works and not IF.
Comments