PROC SORT in SAS

Ekta Aggarwal

Jan 22, 2021

PROC SORT is a very powerful procedure in SAS to sort the dataset at multiple levels and removing the duplicates.

In this tutorial we shall learn about:

Sorting data
Sorting data in descending order
What are FIRST. and LAST. variables?
Saving the output of FIRST. and LAST.
Sorting data by multiple variables
FIRST. and LAST. when data is sorted by multiple variables.
DUPOUT, NODUPKEY and NODUP

Let us firstly define our LIBPATH:

libname mylib '/home/u50132927/My_datasets';

Syntax:

PROC SORT DATA = data_to_be_sorted OUT = output_dataset;

BY ASCENDING (or DESCENDING) variables to be sorted;

RUN;

DATA: Specifies the data which needs be sorted. If DATA is not mentioned then SAS will sort the recently processed dataset by default.

OUT: The location and name of the new dataset where sorted data will be stored. If it is not mentioned then the changes are made in the original dataset.

BY: Contains the list of variables which are to be sorted. You can sort as many variables as possible.

ASCENDING/ DESCENDING: Sorting can be done in ascending or descending order by mentioning the keyword before the variable name in BY statement. By default SAS sorts the data in ascending order.

Missing values are considered as the smallest values thus in ascending (descending) order they will appear first (last).

Learning via Examples!

Let us sort SAS' inbuilt dataset SASHELP.SHOES on the basis of variable Region and save our output in a new data MYLIB.SHOES_SORTED.

PROC SORT DATA = SASHELP.SHOES OUT = MYLIB.SHOES_SORTED;
BY Region;
RUN;

Task: Sort SHOES data by regions in descending order.

PROC SORT DATA = SASHELP.SHOES(keep = region product sales) OUT = MYLIB.SHOES_SORTED;
BY DESCENDING Region;
RUN;

FIRST. and LAST. variables

When you sort a dataset SAS creates temporary variables with prefix FIRST. and LAST. which you won't be able to view in the dataset but they are there in the program data vector (PDV) i.e. SAS' memory. Let us firstly understand what are these variables.

FIRST. : When you sort the data ,for the first occurrence of distinct values in BY keyword First.XXX is 1 and for duplicate values it is 0. In the following example, for first occurrence of Alex, Bella, Charles, Henry and Minnie FIRST.NAME = 1 and for all others it is 0.
LAST. : For the last occurrence of distinct values in BY keyword LAST.XXX is 1 and for other occurrence it is 0. In the example given below, for last occurrence of Alex, Bella, Charles, Henry and Minnie LAST.NAME = 1 and for all others it is 0.

When both FIRST.NAME and LAST.NAME = 1 for a row it means that variable does not have any duplicates (eg. Henry and Minnie)

Task: Let us sort our data by Region

PROC SORT DATA = SASHELP.SHOES OUT(keep = region product sales) = MYLIB.SHOES_SORTED;
BY Region;
RUN;

Can we save the output of FIRST.Region and LAST.Region?

Certainly. In the following code we have saved these temporary variables as a new variable in our data named First_region and Last_region.

Note that since First_region and Last_region are a part of the data thus we shall be able to view them as well!

DATA MYLIB.SHOES_SORTED ;
SET MYLIB.SHOES_SORTED;
BY Region;
First_region  = FIRST.REGION;
Last_region = LAST.REGION;
RUN;

Why FIRST.Region and LAST.Region are important?

They are extremely useful in filtering for duplicate values in our data!

PROC SORT DATA = SASHELP.SHOES(keep = region product sales) OUT = MYLIB.SHOES_SORTED;
BY Region;
RUN;

In the below code we have written IF FIRST.REGION THEN OUTPUT which means whenever FIRST.REGION = 1 then only those rows should be saved in the dataset.

Note to use FIRST. and LAST. a BY statement is a must!

DATA MYLIB.SHOES_SORTED ;
SET MYLIB.SHOES_SORTED;
BY Region;
IF FIRST.REGION THEN OUTPUT;
RUN;

Alternatively

DATA MYLIB.SHOES_SORTED ;
SET MYLIB.SHOES_SORTED;
BY Region;
IF FIRST.REGION;
RUN;

Sorting data by multiple variables

We can also sort the data by multiple variables.

Task: Sort SASHELP.SHOES by Region and product

PROC SORT DATA = SASHELP.SHOES(keep = region product sales) OUT = MYLIB.SHOES_SORTED;
BY Region Product;
RUN;

How do FIRST.XX and LAST.YY variables look when data is sorted by multiple variables?

When we are sorting the data by to levels: say Name and Department we have got 4 temporary variables: First.Name , Last.Name , First.Department and Last.Department.

First.Name and Last.Name : At first level of sorting, FIRST.XXX and LAST.XXX take value as 1 on their first and last occurence respectively.

First.Department: At second level of sorting: First.Department = 1 when SAS encounters first combination of Name and Department, otherwise it is 0.

Last.Department: At second level of sorting: Last.Department = 1 when SAS encounters last combination of Name and Department, otherwise it is 0.

In the example below, for first occurence of Alex and department A our First.Department = 1 and for last occurence it is 1, similarly for first occurence of Alex and department B both are 1.

Task: Save the values of First.XXX and Last.XXX as variables in the dataset.

DATA MYLIB.SHOES_SORTED ;
SET MYLIB.SHOES_SORTED;
BY Region PRODUCT;
First_region  = FIRST.REGION;
Last_region = LAST.REGION;
First_product = FIRST.PRODUCT;
Last_product = LAST.PRODUCT;
RUN;

Filtering for duplicate or unique cases in our data on the basis of two variables.

Let us firstly sort our dataset by region and product

PROC SORT DATA = SASHELP.SHOES(keep = region product sales) OUT = MYLIB.SHOES_SORTED;
BY Region Product;
RUN;

Task: Filter for first occurrence of region and product

In the following code IF statement means when both FIRST.REGION and FIRST.PRODUCT = 1 then only save the output in resultant dataset.

DATA MYLIB.SHOES_SORTED ;
SET MYLIB.SHOES_SORTED;
BY Region Product;
IF FIRST.REGION & FIRST.PRODUCT;
RUN;

Task: Filter for unique combinations of region and product

DATA MYLIB.SHOES_SORTED ;
SET MYLIB.SHOES_SORTED;
BY Region Product;
IF FIRST.REGION & FIRST.PRODUCT & LAST.REGION & LAST.PRODUCT;
RUN;

NODUPKEY and DUPKEY in PROC SORT

There are 3 special keywords used with PROC SORT which we shall understand in detail:

DUPOUT: Location and name of the new dataset where duplicate values will be stored. If it is not mentioned then duplicate values are discarded.

NODUPKEY: Filters the dataset for unique occurrence of the BY variables (our keys) and all other are considered as duplicates.

PROC SORT DATA = SASHELP.SHOES OUT = MYLIB.SHOES_SORTED DUPOUT = MYLIB.DUPLICATES NODUPKEY;
BY Region;
RUN;

NODUP: Filters the dataset for unique occurrence on the basis of all the columns (irrespective of variables mentioned in BY statement).

Warning: NODUP considers only those rows as duplicates which are adjacent to each other (i.e. if row 5 and row 7 are duplicate rows, NODUP won't be able to treat them as duplicates)

PROC SORT DATA = SASHELP.SHOES OUT = MYLIB.SHOES_SORTED NODUP;
BY Region;
RUN;

How to deal with rows which are not adjacent to each other but have duplicate values in all the columns?

To deal this in BY we write _ALL_ i.e. we are telling SAS to sort the data on the basis of all the variables. Due to this all of our duplicate rows will be below one another and then NODUP would be able to treat them as duplicates!

PROC SORT DATA = SASHELP.SHOES OUT = MYLIB.SHOES_SORTED NODUP;
BY _ALL_;
RUN;