This post covers how to deal with missing values in terms of data cleaning.
1. Preparation
Missing values are represented by the NA symbol in R.
As an example of a data frame, we will use 'airquality' which is a built-in data frame in R.
Checking the first 10 rows, we can see some missing values (marked as NA).
#loading data 'airquality'
data("airquality")
head(airquality, 10)
#Consloe window
> data("airquality")
> head(airquality, 10)
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
10    NA     194  8.6   69     5  10
2. Finding Missing Values
1) is.na()
Among several ways to find any NA, is.na() function is the most basic, yet primitive one.
It literally means "Is it NA?" and it tells us whether each value is a NA or not.
It comes in handy when you want to know any specific "location" of NA, but it is less efficient as the size of data increases.
#is.na() function is.na(airquality)
#Consloe window
> is.na(airquality)
       Ozone Solar.R  Wind  Temp Month   Day
  [1,] FALSE   FALSE FALSE FALSE FALSE FALSE
  [2,] FALSE   FALSE FALSE FALSE FALSE FALSE
  [3,] FALSE   FALSE FALSE FALSE FALSE FALSE
  [4,] FALSE   FALSE FALSE FALSE FALSE FALSE
  [5,]  TRUE    TRUE FALSE FALSE FALSE FALSE
  [6,] FALSE    TRUE FALSE FALSE FALSE FALSE
  [7,] FALSE   FALSE FALSE FALSE FALSE FALSE
  [8,] FALSE   FALSE FALSE FALSE FALSE FALSE
  [9,] FALSE   FALSE FALSE FALSE FALSE FALSE
 [10,]  TRUE   FALSE FALSE FALSE FALSE FALSE
(The rest is omitted)
2) table(is.na())
Often, we just want to know whether there is any NA or not.
In this case, we can use either table(is.na()) or sum(is.na()) function.
table(is.na()) displays the number of both non-NA (FALSE, 874) and NA (TRUE, 44).
#table(is.na()) function table(is.na(airquality))
#Consloe window > table(is.na(airquality)) FALSE TRUE 874 44
3) sum(is.na())
sum(is.na()) does a similar job but it only shows us the number of NA (44).
#sum(is.na()) function sum(is.na(airquality))
#Consloe window > sum(is.na(airquality)) [1] 44
4) Finding NA for specific variables
We can also detect NA for specific variable.
In this case, we can use the two functions again, but this time, with designating the variable.
We do that by putting a '$' between the data frame name and the variable name.
As an example, if we test the 'Ozone' variable, we can figure out that there are some NAs and even specific number of them.
#table(is.na(data_frame$variable)) function table(is.na(airquality$Ozone)) #sum(is.na(data_frame$variable)) function sum(is.na(airquality$Ozone))
#Consloe window > table(is.na(airquality$Ozone)) FALSE TRUE 116 37 > sum(is.na(airquality$Ozone)) [1] 37
5) Finding cases with NA for specific variable(s)
If you want to see only the cases with NA for specifica variable(s), you can apply is.na() function to filter() function.
(The filter() function requires installation and loading of dplyr package)
For example, finding cases that have NA for Ozone variable, we get a toal of 37 cases.
#filter(is.na()) function library(dplyr) airquality %>% filter(is.na(Ozone))
#Consloe window > library(dplyr) > airquality %>% filter(is.na(Ozone)) Ozone Solar.R Wind Temp Month Day 1 NA NA 14.3 56 5 5 2 NA 194 8.6 69 5 10 3 NA 66 16.6 57 5 25 (The rest is omitted) 35 NA 255 12.6 75 8 23 36 NA 153 5.7 88 8 27 37 NA 145 13.2 77 9 27
3. Removing Missing Values
1) filter(!is.na())
By using the filter() funtion explained above, we can extract the cases that do not have any NAs.
To do so, we only have to put '!' in front of is.na() function, to reverse the meaning.
The first exmaple below shows how to extract cases that have no NA for 'Ozone' variable.
You may take it as a new data frame.
The second exmaple demonstrates how to extract cases that have no NA for both 'Ozone' and 'Solar.R' variable.
It is done by using '&', and you can pick more than two variables with it.
#filter(!is.na()) function library(dplyr) airquality %>% filter(!is.na(Ozone)) df_noNA2 <- airquality %>% filter(!is.na(Ozone)) airquality %>% filter(!is.na(Ozone) & !is.na(Solar.R)) df_noNA3 <- airquality %>% filter(!is.na(Ozone) & !is.na(Solar.R))
#Consloe window
> library(dplyr)
> airquality %>% filter(!is.na(Ozone))
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
(The rest is omitted)
114    14     191 14.3   75     9  28
115    18     131  8.0   76     9  29
116    20     223 11.5   68     9  30
> df_noNA2 <- airquality %>% filter(!is.na(Ozone))
> airquality %>% filter(!is.na(Ozone) & !is.na(Solar.R))
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
(The rest is omitted)
109    14     191 14.3   75     9  28
110    18     131  8.0   76     9  29
111    20     223 11.5   68     9  30
> df_noNA3 <- airquality %>% filter(!is.na(Ozone) & !is.na(Solar.R))
2) na.omit()
na.omit() function removes every case that has any missing value.
As you can see below, the case number 5 6 10 11 ... were deleted since they included missing value for at least one variable.
If this is what you wanted, you can have it as a new data frame.
#na.omit(data_frame$variable) function na.omit(airquality) df_noNA1 <- na.omit(airquality) df_noNA1
#Consloe window
> na.omit(airquality)
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
7      23     299  8.6   65     5   7
8      19      99 13.8   59     5   8
9       8      19 20.1   61     5   9
12     16     256  9.7   69     5  12
(The rest is omitted)
> df_noNA <- na.omit(airquality)
> df_noNA
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
7      23     299  8.6   65     5   7
8      19      99 13.8   59     5   8
9       8      19 20.1   61     5   9
12     16     256  9.7   69     5  12
(The rest is omitted)
4. Replacing Missing Values
Sometimes we need to replace the NAs, especially when there are too many NAs.
Here, we will take the mean value as a replacement for NAs.
First, we need to know how to calcualte the mean value without counting the NAs.
This can be done by using na.rm parameter, which helps you compute something by without considering (removing) the NAs.
The example below shows how to get the mean value of the Ozone variable, withoout considering the NAs.
#mean(variable, na.rm = T) function mean(airquality$Ozone, na.rm = T)
#Consloe window > mean(airquality$Ozone, na.rm = T) [1] 42.12931
Now, here comes the ifelse() function for the replacement
ifelse(A, B, C) does the job if A is true then it gives B and if A is not true, then C.
Therefore, we will put is.na() function for A, the mean value for B, and the original value for C.
As yuo can see, all the values of 42.12931 were originally NAs but they were replaced by the mean value.
Finally, you may designate it to the original variable name, which lets you keep the name of the variable but have the values changed.
#ifelse() function ifelse(is.na(airquality$Ozone), mean(airquality$Ozone, na.rm = T), airquality$Ozone) airquality$Ozone <- ifelse(is.na(airquality$Ozone), mean(airquality$Ozone, na.rm = T), airquality$Ozone)
#Consloe window > ifelse(is.na(airquality$Ozone), mean(airquality$Ozone, na.rm = T), airquality$Ozone) [1] 41.00000 36.00000 12.00000 18.00000 42.12931 [6] 28.00000 23.00000 19.00000 8.00000 42.12931 [11] 7.00000 16.00000 11.00000 14.00000 18.00000 (The rest is omitted) [141] 13.00000 24.00000 16.00000 13.00000 23.00000 [146] 36.00000 7.00000 14.00000 30.00000 42.12931 [151] 14.00000 18.00000 20.00000 > airquality$Ozone <- ifelse(is.na(airquality$Ozone), mean(airquality$Ozone, na.rm = T), airquality$Ozone)
(Thumbnail image: photo by Pierre Bamin on Unsplash)
 
![[R] Data Import](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5c_zi7m6ac3-R1bYIJyT3W6OQT7hc_EM4VLG7AGHx4KTh34WM1TNnGdU5Ft1mTclDmN1_U91hHixDjM1FFnGa8bZYnpykFWAMGv_EBKXyrWWjrPcxOc25YzY50igIPuznUsEyWRlvjKJpbO6EOCAn3GPlHE3wg2QSOLiqbkLfUazVALtPSH93WE3CTw/w72-h72-p-k-no-nu/mika-baumeister-Wpnoqo2plFA-unsplash.jpg) 
0 Comments