[R] Data Cleaning (Missing Values)

This post covers how to deal with missing values in terms of data cleaning.

    1. Preparation

    Missing values are represented by the NA symbol in R.

    As an example of a data frame, we will use 'airquality' which is a built-in data frame in R.

    Checking the first 10 rows, we can see some missing values (marked as NA).

    #loading data 'airquality'
    data("airquality")
    head(airquality, 10)
    
    #Consloe window
    > data("airquality")
    > head(airquality, 10)
       Ozone Solar.R Wind Temp Month Day
    1     41     190  7.4   67     5   1
    2     36     118  8.0   72     5   2
    3     12     149 12.6   74     5   3
    4     18     313 11.5   62     5   4
    5     NA      NA 14.3   56     5   5
    6     28      NA 14.9   66     5   6
    7     23     299  8.6   65     5   7
    8     19      99 13.8   59     5   8
    9      8      19 20.1   61     5   9
    10    NA     194  8.6   69     5  10
    


    2. Finding Missing Values

    1) is.na()

    Among several ways to find any NA, is.na() function is the most basic, yet primitive one.

    It literally means "Is it NA?" and it tells us whether each value is a NA or not.

    It comes in handy when you want to know any specific "location" of NA, but it is less efficient as the size of data increases.

    #is.na() function
    is.na(airquality)
    
    #Consloe window
    > is.na(airquality)
           Ozone Solar.R  Wind  Temp Month   Day
      [1,] FALSE   FALSE FALSE FALSE FALSE FALSE
      [2,] FALSE   FALSE FALSE FALSE FALSE FALSE
      [3,] FALSE   FALSE FALSE FALSE FALSE FALSE
      [4,] FALSE   FALSE FALSE FALSE FALSE FALSE
      [5,]  TRUE    TRUE FALSE FALSE FALSE FALSE
      [6,] FALSE    TRUE FALSE FALSE FALSE FALSE
      [7,] FALSE   FALSE FALSE FALSE FALSE FALSE
      [8,] FALSE   FALSE FALSE FALSE FALSE FALSE
      [9,] FALSE   FALSE FALSE FALSE FALSE FALSE
     [10,]  TRUE   FALSE FALSE FALSE FALSE FALSE
    (The rest is omitted)
    

    2) table(is.na())

    Often, we just want to know whether there is any NA or not.

    In this case, we can use either table(is.na()) or sum(is.na()) function.

    table(is.na()) displays the number of both non-NA (FALSE, 874) and NA (TRUE, 44).

    #table(is.na()) function
    table(is.na(airquality))
    
    #Consloe window
    > table(is.na(airquality))
    
    FALSE  TRUE 
      874    44 
    

    3) sum(is.na())

    sum(is.na()) does a similar job but it only shows us the number of NA (44).

    #sum(is.na()) function
    sum(is.na(airquality))
    
    #Consloe window
    > sum(is.na(airquality))
    [1] 44
    

    4) Finding NA for specific variables

    We can also detect NA for specific variable.

    In this case, we can use the two functions again, but this time, with designating the variable.

    We do that by putting a '$' between the data frame name and the variable name.

    As an example, if we test the 'Ozone' variable, we can figure out that there are some NAs and even specific number of them.

    #table(is.na(data_frame$variable)) function
    table(is.na(airquality$Ozone))
    
    #sum(is.na(data_frame$variable)) function
    sum(is.na(airquality$Ozone))
    
    #Consloe window
    > table(is.na(airquality$Ozone))
    
    FALSE  TRUE 
      116    37 
      
    > sum(is.na(airquality$Ozone))
    [1] 37
    

    5) Finding cases with NA for specific variable(s)

    If you want to see only the cases with NA for specifica variable(s), you can apply is.na() function to filter() function.

    (The filter() function requires installation and loading of dplyr package)

    For example, finding cases that have NA for Ozone variable, we get a toal of 37 cases.

    #filter(is.na()) function
    library(dplyr)
    airquality %>% filter(is.na(Ozone))
    
    #Consloe window
    > library(dplyr)
    > airquality %>% filter(is.na(Ozone))
       Ozone Solar.R Wind Temp Month Day
    1     NA      NA 14.3   56     5   5
    2     NA     194  8.6   69     5  10
    3     NA      66 16.6   57     5  25
    (The rest is omitted)
    35    NA     255 12.6   75     8  23
    36    NA     153  5.7   88     8  27
    37    NA     145 13.2   77     9  27
    


    3. Removing Missing Values

    1) filter(!is.na())

    By using the filter() funtion explained above, we can extract the cases that do not have any NAs.

    To do so, we only have to put '!' in front of is.na() function, to reverse the meaning.

    The first exmaple below shows how to extract cases that have no NA for 'Ozone' variable.

    You may take it as a new data frame.

    The second exmaple demonstrates how to extract cases that have no NA for both 'Ozone' and 'Solar.R' variable.

    It is done by using '&', and you can pick more than two variables with it.

    #filter(!is.na()) function
    library(dplyr)
    airquality %>% filter(!is.na(Ozone))
    df_noNA2 <- airquality %>% filter(!is.na(Ozone))
    airquality %>% filter(!is.na(Ozone) & !is.na(Solar.R))
    df_noNA3 <- airquality %>% filter(!is.na(Ozone) & !is.na(Solar.R))
    
    #Consloe window
    > library(dplyr)
    > airquality %>% filter(!is.na(Ozone))
        Ozone Solar.R Wind Temp Month Day
    1      41     190  7.4   67     5   1
    2      36     118  8.0   72     5   2
    3      12     149 12.6   74     5   3
    (The rest is omitted)
    114    14     191 14.3   75     9  28
    115    18     131  8.0   76     9  29
    116    20     223 11.5   68     9  30
    > df_noNA2 <- airquality %>% filter(!is.na(Ozone))
    > airquality %>% filter(!is.na(Ozone) & !is.na(Solar.R))
        Ozone Solar.R Wind Temp Month Day
    1      41     190  7.4   67     5   1
    2      36     118  8.0   72     5   2
    3      12     149 12.6   74     5   3
    (The rest is omitted)
    109    14     191 14.3   75     9  28
    110    18     131  8.0   76     9  29
    111    20     223 11.5   68     9  30
    > df_noNA3 <- airquality %>% filter(!is.na(Ozone) & !is.na(Solar.R))
    

    2) na.omit()

    na.omit() function removes every case that has any missing value.

    As you can see below, the case number 5 6 10 11 ... were deleted since they included missing value for at least one variable.

    If this is what you wanted, you can have it as a new data frame.

    #na.omit(data_frame$variable) function
    na.omit(airquality)
    df_noNA1 <- na.omit(airquality)
    df_noNA1
    
    #Consloe window
    > na.omit(airquality)
        Ozone Solar.R Wind Temp Month Day
    1      41     190  7.4   67     5   1
    2      36     118  8.0   72     5   2
    3      12     149 12.6   74     5   3
    4      18     313 11.5   62     5   4
    7      23     299  8.6   65     5   7
    8      19      99 13.8   59     5   8
    9       8      19 20.1   61     5   9
    12     16     256  9.7   69     5  12
    (The rest is omitted)
    > df_noNA <- na.omit(airquality)
    > df_noNA
        Ozone Solar.R Wind Temp Month Day
    1      41     190  7.4   67     5   1
    2      36     118  8.0   72     5   2
    3      12     149 12.6   74     5   3
    4      18     313 11.5   62     5   4
    7      23     299  8.6   65     5   7
    8      19      99 13.8   59     5   8
    9       8      19 20.1   61     5   9
    12     16     256  9.7   69     5  12
    (The rest is omitted)
    


    4. Replacing Missing Values

    Sometimes we need to replace the NAs, especially when there are too many NAs.

    Here, we will take the mean value as a replacement for NAs.

    First, we need to know how to calcualte the mean value without counting the NAs.

    This can be done by using na.rm parameter, which helps you compute something by without considering (removing) the NAs.

    The example below shows how to get the mean value of the Ozone variable, withoout considering the NAs.

    #mean(variable, na.rm = T) function
    mean(airquality$Ozone, na.rm = T)
    
    #Consloe window
    > mean(airquality$Ozone, na.rm = T)
    [1] 42.12931
    

    Now, here comes the ifelse() function for the replacement

    ifelse(A, B, C) does the job if A is true then it gives B and if A is not true, then C.

    Therefore, we will put is.na() function for A, the mean value for B, and the original value for C.

    As yuo can see, all the values of 42.12931 were originally NAs but they were replaced by the mean value.

    Finally, you may designate it to the original variable name, which lets you keep the name of the variable but have the values changed.

    #ifelse() function
    ifelse(is.na(airquality$Ozone), mean(airquality$Ozone, na.rm = T), airquality$Ozone)
    airquality$Ozone <- ifelse(is.na(airquality$Ozone), mean(airquality$Ozone, na.rm = T), airquality$Ozone)
    
    #Consloe window
    > ifelse(is.na(airquality$Ozone), mean(airquality$Ozone, na.rm = T), airquality$Ozone)
      [1]  41.00000  36.00000  12.00000  18.00000  42.12931
      [6]  28.00000  23.00000  19.00000   8.00000  42.12931
     [11]   7.00000  16.00000  11.00000  14.00000  18.00000
    (The rest is omitted)
    [141]  13.00000  24.00000  16.00000  13.00000  23.00000
    [146]  36.00000   7.00000  14.00000  30.00000  42.12931
    [151]  14.00000  18.00000  20.00000
    > airquality$Ozone <- ifelse(is.na(airquality$Ozone), mean(airquality$Ozone, na.rm = T), airquality$Ozone)
    

    (Thumbnail image: photo by Pierre Bamin on Unsplash)

    Post a Comment

    0 Comments