Tidy data structures, summaries, and visualisations for missing data

Overview

naniar

R build status Coverage Status CRAN Status Badge CRAN Downloads Each Month lifecycle

naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data. It does this by providing:

  • Shadow matrices, a tidy data structure for missing data:
    • bind_shadow() and nabular()
  • Shorthand summaries for missing data:
    • n_miss() and n_complete()
    • pct_miss()and pct_complete()
  • Numerical summaries of missing data in variables and cases:
    • miss_var_summary() and miss_var_table()
    • miss_case_summary(), miss_case_table()
  • Statistical tests of missingness:
  • Visualisation for missing data:
    • geom_miss_point()
    • gg_miss_var()
    • gg_miss_case()
    • gg_miss_fct()

For more details on the workflow and theory underpinning naniar, read the vignette Getting started with naniar.

For a short primer on the data visualisation available in naniar, read the vignette Gallery of Missing Data Visualisations.

Installation

You can install naniar from CRAN:

install.packages("naniar")

Or you can install the development version on github using remotes:

# install.packages("remotes")
remotes::install_github("njtierney/naniar")

A short overview of naniar

Visualising missing data might sound a little strange - how do you visualise something that is not there? One approach to visualising missing data comes from ggobi and manet, which replaces NA values with values 10% lower than the minimum value in that variable. This visualisation is provided with the geom_miss_point() ggplot2 geom, which we illustrate by exploring the relationship between Ozone and Solar radiation from the airquality dataset.

library(ggplot2)

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_point()
#> Warning: Removed 42 rows containing missing values (geom_point).

ggplot2 does not handle these missing values, and we get a warning message about the missing values.

We can instead use geom_miss_point() to display the missing data

library(naniar)

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_miss_point()

geom_miss_point() has shifted the missing values to now be 10% below the minimum value. The missing values are a different colour so that missingness becomes pre-attentive. As it is a ggplot2 geom, it supports features like faceting and other ggplot features.

p1 <-
ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) + 
  geom_miss_point() + 
  facet_wrap(~Month, ncol = 2) + 
  theme(legend.position = "bottom")

p1

Data Structures

naniar provides a data structure for working with missing data, the shadow matrix (Swayne and Buja, 1998). The shadow matrix is the same dimension as the data, and consists of binary indicators of missingness of data values, where missing is represented as “NA”, and not missing is represented as “!NA”, and variable names are kep the same, with the added suffix “_NA" to the variables.

head(airquality)
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    NA      NA 14.3   56     5   5
#> 6    28      NA 14.9   66     5   6

as_shadow(airquality)
#> # A tibble: 153 x 6
#>    Ozone_NA Solar.R_NA Wind_NA Temp_NA Month_NA Day_NA
#>    
       
          
        
         
        
         
        
       
      
     
    
   
#>  1 !NA      !NA        !NA     !NA     !NA      !NA   
#>  2 !NA      !NA        !NA     !NA     !NA      !NA   
#>  3 !NA      !NA        !NA     !NA     !NA      !NA   
#>  4 !NA      !NA        !NA     !NA     !NA      !NA   
#>  5 NA       NA         !NA     !NA     !NA      !NA   
#>  6 !NA      NA         !NA     !NA     !NA      !NA   
#>  7 !NA      !NA        !NA     !NA     !NA      !NA   
#>  8 !NA      !NA        !NA     !NA     !NA      !NA   
#>  9 !NA      !NA        !NA     !NA     !NA      !NA   
#> 10 NA       !NA        !NA     !NA     !NA      !NA   
#> # … with 143 more rows

Binding the shadow data to the data you help keep better track of the missing values. This format is called “nabular”, a portmanteau of NA and tabular. You can bind the shadow to the data using bind_shadow or nabular:

bind_shadow(airquality)
#> # A tibble: 153 x 12
#>    Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
#>    
      
     
      
       
        
         
          
           
            
             
            
           
          
         
        
       
      
     
    
   
#>  1    41     190   7.4    67     5     1 !NA      !NA        !NA     !NA    
#>  2    36     118   8      72     5     2 !NA      !NA        !NA     !NA    
#>  3    12     149  12.6    74     5     3 !NA      !NA        !NA     !NA    
#>  4    18     313  11.5    62     5     4 !NA      !NA        !NA     !NA    
#>  5    NA      NA  14.3    56     5     5 NA       NA         !NA     !NA    
#>  6    28      NA  14.9    66     5     6 !NA      NA         !NA     !NA    
#>  7    23     299   8.6    65     5     7 !NA      !NA        !NA     !NA    
#>  8    19      99  13.8    59     5     8 !NA      !NA        !NA     !NA    
#>  9     8      19  20.1    61     5     9 !NA      !NA        !NA     !NA    
#> 10    NA     194   8.6    69     5    10 NA       !NA        !NA     !NA    
#> # … with 143 more rows, and 2 more variables: Month_NA 
   
    , Day_NA 
    
   
nabular(airquality)
#> # A tibble: 153 x 12
#>    Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
#>    
      
     
      
       
        
         
          
           
            
             
            
           
          
         
        
       
      
     
    
   
#>  1    41     190   7.4    67     5     1 !NA      !NA        !NA     !NA    
#>  2    36     118   8      72     5     2 !NA      !NA        !NA     !NA    
#>  3    12     149  12.6    74     5     3 !NA      !NA        !NA     !NA    
#>  4    18     313  11.5    62     5     4 !NA      !NA        !NA     !NA    
#>  5    NA      NA  14.3    56     5     5 NA       NA         !NA     !NA    
#>  6    28      NA  14.9    66     5     6 !NA      NA         !NA     !NA    
#>  7    23     299   8.6    65     5     7 !NA      !NA        !NA     !NA    
#>  8    19      99  13.8    59     5     8 !NA      !NA        !NA     !NA    
#>  9     8      19  20.1    61     5     9 !NA      !NA        !NA     !NA    
#> 10    NA     194   8.6    69     5    10 NA       !NA        !NA     !NA    
#> # … with 143 more rows, and 2 more variables: Month_NA 
   
    , Day_NA 
    
   

Using the nabular format helps you manage where missing values are in your dataset and make it easy to do visualisations where you split by missingness:

airquality %>%
  bind_shadow() %>%
  ggplot(aes(x = Temp,
             fill = Ozone_NA)) + 
  geom_density(alpha = 0.5)

And even visualise imputations

airquality %>%
  bind_shadow() %>%
  as.data.frame() %>% 
   simputation::impute_lm(Ozone ~ Temp + Solar.R) %>%
  ggplot(aes(x = Solar.R,
             y = Ozone,
             colour = Ozone_NA)) + 
  geom_point()
#> Warning: Removed 7 rows containing missing values (geom_point).

Or perform an upset plot - to plot of the combinations of missingness across cases, using the gg_miss_upset function

gg_miss_upset(airquality)

naniar does this while following consistent principles that are easy to read, thanks to the tools of the tidyverse.

naniar also provides handy visualations for each variable:

gg_miss_var(airquality)

Or the number of missings in a given variable at a repeating span

gg_miss_span(pedestrian,
             var = hourly_counts,
             span_every = 1500)

You can read about all of the visualisations in naniar in the vignette Gallery of missing data visualisations using naniar.

naniar also provides handy helpers for calculating the number, proportion, and percentage of missing and complete observations:

n_miss(airquality)
#> [1] 44
n_complete(airquality)
#> [1] 874
prop_miss(airquality)
#> [1] 0.04793028
prop_complete(airquality)
#> [1] 0.9520697
pct_miss(airquality)
#> [1] 4.793028
pct_complete(airquality)
#> [1] 95.20697

Numerical summaries for missing data

naniar provides numerical summaries of missing data, that follow a consistent rule that uses a syntax begining with miss_. Summaries focussing on variables or a single selected variable, start with miss_var_, and summaries for cases (the initial collected row order of the data), they start with miss_case_. All of these functions that return dataframes also work with dplyr’s group_by().

For example, we can look at the number and percent of missings in each case and variable with miss_var_summary(), and miss_case_summary(), which both return output ordered by the number of missing values.

miss_var_summary(airquality)
#> # A tibble: 6 x 3
#>   variable n_miss pct_miss
#>   
        
        
     
    
   
#> 1 Ozone        37    24.2 
#> 2 Solar.R       7     4.58
#> 3 Wind          0     0   
#> 4 Temp          0     0   
#> 5 Month         0     0   
#> 6 Day           0     0
miss_case_summary(airquality)
#> # A tibble: 153 x 3
#>     case n_miss pct_miss
#>    
     
        
     
    
   
#>  1     5      2     33.3
#>  2    27      2     33.3
#>  3     6      1     16.7
#>  4    10      1     16.7
#>  5    11      1     16.7
#>  6    25      1     16.7
#>  7    26      1     16.7
#>  8    32      1     16.7
#>  9    33      1     16.7
#> 10    34      1     16.7
#> # … with 143 more rows

You could also group_by() to work out the number of missings in each variable across the levels within it.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
airquality %>%
  group_by(Month) %>%
  miss_var_summary()
#> # A tibble: 25 x 4
#> # Groups:   Month [5]
#>    Month variable n_miss pct_miss
#>    
    
         
         
      
     
    
   
#>  1     5 Ozone         5     16.1
#>  2     5 Solar.R       4     12.9
#>  3     5 Wind          0      0  
#>  4     5 Temp          0      0  
#>  5     5 Day           0      0  
#>  6     6 Ozone        21     70  
#>  7     6 Solar.R       0      0  
#>  8     6 Wind          0      0  
#>  9     6 Temp          0      0  
#> 10     6 Day           0      0  
#> # … with 15 more rows

You can read more about all of these functions in the vignette “Getting Started with naniar”.

Statistical tests of missingness

naniar provides mcar_test() for Little’s (1988) statistical test for missing completely at random (MCAR) data. The null hypothesis in this test is that the data is MCAR, and the test statistic is a chi-squared value. Given the high statistic value and low p-value, we can conclude that the airquality data is not missing completely at random:

mcar_test(airquality)
#> # A tibble: 1 x 4
#>   statistic    df p.value missing.patterns
#>       
    
       
                 
      
     
    
   
#> 1      35.1    14 0.00142                4

Contributions

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Future Work

  • Extend the geom_miss_* family to include categorical variables, Bivariate plots: scatterplots, density overlays
  • SQL translation for databases
  • Big Data tools (sparklyr, sparklingwater)
  • Work well with other imputation engines / processes
  • Provide tools for assessing goodness of fit for classical approaches of MCAR, MAR, and MNAR (graphical inference from nullabor package)

Acknowledgements

Firstly, thanks to Di Cook for giving the initial inspiration for the package and laying down the rich theory and literature that the work in naniar is built upon. Naming credit (once again!) goes to Miles McBain. Among various other things, Miles also worked out how to overload the missing data and make it work as a geom. Thanks also to Colin Fay for helping me understand tidy evaluation and for features such as replace_to_na, miss_*_cumsum, and more.

A note on the name

naniar was previously named ggmissing and initially provided a ggplot geom and some other visualisations. ggmissing was changed to naniar to reflect the fact that this package is going to be bigger in scope, and is not just related to ggplot2. Specifically, the package is designed to provide a suite of tools for generating visualisations of missing values and imputations, manipulate, and summarise missing data.

…But why naniar?

Well, I think it is useful to think of missing values in data being like this other dimension, perhaps like C.S. Lewis’s Narnia - a different world, hidden away. You go inside, and sometimes it seems like you’ve spent no time in there but time has passed very quickly, or the opposite. Also, NAniar = na in r, and if you so desire, naniar may sound like “noneoya” in an nz/aussie accent. Full credit to @MilesMcbain for the name, and @Hadley for the rearranged spelling.

Comments
  • create other flavours of missing values

    create other flavours of missing values

    Building on issues #25 and #31, and discussions with @rgayler, there needs to be a way to create different flavours of missing values to indicate different mechanisms.

    An example of this could be where a weather station records -99 as a missing value, but missing specifically because the weather was so cold the instruments stop working.

    Currently in R there is only one kind of NA value (ignoring NA_integer_ ... and friends).

    So there needs to be a way to specify your own missing value NA_this (or something).

    This might be a function like tidyr::replace_na, perhaps instead called replace_na_why or something.

    This might look like

    
    data %>%
    replace_na_why(.condition = var == -99,
                  .why = "weather station too cold",
                  .suffix = "TC")
    

    This would then create a value NA_TC, which then has a mechanism recorded.

    Since R does not treat these as missing, we would incorporate this into the shadow matrix values

    !NA, NA, and NA_.why

    opened by njtierney 15
  • add .common_na_strings

    add .common_na_strings

    There are many other ways to represent missing values - I would like to include a bit of a mega string of values that naniar can search for, and also replace. This would cover things like:

    • "N A "
    • "NA "
    • " NA"
    • "N/A"
    • "N / A"

    etc.

    opened by njtierney 14
  • Replacing selected values as NA

    Replacing selected values as NA

    Following on from excellent discussion with @ColinFay about replacing values with missing values.

    Whilst there is a way to replace a value with an NA value in the tidyverse with tidyr::replace_na():

    library(tidyverse)
    
    dat_ms <- tibble::tribble(~x,  ~y,
                              1,   "A",
                              3,   "N/A",
                              NA,  NA,
                              -99, "E")
                              
    dat_ms %>%
      replace_na(list(x = -99, y = c("N/A")))
    

    There is currently not a convenient way to move in the opposite direction - replace certain values with NA.

    One approach is to do something like this:

    dat_ms %>% 
      mutate_all(.funs = function(x) replace(x, which(x == -99 | x == "N/A"), NA))
    

    However, the user still needs to specify all possible missing values. Making the intention to replace for all to the user would be very useful. This function could be called replace_with_na, and would have similar behaviour to replace_na.

    data %>% 
      replace_to_na(list(x = -99, "NAn", "Missing", "missing")) %>% 
      miss_var_summary()
    

    This indicates to the user that you are replacing these missing values, then doing a summary on them.

    To save the user from specifying all of the variable names to replace NA values, we can explore adding the additional suffixes, _at, _if, or _all - borrowing from the scoped variants of summarise, mutate, transmute, and rename, giving us:

    • replace_to_na_if
    • replace_to_na_at, and
    • replace_to_na_all

    These should then follow the same rules as the other scoped variants from dplyr, where:

    • _all affects every variable
    • _at affects variables selected with a character vector or vars()
    • _if affects variables selected with a predicate function

    I think we are stretching this idea a bit, as what we are effectively writing is a customised case_when command. However, if this works out as I think it can, it would be quite powerful.

    We can then take this idea further and extend it to shadow values, allowing us to specify directly the different flavours of missings, providing the verbs:

    • replace_shadow
    • replace_shadow_all
    • replace_shadow_if
    • replace_shadow_at
    data %>%
      bind_shadow() %>%
      replace_shadow_all(-99, "NAn", "Missing", "missing")
    

    This code would then only alter the shadow matrix, and leave the data intact as is, allowing us to leverage other features of the shadow matrix, and also possibly maybe add an additional factor level to it that describes the missingness mechanism (!NA, NA, NA_, NA_) - this has been described in the github issue https://github.com/njtierney/narnia/issues/50 , but there needs to be a way to store the "codebook / data dictionary" of missingness mechanisms, so that the user has a way to look up / describe what a value like NA_rainfall really means.

    So I think that the idea of specifying these different missingness mechanisms still needs more work and experimentation. There are currently ways in haven to store the "different" values of missingness using tagged_na(), but I don't really like hiding important features from the user, and here I think that shadow matrix can be more practical.

    We can then make the miss_ functions have special behaviour for the _NA columns, so that gg_miss_var(data) could provide the summary of the number of pure NA values, and then also the number of values that are coded as NA_<reason_1>.

    enhancement help wanted V0.2.0 Idea 
    opened by njtierney 12
  • Work on perf in gg_miss for large dataset

    Work on perf in gg_miss for large dataset

    gg_miss() family seems to take a (very) long time to compute on large dataset.

    Here's an example w/ diamonds:

    mb <- microbenchmark::microbenchmark(gg_miss_case(diamonds), times = 10)
    mb
    

    capture d ecran 2017-08-24 a 14 52 04

    Is this ggplot or naniar related?

    V0.2.0 
    opened by ColinFay 10
  • Use rowSums / rowMeans to compute row-wise summaries

    Use rowSums / rowMeans to compute row-wise summaries

    This simplifies the code and is 4-5 orders of magnitude faster

    Description

    An alternative to https://github.com/njtierney/naniar/pull/111, in case you do not want the added complexity. This uses the base functions rowSums() and rowMeans() and the fact is.na() is vectorized to do the calculations. It is 4-5 orders of magnitude faster than the original implementation, and seems to be ~ 2-10x faster than the Rcpp version in that PR (although I did not directly compare them).

    microbenchmark::microbenchmark(add_n_miss(airquality), current_add_n_miss(airquality), unit = "ms")
    
    #> Unit: milliseconds
    #>                            expr       min         lq         mean     median          uq        max neval cld
    #>          add_n_miss(airquality)  0.048766   0.060757   0.07926442   0.083558   0.0895555   0.160938   100  a
    #>  current_add_n_miss(airquality) 92.369034 100.457525 106.24287966 104.383850 108.0468730 294.761924   100   b
    
    opened by jimhester 7
  • c++ back end for add_n_miss

    c++ back end for add_n_miss

    Description

    I've added a (parallel) C++ back end for counting the number of NA in each row, and simplified the associated tidyeval logic.

    Example

    No interface change. But the function is now split into smaller bits.

    > test_df <- data.frame(x = c(NA,2,3),
    +                       y = c(1,NA,3),
    +                       z = c(1,2,3))
    > 
    > add_n_miss(test_df)
       x  y z n_miss_all
    1 NA  1 1          1
    2  2 NA 2          1
    3  3  3 3          0
    > add_n_miss(test_df, x)
       x  y z n_miss_vars
    1 NA  1 1           1
    2  2 NA 2           0
    3  3  3 3           0
    > ( naniar:::count_na(test_df, 1:2 ) )
    [1] 1 1 0
    

    Tests

    No additional tests

    opened by romainfrancois 7
  • behaviour of is.na

    behaviour of is.na

    is.na will return true for NA

    is.na(NA)
    #> [1] TRUE
    
    

    But for a quoted character, "NA", this is apparently not missing

    is.na("NA")
    #> [1] FALSE
    

    Somewhat suprisingly, NaN values are also regarded as missing values.

    is.na(NaN)
    #> [1] TRUE
    
    is.na("NaN")
    #> [1] FALSE
    

    I think that the quoted character "NA", "na", "Na", etc. should be regarded as missing, or there should be a specific function to coerce them to NAs. This function should also allow for some user specified NA characters. perhaps something like coerce_na

    There might also be scope here for another NA function where people can specify different factors / structure for the NA values. For example, -99, -98, might be missing values, but could have specific reasons / mechanisms for being missing, and so should be recorded differently. This might be a function called label_na.

    I'm not really sure if it is a problem that NaN values are considered missing. Perhaps it might be useful to provide some specific handlers for _na type objects, which also handle NaNs in a potentially more opinionated way.

    opened by njtierney 7
  • another way to display info. about missing

    another way to display info. about missing

    One simple way to display missing data in bivariate plots is to create a separate variable for where the values are missing and to plot that information. See script and an example plot here:

    https://gist.github.com/soodoku/36fecfc442342c0e01aad6742b8ee47e

    help wanted Idea 
    opened by soodoku 7
  • naniar::recode_shadow() throwing error:

    naniar::recode_shadow() throwing error: "`idx` must contain one integer for each level of `f`"

    I have been using recode_shadow() to label missing values which are due to questionnaire skips as 'NA_skip'. The function runs for one column without error, but when run over the second and subsequent columns it persistently throws an error:

    Error: `idx` must contain one integer for each level of `f `

    I thought the error message referred to empty factor levels, so I created dummy rows with values of NA and NA_skip, but this did not fix the problem.

    
    `
    library(tidyverse)
    library(naniar)
    
    df <- data.frame(Q1 = c("yes", "no", "no", NA), Q2 = c("a", NA, NA, NA), Q3 = c(1, NA, NA, 4)) %>% 
        mutate(Q1 = factor(Q1)) %>% 
        mutate(Q2 = factor(Q2))
    
    df_sh <- bind_shadow(df)
    
    df_sh_recode <- df_sh
    
    # Q1 is a filter question - people who answer no should skip Q2 and Q3
    df_sh_recode <- df_sh_recode %>% 
        recode_shadow(Q2 = .where(Q1 %in% "no" ~ "skip"))
    
    df_sh_recode <- df_sh_recode %>% 
        recode_shadow(Q3 = .where(Q1 %in% "no" ~ "skip"))
    #> Error: `idx` must contain one integer for each level of `f`
    # throws Error: `idx` must contain one integer for each level of `f`
    
    # there are empty factor levels ..
    df_sh_recode$Q1_NA %>% table(., useNA = "always")
    #> .
    #>     !NA      NA NA_skip    <NA> 
    #>       3       1       0       0
    
    # .. so: kludge: add dummy rows 5 and 6, to fill up factor levels in an attempt to solve the 'idx must contain one integer .." error
    df_sh_recode[5, ] <- NA
    df_sh_recode[5, 4:6] <- "NA_skip"
    df_sh_recode[6, ] <- NA
    df_sh_recode[6, 4:6] <- "NA"
    
    # and re-run the recode_shadow for Q3:
    df_sh_recode <- df_sh_recode %>% 
        recode_shadow(Q3 = .where(Q1 %in% "no" ~ "skip"))
    #> Error: `idx` must contain one integer for each level of `f`
    # still throws same error
    `
    Created on 2020-09-21 by the reprex package (v0.3.0)
    
    
    opened by benifex 6
  • Changes after revdep checks of dplyr 0.8.0 RC

    Changes after revdep checks of dplyr 0.8.0 RC

    This fixes two problems that were identified as part of reverse dependency checks of dplyr 0.8.0 release candidate. https://github.com/tidyverse/dplyr/blob/revdep_dplyr_0_8_0_RC/revdep/problems.md#naniar

    • n() must be imported or prefixed like any other function. In the PR, I've changed 1:n() to dplyr::row_number() as naniar seems to prefix all dplyr functions.

    • update_shadow was only restoring the class attributes, changed so that it restores all attributes, this was causing problems when data was a grouped_df. This likely was a problem before too, but dplyr 0.8.0 is stricter about what is a grouped data frame.

    opened by romainfrancois 6
  • Refactor capture of arguments

    Refactor capture of arguments

    • Pass ... directly to other functions rather than quote and splice.

    • Use as_string() rather than expr_text() to cast symbols to strings. The latter is a multi-line deparser for arbitrary expressions. It might add backticks to deparsed symbols and allow unwanted input types like complex expressions.

    • Use ensyms() before coercing to strings. This guarantees only symbols can be passed by user. quos() allows stuff like starts_with(), but the code generally assumes symbols, not calls.

    • Remove bare_to_chr() as part of this refactoring. It was making the assumption that exprs() unwraps quosures, which was a bug in rlang. This causes a revdep failure for the upcoming rlang 0.3.0.

    opened by lionel- 6
  • re fixing referenced JSS links

    re fixing referenced JSS links

    As per the email from Achim,

    the Journal of Statistical Software (JSS, https://www.jstatsoft.org/) recently migrated to a new server and editorial system, resulting in a change of the URLs being used for publications. Hence we checked all CRAN packages using JSS URLs in the documentation or citation files etc. This includes some of your packages: naniar.
    
    In general we recommend to use DOIs instead of URLs to link to JSS publications. These use the following pattern for articles: 10.18637/jss.vXXX.iYY where XXX is the three-digit volume and YY the two-digit issue. (For code snippets a "cYY" instead of "iYY" is used.) The DOIs are also shown on the web pages of the JSS articles.
    
    For including these in a package you typically use:
    - \doi{...} markup in .Rd files
    - <doi:...> in DESCRIPTION/Description fields
    - bibentry(..., doi = ...) in CITATION files (or citEntry)
    
    We would recommend to change all JSS references in your package correspondingly (even if redirections for the URLs are still working). The corresponding files in the package are:
    naniar/inst/doc/getting-started-w-naniar.html
    
    maintenance 
    opened by njtierney 0
  • Explore options to make `nabular` data smaller

    Explore options to make `nabular` data smaller

    library(naniar)
    library(lobstr)
    obj_size(riskfactors)
    #> 49,232 B
    obj_size(nabular(riskfactors))
    #> 99,992 B
    

    Created on 2022-04-06 by the reprex package (v2.0.1)

    Session info
    sessioninfo::session_info()
    #> ─ Session info  🤱🏻  🏇🏼  👲🏿   ─────────────────────────────────────────────────
    #>  hash: breast-feeding: light skin tone, horse racing: medium-light skin tone, person with skullcap: dark skin tone
    #> 
    #>  setting  value
    #>  version  R version 4.1.3 (2022-03-10)
    #>  os       macOS Big Sur 11.2.2
    #>  system   aarch64, darwin20
    #>  ui       X11
    #>  language (EN)
    #>  collate  en_AU.UTF-8
    #>  ctype    en_AU.UTF-8
    #>  tz       Australia/Melbourne
    #>  date     2022-04-06
    #>  pandoc   2.17.1.1 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/ (via rmarkdown)
    #> 
    #> ─ Packages ───────────────────────────────────────────────────────────────────
    #>  package     * version date (UTC) lib source
    #>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
    #>  backports     1.4.1   2021-12-13 [1] CRAN (R 4.1.1)
    #>  cli           3.2.0   2022-02-14 [1] CRAN (R 4.1.1)
    #>  colorspace    2.0-3   2022-02-21 [1] CRAN (R 4.1.1)
    #>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.1.3)
    #>  DBI           1.1.1   2021-01-15 [1] CRAN (R 4.1.0)
    #>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.1.1)
    #>  dplyr         1.0.8   2022-02-08 [1] CRAN (R 4.1.1)
    #>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
    #>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.1.1)
    #>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.1.1)
    #>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
    #>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.1)
    #>  generics      0.1.2   2022-01-31 [1] CRAN (R 4.1.1)
    #>  ggplot2       3.3.5   2021-06-25 [1] CRAN (R 4.1.1)
    #>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.1.1)
    #>  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.1.1)
    #>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.0)
    #>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.1)
    #>  knitr         1.37    2021-12-16 [1] CRAN (R 4.1.1)
    #>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.1)
    #>  lobstr      * 1.1.1   2019-07-02 [1] CRAN (R 4.1.0)
    #>  magrittr      2.0.2   2022-01-26 [1] CRAN (R 4.1.1)
    #>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.1.0)
    #>  naniar      * 0.6.1   2021-05-14 [1] CRAN (R 4.1.1)
    #>  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.1.1)
    #>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
    #>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
    #>  R.cache       0.15.0  2021-04-30 [1] CRAN (R 4.1.0)
    #>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.1.0)
    #>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.1.0)
    #>  R.utils       2.11.0  2021-09-26 [1] CRAN (R 4.1.1)
    #>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.1)
    #>  Rcpp          1.0.8   2022-01-13 [1] CRAN (R 4.1.1)
    #>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.1)
    #>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.1.1)
    #>  rmarkdown     2.11    2021-09-14 [1] CRAN (R 4.1.1)
    #>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.0)
    #>  scales        1.1.1   2020-05-11 [1] CRAN (R 4.1.0)
    #>  sessioninfo   1.2.1   2021-11-02 [1] CRAN (R 4.1.1)
    #>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.1.1)
    #>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.1.1)
    #>  styler        1.6.2   2021-09-23 [1] CRAN (R 4.1.1)
    #>  tibble        3.1.6   2021-11-07 [1] CRAN (R 4.1.1)
    #>  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.1.1)
    #>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
    #>  vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.0)
    #>  visdat        0.5.3   2019-02-15 [1] CRAN (R 4.1.1)
    #>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.1)
    #>  xfun          0.30    2022-03-02 [1] CRAN (R 4.1.1)
    #>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.1.1)
    #> 
    #>  [1] /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library
    #> 
    #> ──────────────────────────────────────────────────────────────────────────────
    

    It might be possible, for example, to make nabular data have a print method that looks like it does currently, and then to somehow store the references to those columns in a more compact way? Although perhaps thinking about that, it could perhaps just be delaying computation to later...some sort of lazy/JIT printing / computation

    planning 
    opened by njtierney 0
  • Add helper function to add random missingness

    Add helper function to add random missingness

    rather than needing to do something like:

    x <- 1:10
    x
    #>  [1]  1  2  3  4  5  6  7  8  9 10
    x[sample(x = length(x), size = 5)] <- NA
    x
    #>  [1] NA NA  3 NA NA  6  7  8  9 NA
    
    add_n_na <- function(x, n_na){
      x[sample(x = vctrs::vec_size(x), size = n_na)] <- NA
      x
    }
    
    x <- 1:10
    x
    #>  [1]  1  2  3  4  5  6  7  8  9 10
    add_n_na(x, 3)
    #>  [1]  1  2  3  4 NA  6  7 NA NA 10
    

    Created on 2022-04-05 by the reprex package (v2.0.1)

    Session info
    sessioninfo::session_info()
    #> ─ Session info  🇻🇺  ⏺️  🕰️   ───────────────────────────────────────────────────
    #>  hash: flag: Vanuatu, record button, mantelpiece clock
    #> 
    #>  setting  value
    #>  version  R version 4.1.3 (2022-03-10)
    #>  os       macOS Big Sur 11.2.2
    #>  system   aarch64, darwin20
    #>  ui       X11
    #>  language (EN)
    #>  collate  en_AU.UTF-8
    #>  ctype    en_AU.UTF-8
    #>  tz       Australia/Melbourne
    #>  date     2022-04-05
    #>  pandoc   2.17.1.1 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/ (via rmarkdown)
    #> 
    #> ─ Packages ───────────────────────────────────────────────────────────────────
    #>  package     * version date (UTC) lib source
    #>  backports     1.4.1   2021-12-13 [1] CRAN (R 4.1.1)
    #>  cli           3.2.0   2022-02-14 [1] CRAN (R 4.1.1)
    #>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.1.3)
    #>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.1.1)
    #>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
    #>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.1.1)
    #>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.1.1)
    #>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
    #>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.1)
    #>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.1.1)
    #>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.0)
    #>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.1)
    #>  knitr         1.37    2021-12-16 [1] CRAN (R 4.1.1)
    #>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.1)
    #>  magrittr      2.0.2   2022-01-26 [1] CRAN (R 4.1.1)
    #>  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.1.1)
    #>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
    #>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
    #>  R.cache       0.15.0  2021-04-30 [1] CRAN (R 4.1.0)
    #>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.1.0)
    #>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.1.0)
    #>  R.utils       2.11.0  2021-09-26 [1] CRAN (R 4.1.1)
    #>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.1)
    #>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.1.1)
    #>  rmarkdown     2.11    2021-09-14 [1] CRAN (R 4.1.1)
    #>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.0)
    #>  sessioninfo   1.2.1   2021-11-02 [1] CRAN (R 4.1.1)
    #>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.1.1)
    #>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.1.1)
    #>  styler        1.6.2   2021-09-23 [1] CRAN (R 4.1.1)
    #>  tibble        3.1.6   2021-11-07 [1] CRAN (R 4.1.1)
    #>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
    #>  vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.0)
    #>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.1)
    #>  xfun          0.30    2022-03-02 [1] CRAN (R 4.1.1)
    #>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.1.1)
    #> 
    #>  [1] /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library
    #> 
    #> ──────────────────────────────────────────────────────────────────────────────
    
    maintenance 
    opened by njtierney 0
  • revamp internals to preserve data type and consider renaming `shadow_long` and `shadow_gather` functions

    revamp internals to preserve data type and consider renaming `shadow_long` and `shadow_gather` functions

    To be more in line with pivot_longer and pivot_wider functions.

    Also there is an issue with shadow_long where the data type is not preserved. In this case, one of the variables is coerced from numeric to character

    image

    planning 
    opened by njtierney 0
  • Unable to install because of `norm` package

    Unable to install because of `norm` package


    Naniar has norm as an import and I'm unable to install the source. Now, I can install gfortran, it's not a big deal. I wanted to write up this issue since I couldn't find anything in the norm documentation about gfortran or in this package readme. It may be worth considering changing norm from import to suggests but I'm not sure if that's possible without breaking something. Looking at the norm package, it was last published in 2013.

    install.packages("naniar")
    

    ===== Output ====

    Installing package into ‘/home/joshuapark/R/x86_64-pc-linux-gnu-library/4.1’
    (as ‘lib’ is unspecified)
    also installing the dependency ‘norm’
    
    trying URL 'https://cloud.r-project.org/src/contrib/norm_1.0-9.5.tar.gz'
    Content type 'application/x-gzip' length 18173 bytes (17 KB)
    ==================================================
    downloaded 17 KB
    
    trying URL 'https://cloud.r-project.org/src/contrib/naniar_0.6.1.tar.gz'
    Content type 'application/x-gzip' length 2781177 bytes (2.7 MB)
    ==================================================
    downloaded 2.7 MB
    
    * installing *source* package ‘norm’ ...
    ** package ‘norm’ successfully unpacked and MD5 sums checked
    ** using staged installation
    ** libs
    gfortran -fno-optimize-sibling-calls  -fpic  -g -O2 -ffile-prefix-map=/build/r-base-crNhFx/r-base-4.1.2=. -flto=auto -ffat-lto-objects -fstack-protector-strong  -c norm.f -o norm.o
    /bin/bash: line 1: gfortran: command not found
    make: *** [/usr/lib/R/etc/Makeconf:191: norm.o] Error 127
    ERROR: compilation failed for package ‘norm’
    * removing ‘/home/joshuapark/R/x86_64-pc-linux-gnu-library/4.1/norm’
    ERROR: dependency ‘norm’ is not available for package ‘naniar’
    * removing ‘/home/joshuapark/R/x86_64-pc-linux-gnu-library/4.1/naniar’
    
    planning 
    opened by jpark-ehc 2
Releases(0.6.0)
  • 0.6.0(Sep 2, 2020)

    naniar 0.6.0 (2020/08/17) "Spur of the lamp post"

    • Provide warning for replace_with_na when columns provided that don't exist (see #160). Thank you to michael-dewar for their help with this.

    Breaking Changes

    • Drop the "nabular" and "shadow" classes (#268) used in nabular() and bind_shadow(). In doing so removes the functions, as_shadow(), is_shadow(), is_nabular(), new_nabular(), new_shadow(). These were mostly used internally and it is not expected that users would have used this functions. If these were used, please file an issue and I can implement them again.
    Source code(tar.gz)
    Source code(zip)
  • 0.5.2(Jun 29, 2020)

  • 0.5.1(May 14, 2020)

    naniar 0.5.1 (2020/04/10) "Uncle Andrew's Applewood Wardrobe"

    Minor Changes

    • Fixes warnings and errors from tibble and subsequent downstream impacts on simputation.
    Source code(tar.gz)
    Source code(zip)
  • 0.5.0(Mar 3, 2020)

    naniar 0.5.0 (2020/02/20) "The End of this Story and the Beginning of all of the Others"

    Breaking Changes

    • The following functions related to calculating the proportion/percentage of missingness were made Defunct and will no longer work:
      • miss_var_prop()
      • complete_var_prop()
      • miss_var_pct()
      • complete_var_pct()
      • miss_case_prop()
      • complete_case_prop()
      • miss_case_pct()
      • complete_case_pct()

    Instead use: prop_miss_var(), prop_complete_var(), pct_miss_var(), pct_complete_var(), prop_miss_case(), prop_complete_case(), pct_miss_case(), pct_complete_case(). (see 242)

    • replace_to_na() was made defunct, please use replace_with_na() instead. (see 242)

    Minor changes

    • miss_var_cumsum and miss_case_cumsum are now exported
    • use map_dfc instead of map_df
    • Fix various extra warnings and improve test coverage

    Bug Fixes

    • Address bug where the number of missings in a row is not calculated properly - see 238 and 232. The solution involved using rowSums(is.na(x)), which was 3 times faster.
    • Resolve bug in gg_miss_fct() where warning is given for non explicit NA values - see 241.
    • skip vdiffr tests on github actions
    • use tibble() not data_frame()
    Source code(tar.gz)
    Source code(zip)
    naniar_0.5.0.tar.gz(2.55 MB)
  • 0.4.2(Feb 19, 2019)

    Improvements

    • The geom_miss_point() ggplot2 layer can now be converted into an interactive web-based version by the ggplotly() function in the plotly package. In order for this to work, naniar now exports the geom2trace.GeomMissPoint() function (users should never need to call geom2trace.GeomMissPoint() directly -- ggplotly() calls it for you).
    • adds WORDLIST for spelling thanks to usethis::use_spell_check()
    • fix documentation @seealso bug (#228) (@sfirke)

    Dependency fixes

    • Thanks to a PR (#223) from @romainfrancois:

      • This fixes two problems that were identified as part of reverse dependency checks of dplyr 0.8.0 release candidate. https://github.com/tidyverse/dplyr/blob/revdep_dplyr_0_8_0_RC/revdep/problems.md#naniar

      • n() must be imported or prefixed like any other function. In the PR, I've changed 1:n() to dplyr::row_number() as naniar seems to prefix all dplyr functions.

      • update_shadow was only restoring the class attributes, changed so that it restores all attributes, this was causing problems when data was a grouped_df. This likely was a problem before too, but dplyr 0.8.0 is stricter about what is a grouped data frame.

    Source code(tar.gz)
    Source code(zip)
    naniar_0.4.2.tar.gz(2.81 MB)
  • 0.4.1(Dec 3, 2018)

  • 0.4.0(Sep 11, 2018)

    New Feature

    • Add custom label support for missings and not missings with functions add_label_missings and add_label_shadow() and add_any_miss(). So you can now do `add_label_missings(data, missing = "custom_missing_label", complete = "custom_complete_label")

    • impute_median() and scoped variants

    • any_shade() returns a logical TRUE or FALSE depending on if there are any shade values

    • nabular() an alias for bind_shadow() to tie the nabular term into the work.

    • is_nabular() checks if input is nabular.

    • geom_miss_point() now gains the arguments from shadow_shift()/impute_below() for altering the amount of jitter and proportion below (prop_below).

    • Added two new vignettes, "Exploring Imputed Values", and "Special Missing Values"

    • miss_var_summary and miss_case_summary now no longer provide the cumulative sum of missingness in the summaries - this summary can be added back to the data with the option add_cumsum = TRUE. #186

    • Added gg_miss_upset to replace workflow of:
      data %>% 
        as_shadow_upset() %>%
        UpSetR::upset()
      

    Major Change

    • recode_shadow now works! This function allows you to recode your missing values into special missing values. These special missing values are stored in the shadow part of the dataframe, which ends in _NA.
    • implemented shade where appropriate throughout naniar, and also added verifiers, is_shade, are_shade, which_are_shade, and removed which_are_shadow.
    • as_shadow and bind_shadow now return data of class shadow. This will feed into recode_shadow methods for flexibly adding new types of missing data.
    • Note that in the future shadow might be changed to nabble or something similar.

    Minor feature

    • Functions add_label_shadow() and add_label_missings() gain arguments so you can only label according to the missingness / shadowy-ness of given variables.
    • new function which_are_shadow(), to tell you which values are shadows.
    • new function long_shadow(), which converts data in shadow/nabular form into a long format suitable for plotting. Related to #165
    • Added tests for miss_scan_count

    Minor Changes

    • gg_miss_upset gets a better default presentation by ordering by the largest intersections, and also an improved error message when data with only 1 or no variables have missing values.
    • shadow_shift gains a more informative error message when it doesn't know the class.
    • Changed common_na_string to include escape characters for "?", "", "." so that if they are used in replacement or searching functions they don't return the wildcard results from the characters "?", "", and ".".
    • miss_case_table and miss_var_table now has final column names pct_vars, and pct_cases instead of pct_miss - fixes #178.

    Breaking Changes

    • Deprecated old names of the scalar missingness summaries, in favour of a more consistent syntax #171. The old the and new are:

    |old_names |new_names | |:--------------------|:--------------------| |miss_case_pct |pct_miss_case | |miss_case_prop |prop_miss_case | |miss_var_pct |pct_miss_var | |miss_var_prop |prop_miss_var | |complete_case_pct |pct_complete_case | |complete_case_prop |prop_complete_case | |complete_var_pct |pct_complete_var | |complete_var_prop |prop_complete_var |

    These old names will be made defunct in 0.5.0, and removed completely in 0.6.0.

    • impute_below has changed to be an alias of shadow_shift - that is it operates on a single vector. impute_below_all operates on all columns in a dataframe (as specified in #159)

    Bug fix

    • Ensured that miss_scan_count actually return'd something.
    • gg_miss_var(airquality) now prints the ggplot - a typo meant that this did not print the plot
    Source code(tar.gz)
    Source code(zip)
    naniar_0.4.0.tar.gz(2.80 MB)
  • V0.3.1(Jun 8, 2018)

  • V0.3.0(Jun 6, 2018)

    New Features

    • Added all_miss() / all_na() equivalent to all(is.na(x))

    • Added any_complete() equivalent to all(complete.cases(x))

    • Added any_miss() equivalent to anyNA(x)

    • Added common_na_numbers and finalised common_na_strings - to provide a list of commonly used NA values #168

    • Added miss_var_which, to lists the variable names with missings

    • Added as_shadow_upset which gets the data into a format suitable for plotting as an UpSetR plot:

      airquality %>%
        as_shadow_upset() %>%
        UpSetR::upset()
      
    • Added some imputation functions to assist with exploring missingness structure and visualisation:

      • impute_below Perfoms as for shadow_shift, but performs on all columns. This means that it imputes missing values 10% below the range of the data (powered by shadow_shift), to facilitate graphical exloration of the data. Closes #145 There are also scoped variants that work for specific named columns: impute_below_at, and for columns that satisfy some predicate function: impute_below_if.
      • impute_mean, imputes the mean value, and scoped variants impute_mean_at, and impute_mean_if.
    • impute_below and shadow_shift gain arguments prop_below and jitter to control the degree of shift, and also the extent of jitter.

    • Added complete_{case/var}_{pct/prop}, which complement miss_{var/case}_{pct/prop} #150

    • Added unbind_shadow and unbind_data as helpers to remove shadow columns from data, and data from shadows, respectively.

    • Added is_shadow and are_shadow to determine if something contains a shadow column. simimlar to rlang::is_na and rland::are_na, is_shadow this returns a logical vector of length 1, and are_shadow returns a logical vector of length of the number of names of a data.frame. This might be revisited at a later point (see any_shade in add_label_shadow).

    • Aesthetics now map as expected in geom_miss_point(). This means you can write things like geom_miss_point(aes(colour = Month)) and it works appropriately. Fixed by Luke Smith in Pull request #144, fixing #137.

    Minor Changes

    • miss_var_summary and miss_case_summary now return use order = TRUE by default, so cases and variables with the most missings are presented in descending order. Fixes #163

    • Changes for Visualisation:

      • Changed the default colours used in gg_miss_case and gg_miss_var to lorikeet purple (from ochRe package: https://github.com/ropenscilabs/ochRe)
      • gg_miss_case
        • The y axis label is now ...
        • Default presentation is with order_cases = TRUE.
        • Gains a show_pct option to be consistent with gg_miss_var #153
      • gg_miss_which is rotated 90 degrees so it is easier to read variable names
      • gg_miss_fct uses a minimal theme and tilts the axis labels #118.
    • imported is_na and are_na from rlang.

    • Added common_na_strings, a list of common NA values #168.

    • Added some detail on alternative methods for replacing with NA in the vignette "replacing values with NA".

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Aug 9, 2017)

    "The Founding of naniar the first version on CRAN! The name is taken from Chapter 9 of The Magician's Nephew. Below is the updated NEWS file

    naniar 0.1.0 (2017/08/09) "The Founding of naniar"

    =========================

    • This is the first release of naniar onto CRAN, updates to naniar will happen reasonably regularly after this approximately every 1-2 months

    naniar 0.0.9.9995 (2017/08/07)

    =========================

    Name change

    • After careful consideration, I have changed back to naniar

    Major Change

    • three new functions : miss_case_cumsum / miss_var_cumsum / replace_to_na
    • two new visualisations : gg_var_cumsum & gg_case_cumsum

    New Feature

    • group_by is now respected by the following functions:
      • miss_case_cumsum()
      • miss_case_summary()
      • miss_case_table()
      • miss_prop_summary()
      • miss_var_cumsum()
      • miss_var_run()
      • miss_var_span()
      • miss_var_summary()
      • miss_var_table()

    Minor changes

    • Reviewed documentation for all functions and improved wording, grammar, and style.
    • Converted roxygen to roxygen markdown
    • updated vignettes and readme
    • added a new vignette "naniar-visualisation", to give a quick overview of the visualisations provided with naniar.
    • changed label_missing* to label_miss to be more consistent with the rest of naniar
    • Add pct and prop helpers (#78)
    • removed miss_df_pct - this was literally the same as pct_miss or prop_miss.
    • break larger files into smaller, more manageable files (#83)
    • gg_miss_var gets a show_pct argument to show the percentage of missing values (Thanks Jennifer for the helpful feedback! :))

    Minor changes

    • miss_var_summary & miss_case_summary now have consistent output (one was ordered by n_missing, not the other).
    • prevent error in miss_case_pct
    • enquo_x is now x (as adviced by Hadley)
    • Now has ByteCompile to TRUE
    • add Colin to auth

    narnia 0.0.9.9400 (2017/07/24)

    =========================

    new features

    • replace_to_na is a complement to tidyr::replace_na and replaces a specified value from a variable to NA.
    • gg_miss_fct returns a heatmap of the number of missings per variable for each level of a factor. This feature was very kindly contributed by Colin Fay.
    • gg_miss_ functions now return a ggplot object, which behave as such. gg_miss_ basic themes can be overriden with ggplot functions. This fix was very kindly contributed by Colin Fay.
    • removed defunct functions as per #63
    • made add_* functions handle bare unqouted names where appropriate as per #61
    • added tests for the add_* family
    • got the svgs generated from vdiffr, thanks @karawoo!

    breaking changes

    • changed geom_missing_point() to geom_miss_point(), to keep consistent with the rest of the functions in naniar.

    narnia 0.0.8.9100 (2017/06/23)

    =========================

    new features

    • updated datasets brfss and tao as per #59

    narnia 0.0.7.9992 (2017/06/22)

    =========================

    new features

    • add_label_missings()

    • add_label_shadow()

    • cast_shadow()

    • cast_shadow_shift()

    • cast_shadow_shift_label()

    • added github issue / contribution / pull request guides

    • ts generic functions are now miss_var_span and miss_var_run, and gg_miss_span and work on data.frame's, as opposed to just ts objects.

    • add_shadow_shift() adds a column of shadow_shifted values to the current dataframe, adding "_shift" as a suffix

    • cast_shadow() - acts like bind_shadow() but allows for specifying which columns to add

    • shadow_shift now has a method for factors - powered by forcats::fct_explicit_na() #3

    bug fixes

    • shadow_shift.numeric works when there is no variance (#37)

    name changes

    • changed is_na function to label_na
    • renamed most files to have tidy-miss-[topic]
    • gg_missing_* is changed to gg_miss_* to fit with other syntax

    Removed functions

    • Removed old functions miss_cat, shadow_df and shadow_cat, as they are no longer needed, and have been superceded by label_missing_2d, as_shadow, and is_na.

    minor changes

    • drastically reduced the size of the pedestrian dataset, consider 4 sensor locations, just for 2016.

    New features

    • New dataset, pedestrian - contains hourly counts of pedestrians
    • First pass at time series missing data summaries and plots:
      • miss_ts_run(): return the number of missings / complete in a single run
      • miss_ts_summary(): return the number of missings in a given time period
      • gg_miss_ts(): plot the number of missings in a given time period

    Name changes

    • renamed package from naniar to narnia - I had to explain the spelling a few times when I was introducing the package and I realised that I should change the name. Fortunately it isn't on CRAN yet.

    naniar 0.0.6.9100 (2017/03/21)

    =========================

    • Added prop_miss and the complement prop_complete. Where n_miss returns the number of missing values, prop_miss returns the proportion of missing values. Likewise, prop_complete returns the proportion of complete values.

    Defunct functions

    • As stated in 0.0.5.9000, to address Issue #38, I am moving towards the format miss_type_value/fun, because it makes more sense to me when tabbing through functions.

    The left hand side functions have been made defunct in favour of the right hand side. - percent_missing_case() --> miss_case_pct() - percent_missing_var() --> miss_var_pct() - percent_missing_df() --> miss_df_pct() - summary_missing_case() --> miss_case_summary() - summary_missing_var() --> miss_var_summary() - table_missing_case() --> miss_case_table() - table_missing_var() --> miss_var_table()

    naniar 0.0.5.9000 (2016/01/08)

    =========================

    Deprecated functions

    • To address Issue #38, I am moving towards the format miss_type_value/fun, because it makes more sense to me when tabbing through functions.
    • miss_* = I want to explore missing values
    • miss_case_* = I want to explore missing cases
    • miss_case_pct = I want to find the percentage of cases containing a missing value
    • miss_case_summary = I want to find the number / percentage of missings in each case miss_case_table = I want a tabulation of the number / percentage of cases missing

    This is more consistent and easier to reason with.

    Thus, I have renamed the following functions: - percent_missing_case() --> miss_case_pct() - percent_missing_var() --> miss_var_pct() - percent_missing_df() --> miss_df_pct() - summary_missing_case() --> miss_case_summary() - summary_missing_var() --> miss_var_summary() - table_missing_case() --> miss_case_table() - table_missing_var() --> miss_var_table()

    These will be made defunct in the next release, 0.0.6.9000 ("The Wood Between Worlds").

    naniar 0.0.4.9000 (2016/12/31)

    =========================

    New features

    • n_complete is a complement to n_miss, and counts the number of complete values in a vector, matrix, or dataframe.

    Bug fixes

    • shadow_shift now handles cases where there is only 1 complete value in a vector.

    Other changes

    • added much more comprehensive testing with testthat.

    naniar 0.0.3.9901 (2016/12/18)

    =========================

    After a burst of effort on this package I have done some refactoring and thought hard about where this package is going to go. This meant that I had to make the decision to rename the package from ggmissing to naniar. The name may strike you as strange but it reflects the fact that there are many changes happening, and that we will be working on creating a nice utopia (like Narnia by CS Lewis) that helps us make it easier to work with missing data

    New Features (under development)

    • add_n_miss and add_prop_miss are helpers that add columns to a dataframe containing the number and proportion of missing values. An example has been provided to use decision trees to explore missing data structure as in Tierney et al

    • geom_miss_point() now supports transparency, thanks to @seasmith (Luke Smith)

    • more shadows. These are mainly around bind_shadow and gather_shadow, which are helper functions to assist with creating

    Bug fixes

    • geom_missing_point() broke after the new release of ggplot2 2.2.0, but this is now fixed by ensuring that it inherits from GeomPoint, rather than just a new Geom. Thanks to Mitchell O'hara-Wild for his help with this.

    • missing data summaries table_missing_var and table_missing_case also now return more sensible numbers and variable names. It is possible these function names will change in the future, as these are kind of verbose.

    • semantic versioning was incorrectly entered in the DESCRIPTION file as 0.2.9000, so I changed it to 0.0.2.9000, and then to 0.0.3.9000 now to indicate the new changes, hopefully this won't come back to bite me later. I think I accidentally did this with visdat at some point as well. Live and learn.

    Other changes

    • gathered related functions into single R files rather than leaving them in their own.

    • correctly imported the %>% operator from magrittr, and removed a lot of chaff around @importFrom - really don't need to use @importFrom that often.

    ggmissing 0.0.2.9000 (2016/07/29)

    =========================

    New Feature (under development)

    • geom_missing_point() now works in a way that we expect! Thanks to Miles McBain for working out how to get this to work.

    ggmissing 0.0.1.9000 (2016/07/29)

    =========================

    New Feature (under development)

    • tidy summaries for missing data:
      • percent_missing_df returns the percentage of missing data for a data.frame
      • percent_missing_var the percentage of variables that contain missing values
      • percent_missing_case the percentage of cases that contain missing values.
      • table_missing_var table of missing information for variables
      • table_missing_case table of missing information for cases
      • summary_missing_var summary of missing information for variables (counts, percentages)
      • summary_missing_case summary of missing information for variables (counts, percentages)
    • gg_missing_col: plot the missingness in each variable
    • gg_missing_row: plot the missingness in each case
    • gg_missing_which: plot which columns contain missing data.
    Source code(tar.gz)
    Source code(zip)
  • v0.0.4.9000(Dec 31, 2016)

    naniar 0.0.4.9000 (2016/12/31)

    New features

    • n_complete is a complement to n_miss, and counts the number of complete values in a vector, matrix, or dataframe.

    Bug fixes

    • shadow_shift now handles cases where there is only 1 complete value in a vector.

    Other changes

    • added much more comprehensive testing with testthat.

    naniar 0.0.3.9901 (2016/12/18)

    New features

    • add_n_miss and add_prop_miss are helpers that add columns to a dataframe containing the number and proportion of missing values. An example has been provided to use decision trees to explore missing data structure as in Tierney et al
    • geom_miss_point() now supports transparency, thanks to @seasmith (Luke Smith)

    naniar 0.0.3.9000 (2016/12/18)

    After a burst of effort on this package I have done some refactoring and thought hard about where this package is going to go. This meant that I had to make the decision to rename the package from ggmissing to naniar. The name may strike you as strange but it reflects the fact that there are many changes happening, and that we will be working on creating a nice utopia (like Narnia by CS Lewis) that helps us make it easier to work with missing data

    New Features (under development)

    • more shadows. These are mainly around bind_shadow and gather_shadow, which are helper functions to assist with creating

    Bug fixes

    • geom_missing_point() broke after the new release of ggplot2 2.2.0, but this is now fixed by ensuring that it inherits from GeomPoint, rather than just a new Geom. Thanks to Mitchell O'hara-Wild for his help with this.
    • missing data summaries table_missing_var and table_missing_case also now return more sensible numbers and variable names. It is possible these function names will change in the future, as these are kind of verbose.
    • semantic versioning was incorrectly entered in the DESCRIPTION file as 0.2.9000, so I changed it to 0.0.2.9000, and then to 0.0.3.9000 now to indicate the new changes, hopefully this won't come back to bite me later. I think I accidentally did this with visdat at some point as well. Live and learn.

    Other changes

    • gathered related functions into single R files rather than leaving them in their own.
    • correctly imported the %>% operator from magrittr, and removed a lot of chaff around @importFrom - really don't need to use @importFrom that often.

    ggmissing 0.0.2.9000 (2016/07/29)

    New Feature (under development)

    • geom_missing_point() now works in a way that we expect! Thanks to Miles McBain for working out how to get this to work.

    ggmissing 0.0.1.9000 (2016/07/29)

    New Feature (under development)

    • tidy summaries for missing data:
      • percent_missing_df returns the percentage of missing data for a data.frame
      • percent_missing_var the percentage of variables that contain missing values
      • percent_missing_case the percentage of cases that contain missing values.
      • table_missing_var table of missing information for variables
      • table_missing_case table of missing information for cases
      • summary_missing_var summary of missing information for variables (counts, percentages)
      • summary_missing_case summary of missing information for variables (counts, percentages)
    • gg_missing_col: plot the missingness in each variable
    • gg_missing_row: plot the missingness in each case
    • gg_missing_which: plot which columns contain missing data.
    Source code(tar.gz)
    Source code(zip)
    naniar_0.0.4.9000.tar.gz(1.62 MB)
    naniar_0.0.4.9000.tgz(99.09 KB)
Owner
Nicholas Tierney
|| Research Software Engineer | Rockclimber | Hiker | Coffee Geek | He/Him | orcid.org/0000-0003-1460-8722 ||
Nicholas Tierney
Graphing communities on Twitch.tv in a visually intuitive way

VisualizingTwitchCommunities This project maps communities of streamers on Twitch.tv based on shared viewership. The data is collected from the Twitch

Kiran Gershenfeld 312 Jan 07, 2023
A Jupyter - Three.js bridge

pythreejs A Python / ThreeJS bridge utilizing the Jupyter widget infrastructure. Getting Started Installation Using pip: pip install pythreejs And the

Jupyter Widgets 844 Dec 27, 2022
GitHub English Top Charts

Help you discover excellent English projects and get rid of the interference of other spoken language.

kon9chunkit 529 Jan 02, 2023
`charts.css.py` brings `charts.css` to Python. Online documentation and samples is available at the link below.

charts.css.py charts.css.py provides a python API to convert your 2-dimension data lists into html snippet, which will be rendered into charts by CSS,

Ray Luo 3 Sep 23, 2021
:art: Diagram as Code for prototyping cloud system architectures

Diagrams Diagram as Code. Diagrams lets you draw the cloud system architecture in Python code. It was born for prototyping a new system architecture d

MinJae Kwon 27.5k Dec 30, 2022
The interactive graphing library for Python (includes Plotly Express) :sparkles:

plotly.py Latest Release User forum PyPI Downloads License Data Science Workspaces Our recommended IDE for Plotly’s Python graphing library is Dash En

Plotly 12.7k Jan 05, 2023
Interactive Data Visualization in the browser, from Python

Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords hi

Bokeh 17.1k Dec 31, 2022
Small project to recursively calculate and plot each successive order of the Hilbert Curve

hilbert-curve Small project to recursively calculate and plot each successive order of the Hilbert Curve. After watching 3Blue1Brown's video on Hilber

Stefan Mejlgaard 2 Nov 15, 2021
Numerical methods for ordinary differential equations: Euler, Improved Euler, Runge-Kutta.

Numerical methods Numerical methods for ordinary differential equations are methods used to find numerical approximations to the solutions of ordinary

Aleksey Korshuk 5 Apr 29, 2022
Shaded 😎 quantile plots

shadyquant 😎 This python package allows you to quantile and plot lines where you have multiple samples, typically for visualizing uncertainty. Your d

Mehrad Ansari 13 Sep 29, 2022
Extract and visualize information from Gurobi log files

GRBlogtools Extract information from Gurobi log files and generate pandas DataFrames or Excel worksheets for further processing. Also includes a wrapp

Gurobi Optimization 56 Nov 17, 2022
Python Data Validation for Humans™.

validators Python data validation for Humans. Python has all kinds of data validation tools, but every one of them seems to require defining a schema

Konsta Vesterinen 670 Jan 09, 2023
This is a place where I'm playing around with pandas to analyze data in a csv/excel file.

pandas-csv-excel-analysis This is a place where I'm playing around with pandas to analyze data in a csv/excel file. 0-start A very simple cheat sheet

Chuqin 3 Oct 05, 2022
plotly scatterplots which show molecule images on hover!

molplotly Plotly scatterplots which show molecule images on hovering over the datapoints! Required packages: pandas rdkit jupyter_dash ➡️ See example.

150 Dec 28, 2022
Visualization Library

CamViz Overview // Installation // Demos // License Overview CamViz is a visualization library developed by the TRI-ML team with the goal of providing

Toyota Research Institute - Machine Learning 67 Nov 24, 2022
Focus on Algorithm Design, Not on Data Wrangling

The dataTap Python library is the primary interface for using dataTap's rich data management tools. Create datasets, stream annotations, and analyze model performance all with one library.

Zensors 37 Nov 25, 2022
Python histogram library - histograms as updateable, fully semantic objects with visualization tools. [P]ython [HYST]ograms.

physt P(i/y)thon h(i/y)stograms. Inspired (and based on) numpy.histogram, but designed for humans(TM) on steroids(TM). The goal is to unify different

Jan Pipek 120 Dec 08, 2022
Runtime analysis of code with plotting

Runtime analysis of code with plotting A quick comparison among Python, Cython, and the C languages A Programming Assignment regarding the Programming

Cena Ashoori 2 Dec 24, 2021
Project coded in Python using Pandas to look at changes in chase% for batters facing a pitcher first time through the order vs. thrid time

Project coded in Python using Pandas to look at changes in chase% for batters facing a pitcher first time through the order vs. thrid time

Jason Kraynak 1 Jan 07, 2022
These data visualizations were created as homework for my CS40 class. I hope you enjoy!

Data Visualizations These data visualizations were created as homework for my CS40 class. I hope you enjoy! Nobel Laureates by their Country of Birth

9 Sep 02, 2022