Title: Getting NJ assessment data into R: part 2 in a series
Date: 2015-04-15
Category: education
Tags: NJ, assessment, NJASK, HSPA, data_management, tutorial
Slug: reading-nj-assess-data-2
Author: Andrew Martin

```{r, echo=FALSE}
#SET THIS TO TRUE WHEN READY TO PUBLISH
ready_to_ship = TRUE

library(knitr)
hook_plot <- knit_hooks$get('plot')

knit_hooks$set(plot=function(x, options) {
    if (!is.null(options$pelican.publish) && options$pelican.publish) {
        x <- paste0("{filename}", x)
    }
    hook_plot(x, options)
})
opts_chunk$set(dev='Cairo_svg')
opts_chunk$set(pelican.publish=ready_to_ship)

```

In my [last post]({filename}/04_njask-data-1.Rmd), I showed how to read a fixed width file into R.  In this post, I'll refactor that code into a function that takes year/grade as a parameter.

First, read in the layout data frame.  This contains the column headers for NJASK fixed with files.<!-- PELICAN_END_SUMMARY -->

```{r load_layout}

load(file = 'datasets/njask_layout.rda')
#also read in the old NJASK layouts
load(file = 'datasets/njask05_layout.rda')
load(file = 'datasets/njask04_layout.rda')
load(file = 'datasets/njask07gr3_layout.rda')
load(file = 'datasets/njask06gr3_layout.rda')
load(file = 'datasets/njask06gr5_layout.rda')

head(layout_njask)

```

Our final call to `read_fwf` in the last post looked like this:
  
```{r final_fwf}

sample_file = "http://www.state.nj.us/education/schools/achievement/14/njask8/state_summary.txt"

njask14_gr8 <- readr::read_fwf(
  file = sample_file,
  col_positions = readr::fwf_positions(
    start = layout_njask$field_start_position,
    end = layout_njask$field_end_position,
    col_names = layout_njask$final_name
  ),
  na = "*"
)

```

A script that has a bunch of copy/paste versions of that call would probably get the job done, but we're writing for our [future selves](https://xkcd.com/1421/) here, and those url paths are easy to build.  The function should:

1) build the target url

2) use hadley's `readr` to fetch the fixed with file, using the provided field definitions.

```{r as_function}

get_raw_njask <- function(year, grade, layout=layout_njask) {
  require(readr)
    
  #url paths changed after the 2012 assessment
  years <- list(
    "2014"="14", "2013"="13", "2012"="2013", "2011"="2012", "2010"="2011", "2009"="2010", 
    "2008"="2009", "2007"="2008", "2006"="2007", "2005"="2006", "2004"="2005"
  )
  parsed_year <- years[[as.character(year)]]
  
  #2008 follows a totally unique pattern
  grade_str <- if (year==2008 & grade >=5) {
    paste0('58/g', grade)
  } else if (year %in% c(2006, 2007) & grade %in% c(5, 6, 7)) {
    '57'
  } else {
    grade
  }
  
  #filenames are also inconsistent 
  filename <- list(
    "2014"="state_summary.txt", "2013"="state_summary.txt", "2012"="state_summary.txt",
    "2011"="state_summary.txt", "2010"="state_summary.txt", "2009"="state_summary.txt",
    "2008"="state_summary.txt", "2007"=if(grade >= 5) {
        paste0('G', grade, 'state_summary.txt')
      } else {
        "state_summary.txt"
      }, 
      "2006"=if(grade >= 5) {
        paste0('G', grade, 'state_summary.txt')
      } else {
        "state_summary.txt"
      }, 
      "2005"= if(grade==3) {
        "njask005_state_summary3.txt"
      } else if (grade==4) {
        "njask2005_state_summary4.txt"
      }, 
    "2004"=paste0("njask", grade, "04state_summary.txt")   
  )
  parsed_filename <- filename[[as.character(year)]]
    
  #build url
  target_url <- paste0(
    "http://www.state.nj.us/education/schools/achievement/", parsed_year, 
    "/njask", grade_str, "/", parsed_filename
  )
      
  #read_fwf
  df <- readr::read_fwf(
    file = target_url,
    col_positions = readr::fwf_positions(
      start = layout$field_start_position,
      end = layout$field_end_position,
      col_names = layout$final_name
    ),
    na = "*"
  )
  
  #return df
  return(df)
  
}

```

Let's give it a try!

```{r test_get_raw_njask} 
library(magrittr)

ex <- get_raw_njask(2014, 6)

dplyr::sample_n(ex[, sample(c(1:551), 10)], 10) %>% as.data.frame()

```

Great - we can definitely pull down all the data files we need.  Now let's make sure that we are doing the necessary cleanup.  In particular, there are 228 columns in our layout file that have the comment 'One implied decimal.'  That really matters - it means that `255` _actually_ means 25.5.  

For each of those columns, we want to apply a function to each column.  Hadley's `dplyr` has a nice utility for doing this - `mutate_each`.

We'll also need to make sure that we only mutate the columns that are tagged with 'One implied decimal.'

```{r limit_df}
library(dplyr)

process_njask <- function(df, mask=layout_njask$comments) {
  #keep the names to put back in the same order
  all_names <- names(df)
  
  #replace any line breaks in last column
  df$Grade <- gsub('\n', '', df$Grade, fixed = TRUE)
  
  mask_boolean <- mask == 'One implied decimal'
  #put some columns aside
  ignore <- df[, !mask_boolean]
  
  #process the columns that have an implied decimal
  processed <- df[, mask_boolean] %>%
    dplyr::mutate_each(
      dplyr::funs(implied_decimal = . / 10)  
    )
  
  #put back together 
  final <- cbind(ignore, processed)
  
  #reorder and return
  final %>%
    select(
      one_of(names(df))
    )
}

ex_process <- process_njask(ex)

head(ex_process[, 1:15])

```

Now we put it all together, with a little wrapper function around the `get_raw_njask` and `process_njask` functions:

```{r wrapper}

fetch_njask <- function(year, grade) {
  if (year == 2004) {
    df <- get_raw_njask(year, grade, layout=layout_njask04)  %>% 
      process_njask(mask=layout_njask04$comments)
  }
  else if (year == 2005) {
    df <- get_raw_njask(year, grade, layout=layout_njask05)  %>% 
      process_njask(mask=layout_njask05$comments) 
  } else if (year %in% c(2007, 2008) & grade %in% c(3, 4)) {
    df <- get_raw_njask(year, grade, layout=layout_njask07gr3)  %>% 
      process_njask(mask=layout_njask07gr3$comments) 
  }
  else if (year == 2006 & grade %in% c(3, 4)) {
    df <- get_raw_njask(year, grade, layout=layout_njask06gr3)  %>% 
      process_njask(mask=layout_njask06gr3$comments)
  } else if (year == 2006 & grade >= 5) {
    #fetch
    df <- get_raw_njask(year, grade, layout=layout_njask06gr5)  
    #inexplicably 2006 data has no Grade column
    df$Grade <- grade
    df <- df %>% process_njask(mask=layout_njask06gr5$comments)    
  }
  else {
    df <- get_raw_njask(year, grade) %>% process_njask()    
  }

  return(df)
}

fetch_njask(2014, 6) %>% head() %>% select(CDS_Code:TOTAL_POPULATION_LANGUAGE_ARTS_Scale_Score_Mean)

```

There's a more data out there than just the grade 3 NJASK, though.  In my[next post]({filename}/06_njask-data-3.Rmd), we'll tackle the HSPA, and dive into some older data - the 'Grade Eight Proficiency Assessment' (GEPA) that used to be offered to students at the end of middle school, with the ultimate goal of stitching everything together into a simplified R interface for NJ assessment data.