Title: Getting NJ assessment data into R: part 3 in a series.
Date: 2015-04-16
Category: education
Tags: NJ, assessment, NJASK, HSPA, data_management, tutorial
Slug: reading-nj-assess-data-3
Author: Andrew Martin

```{r pelican_conf, echo=FALSE}
#SET THIS TO TRUE WHEN READY TO PUBLISH
ready_to_ship = TRUE

library(knitr)
hook_plot <- knit_hooks$get('plot')

knit_hooks$set(plot=function(x, options) {
    if (!is.null(options$pelican.publish) && options$pelican.publish) {
        x <- paste0("{filename}", x)
    }
    hook_plot(x, options)
})
opts_chunk$set(dev='Cairo_svg')
opts_chunk$set(pelican.publish=ready_to_ship)

```

In my [last post]({filename}/06_njask-data-3.Rmd), I talked about how to programmatically process and cleanup NJASK data.  In this post, we'll extend the NJASK functions to the High School Proficiency Assessment (HSPA), and to the old Grade Eight Proficiency Assessment (GEPA).  With functions that can access each of those data sources, we'll be ready to write a general wrapper that simplifies access to relevant state assessment data.<!-- PELICAN_END_SUMMARY -->

# HSPA

Much like the NJASK data in posts [1]({filename}/04_njask-data-1.Rmd) and [2]({filename}/05_njask-data-2.Rmd), we're going to read from a fixed width file on the state website, use a layout file to name the variables, and do some post-processing.  I also [wrote up]({filename}pages/06a_hspa-layout.Rmd) how to process the HSPA metadata, if data processing is your thing.

Load in those processed files:


```{r libraries, message=FALSE, warning=FALSE}
library(readr)
library(dplyr)
library(magrittr)
```

```{r hspa1}

load(file = 'datasets/hspa_layout.rda')
load(file = 'datasets/hspa2010_layout.rda')
head(layout_hspa)

```

Use the layout file to process an example HSPA data file:

```{r hspa2}

hspa_url <- 'http://www.state.nj.us/education/schools/achievement/14/hspa/state_summary.txt'

hspa_ex <- readr::read_fwf(
  file = hspa_url,
  col_positions = readr::fwf_positions(
    start = layout_hspa$field_start_position,
    end = layout_hspa$field_end_position,
    col_names = layout_hspa$final_name
  ),
  na = "*"
)

hspa_ex %>% as.data.frame() %>% select(CDS_Code:TOTAL_POPULATION_LANGUAGE_ARTS_Scale_Score_Mean) %>% head()

```

That gets us to a similar state as we were in for the NJASK data - we have all the columns identified, but there's a need for post-processing, especially for the percentage columns, which have 'One implied decimal.'

We can take the formula we wrote to process NJASK data frames and generalize it, so that it can handle both NJASK and HSPA data.

```{r generalize_processing}

process_nj_assess <- function(df, layout) {
  #build a mask
  mask <- layout$comments == 'One implied decimal'
    
  #keep the names to put back in the same order
  all_names <- names(df)
  
  #make sure df is data frame (not dplyr data frame) so that normal subsetting
  df <- as.data.frame(df)

  #get name of last column and kill \n characters
  last_col <- names(df)[ncol(df)]
  df[, last_col] <- gsub('\n', '', df[, last_col], fixed = TRUE)
      
  #put some columns aside
  ignore <- df[, !mask]
  
  implied_decimal_fix <- function(x) {
    #strip out anything that's not a number.
    x <- as.numeric(gsub("[^\\d]+", "", x, perl=TRUE))
    x / 10
  }

  #process the columns that have an implied decimal
  processed <- df[, mask] %>%
    dplyr::mutate_each(
      dplyr::funs(implied_decimal_fix)  
    )
  
  #put back together 
  final <- cbind(ignore, processed)
  
  #reorder and return
  final %>%
    select(
      one_of(names(df))
    )
}

process_nj_assess(hspa_ex, layout_hspa) %>% select(CDS_Code:TOTAL_POPULATION_LANGUAGE_ARTS_Scale_Score_Mean) %>% head()

```

Yep, that totally works.  Following from the NJASK example, we'll write a function to simplify fetching the HSPA data, and a final wrapper around the fetch/process steps.

```{r fetch_hspa}

get_raw_hspa <- function(year, layout=layout_hspa) {
  require(readr)
    
  #url paths changed in 2012
  years <- list(
    "2014"="14", "2013"="13", "2012"="2013", "2011"="2012", "2010"="2011", "2009"="2010", 
    "2008"="2009", "2007"="2008", "2006"="2007", "2005"="2006", "2004"="2005"
  )
  parsed_year <- years[[as.character(year)]]
  
  #filenames are screwy
  parsed_filename <- if(year > 2005) {
    "state_summary.txt"
  } else if (year == 2005) {
    "2005hspa_state_summary.txt" 
  } else if (year == 2004) {
    "hspa04state_summary.txt"
  }
      
  #build url
  target_url <- paste0(
    "http://www.state.nj.us/education/schools/achievement/", parsed_year, 
    "/hspa/", parsed_filename
  )
  
  #read_fwf
  df <- readr::read_fwf(
    file = target_url,
    col_positions = readr::fwf_positions(
      start = layout$field_start_position,
      end = layout$field_end_position,
      col_names = layout$final_name
    ),
    na = "*"
  )
  
  #return df
  return(df)
  
}

#final wrapper
fetch_hspa <- function(year) {
  if (year >= 2011) {
    hspa_df <- get_raw_hspa(year) %>% process_nj_assess(layout=layout_hspa)
  } else if (year >= 2004) {
    hspa_df <- get_raw_hspa(year, layout=layout_hspa2010) %>% process_nj_assess(layout=layout_hspa2010) 
  }
  
  return(hspa_df)
}

fetch_hspa(2010) %>% select(CDS_Code:TOTAL_POPULATION_LANGUAGE_ARTS_Scale_Score_Mean) %>% head()

```

Nice!  NJASK and HSPA down, GEPA data to go.

# GEPA

Load in the processed GEPA layout file, and the old NJASK layout file.
```{r gepa1}

load(file = 'datasets/gepa_layout.rda')
load(file = 'datasets/njask05_layout.rda')

head(layout_gepa)

```


A function to get GEPA data:

```{r gepa2}

get_raw_gepa <- function(year, layout=layout_gepa) {
  require(readr)
    
  #url paths changed in 2012
  years <- list(
    "2007"="2008", "2006"="2007", "2005"="2006", "2004"="2005"
  )
  parsed_year <- years[[as.character(year)]]
  
  filename <- list(
    "2007"="state_summary.txt", "2006"="state_summary.txt",
    "2005"="2005njgepa_state_summary.txt", "2004"="gepa04state_summary.txt"   
  )
  parsed_filename <- filename[[as.character(year)]]

  #build url
  target_url <- paste0(
    "http://www.state.nj.us/education/schools/achievement/", parsed_year, 
    "/gepa/", parsed_filename
  )
      
  #read_fwf
  df <- readr::read_fwf(
    file = target_url,
    col_positions = readr::fwf_positions(
      start = layout$field_start_position,
      end = layout$field_end_position,
      col_names = layout$final_name
    ),
    na = "*"
  )
  
  #return df
  return(df)
  
}

gepa_ex <- get_raw_gepa(2007)

gepa_ex %>% as.data.frame() %>% select(CDS_Code:TOTAL_POPULATION_LANGUAGE_ARTS_Scale_Score_Mean) %>% head()

```

Can we process the GEPA df using our existing function?

```{r gepa3}

process_nj_assess(gepa_ex, layout_gepa) %>% select(CDS_Code:TOTAL_POPULATION_LANGUAGE_ARTS_Scale_Score_Mean) %>% head()

```

Yes, totally.  
Final step: write all of that into a final wrapper function:

```{r gepa_wrapper}

#final wrapper
fetch_gepa <- function(year) {
  get_raw_gepa(year) %>% process_nj_assess(layout=layout_gepa)
  
}

fetch_gepa(2007) %>% as.data.frame() %>% select(CDS_Code:TOTAL_POPULATION_LANGUAGE_ARTS_Scale_Score_Mean) %>% head()

```

In the [next post]({filename}/07_njask-data-4.Rmd) in this series, we'll take these individual NJASK, HSPA, and GEPA functions and write one wrapper to rule them all, allowing data to be easily fetched for any year/grade.