Renaming multiple columns and a tidy dataframe transpose

Oftentimes when you get some data, especially time series from some data bases, it’s in the wrong “format”, and it needs to be reshaped. Let’s have a look!

library(tidyverse)

First, let’s look at some data. This is data from a swedish opinion poll, which measures sympathy for the green party, in row “party sympathisers”, and how the share that ranks it as their second best party. However, as you can see this is in a wide format based on rows, and needs to be transposed.

poll <- tribble(~year, ~`2011`, ~`2012`,~`2013`,~`2014`,~`2015`,~`2016`,~`2017`,~`2018`, ~`2019`,
    "Party sympathisers", 8.5, 8.7, 10.4, 8.2, 8.1, 8.3, 8.5, 8.2, 7.9,
    "Second best party", 13.1, 11.3, 10.3, 10.7, 10.2, 10.6, 10.9, 10.7, 9.8)
poll
## # A tibble: 2 x 10
##   year            `2011` `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019`
##   <chr>            <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Party sympathi…    8.5    8.7   10.4    8.2    8.1    8.3    8.5    8.2    7.9
## 2 Second best pa…   13.1   11.3   10.3   10.7   10.2   10.6   10.9   10.7    9.8

One way is to simply use a transpose function. But this turns it into a matrix instead of a data frame.

t(poll)
##      [,1]                 [,2]               
## year "Party sympathisers" "Second best party"
## 2011 " 8.5"               "13.1"             
## 2012 " 8.7"               "11.3"             
## 2013 "10.4"               "10.3"             
## 2014 " 8.2"               "10.7"             
## 2015 " 8.1"               "10.2"             
## 2016 " 8.3"               "10.6"             
## 2017 " 8.5"               "10.9"             
## 2018 " 8.2"               "10.7"             
## 2019 "7.9"                "9.8"

Setting it into a data frame also does not help, since this does not use the first row as names, it has simply assign the column names as row names and so on. Secondly, everything is transformed into character variables. All of this is fixable, by assigning the row names into a new variable, removing the first row, and setting the row names to NULL, but it’s messy. How to fix it?

as.data.frame(t(poll))
##                      V1                V2
## year Party sympathisers Second best party
## 2011                8.5              13.1
## 2012                8.7              11.3
## 2013               10.4              10.3
## 2014                8.2              10.7
## 2015                8.1              10.2
## 2016                8.3              10.6
## 2017                8.5              10.9
## 2018                8.2              10.7
## 2019                7.9               9.8

It’s actually fairly simple in this case, you just have to use a two stage pivot.

cleaner_poll <- poll %>% 
    pivot_longer(-year) %>%
        pivot_wider(names_from=year, values_from=value)
cleaner_poll
## # A tibble: 9 x 3
##   name  `Party sympathisers` `Second best party`
##   <chr>                <dbl>               <dbl>
## 1 2011                   8.5                13.1
## 2 2012                   8.7                11.3
## 3 2013                  10.4                10.3
## 4 2014                   8.2                10.7
## 5 2015                   8.1                10.2
## 6 2016                   8.3                10.6
## 7 2017                   8.5                10.9
## 8 2018                   8.2                10.7
## 9 2019                   7.9                 9.8

This still isn’t perfect, since we still have to change the name of the first var. This is because when we pivot longer, we use all cols except the one named year, which isn’t year at all but the future column names. This is because of the data quality from the beginning, and not much to do about that.

It’s cleaner but not clean yet, since the names still are bad, containing spaces. But we can fix that! The quick and easy version, especially in a setting such as this with few variables, is renaming it all manually. But what if you have 50 columns? One way, is using the package Janitor.

library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
cleaner_poll %>% clean_names()
## # A tibble: 9 x 3
##   name  party_sympathisers second_best_party
##   <chr>              <dbl>             <dbl>
## 1 2011                 8.5              13.1
## 2 2012                 8.7              11.3
## 3 2013                10.4              10.3
## 4 2014                 8.2              10.7
## 5 2015                 8.1              10.2
## 6 2016                 8.3              10.6
## 7 2017                 8.5              10.9
## 8 2018                 8.2              10.7
## 9 2019                 7.9               9.8

The other, and more flexible way, is doing it using rename_with. It takes a function, in this case an anonymous function, and applies it to selected columns.

cleaner_poll %>% rename_with(~tolower(gsub(" ", "_", .x,)), where(is_numeric))
## Warning: Deprecated

## Warning: Deprecated

## Warning: Deprecated
## # A tibble: 9 x 3
##   name  party_sympathisers second_best_party
##   <chr>              <dbl>             <dbl>
## 1 2011                 8.5              13.1
## 2 2012                 8.7              11.3
## 3 2013                10.4              10.3
## 4 2014                 8.2              10.7
## 5 2015                 8.1              10.2
## 6 2016                 8.3              10.6
## 7 2017                 8.5              10.9
## 8 2018                 8.2              10.7
## 9 2019                 7.9               9.8

If we change it to only integer columns, nothing changes.

cleaner_poll %>% rename_with(~tolower(gsub(" ", "_", .x,)), where(is_integer))
## # A tibble: 9 x 3
##   name  `Party sympathisers` `Second best party`
##   <chr>                <dbl>               <dbl>
## 1 2011                   8.5                13.1
## 2 2012                   8.7                11.3
## 3 2013                  10.4                10.3
## 4 2014                   8.2                10.7
## 5 2015                   8.1                10.2
## 6 2016                   8.3                10.6
## 7 2017                   8.5                10.9
## 8 2018                   8.2                10.7
## 9 2019                   7.9                 9.8

The selector uses tidy selection, so you can use most dplyr verbs to select the columns, such as starts_with.

cleaner_poll %>% rename_with(~tolower(gsub(" ", "_", .x,)), starts_with("P"))
## # A tibble: 9 x 3
##   name  party_sympathisers `Second best party`
##   <chr>              <dbl>               <dbl>
## 1 2011                 8.5                13.1
## 2 2012                 8.7                11.3
## 3 2013                10.4                10.3
## 4 2014                 8.2                10.7
## 5 2015                 8.1                10.2
## 6 2016                 8.3                10.6
## 7 2017                 8.5                10.9
## 8 2018                 8.2                10.7
## 9 2019                 7.9                 9.8

Hopefully this helps with the data cleaning!

Leo Carlsson
Leo Carlsson

My research interests include science, statistics and politics.