Taller de Introducción a R

3.4 Dataframe

Un dataframe es un objeto de dos dimensiones en R. Puede verse como un arreglo de vectores de la misma dimensión, similar a una matriz.

La ventaja de un dataframe, es que a diferencia de una matriz, los vectores o columnas pueden ser de diferentes tipos.

En general, funcionan para guardar tablas de datos. Donde las columnas representan variables y los renglones observaciones. Es similar a la carga de datos en paquetes estadísticos como SAS y SPSS.

3.4.1 Crear un dataframe

En R se crean dataframes con la función data.frame().

Una forma de crear un dataframe es asignando vectores.

muestra_df <- data.frame(secuencia = 1:5,
                         aleatorio = rnorm(5),
                         letras = c("a", "b", "c", "d", "e"))
muestra_df

##   secuencia  aleatorio letras
## 1         1  0.4663137      a
## 2         2 -1.0043280      b
## 3         3 -0.6544573      c
## 4         4 -0.2315481      d
## 5         5 -0.2093135      e

O bien, se pude transformar una matriz con la misma función. Tomemos los datos de los ingresos de las películas de la saga de HP y hagamos una matriz.

sales_df <- data.frame(sales_mat)
sales_df

##                                          total release_date
## 1. HP and the Sorcerer's Stone       497066400    141823200
## 8. HP and the Deathly Hallows Part 2 426630300    189432500
## 4. HP and the Goblet of Fire         401608200    142414700
## 2. HP and the Chamber of Secrets     399302200    135197600
## 5. HP and the Order of the Phoenix   377314200     99635700
## 6. HP and the Half-Blood Prince      359788300     92756000
## 3. HP and the Prisoner of Azkaban    357233500    134119300
## 7. HP and the Deathly Hallows Part 1 328833900    138752100

3.4.2 Nombres de dimensiones

Al igual que matrices, las funciones rownames() y colnames() permiten nombrar los renglones y columnas del objeto.

colnames(sales_df) <- c("total_grosses", "opening_grosses")
sales_df

##                                      total_grosses opening_grosses
## 1. HP and the Sorcerer's Stone           497066400       141823200
## 8. HP and the Deathly Hallows Part 2     426630300       189432500
## 4. HP and the Goblet of Fire             401608200       142414700
## 2. HP and the Chamber of Secrets         399302200       135197600
## 5. HP and the Order of the Phoenix       377314200        99635700
## 6. HP and the Half-Blood Prince          359788300        92756000
## 3. HP and the Prisoner of Azkaban        357233500       134119300
## 7. HP and the Deathly Hallows Part 1     328833900       138752100

3.4.3 Seleccion de elementos

Para dataframes, ademas de seleccionar posiciones de renglones y columnas con [ , ], se puede usar el signo $.

sales_df$total_grosses

## [1] 497066400 426630300 401608200 399302200 377314200 359788300 357233500
## [8] 328833900

Usando este mismo signo se pueden agregar nuevas columnas al objeto.

Por ejemplo, tomemos los títulos que se heredaron de la matriz como nombres de columnas. Incluyamos una variable al dataframe de los títulos como un factor.

sales_df$title <- factor(rownames(sales_df))
sales_df

##                                      total_grosses opening_grosses
## 1. HP and the Sorcerer's Stone           497066400       141823200
## 8. HP and the Deathly Hallows Part 2     426630300       189432500
## 4. HP and the Goblet of Fire             401608200       142414700
## 2. HP and the Chamber of Secrets         399302200       135197600
## 5. HP and the Order of the Phoenix       377314200        99635700
## 6. HP and the Half-Blood Prince          359788300        92756000
## 3. HP and the Prisoner of Azkaban        357233500       134119300
## 7. HP and the Deathly Hallows Part 1     328833900       138752100
##                                                                     title
## 1. HP and the Sorcerer's Stone             1. HP and the Sorcerer's Stone
## 8. HP and the Deathly Hallows Part 2 8. HP and the Deathly Hallows Part 2
## 4. HP and the Goblet of Fire                 4. HP and the Goblet of Fire
## 2. HP and the Chamber of Secrets         2. HP and the Chamber of Secrets
## 5. HP and the Order of the Phoenix     5. HP and the Order of the Phoenix
## 6. HP and the Half-Blood Prince           6. HP and the Half-Blood Prince
## 3. HP and the Prisoner of Azkaban       3. HP and the Prisoner of Azkaban
## 7. HP and the Deathly Hallows Part 1 7. HP and the Deathly Hallows Part 1

Ahora los títulos de las películas son un factor con los siguientes niveles:

levels(sales_df$title)

## [1] "1. HP and the Sorcerer's Stone"      
## [2] "2. HP and the Chamber of Secrets"    
## [3] "3. HP and the Prisoner of Azkaban"   
## [4] "4. HP and the Goblet of Fire"        
## [5] "5. HP and the Order of the Phoenix"  
## [6] "6. HP and the Half-Blood Prince"     
## [7] "7. HP and the Deathly Hallows Part 1"
## [8] "8. HP and the Deathly Hallows Part 2"

Como los títulos ya los tenemos como una variable podemos borrar los nombres de los renglones usando NULL.

rownames(sales_df) <- NULL
sales_df

##   total_grosses opening_grosses                                title
## 1     497066400       141823200       1. HP and the Sorcerer's Stone
## 2     426630300       189432500 8. HP and the Deathly Hallows Part 2
## 3     401608200       142414700         4. HP and the Goblet of Fire
## 4     399302200       135197600     2. HP and the Chamber of Secrets
## 5     377314200        99635700   5. HP and the Order of the Phoenix
## 6     359788300        92756000      6. HP and the Half-Blood Prince
## 7     357233500       134119300    3. HP and the Prisoner of Azkaban
## 8     328833900       138752100 7. HP and the Deathly Hallows Part 1

Ej: Salas de cine

Agrega una columna con el número de cines en los que se exhibió la película usando el vector que generamos antes theaters_vec.

sales_df$theaters <- 
sales_df

3.4.4 Orden de posiciones

La función order() ordena el vector y regresa la posición de los elementos ordenados de menor a mayor.

Siguiendo con el ejemplo de los ingresos de la saga, obtengamos el vector de posiciones de las películas ordenado por el total de ingresos.

El vector de total de ingresos es el siguiente:

sales_df$total_grosses

## [1] 497066400 426630300 401608200 399302200 377314200 359788300 357233500
## [8] 328833900

El vector con las posiciones ordenadas

total_order <- order(sales_df$total_grosses)
total_order

## [1] 8 7 6 5 4 3 2 1

Seleccionamos las posiciones del total de ingresos en el orden que nos dice el vector ordenado total_order para obtener el vector de ingresos ordenado.

sales_df$total_grosses[total_order]

## [1] 328833900 357233500 359788300 377314200 399302200 401608200 426630300
## [8] 497066400

De la misma forma, es posible ordenar el dataframe:

sales_order_df <- sales_df[ total_order , c(3, 1, 2)]
sales_order_df

##                                  title total_grosses opening_grosses
## 8 7. HP and the Deathly Hallows Part 1     328833900       138752100
## 7    3. HP and the Prisoner of Azkaban     357233500       134119300
## 6      6. HP and the Half-Blood Prince     359788300        92756000
## 5   5. HP and the Order of the Phoenix     377314200        99635700
## 4     2. HP and the Chamber of Secrets     399302200       135197600
## 3         4. HP and the Goblet of Fire     401608200       142414700
## 2 8. HP and the Deathly Hallows Part 2     426630300       189432500
## 1       1. HP and the Sorcerer's Stone     497066400       141823200

Ej: Fechas de lanzamiento

Agrega otra columna al dataframe sales_order_df con las fechas de lanzamiento del vector que se presenta a continuación.

release_hp <- c("11/16/01", "7/15/11", "11/18/05", "11/15/02", "7/11/07", "7/15/09", "6/4/04", "11/19/10")
names(release_hp) <-  titles_hp
release_hp

##       1. HP and the Sorcerer's Stone 8. HP and the Deathly Hallows Part 2 
##                           "11/16/01"                            "7/15/11" 
##         4. HP and the Goblet of Fire     2. HP and the Chamber of Secrets 
##                           "11/18/05"                           "11/15/02" 
##   5. HP and the Order of the Phoenix      6. HP and the Half-Blood Prince 
##                            "7/11/07"                            "7/15/09" 
##    3. HP and the Prisoner of Azkaban 7. HP and the Deathly Hallows Part 1 
##                             "6/4/04"                           "11/19/10"

Existe un problema con este vector. Tiene el orden de la matriz original.

Usando la función order() arregla la posición del vector con el orden de los títulos y este vector arreglado inclúyelo, finalmente, al df.

sales_order_df$release_date <- release_hp[]
sales_order_df

##                                  title total_grosses opening_grosses
## 8 7. HP and the Deathly Hallows Part 1     328833900       138752100
## 7    3. HP and the Prisoner of Azkaban     357233500       134119300
## 6      6. HP and the Half-Blood Prince     359788300        92756000
## 5   5. HP and the Order of the Phoenix     377314200        99635700
## 4     2. HP and the Chamber of Secrets     399302200       135197600
## 3         4. HP and the Goblet of Fire     401608200       142414700
## 2 8. HP and the Deathly Hallows Part 2     426630300       189432500
## 1       1. HP and the Sorcerer's Stone     497066400       141823200
##   release_date
## 8     11/19/10
## 7       6/4/04
## 6      7/15/09
## 5      7/11/07
## 4     11/15/02
## 3     11/18/05
## 2      7/15/11
## 1     11/16/01

3.4.5 Funciones útiles para data frames

Existen algunas que ayudan a tratar dataframes.

head() y tail():

head(sales_order_df)

##                                  title total_grosses opening_grosses
## 8 7. HP and the Deathly Hallows Part 1     328833900       138752100
## 7    3. HP and the Prisoner of Azkaban     357233500       134119300
## 6      6. HP and the Half-Blood Prince     359788300        92756000
## 5   5. HP and the Order of the Phoenix     377314200        99635700
## 4     2. HP and the Chamber of Secrets     399302200       135197600
## 3         4. HP and the Goblet of Fire     401608200       142414700
##   release_date
## 8     11/19/10
## 7       6/4/04
## 6      7/15/09
## 5      7/11/07
## 4     11/15/02
## 3     11/18/05

tail(sales_order_df)

##                                  title total_grosses opening_grosses
## 6      6. HP and the Half-Blood Prince     359788300        92756000
## 5   5. HP and the Order of the Phoenix     377314200        99635700
## 4     2. HP and the Chamber of Secrets     399302200       135197600
## 3         4. HP and the Goblet of Fire     401608200       142414700
## 2 8. HP and the Deathly Hallows Part 2     426630300       189432500
## 1       1. HP and the Sorcerer's Stone     497066400       141823200
##   release_date
## 6      7/15/09
## 5      7/11/07
## 4     11/15/02
## 3     11/18/05
## 2      7/15/11
## 1     11/16/01

str()

str(sales_order_df)

## 'data.frame':    8 obs. of  4 variables:
##  $ title          : Factor w/ 8 levels "1. HP and the Sorcerer's Stone",..: 7 3 6 5 2 4 8 1
##  $ total_grosses  : num  3.29e+08 3.57e+08 3.60e+08 3.77e+08 3.99e+08 ...
##  $ opening_grosses: num  1.39e+08 1.34e+08 9.28e+07 9.96e+07 1.35e+08 ...
##  $ release_date   : chr  "11/19/10" "6/4/04" "7/15/09" "7/11/07" ...

dim(), nrow() y ncol()

nrow(sales_order_df)

## [1] 8

subset()

avg_total_gr <- mean(sales_order_df$total_grosses)
subset(sales_order_df, total_grosses > avg_total_gr)

##                                  title total_grosses opening_grosses
## 4     2. HP and the Chamber of Secrets     399302200       135197600
## 3         4. HP and the Goblet of Fire     401608200       142414700
## 2 8. HP and the Deathly Hallows Part 2     426630300       189432500
## 1       1. HP and the Sorcerer's Stone     497066400       141823200
##   release_date
## 4     11/15/02
## 3     11/18/05
## 2      7/15/11
## 1     11/16/01