For marketers
who love technology
Home » , , » A command cheatsheet for R - most useful commands

A command cheatsheet for R - most useful commands

This post helps beginners who want to start using R for data science purposes. It provides tips and tricks and an R command cheatsheet.

Acknowledgement: the content of this post is largely inspired from my work on the Coursera lectures "getting and cleaning data", which I can't praise enough.

Feedback is welcome and will be taken into account. Thanks !

First steps with R

Install R data science software from http://www.r-project.org/
Install R studio, an IDE for R, from http://www.rstudio.com/

In RStudio, you are going to want a few libraries to learn how to use R (swirl library from John Hopkins University) and to manipulate various data formats (xlsx, SQL, and XML libraries).

To install these libraries, start RStudio, and type following commands at RStudio's prompt:
install.packages("swirl")  #for learning to use R 
install.packages("data.table") #for manipulating large datasets 
install.packages("xlsx")   #for downloading and handling Excel docs 
install.packages("XML")    #for downloading and handling XML docs 
install.packages('RMySQL') #for interacting with MySQL servers
install.packages('httr')   #for downloading and handling html docs
install.packages('dplyr')  #for manipulating data frames easily


To load any of these libraries, for instance swirl, use library command:
library("swirl") 

Then you can use the commands provided by the library, for instance:
swirl()

When you will be at ease with R, you will also want to install ggplot2 for graphics, and Rcpp for R and C++ integration.

R most useful commands (cheatsheet)

To play with R, you will need to know a few commands. A cheatsheet is useful if you already played with the commands in the cheatsheet but do not remember exactly the syntax and subtleties. I strongly suggest you have a look to the great courses provided by Coursera on these subjects. It's, in my opinion, the best and fastest way to learn the basics.

I keep the following list of command examples:
 
# Playing with basics variable types
## Determine the type of a variable 
class(x)  # class (type) of an object
attributes(x) #attributes of an object
## Determine the dimensions of a vector / matrix / frame
nrow(x) 
ncol(x)  
## create a vector (objects all of the same type) 
a <-c(0.5,0.6) #concatenate 
as.numeric(vector) #type conversions. see as.logical(), as.complex(), etc.
## create a 2 by 3 matrix containing 1:6
m <- matrix(1:6, nrow=2, ncol=3) 
#transform a vector into a matrix by specifying its dimensions:
dim(a) <- c(2,1) 
dimnames(a) <- list(c("a","b"),c("c")) #add labels to rows and cols
## concanete vectors into a matrix or add rows / cols to a frame
cbind(a,b) #every vector forms a column of the matrix
rbind(a,b) #every vector forms a row of the matrix


## Play with frames (matrices with labels, like a database)
x <- data.frame(foo= 1:3, bar=c("a","b","c"))
y <- 1:3
names(y)=c("pies","trees","cars")

# Subsetting 
## subset a vector 
y[1:2] 
y[y > 2]
## subset a matrix
a[1,2, drop=FALSE] #row 1, col 2
## remove NA values
clean_a <- a[!is.na(a)] # on a single vector
good_idx <- complete.cases(a,b) # columns with no NA in 2 vectors
a[good_idx] 
good <- complete.cases(frame) # index of the rows with all fields non NA
matrix[good,] #subset with good rows, all cols
## subset a frame
frame[frame$y>5] # rows for which the value at col y is greater than 5
good <- subset(r_week1,"Ozone">31 & Temp >90) #rows matching a set of constraints
best_idx <- complete.cases(good) #idx of rows without NA
best <- good[best_idx,] 
subset(r_week1,Month==6,select=c(Temp,Month) ) #subsetting rows and cols
X[(X$var1 > 3 & X$var2 <5),] #subset row with multiple conditions. "&"= and, "|"=or 
x[x$col %in% c("3","17"),] #subset rows with a value "3" or "17" in "col" column
##Subset with dplyr package
select(cran, size:ip_id) #subset columns size to ip_id
select(cran, -toto:titi) #subset all columns except the ones from toto to titi
filter(cran, !is.na(toto)) #subset rows that are not NA
filter(cran, (!is.na(toto)), titi="5") #subset rows with multiple conditions (and)
filter(cran, (!is.na(toto)) | titi="5") #subset rows with multiple conditions (or)
 
# Downloading and reading various data formats
## download a file :
getwd() # display the working directory  
download.file("https://example.com/NGAP.xlsx",destfile="./natural_gas.xlsx")
dateDownloaded <- date()
if (! file.exists("./natural_gas.xlsx")){NULL} #check if a file exists 
 
## read a csv file and put it into a frame
housingData <- read.table("./housing_microdata.csv",sep=",",header= TRUE) 
 
## read a few lines from a gzipped file 
con <- gzfile("./housing_microdata.gz")
x <- readLines(con, 10)
 
## display first lines of a file: 
head(housingData)
 
## read an xlsx file and put it into a frame
library("xlsx")
dat <- read.xlsx("./natural_gas.xlsx",sheetIndex = 1,colIndex = 7:15, rowIndex = 18:23)
 
## read an XML file and put it into a frame
library(XML)
doc <- xmlTreeParse("./restos.xml",useInternalNodes = TRUE)
### extract a type of tag from an XML file
MySQLxpathSApply(xmlRoot(doc),"//zipcode",xmlValue) 
 
## Store/read a variable to/from a text file
dput(objectName, file = filename)
dget(filename)
### Store/read multiple objects 
dump(vectorContainingObjectNames, file = filename) 
source(filename) 
 
## Read from a fixed width format text file 

download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for", destfile='data.for')
# specify the column widths and names explicitly. 
data <- read.fwf(file,widths=c(15,4, ...), header = FALSE, skip= 4, col.names=c("var1","var2" ... )) 
 

## connect to a MySQL database
con <- dbConnect(MySQL(),user="toto",host="example.com",db="dbID")
res <- dbGetQuery(con,"show databases;") # pass any SQL command on the DB and get the result
quer <- dbSendQuery(con,"show databases;") # pass any SQL command on the DB but do not get the result
res <- fetch(quer, n= 10) # get first 10 lines of the result of a the SQL query 
dbClearResult(quer) # clear the result on the server
dbListTables(con) # list tables of a DB 
dbListFields(con,"mytable") #list columns names of a table
myFrame <- dbReadTable("mydb","mytable") 
disconnect(con) 
 
## Read an html webpage 
### without authentication
con <-  url("http://example.com")
htmlCode <- readLines(con) #long line without parsing, use XML or httr library to parse it.
close(con)
### with authentication
#You should use httr handles
### through APIs
#You should use httr  

#Sort variables
sort(X$var1, decreasing = TRUE, na.last=TRUE) #sort an array with NA values at the end
X[order(X$var1)] #sort a frame rows according the values of var1 column (increasing order)
# Arrange command in plyr library can also be handy to sort frames
arrange(cran2, country, desc(r_version), ip_id)) #sort by according to multiple rows
 
 

# Calculate basic statistics from a frame data
mean(x$col) # mean of a given column
colMeans(x, na.rm = FALSE) # means of every column 
colSums(x, na.rm = TRUE)
rowMeans(x) #means of every row
quantile(x$col, probs= c(0.2,0.8)) #20% and 80% quantiles

# Check the correctness of data
## Preview data (e.g., a frame) to check its correctness 
str(x) # classes of the data and first elements  
summary(x) #various stats (mean, min, max...) per column 
head(x, n=12)  # first 12 lines of the frame 
tail(x,n=12)  # last 12 lines of the frame 
### Use str(), summary(), head() and tail() commands after every data transformation 
### to check that the command does exactly what you want. 
## Detect error types 
is.na(x) 
is.nan(x)  
## Check if data is of the correct type and in the correct range
sum(is.na(x$mytransformedcol)) #counts the number of NA values in a column
colSums(is.na(x)) #counts the number of NA values in all columns
all(x$mycol >0 & x$mycol <100) #checks is all values are in the 0-100 range (boolean)
all(colSums(is.na(x))==0) #check that there is no NA in the frame
x$col %in% c("3","17") #returns a boolean vector TRUE if value is in {"3","17"} set
table(x$col %in% c("3","17")) #numbers of elements in and out the set {"3","17"}
## Group data having the same value of a variable (dplyr)
by_package <- group_by(cran, package) #group by "package"
summarize(by_package, mean(size)) #calculate the mean of "size" for every "package"
summarize(by_package, count = n(), unique = n_distinct(ip_id), avg_bytes =  mean(size)) 
 
#Manipulate text
tolower(str) #converts uppercase letters to lowercase 
strsplit("toto.titi","\\.") #split string at a given characer
sub('titi','tata',names(x)) #replace the first occurence of titi by tata in column names
gsub('titi','tata',names(x)) #replace all occurences not just the first one
grep("toto",x) # returns the indices of all elements of x that contain the string "toto" 
grep("toto",x, value=TRUE) # returns the elements of x that contain the string "toto" 
table(grep("toto",x)) #counts the number of elements containing "toto"  
nchar("toto") #counts the number of characters in a string 
substr("toto",2,3) #substring between index 2 and 3 (ot)
 
# Manipulate dates
d <-date() #returns a string with the date / time 
format(d,"%d/%m/%Y") #reformat it as dd/mm/yyyy
weekdays(d) #the full text day name (e.g., Monday)
months(d) #month name full name (e.g., July)
d14 - d2 #difference between 2 dates 
 
# Other commands worth to be known
mutate(cran3, size_mb = size / 2^20, size_gb = size_mb/ 2^10) #add two new columns on the right  
xtabs #reformat a frame according to cols / rows 
ifelse(condition, value if true, value if false) #test on variables
melt #reshape the dataset as tall & skinny (more rows, less cols)
dcast #quite similar to table; have a look to the doc
tapply #calculations on subsets of a frame 
split #segment the data according to the values of a col

 Other tips and best practices

  1. If performance matters (large datasets), always define all arguments of the read.table function explicitly.
  2. dplyr is a library that permits simpler operations on data frames, and it is quick.
  3. Data.table library is faster than usual data frames
  4. Use vectorized operations (like: x>2 to get test all items of a vector and see if they are above 2), they are very handy and elegant. It's a strength of R to avoid the use of 'for' loops.
  5. Get only the data you need, not more, not less. For instance, restrict the data you query from a SQL server thanks to a select statement, and check the data with a size-limited fetch() command. 
  6. Check every step of your R scripts. Use str(), summary(), head() and tail() commands after every data transformation to check that the command does exactly what you want. Also use sum(is.na(x$mytransformedcol)) to check if any NA value are present. all (x$mycol >0) checks is
    • Beware of conversions of imported data to numeric format, they often produce impracticable results.
  7. Sometimes another tool is more practical than R. I use SQLlite quite a lot for simple database operations like querying data with SQL. You do not need to be an expert in SQL to perform simple queries under the model select/from/where. A quick and simple example:
    1. select avg(Ozone)
      from 'quizz 1'
      where Ozone<>'NA';
Have fun!


SHARE

About Gilles

0 comments :

Post a Comment