Course:Cons452/UsingR
What is R
R is a programming language and a open source software environment for statistical computing, and is widely used for data analysis. R allows an extraordinary range of statistical calculations. It is a free program, mainly written by voluntary contributions from statisticians around the world. R has its home page at https://www.r-project.org/.
For this course, to utilize R for various statistical analysis, we will use R Studio, which is an Integrated Development Environment (IDE). There are other IDEs available for running R but R Studio is the most popularly used. The picture below depicts the main difference between the two.
Installing R and R Studio Desktop
You need to install both R and R studio on your computer. First, you should install R, followed by R Studio. We will be relying on R Studio Desktop version (it also has a cloud version called R Studio Server)
R | R Studio |
---|---|
Download:
|
Download:
|
After installing, access R Studio like you would access any other application on your computer. It may be useful to add a desktop shortcut for easy access.
Getting familiar
Data types and Data structures[1]
Everything in R is an object.
R has basic 6 data types:
- character:
"cons452"
,"lab"
- numeric:
2
,13.4
- integer:
3L
(L
is for telling R to store this as an integer) - logical:
True
,False
- complex:
3+4i
A simple object in R could be a collection of elements - e.g. sequence of numbers. When all elements are of the same data type, it is called a vector (more specifically atomic vector). Vector is the simplest data structure in R. R data structures include:
- atomic vector
- list
- matrix
- data frame
- factors
For the purpose of CONS 452, data frame is the most relevant data structure. A typical data file (a spreadsheet where columns represent different variables and the rows involve observations), resembles a Data frame in R.
Rstudio has 4 main sections to the interface
- The Editor : this is where you will write your code, this will save as a script file on your computer.
- The Console: the console is where the codes are entered when you run the written script and the output is printed.
- Environment & History: provides a list of datasets loaded and the history of commands used.
- Files, Plots, Packages & Help: this section will help you keep track of data, packages, and plots produced.
Setting A Working Directory
- Before you start importing data, installing packages and exploring your data - you will have to set your working directory.
- The files on your computers are organized hierarchically into folders, or “directories.” It is convenient in RStudio to tell R which directory to look for files at the beginning of a session, to minimize typing later. This is essentially setting up a folder path.
- To set the working directory for RStudio from the “Session” tab in the menu bar, choose “Set Working Directory”, and then “Choose Directory...” This will open a dialog box that will let you find and select the directory you want. It is also possible to type the code in the editor to set the working directory.
- !! WARNING !! When using a Mac your file path will have forward slashes (/) and when using Windows your file path will have backslashes (\). Keep this in mind, when switching working computers or while working on scripts with other people!
Installing Packages
- To perform a particular task, there are numerous approaches within R - they are linked to various packages. Follow these instructions to download a package of choice. You have the option to download a package through written code in your script of manually navigating the interface.
- Follow this tutorial to install a package using the editor through written script: How to Install Packages in R
- Follow these steps to download a package using R interface:
Functions and Variables
- Most of the work in R is done by functions. A function has a name and one or more arguments. For example, log(4) is a function that calculates the log in base e for the value 4 given as input. Different packages offer you access to new functions!
- In R, we can store information of various sorts by assigning them to variables. For example, if we want to create a variable called x and give it a value of 4, we would write " x <- 4" , after running this command , whenever we use x in a command it would be replaced by its value 4.
Naming Variables
Naming variables and functions in R is pretty flexible. Here is a list of important things to remember when naming variables:
- A name has to start with a letter, but can be followed by letters or numbers.
- There can't be any spaces
- Names in R are case-sensitive. This means that Weights and weights are completely different things to R. Unfortunately, this is a common and frustrating error many of us make while using R.
- It’s a good idea to have your names be as descriptive as possible, so that you will know what you meant later on when looking at it. (However, if they get too long, it becomes painful and error prone to type them each time we use them, so this, as with all things, requires moderation.) Underscores often become useful.
Get Familiar with basic coding using variables and vectors by watching these tutorials:
Reading Data
Importing Data Cheatsheet : Data Import Cheatsheet
- A quick tutorial on importing data: Importing Data on R
- Data is found in many different formats , it's helpful to know how to convert different data types. It's actually quite easy to convert data types to a preferred format.
- Exporting Data : There are many different ways to export data in R. Here are a few you might use for data tables.
- In this R stats video, learn to export using various R functions such as "write.table", "write.csv", "write.csv2", “write.command” and "write.delim" to export data out of R and save in various formats such as csv, tab-delimited, space-delimited
Data Transformation
Important packages: tidyr, dplyr
Data Cleaning with tidyr
Turn messy data into clean data! Available functions will allow you to deal with missing values, to nest data, separate or unite rows, pivot your data and more. Refer to the second page of the Data Import Cheatsheet for an overview of some of these useful tools.
- Cleaning your data is important to make ensure it's easier to work with once you start analyzing and modelling your data.
Data Transformation with dplyr
The dplyr cheatsheet: dplyr Cheatsheet
- dplyr provides a grammar for manipulating tables in R. This cheat sheet will guide you through the grammar, reminding you how to select, filter, arrange, mutate, summarise, group, and join data frames and tibbles.
- dplyr allows you to perform excel like analysis, it provides the tools/functions to add or change columns, delete or subset variables and perform simple algebra!
Popular dplyr functions:
- mutate(): adds new variables
- recode(): recodes values in the variable
- rowSums(): computes sum score
- filter(): selects cases based on condition
Data Manipulation with data table Cheatsheet
- Explore this "Basic Data Manipulation" Tutorial here: https://ourcodingclub.github.io/2017/01/06/data-manip-intro.html to experiment with some of these functions. It guides users through the basics of tidyr and dplyr. Users will learn how to subset, modify and shape data.
Data Visualization
Data Visualization using ggplot2
- The ggplot2 Cheatsheet is great to have by your side when you begin to visualize your data!
Our Favourite R Youtuber MarinStats provides a handy tutorial for all sorts of graphs that will come in handy:
A helpful summary tool of graphs produced using practice R datasets:
R will not automatically produce very nice looking plots, users will need to code to modify and customize graphs. Here are a few more helpful tutorials:
Explore your options with ggplot. This package becomes really important when it comes to presenting data.
Data Analysis
Descriptive statistics
Descriptive statistics provides a summary of your data, exploring descriptive statistics will help you:
- Check whether data is loaded properly
- Explore data to identify potential group differences, associations between variables.
- Create sample descriptions by looking at percentages, means and standard variations.
Use the summary() function to get a quick overview of your data table!
A helpful guide to basic summary statistics: Calculating mean, standard deviation, frequencies and more in R
Basic Inferential Statistics:
- Linear Regression in R
- T distribution and T scores in R : Use the paired t-test to test differences between group means with paired data or test for a difference between the means of two groups using the 2-sample t-test in R.
- Conducting a one way ANOVA in R : Use R to perform analysis of variance (ANOVA) to compare the means of multiple group.
Cheatsheets
Cheatsheets are pamphlet like utility documents for specific purposes. They contain shortcut instructions either for numerous functions from within a particular R package or for a certain category of useful functions. We encourage you to keep these cheatsheet pdf files handy.
Organizing your workspace and files
- To minimize revision of the code within the team, while working on your project, it would be useful to save your files on the coursedrive
- Have a folder with your project name on the course drive
- Let the instruction for reading your .csv file be like the following
- Find a "Coding Etiquette Tutorial" here : https://ourcodingclub.github.io/2017/04/25/etiquette.html
Extra Resources
- Coding Club Online Tutorials : Offers intro, intermediate and advanced tutorials for R as well as introductory tutorials for Google Earth Engine! Link to website: https://ourcodingclub.github.io/tutorials/?fbclid=IwAR1f91lXrVPkvNS12CEI8NVev1z9IE8zvwGGiPSnccofb1pJuuANWw3sbHU
Open Source R Guide Books:
- "The R Guide" by W.J Owen : https://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf
- "R for Beginners" by Emmanuel Paradis: https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
- "R for Data Science:" by Garrett Grolemund & Hadley Wickham: https://r4ds.had.co.nz/?fbclid=IwAR15o7QIbva-WXwwbDh-IPyAaJ7ijQ045xiiSQ80SuuMCTWVBMqesRGJlfQ
References
- ↑ "Data Types and Structures". Retrieved November 7, 2019.