This chapter outlines a suggestion on how to organize these files. Adhering to a standard format makes it easier for us to efficiently collaborate across projects, but because most CTML members are working on other projects your organization may take on a personalized flavor. This page is largely base on Andrew Mertens’ personal experience and preferences and makes it easier to collaborate on a repo, but talk to your direct supervisor about their standards.
Each study has at least one code repository that typically holds R (or other) code, and research outputs (results .RDS files, tables, figures). Repositories may also include datasets, but be careful not to host any private data on public repositories.
We recommend the following directory structure (mock example):
0-run-project.sh
0-config.R
1 - Data-Management/
0-prep-data.sh
1-prep-cdph-fluseas.R
2a-prep-absentee.R
2b-prep-absentee-weighted.R
3a-prep-absentee-adj.R
3b-prep-absentee-adj-weighted.R
2 - Analysis/
0-run-analysis.sh
1 - Absentee-Mean/
1-absentee-mean-primary.R
2-absentee-mean-negative-control.R
3-absentee-mean-CDC.R
4-absentee-mean-peakwk.R
5-absentee-mean-cdph2.R
6-absentee-mean-cdph3.R
2 - Absentee-Positivity-Check/
3 - Absentee-P1/
4 - Absentee-P2/
3 - Figures/
0-run-figures.sh
...
4 - Tables/
0-run-tables.sh
...
5 - Results/
1 - Absentee-Mean/
1-absentee-mean-primary.RDS
2-absentee-mean-negative-control.RDS
3-absentee-mean-CDC.RDS
4-absentee-mean-peakwk.RDS
5-absentee-mean-cdph2.RDS
6-absentee-mean-cdph3.RDS
...
.gitignore
.Rproj
For brevity, not every directory is "expanded", but we can glean some important takeaways from what we do see.
.Rproj
filesAn "R Project" can be created within RStudio by going to File >> New Project
. Depending on where you are with your research, choose the most appropriate option. This will save preferences, working directories, and even the results of running code/data (though I'd recommend starting from scratch each time you open your project, in general). Then, ensure that whenever you are working on that specific research project, you open your created project to enable the full utility of .Rproj
files. This also automatically sets the directory to the top level of the project.
This is the single most important file for your project. It will be responsible for a variety of common tasks, declare global variables, load functions, declare paths, and more. Every other file in the project will begin with source("0-config")
, and its role is to reduce redundancy and create an abstraction layer that allows you to make changes in one place (0-config.R
) rather than 5 different files. To this end, paths which will be reference in multiple scripts (i.e. a merged_data_path
) can be declared in 0-config.R
and simply referred to by its variable name in scripts. If you ever want to change things, rename them, or even switch from a downsample to the full data, all you would then to need to do is modify the path in one place and the change will automatically update throughout your project. See the example config file for more details. The paths defined in the 0-config.R
file assume that users have opened the .Rproj
file, which sets the directory to the top level of the project.
This makes the jumble of alphabetized filenames much more coherent and places similar code and files next to one another. This also helps us understand how data flows from start to finish and allows us to easily map a script to its output (i.e. 2 - Analysis/1 - Absentee-Mean/1-absentee-mean-primary.R
=> 5 - Results/1 - Absentee-Mean/1-absentee-mean-primary.RDS
). If you take nothing else away from this guide, this is the single most helpful suggestion to make your workflow more coherent. Often the particular order of files will be in flux until an analysis is close to completion. At that time it is important to review file order and naming and reproduce everything prior to drafting a manuscript.
(add details)
(add details)