Semester start – data scientists toolbox

For the first full month of the semester the Directed Study students will be learning about various tools in the data scientist’s toolbox. They will work through a self guided online tutorial (course) on this topic by Jeff Leek, PhD, a professor of biostatistics at John’s Hopkin’s Bloomberg School of Public Health. They will learn about the reasons for data analysis, the basics of R (getting it, installing it and a bit about using it (, and they will learn about version control and online repositories that include version control such as github – for example:¬†

For the directed study I have started a group github repository:

Students will share their public github site with me so that I can follow their repositories which will grow as they learn to perform statistical programming with R. They will meet with me as they need, when they are able, to get help with their self guided learning. A huge part of doing statistical programming is learning to learn new techniques – being resourceful with the tools of the trade. This includes knowing about them, but more importantly learning how to access them and develop skills independently. People often ask me how I learned SPSS (when I was using SPSS; I have switched to R for just about all my data analysis) and why I could do so much with SPSS. I learned it through using it to solve problems. First small problems and as my abilities grew the problems I could tackle with it grew. My seeming proficiency with SPSS was not really with SPSS at all, it was with an approach to learning on my own with the tools I had available to solve the problems at hand.

I hope for the same developmental process for the directed study students. Eventually (later this semester and next semester) we can start doing team projects which will be housed on the group repository so we can each contribute to it, see what others are doing for the project, and have version control for the code produced. Finally, when we have completed a project we will have done so with the principles of “reproducible research” – our de identified¬†data (from public sources) will be available, as will the code used to process and analyze it, and the results. All free for people to see, manipulate, confirm…..that is, reproduce if they have questions about how the data was used to generate the results presented.