Today’s post on my other blog might be of interest in developing data scientists – it is basically about the limitations of data – it is on Deflate gate, inference and concerns about hypothesis testing

# Monthly Archives: January 2015

# Sharp Sight Labs

DS Students – just want to point you to Sharp Sight Labs –

A founder / author from Sharp Sight Labs (wrote the post about learning R first for data science) has been very helpful in pointing me to several new developments relevant to exercise physiology, data science and health analytics.

On the user end – what people wear:

Athos ; quite a wired shirt, just another step in the direction of wearable computing. If you decide to take the course on Getting and Cleaning Data you will learn techniques in R for getting data from various sources and integrating them into a clean and tidy data set for analysis. In fact, the course project for that class is based on data collected from wearable computing sensors (accelerometers).

Apple Watch : which integrates with the iPhone of course and has the potential to provide a very powerful system for health analytics given the opportunity for data about context, causes and effects, as pointed out to me by Sharp Sight Labs.

On the backend – where the data exists and data analysis starts:

https://developer.apple.com/healthkit/

http://www.samsung.com/us/globalinnovation/pdf/Samsung_SAMI_Backgrounder.pdf

These are both APIs (well Samsung’s is an AMI, I suppose, to be honest I am not completely sure of what the difference is); an API is an application programming interface. You can learn more about getting data from an API in the Getting and Cleaning Data course – at least enough to get you started if you want to go in that direction.

# Git hub cheat sheet

Original Article: http://andrewgelman.com/2015/01/20/github-cheat-sheet/

“Mike Betancourt pointed us to this page. Maybe it will be useful to you too.”

The post Github cheat sheet appeared first on Statistical Modeling, Causal Inference, and Social Science.

# Why you should start with R for data science

I found this post on R-bloggers helpful and affirming given the upcoming semester. Recall the directed study students will be learning the statistical programming language R: http://www.r-project.org

Here is a link to the post:

http://www.r-bloggers.com/why-you-should-learn-r-first-for-data-science/

From Sharp Sight Labs:

## Jawbone Summer Internships in data science

### Link

Jawbone is looking for summer interns for data science. Jawbone is a wrist worn device that helps people track activity, exercise and sleep.

https://jawbone.com/careers/ooRJZfwH

Last year’s intern projects are posted here:

https://jawbone.com/blog/jawbone-up-food-data/

https://jawbone.com/blog/circadian-rhythm/

## Law of large numbers – central limit theorem

### Image

I just came up with a simple example of the central limit theorem after teaching this concept to the DPT research methods course and getting a post class question. In class I had used a normal distribution of HRs from about 270,000 subjects (a large dataset of resting HR and BP I have of people getting pre-employment testing). We took this sample to be our population for example purposes. The initial purpose was to demonstrate the law of large numbers. That as we randomly drew larger samples from this population of 274,000 our sample mean and sample distribution started to be the same as the population mean and population distribution. And if the underlying population distribution is not normal, then that is the distribution that is obtained with a large sample. So if there is a right skew to a population distribution then as samples get larger then we expect that sample to have a right skew. That is the law of large numbers.

Central limit theorem (sometimes confused with the law of large numbers) is about the mean and distribution of “sample means.” You should think of it as the mean of means (though it also applies to the sum). If we have population with a “non normal” distribution, and we sample from it many times all with the same size sample, and we take the means of those samples, and then we make a histogram of those “sample means” then we end up with a normal distribution – no matter what the underlying population distribution might be.

To demonstrate I wrote the following R code that first creates a population with a log distribution and 500,000 “subjects.” There is then a function samples from that population with parameters you can vary (number of samples from the population, and number of subjects / sample), and creates a new data vector of the “sample means.” The only R add on package used is: dplyr

The code is attached, and below is the underlying population distribution as well as the sample means distribution. This is the central limit theorem at work.

# Semester start – data scientists toolbox

For the first full month of the semester the Directed Study students will be learning about various tools in the data scientist’s toolbox. They will work through a self guided online tutorial (course) on this topic by Jeff Leek, PhD, a professor of biostatistics at John’s Hopkin’s Bloomberg School of Public Health. They will learn about the reasons for data analysis, the basics of R (getting it, installing it and a bit about using it (www.r-project.org), and they will learn about version control and online repositories that include version control such as github – for example: https://github.com/scollinspt

For the directed study I have started a group github repository:

https://github.com/HFP-analytics

Students will share their public github site with me so that I can follow their repositories which will grow as they learn to perform statistical programming with R. They will meet with me as they need, when they are able, to get help with their self guided learning. A huge part of doing statistical programming is learning to learn new techniques – being resourceful with the tools of the trade. This includes knowing about them, but more importantly learning how to access them and develop skills independently. People often ask me how I learned SPSS (when I was using SPSS; I have switched to R for just about all my data analysis) and why I could do so much with SPSS. I learned it through using it to solve problems. First small problems and as my abilities grew the problems I could tackle with it grew. My seeming proficiency with SPSS was not really with SPSS at all, it was with an approach to learning on my own with the tools I had available to solve the problems at hand.

I hope for the same developmental process for the directed study students. Eventually (later this semester and next semester) we can start doing team projects which will be housed on the group repository so we can each contribute to it, see what others are doing for the project, and have version control for the code produced. Finally, when we have completed a project we will have done so with the principles of “reproducible research” – our de identified data (from public sources) will be available, as will the code used to process and analyze it, and the results. All free for people to see, manipulate, confirm…..that is, reproduce if they have questions about how the data was used to generate the results presented.

# Reasons we do analysis

We can consider several reasons that we “do analysis” – some that have been listed include:

- Description (descriptive analysis)
- Exploration (exploratory analysis)
- Inference (as in inductive inference – inferential analysis)
- Prediction (a form of inferential – predictive analysis)
- Causation (a form of inferential – causal analysis)
- Mechanism (a form of inferential – mechanistic analysis)

But at the end of the day there is really one reason we “do analysis.” We analyze for knowledge. Analysis is a bridge between the empirical and the rationale. From the sensory observations to the mental models.

# Read here about the popularity of R amongst data scientists

# Exercise Physiology data analysis directed study – spring 2015

This spring five students will embark on a new opportunity, a directed study geared toward equipping them with skills quite different, but yet complimentary, to those they learn in the Exercise Physiology program at UMass Lowell. The five students have agreed to participate in two semesters (a full academic year) of directed study. In the first semester they will learn about data analysis with hands on programming using the R statistical programming language (www.r-project.org). Despite the steep learning curve of using R, it is popular, powerful, open source and free! So students will be able to use the skills they obtain anywhere they end up going – not bound by expensive licenses to use the tools. If they decide to, they could get so advanced with R that they can develop their own customized R packages to for obtaining data, cleaning data for analysis, or running analyses in whatever area of exercise physiology they pursue. During the second semester they will complete a data analysis project using open source data from a variety of possible sources (part of this semester’s exercise the students will seek out and build a list of links to open source data for their projects next semester).

Data science (analysis) skills are highly sought after (see here, and here, and here), and there is an increasing demand for people with cross disciplinary skills. With mobile technologies allowing measurement of so many health, fitness and performance metrics there is a data revolution creating a need for people who understand health, fitness and performance metrics (such as exercise physiology) to also know how to analyze and generate knowledge using the data (such as hacking skills and statistics. Companies such as Garmin, Nike, MapMyFitness (owned by Underarmor), Zephyr, Polar, Wahoo…. are all competing not only for the hardware systems to measure important metrics, for the analysts that can help develop the proper tools for going from data to understanding to action.

The Exercise Physiology program provides the “substantiative expertise”, the core curriculum requirements provides an introduction quantitative reasoning and statistics; this directed study will build on those foundations and further develop the statistics skills while adding the hacking (computer programming, self directed problem solver) skills.

Source: Drew Conway, Sept 2010. Reproduced under a Creative Commons License.

During this semester I will regularly update the blog to keep anyone interested up to date with our progress. Also, if you have a data analysis project in mind, something you could use help with – send me an email and let me know (see the About page).