Sharp Sight Labs

DS Students – just want to point you to Sharp Sight Labs –

http://www.sharpsightlabs.com

A founder / author from Sharp Sight Labs (wrote the post about learning R first for data science) has been very helpful in pointing me to several new developments relevant to exercise physiology, data science and health analytics.

On the user end – what people wear:

Athos ; quite a wired shirt, just another step in the direction of wearable computing. If you decide to take the course on Getting and Cleaning Data you will learn techniques in R for getting data from various sources and integrating them into a clean and tidy data set for analysis. In fact, the course project for that class is based on data collected from wearable computing sensors (accelerometers).

Apple Watch : which integrates with the iPhone of course and has the potential to provide a very powerful system for health analytics given the opportunity for data about context, causes and effects, as pointed out to me by Sharp Sight Labs.

On the backend – where the data exists and data analysis starts:

https://developer.apple.com/healthkit/

http://www.samsung.com/us/globalinnovation/pdf/Samsung_SAMI_Backgrounder.pdf

These are both APIs (well Samsung’s is an AMI, I suppose, to be honest I am not completely sure of what the difference is); an API is an application programming interface. You can learn more about getting data from an API in the Getting and Cleaning Data course – at least enough to get you started if you want to go in that direction.

Why you should start with R for data science

I found this post on R-bloggers helpful and affirming given the upcoming semester. Recall the directed study students will be learning the statistical programming language R: http://www.r-project.org

Here is a link to the post:

http://www.r-bloggers.com/why-you-should-learn-r-first-for-data-science/

From Sharp Sight Labs:

http://www.sharpsightlabs.com/learn-r-data-science/

Law of large numbers – central limit theorem

Image

I just came up with a simple example of the central limit theorem after teaching this concept to the DPT research methods course and getting a post class question. In class I had used a normal distribution of HRs from about 270,000 subjects (a large dataset of resting HR and BP I have of people getting pre-employment testing). We took this sample to be our population for example purposes. The initial purpose was to demonstrate the law of large numbers. That as we randomly drew larger samples from this population of 274,000 our sample mean and sample distribution started to be the same as the population mean and population distribution. And if the underlying population distribution is not normal, then that is the distribution that is obtained with a large sample. So if there is a right skew to a population distribution then as samples get larger then we expect that sample to have a right skew. That is the law of large numbers.

Central limit theorem (sometimes confused with the law of large numbers) is about the mean and distribution of “sample means.” You should think of it as the mean of means (though it also applies to the sum). If we have population with a “non normal” distribution, and we sample from it many times all with the same size sample, and we take the means of those samples, and then we make a histogram of those “sample means” then we end up with a normal distribution – no matter what the underlying population distribution might be.

To demonstrate I wrote the following R code that first creates a population with a log distribution and 500,000 “subjects.” There is then a function samples from that population with parameters you can vary (number of samples from the population, and number of subjects / sample), and creates a new data vector of the “sample means.”  The only R add on package used is: dplyr

The code is attached, and below is the underlying population distribution as well as the sample means distribution. This is the central limit theorem at work.

CLT_code

population

sampleMeans