This website has two primary objectives. The first objective is to illustrate concepts related to the application of statistics in scientific contexts, which includes both *designed experiments* and *observational studies*. The second objective is to demonstrate principles of *Reproducible Research*.

We are very concerned about the general public’s growing distrust of conclusions drawn from scientific studies. The public’s skepticism is understandable. In 2005, Dr. John P. A. Ionnidis published an essay in PLoS - Medicine titled Why Most Published Research Findings Are False, wherein he explained the scope of the problem related to irreproducible research and gave some recommendations based on sound statistical principles for addressing this serious issue. We believe strongly that one very important component for increasing the quality of scientific studies that yield reproducible results is to use the principles of *Reproducible Research* (see the next item for more information).

The fundamental principle of *Reproducible Research* is to make available for every published scientific study all raw data and numerical processing routines used by the study authors, with the goal being to enable anyone reading the resulting publication to be able to reproduce *exactly*, and in a facile manner, every single numerical result and figure contained in the publication that relies upon data.

The following quote was taken from Lecture 2 of a Coursera course titled, Reproducible Research.

“Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. The need for reproducibility is increasing dramatically as data analyses become more complex, involving larger datasets and more sophisticated computations. Reproducibility allows for people to focus on the actual content of a data analysis, rather than on superficial details reported in a written summary. In addition, reproducibility makes an analysis more useful to others because the data and code that actually conducted the analysis are available.”

The target audience is scientists who want to learn more about applying principles of *Reproducible Research* in their daily work.

There are many websites dedicated to explaining statistical concepts. For example, Wikipedia contains many high quality entries for virtually any common statistical concept. However, many of these entries are written by statisticians for statisticians. For scientists with no formal statistical training, using these entries to try and gain an understanding of fundamental concepts of statistics that could be applied to their own field is daunting.

For most scientists, their introduction to statistics often came through a *crash-course* that was incorporated into their respective field of study. For example, chemistry students are universally introduced to the concepts of *sample* and *population* *means* and *standard deviations* in their first chemistry course.

Because these introductory courses have so much material to cover, the instructors cannot dedicate substantial time to teaching statistical concepts, so instead, they teach just the minimal amount needed to solve the problem at hand. When statistical concepts are introduced in this manner, they are presented in an algorithmic fashion, where students are told *which* formulas to use for a limited problem set, but the underlying concepts are not taught. Without such knowledge, a scientist’s toolbox is extremely limited.

Traditional books have served humanity well in conveying knowledge. However, with the rise of the WWW, websites can be used to create interactive pages that enable understanding complex concepts in ways that static pages could never do. If a picture is worth a thousand words, then a well-designed, interactive web-page is worth a thousand pictures.

Many of the pages on this website will use dynamically generated figures. Initially, these figures will be rendered using the R Markdown package, because we have experience with that framework. However, we are not opposed to using other frameworks for *Reproducible Research*, such as Python’s Jupyter Notebook. Every example will contain the actual code used to generate the figure. Interested readers wishing to replicate the figures can simply copy the code on the page. Alternatively, the complete codebase can be obtained via a Github repository located at https://github.com/ScientificProgrammer/WillItReplicate.git. If you are interested in contributing and know how to use Github, feel free to `clone`

the repository, make changes, and submit a `pull`

request. Obviously, we cannot guarantee that all of your changes will be accepted, but we will certainly be happy to incorporate content that we feel enhances the site.

The block of code shown below was used to generate the figure below the code block. If you would like to run the code yourself, simply copy and paste it into R. We recommend the use of R-Studio, which is free and can be downloaded at https://www.rstudio.com/products/rstudio/download/, but before installing R-Studio, you need to have a current version of R installed. The R software can be downloaded at https://www.r-project.org/.

```
#Define Constants
xBegin = -4
xEnd = 4
vNumDataPoints = 501
mean1 = -0.5
mean2 = 0.5
sd1 = 0.25
sd2 = 0.25
#Define Functions
ComputeYVals = function(vXVals = NULL, vMean = 0, vSD = 1) {
if(is.null(vXVals)) stop("Error: parameter \'vXVals\' must not be NULL");
yVals = dnorm(x = vXVals, mean = vMean, sd = vSD)
}
DrawPolygon = function(xValsForward = NULL, xValsReverse = NULL, yValsForward = NULL, yValsReverse = NULL, pColor = "red", pAlpha = 0.5, plty = 0) {
if(is.null(xValsForward) | is.null(yValsForward)) {
stop("Error: xValsFoward and yValsForward must not be NULL\n")
}
if(is.null(xValsReverse)) {
xValsReverse = rev(xValsForward)
}
if(is.null(yValsReverse)) {
yValsReverse = rep(0, length(yValsForward))
}
polygon( x = c(xValsForward, xValsReverse),
y = c(yValsForward, yValsReverse),
col = adjustcolor(col = pColor, alpha.f = pAlpha),
lty = plty
)
}
#Define Working Code
xvals = seq(from = xBegin, to = xEnd, length.out = vNumDataPoints)
y1_vals = ComputeYVals(vXVals = xvals, vMean = mean1, vSD = sd1)
y2_vals = ComputeYVals(vXVals = xvals, vMean = mean2, vSD = sd2)
set1 = cbind.data.frame(xvals, yvals = y1_vals)
set2 = cbind.data.frame(xvals, yvals = y2_vals)
plot( x = NULL,
y = NULL,
xlim = range(xBegin, xEnd),
#ylim = range(y1_vals - y1_vals * 0.1, y2_vals + y2_vals * 0.1),
ylim = extendrange(x = range(y1_vals, y2_vals), f = 0.10),
main = "Conceptual Illustration of Two Separate Normal Distributions",
xlab = "x",
ylab = "Probability Density")
points(xvals, y1_vals, lty = "dashed", type = "l")
points(xvals, y2_vals, lty = "dashed", type = "l")
# Color the portion of the left distribution that doesn't overlap with the right distribution
DrawPolygon( xValsForward = set1$xvals[set1$xvals <= 0],
xValsReverse = rev(set2$xvals[set2$xvals <= 0]),
yValsForward = set1$yvals[set1$xvals <= 0],
yValsReverse = rev(set2$yvals[set2$xvals <= 0]),
pColor = "red")
# Color the portion of the right distribution that doesn't overlap with the left distribution
DrawPolygon( xValsForward = set2$xvals[set2$xvals >= 0],
xValsReverse = rev(set1$xvals[set1$xvals >= 0]),
yValsForward = set2$yvals[set2$xvals >= 0],
yValsReverse = rev(set1$yvals[set1$xvals >= 0]),
pColor = "yellow")
# Color the portion of the two distributions that overlap
DrawPolygon( xValsForward = c(set2$xvals[set2$xvals <= 0], set1$xvals[set1$xvals >= 0]),
xValsReverse = NULL,
yValsForward = c(set2$yvals[set2$xvals <= 0], set1$yvals[set1$xvals >= 0]),
yValsReverse = NULL,
pColor = "green")
abline(v = c(mean1, mean2), lty = "dashed")
```

Dr. Eric Milgram holds a BS in chemistry and PhD in analytical chemistry, both from the University of Florida’s Department of Chemistry. His PhD research focused on designing and building atmospheric pressure ionization sources for ultra high resolution mass spectrometry applications.

After earning his PhD, most of his professional career experience was split between the pharmaceutical, biotechnology, and food & beverage industries. He has worked for numerous well-known organizations including the US Centers for Disease Control and Prevention, Pfizer, Metabolon and PepsiCo.

Dr. Milgram’s professional interests include applications of data science to analytical chemistry, especially machine learning. He is a big fan of the R Studio environment.

Dr. S. Stanley Young is a retired researcher from Eli Lilly, GlaxoSmithKline and the National Institute of Statistical Sciences.

Dr. Young graduated from North Carolina State University, BS, MES and a PhD in Statistics and Genetics. He worked in the pharmaceutical industry on all phases of pre-clinical research. He has authored or co-authored over 60 papers including six “*best paper*” awards and a highly cited book, Resampling-Based Multiple Testing and has three issued patents.

Dr. Young is interested in all aspects of applied statistics, with special interest in chemical informatics and biological informatics. His current research interest is in the area of data mining.

Dr. Young is a *fellow* of the American Statistical Association and the American Association for the Advancement of Science. He is also an adjunct professor of statistics at North Carolina State University, the University of Waterloo, and the University of British Columbia, where he has co-directed thesis work.