# tocID <- "FND-STA-Significance.R"
#
#
# Purpose:  A Bioinformatics Course:
#              R code accompanying the FND-STA-Significance unit.
#
# Version:  1.3
#
# Date:     2017-09  - 2020-09
# Author:   Boris Steipe (boris.steipe@utoronto.ca)
#
# Versions:
#           1.3    2020 Maintenance. Add sample solution.
#           1.2    Update set.seed() usage
#           1.1    Corrected treatment of empirical p-value
#           1.0    First contents
#
# TODO:
#
#
# == DO NOT SIMPLY  source()  THIS FILE! =======================================
#
# If there are portions you don't understand, use R's help system, Google for an
# answer, or ask your instructor. Don't continue if you don't understand what's
# going on. That's not how it works ...
# ==============================================================================


#TOC> ==========================================================================
#TOC> 
#TOC>   Section  Title                                              Line
#TOC> ------------------------------------------------------------------
#TOC>   1        Significance and p-value                             49
#TOC>   1.1        Significance levels                                60
#TOC>   1.2        probability and p-value                            77
#TOC>   1.2.1          p-value illustrated                           109
#TOC>   2        One- or two-sided                                   165
#TOC>   3        Significance by integration                         209
#TOC>   4        Significance by simulation or permutation           215
#TOC>   5        Final tasks                                         327
#TOC>   6        Sample solutions                                    336
#TOC>   6.1                                                          338
#TOC>   6.2                                                          342
#TOC>   6.3                                                          346
#TOC> 
#TOC> ==========================================================================


# =    1  Significance and p-value  ============================================

# The idea of the probability of an event has a precise mathematical
# interpretation, but how is it useful to know the probability? Usually we are
# interested in whether we should accept or reject a hypothesis based on the
# observations we have. A rational way to do this is to say: if the probability
# of observing the data is very small under the null-hypothesis, then we will
# assume the observation is due to something other than the null-hypothesis. But
# what do we mean by the "probability of our observation"? And what is "very
# small"?

# ==   1.1  Significance levels  ===============================================

# A "very small" probability is purely a matter of convention - a cultural
# convention. In the biomedical field we usually call probabilities of less then
# 0.05 (5%) small enough to reject the null-hypothesis. Thus we call
# observations with a probability of less than 0.05 "significant" and if we want
# to highlight this in text or in a graph, we often mark them with an asterisk
# (*). Also we often call observations with a probability of less than 0.01
# "highly significant" and mark them with two asterisks (**). But there is no
# special significance in these numbers, the cutoff point for significance could
# also be 0.0498631, or 0.03, or 1/(pi^3). 0.05 is just the value that the
# British statistician Ronald Fisher happened to propose for this purpose in
# 1925. Incidentally, Fisher later recommended to use different cutoffs for
# different purposes (cf.
# https://en.wikipedia.org/wiki/Statistical_significance).


# ==   1.2  probability and p-value  ===========================================

# But what do we even mean by the probability of an observation?
# Assume I am drawing samples from a normal distribution with a mean of 0 and a
# standard deviation of 1. The sample I get is ...

set.seed(sqrt(5))
x <- rnorm(1)
set.seed(NULL)

print(x, digits = 22)
# [1] -0.8969145466249813791748

# So what's the probability of that number? Obviously, the probability of
# getting exactly this number is very, very, very small. But also obviously,
# this does not mean that observing this number is in any way significant - we
# always observe some number. That's not what we mean in this case. There are
# several implicit assumptions when we speak of the probability of an
# observation:

# 1: the observation can be compared to a probability distribution;
# 2: that distribution can be integrated between any specific value
#      and its upper and lower bounds (or +- infinity).

# Then what we really mean by the probability of an observation in the context
# of that distribution is: the probability of observing that value, or a value
# more extreme than the one we have. We call this the p-value. Note that we are
# not talking about an individual number anymore, we are talking about the area
# under the curve between our observation and the upper (or lower) bound of the
# curve, as a fraction of the whole.


# ===   1.2.1  p-value illustrated                      

# Let's illustrate. First we draw a million random values from our
# standard, normal distribution:

N <- 1e6                             # one million
set.seed(112358)                     # set RNG seed for repeatable randomness
r <- rnorm(N)                        # N values from a normal distribution
set.seed(NULL)                       # reset the RNG

# Let's see what the distribution looks like:

(h <- hist(r))

# The histogram details are now available in the list h -  e.g. h$counts

# Where is the value we have drawn previously?
abline(v = x, col = "#EE0000")

# How many values are smaller?
sum(r < x)

# Let's color the bars:
#    first, make a vector of red and green colors for the bars with breaks
#    smaller and larger then x, white for the bar that contains x ...
hCol <- rep("#EE000044", sum(h$breaks < x) - 1)
hCol <- c(hCol, "#FFFFFFFF")
hCol <- c(hCol, rep("#00EE0044", sum(h$breaks > x) - 1))
# ... then plot the histogram, with colored bars ...
hist(r, col = hCol)
# ... add two colored rectangles into the white bar ...
idx <- sum(h$breaks < x)
xMin <- h$breaks[idx]
xMax <- h$breaks[idx + 1]
y <- h$counts[idx]
rect(xMin, 0, x, y, col = "#EE000044", border = TRUE)
rect(x, 0, xMax, y, col = "#00EE0044", border = TRUE)
# ... and a red line for our observation.
abline(v = x, col = "#EE0000", lwd = 2)

# The p-value of our observation is the red area as a fraction of the
# whole histogram (red + green).


# Task:
#    Explain how the expression sum(r < x) works to give us a count of values
#    with the property we are looking for. E.g., examine -4:4 < x

# Task:
#    Write an expression to estimate the probability that a value
#    drawn from the vector r is less-or-equal to x. The result you get
#    will depend on the exact values that went into the vector r but it should
#    be close to 0.185  That expression is the p-value associated with x.
#    (Sample solution 6.1)


# =    2  One- or two-sided  ===================================================

# The shape of our histogram confirms that the rnorm() function has returned
# values that appear distributed according to a normal distribution. In a normal
# distribution, readily available tables tell us that 5% of the values (i.e. our
# significance level) lie 1.96 (or approximately 2) standard deviations away
# from the mean. Is this the case here? How many values in our vector r are
# larger than 1.96?

sum(r > 1.96)
# [1] 24589

# Wait - that's about 2.5% of 1,000,000, not 5% as expected. Why?

# The answer is: we have to be careful with two-sided distributions. 2 standard
# deviations away from the mean means either larger or smaller than 1.96 . This
# can give rise to errors. If we are simply are interested in outliers, no
# matter larger or smaller, then the 1.96 SD cutoff for significance is correct.
# But if we are specifically interested in, say, larger values, because a
# smaller value is not meaningful, then the significance cutoff, expressed as
# standard deviations, is relaxed. We can use the quantile function to see what
# the cutoff values are:

quantile(r)
quantile(r, probs = c(0.025, 0.975)) # for the symmetric 2.5% boundaries
# close to ± 1.96, as expected
quantile(r, probs = 0.95) # for the single 5% boundary
# close to 1.64 . Check counts to confirm:
sum(r > quantile(r, probs = 0.95))
# [1] 50000
# which is 5%, as expected.

# Task:
# Use abline() to add the p = 0.05 boundary for smaller values to the histogram.
# (Sample solution 6.2)

# To summarize: when we evaluate the significance of an event, we divide a
# probability distribution into two parts at the point where the event was
# observed. We then ask whether the integral over the more extreme part is less
# or more than 5% of the whole. If it is less, we deem the event to be
# significant.
#


# =    3  Significance by integration  =========================================

# If the underlying probability distribution can be analytically or numerically
# integrated, the siginificance of an observation can be directly computed.


# =    4  Significance by simulation or permutation  ===========================

# But whether the integration is correct, or relies on assumptions that may not
# be warranted for biological data, can be a highly technical question.
# Fortunately, we can often simply run a simulation, a random resampling, or a
# permutation and then count the number of outcomes, just as we did with our
# rnorm() samples. We call this an empirical p-value. (Actually, the "empirical
# p-value" is defined as (Nobs + 1) / (N + 1).  )

# Here is an example. Assume you have a protein sequence and
# you speculate that positively charged residues are close to negatively charged
# residues to balance charge locally. A statistic that would capture this is the
# mean minimum distance between all D,E residues and the closest R,K,H
# residue. Let's compute this for the sequence of yeast Mbp1.

MBP1 <- paste0("MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLK",
               "ETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASPPPAPKHHHA",
               "SKVDRKKAIRSASTSAIMETKRNNKKAEENQFQSSKILGNPTAAPRKRGRPVGSTRGSRR",
               "KLGVNLQRSQSDMGFPRPAIPNSSISTTQLPSIRSTMGPQSPTLGILEEERHDSRQQQPQ",
               "QNNSAQFKEIDLEDGLSSDVEPSQQLQQVFNQNTGFVPQQQSSLIQTQQTESMATSVSSS",
               "PSLPTSPGDFADSNPFEERFPGGGTSPIISMIPRYPVTSRPQTSDINDKVNKYLSKLVDY",
               "FISNEMKSNKSLPQVLLHPPPHSAPYIDAPIDPELHTAFHWACSMGNLPIAEALYEAGTS",
               "IRSTNSQGQTPLMRSSLFHNSYTRRTFPRIFQLLHETVFDIDSQSQTVIHHIVKRKSTTP",
               "SAVYYLDVVLSKIKDFSPQYRIELLLNTQDKNGDTALHIASKNGDVVFFNTLVKMGALTT",
               "ISNKEGLTANEIMNQQYEQMMIQNGTNQHVNSSNTDLNIHVNTNNIETKNDVNSMVIMSP",
               "VSPSDYITYPSQIATNISRNIPNVVNSMKQMASIYNDLHEQHDNEIKSLQKTLKSISKTK",
               "IQVSLKTLEVLKESSKDENGEAQTNDDFEILSRLQEQNTKKLRKRLIRYKRLIKQKLEYR",
               "QTVLLNKLIEDETQATTNNTVEKDNNTLERLELAQELTMLQLQRKNKLSSLVKKFEDNAK",
               "IHKYRRIIREGTEMNIEEVDSSLDVILQTLIANNNKNKGAEQIITISNANSHA")

# first we split this string into individual characters:
v <- unlist(strsplit(MBP1, ""))

# and find the positions of our charged residues

ED  <- grep("[ED]", v)
RKH <- grep("[RKH]", v)

sep <- numeric(length(ED)) # this vector will hold the distances
for (i in seq_along(ED)) {
  sep[i] <- min(abs(RKH - ED[i]))
}

# Task: read and explain this bit of code

# Now that sep is computed, what does it look like?

table(sep)  # these are the minimum distances
# 24 of D,E residues are adjacent to R,K,H;
# the longest separation is 28 residues.

# What is the mean separation?
mean(sep)

# The value is 4.1 . Is this significant? Honestly, I would be hard pressed
# to solve this analytically. But by permutation it's soooo easy.

# First, we combine what we have done above into a function:

chSep <- function(v) {
  # computes the mean minimum separation of oppositely charged residues
  # Parameter: v (char) a vector of amino acids in the one-letter code
  # Value: msep (numeric) mean minimum separation

  ED  <- grep("[EDed]", v)
  RKH <- grep("[RKHrkh]", v)

  sep <- numeric(length(ED))
  for (i in seq_along(ED)) {
    sep[i] <- min(abs(RKH - ED[i]))
  }
  return(mean(sep))
}

# Execute the function to define it.

# Confirm that the function gives the same result as the number we
# calculated above:
chSep(v)

# Now we can produce a random permutation of v, and recalculate

set.seed(pi)                       # set RNG seed for repeatable randomness
w <- sample(v, length(v))          # This shuffles the vector v. Memorize this
                                   # code paradigm. It is very useful.
set.seed(NULL)                     # reset the RNG



chSep(w)
# 3.773 ... that's actually less than what we had before.

# Let's do this 10000 times and record the results (takes a few seconds):

N <- 10000
chs <- numeric(N)
for (i in 1:N) {
  chs[i] <- chSep(sample(v, length(v))) # charge
}

hist(chs, breaks = 50)
abline(v = chSep(v), col = "#EE0000")

# Contrary to our expectations, the actual observed mean minimum charge
# separation seems to be larger than what we observe in randomly permuted
# sequences. But is this significant? Your task to find out.

# Task:
# Calculate the empirical p-value for chsep(v)
# (Sample solution 6.3)


# =    5  Final tasks  =========================================================

# From chs, compute the empirical p-value of a mean minimum charge separation to
#   be larger or equal to the value observed for the yeast MBP1 sequence. Note
#   the result in your journal. Is it significant? Also note the result of
#   the following expression for validation:
seal(sum(chs))


# =    6  Sample solutions  ====================================================

# ==   6.1    ==================================================================
#
sum(r <= x) / length(r)

# ==   6.2    ==================================================================
#
abline(v = quantile(r, probs = c(0.05)))

# ==   6.3    ==================================================================
#
( x <- (sum(chs >= chSep(v)) + 1) / (length(chs) + 1) )


# [END]