Extensive additions on Poisson and Binomial distributions and NLS fits

This commit is contained in:
hyginn 2020-11-03 21:24:56 +10:00
parent 7bbd19078f
commit 1f2ec21546

View File

@ -6,10 +6,12 @@
#
# Version: 1.4
#
# Date: 2017-10 - 2020-09
# Date: 2017-10 - 2020-11
# Author: Boris Steipe (boris.steipe@utoronto.ca)
#
# Versions:
# 1.5 Extensive additions on Poisson and binomial distributions
# and regression fits
# 1.4 2020 Maintenance
# 1.3 Change from require() to requireNamespace(),
# use <package>::<function>() idiom throughout,
@ -31,22 +33,33 @@
#TOC> ==========================================================================
#TOC>
#TOC> Section Title Line
#TOC> -----------------------------------------------------------------------------
#TOC> 1 Introduction 54
#TOC> 2 Three fundamental distributions 117
#TOC> 2.1 The Poisson Distribution 120
#TOC> 2.2 The uniform distribution 174
#TOC> 2.3 The Normal Distribution 194
#TOC> 3 quantile-quantile comparison 235
#TOC> 3.1 qqnorm() 245
#TOC> 3.2 qqplot() 311
#TOC> 4 Quantifying the difference 328
#TOC> 4.1 Chi2 test for discrete distributions 363
#TOC> 4.2 Kullback-Leibler divergence 454
#TOC> 4.2.1 An example from tossing dice 465
#TOC> 4.2.2 An example from lognormal distributions 588
#TOC> 4.3 Kolmogorov-Smirnov test for continuous distributions 631
#TOC> Section Title Line
#TOC> --------------------------------------------------------------------------
#TOC> 1 Introduction 67
#TOC> 2 Three fundamental distributions 130
#TOC> 2.1 The Poisson Distribution 133
#TOC> 2.2 The hypergeometric distribution 227
#TOC> 2.2.1 Digression: qqplot() and residuals 375
#TOC> 2.2.2 Fitting functions 433
#TOC> 2.2.2.1 Fit a normal distribution (using nls() ) 503
#TOC> 2.2.2.2 Fit a normal distribution (using nlrob()) ) 520
#TOC> 2.2.2.3 Extreme Value distributions: Gumbel 547
#TOC> 2.2.2.4 Extreme Value distributions: Weibull 570
#TOC> 2.2.2.5 Logistic distribution 613
#TOC> 2.2.2.6 Log-Logistic distribution 643
#TOC> 2.2.2.7 Fitting a negative binomial distribution 673
#TOC> 2.2.2.8 Fitting a binomial distribution 726
#TOC> 2.3 The uniform distribution 774
#TOC> 2.4 The Normal Distribution 794
#TOC> 3 quantile-quantile comparison 835
#TOC> 3.1 qqnorm() 845
#TOC> 3.2 qqplot() 911
#TOC> 4 Quantifying the difference 928
#TOC> 4.1 Chi2 test for discrete distributions 963
#TOC> 4.2 Kullback-Leibler divergence 1054
#TOC> 4.2.1 An example from tossing dice 1065
#TOC> 4.2.2 An example from lognormal distributions 1188
#TOC> 4.3 Continuous distributions: Kolmogorov-Smirnov test 1231
#TOC>
#TOC> ==========================================================================
@ -124,9 +137,46 @@ abline(v = qnorm(c(0.01, 0.99)), lwd = 0.5, col = "#CCAAFF")
# if the mean probability of an event is known. Assume we know that there are
# 256 transcription factors among the 6091 protein-coding genes of yeast, then
# the probability of picking a transcription factor at random from all ORFs is
# 256/6091 ~= 4.2%. How many do we expect if we look e.g. at 250 differentially
# expressed genes? This means the mean number of transcription factors we would
# expect in that sample of differentially expressed genes is (250 * 256)/6091.
# 256/6091 ~= 4.2%. How many transcription factors do we expect in a sample of
# 250 genes - like, for example, the top 250 differentially expressed genes in
# an assay? This is a frequently encountered type of question, converging on the
# information contained in a "list of genes". Once an assay has yielded a list
# of genes, what can we learn from looking at staistical features of its
# membership. In the transcription factor model, the question is: how many
# transcription factors do we expect as members of our list - and did we
# actually observe more or less than that?
#
# It would seem that the the probability of encountering _one_ transcription
# factor among our genes is 256/6091 and therefore the number of transcription
# factors we expect to choose at random is (250 * 256)/6091 i.e. 10.50731 in the
# average over many trials. Let's repeat our experiment a million times and see what we get:
N <- 1000000 # One million trials
genes <- numeric(6091) # All genes are 0
genes[1:256] <- 1 # TFs are 1
hAssays <- numeric(N) # initialize vector
set.seed(112358)
for (i in 1:N) {
pBar(i, N)
hAssays[i] <- sum(sample(genes, 250)) # sum of TFs in our sample in this trial
}
set.seed(NULL)
# And the average is:
mean(hAssays) # 10.50293
# ... which is superbly close to our expectation of 10.50731
# All good - but we won't get 10.5 transcription factors in our assay. We'll
# observe five. Or thirteen. Or thirtysix. Or none at all ... and then we ask
# ourselves: is the number of observed transcription factors significantly
# different from what we would have expected if our experiment identified a
# transcription factor just as likely as it identified any other gene? To answer
# this, we need to consider the probability distribution of possible outcomes of
# our assay. Back to the Poisson distribution. In R it is implemented as dpois()
# and its parameters are: the number of observed events, and the probability of
# observing an event:
dpois(0, (250 * 256) / 6091) # Probability of seeing no TFs
dpois(1, (250 * 256) / 6091) # Probability of seing one ...
@ -134,44 +184,594 @@ dpois(2, (250 * 256) / 6091) # Probability of seing two ...
dpois(3:10, (250 * 256) / 6091) # Probability of seing from three to ten ...
sum(dpois(0:4, (250 * 256) / 6091)) # Probability of seeing four or less ...
# Lets plot this
N <- 25
x <- dpois(0:N, (250 * 256) / 6091)
names(x) <- as.character(0:N)
midPoints <- barplot(x, col = "#E6FFF6",
axes = TRUE,
ylim = c(0, 0.15),
xlab = "# of TF in set",
ylab = "p")
sum(dpois(0:250, (250*256)/6091)) # The sum over all possibilities is (for
# any probability distribution) exactly one.
# Confirm that our understanding of dpois() is correct, by simulating actual
# trials:
# Lets plot these probabilities ...
nMax <- 28
N <- 1000
x <- dpois(0:nMax, (250 * 256) / 6091)
barMids <- barplot(x,
col = "#FCF3CF",
names.arg = as.character(0:nMax),
cex.names = 0.5,
axes = TRUE,
ylim = c(0, 0.15),
xlab = "# of TF in set",
ylab = "p")
# ... and add our simulated assays:
(th <- table(factor(hAssays, levels = 0:nMax))/N)
points(barMids - 0.55, th, type = "s", col = "firebrick")
abline(v = (((250 * 256) / 6091) * (barMids[2] - barMids[1])) + barMids[1],
col = "#5588FF") # scale to the correct position in the barplot
legend("topright",
legend = c("Poisson", "simulated", "expectation"),
pch = c(22, NA, NA), # one box, two lines only
pt.cex = 1.4, # larger, to match bar-width better
pt.bg = c("#FCF3CF", NA, NA), # bg color only for the box
lty = c(0, 1, 1), # no line for the box
col = c("#000000", "firebrick", "#5588FF"),
bty = "n") # no frame around the legend
# NOTE: The simulation shows us that our expectations about the number of
# transcription factors in a sample are almost, but not exactly Poisson
# distributed. Can you figure out where we went wrong? Maybe the
# Poisson distribution is not quite the correct distribution to
# model our expectations?
# == 2.2 The hypergeometric distribution ===================================
# With that suspicion in mind, we reconsider what we were trying to achieve at
# each point of the Poisson distribution. Assume we have observed seven
# transcription factors in our set of 250 genes. So we asked: what is the
# probability of observing seven? And concluded:
dpois(7, (250 * 256) / 6091) # Probability of seeing seven ...
# ... which is the probability of seven events in 250 choices if the underlying
# probability is equal to fraction of transcription factors among genes. But
# wait: we weren't careful in our simulation. Assume that our first observed
# gene was a transcription factor. Is the probability of the next sample the
# same? Not quite: one transcription factor has been observed, 249 more samples
# need to be considered, and there are now 6090 genes to choose from. I.e. the
# mean probability changes with every observation:
(250 * 256) / 6091 # First sample, a TF: p = 10.50731
(249 * (256 - 1)) / (6091 - 1) # Second sample: p = 10.42611 (-0.992 %)
# This is actually noticeable: the mean probability for transcription factors
# drops by about one percent after a transcription factor is observed. But what
# if we would have observed the first transcription factor as the tenth gene?
(239 * (256 - 1)) / (6091 - 10) # Eleventh sample: p = 10.0222 (-0.954 %)
# This is getting complicated and we need a different way to think about this if
# we don't want to enumerate all possibilities by hand. (There are far too
# many!) Generally, the probability of observing a certain number of events in a
# series is the number of ways to realize the desired outcome, divided by the
# number of possible outcomes. So, if we code transcription factors by "1" and
# other genes by "0", seven transcription factors could be
# 01010100100100100100000000000000000000000 ..., or
# 01110111100000000000000000000000000000000 ..., or
# 11101001000010000000100000000000000000000 ..., or any other combination.
# But crucially, our number of trials is limited and every "success" changes the
# probability of future successes. This is sampling without replacement! And
# sampling without replacement is modeled by the so-called "hypergeometric
# distribution". This is the big difference: when we are sampling WITH
# replacement, we can model the process with a Poisson distribution. When we are
# sampling WITHOUT replacement, we use a hypergeometric distribution instead.
# Let's first re-run our simulated assays. We put the previous run into the
# vector "hAssays" which I prefixed "h" as in "hypergeometric" because I knew
# what was coming, and that I had sampled WITHOUT replacement because that is
# the default for the sample() function. Accordingly, we call the new samples
# "pAssays", where "p" stands for Poisson:
N <- 1000000 # One million trials
genes <- numeric(6091) # All genes are 0
genes[1:256] <- 1 # TFs are 1
x <- numeric(N) # initialize vector
pAssays <- numeric(N) # initialize vector
set.seed(112358)
for (i in 1:N) {
x[i] <- sum(sample(genes, 250)) # sum of TFs in our sample in this trial
pBar(i, N)
pAssays[i] <- sum(sample(genes, 250, replace = TRUE))
}
set.seed(NULL)
(t <- table(x)/N)
# Now the average is:
mean(pAssays) # 10.50312 which is essentially the same as 10.50293
# And the plot ...
nMax <- 28
barMids <- barplot(dpois(0:nMax, (250 * 256) / 6091),
names.arg = as.character(0:nMax),
cex.names = 0.5,
axes = TRUE,
ylim = c(0, 0.15),
xlab = "# of TF in set",
ylab = "p",
col = "#FCF3CF")
abline(v = (((250 * 256) / 6091) * (barMids[2] - barMids[1])) + barMids[1],
col = "#5588FF") # scale to the correct position in the barplot
th <- table(factor(hAssays, levels = 0:nMax))/N
points(barMids - 0.55, th, type = "s", col = "firebrick")
# Here are the new assays:
tp <- table(factor(pAssays, levels = 0:nMax))/N
points(barMids - 0.55, tp, type = "s", col = "seagreen")
# Add these values to the plot
y <- numeric(26) # initialize vector with 26 slots
y[as.numeric(names(t)) + 1] <- t # put the tabled values there (index + 1)
points(midPoints - 0.55, y, type = "s", col = "firebrick")
legend("topright",
legend = c("theoretical", "simulated"),
pch = c(22, 22),
pt.bg = c("#E6FFF6", "firebrick"),
legend = c("Poisson",
"no replacement",
"with replacement",
"expectation"),
pch = c(22, NA, NA, NA),
pt.cex = 1.4,
pt.bg = c("#FCF3CF", NA, NA, NA),
lty = c(0, 1, 1, 1),
col = c("#000000", "firebrick", "seagreen", "#5588FF"),
bty = "n")
# Clearly .. the "correct" simulation, the simulation that is actually
# appropriate for the Poisson distribution, matches the theoretical values
# better. Now let's see how well the hypergeometric distribution matches our
# original simulated assays.
# The function dhyper(x, m, n, k) expects the following parameters:
# x: the number of observed positive events for which to
# compute the probability
# m: the number of positive events in the population
# n: the number of negative eventsw in the population
# k: the number of observations (or trials)
# == 2.2 The uniform distribution ==========================================
dhyper(0, 256, 6091 - 256, 250) # Probability of seeing no TFs
dhyper(1, 256, 6091 - 256, 250) # Probability of seing one ...
dhyper(2, 256, 6091 - 256, 250) # Probability of seing two ...
dhyper(3:10, 256, 6091 - 256, 250) # Probability of three to ten ...
sum(dhyper(0:4, 256, 6091 - 256, 250)) # Probability of seeing four or less ...
sum(dhyper(0:250, 256, 6091-256, 250)) # The sum over all possibilities is (for
# any probability distribution)
# exactly one.
# Lets plot these probabilities like we did above ...
nMax <- 28
x <- dhyper(0:nMax, 256, 6091 - 256, 250)
barMids <- barplot(x, col = "#E6FFF6",
names.arg = as.character(0:nMax),
cex.names = 0.5,
axes = TRUE,
ylim = c(0, 0.15),
xlab = "# of TF in set",
ylab = "p")
abline(v = (mean(hAssays) * (barMids[2] - barMids[1])) + barMids[1],
col = "#5588FF") # scale to the correct position in the barplot
points(barMids - 0.55, th, type = "s", col = "firebrick")
legend("topright",
legend = c("Hypergeometric",
"no replacement",
"expectation"),
pch = c(22, NA, NA),
pt.cex = 1.4,
pt.bg = c("#E6FFF6", NA, NA),
lty = c(0, 1, 1, 1),
col = c("#000000", "firebrick", "#5588FF"),
bty = "n")
# This!
# This is what a correctly simulated distribution looks like.
# === 2.2.1 Digression: qqplot() and residuals
# The indication that something was wrong with our simulation versus the
# theoretical expectation came from the observation that the differences between
# the barplot and theory were not quite random: the values near the mode were
# systematically too high, the values in the tails were systematically too low.
# This is a general principle: when you see a SYSTEMATIC deviation between
# simulation and theory, something is wrong with your understanding. Either
# there is a subtle error in how you set up the simulation, or a subtle
# misunderstanding about the requirements for the particular theory to apply. R
# has a general way to examine such differences: qqplot() plots the deviations
# between theory and observation ordered as ranks, i.e. not affected by absolute
# scale.
oPar <- par(mfrow = c(1, 2))
qqplot( dpois(0:nMax, (250 * 256) / 6091),
th,
xlab = "Poisson distribution",
ylab = "sampling with replacement",
pch = 22, bg = "#FCF3CF")
abline(lm(th ~ dpois(0:nMax, (250 * 256) / 6091)), col = "seagreen")
qqplot( dhyper(0:nMax, 256, 6091 - 256, 250),
th,
xlab = "hypergeometric distribution",
ylab = "sampling with replacement",
pch = 22, bg = "#E6FFF6")
abline(lm(th ~ dhyper(0:nMax, 256, 6091 - 256, 250)), col = "seagreen")
par(oPar)
# Similar information can be obtained from a residual plot, plotting differences
# between prediction and observation:
oPar <- par(mfrow = c(1, 2))
plot( dpois(0:nMax, (250 * 256) / 6091) - as.numeric(th),
type = "b",
ylim = c(-0.006, 0.006),
xlab = "# observations",
ylab = "Poisson density - obs.",
pch = 22,
bg = "#FCF3CF",
col = "#000000")
abline(h = 0, col = "seagreen")
plot( dhyper(0:nMax, 256, 6091 - 256, 250) - as.numeric(th),
type = "b",
ylim = c(-0.006, 0.006),
xlab = "# observations",
ylab = "Hypergeometric density - obs.",
pch = 22,
bg = "#E6FFF6",
col = "#000000")
abline(h = 0, col = "seagreen")
par(oPar)
# === 2.2.2 Fitting functions
# Note that there is a further subtle catch: We did not actually ask about seven
# transcription factors! We asked about the probability of seven _different_
# transcription factors - because, implicit in the assumptions I made about the
# assay (and reasonable for most gene lists), we count duplicates only once!
# Does this actually make a difference? We can model the situation by giving
# each transcription factor a name (or a number), sampling at random, and
# counting how many unique() factors we found in our sample:
N <- 1000000 # One million trials
genes <- numeric(6091) # All genes are initially 0
genes[1:256] <- 1:256 # TFs get a unique "name"
# example sample:
set.seed(11235)
x <- sample(genes, 250, replace = TRUE) # Pick 250 random genes.
x[x != 0] # Note that 189 appears twice.
length(x[x != 0]) # We picked 8 transcription factors ...
unique(x) # but only 7 are unique (plus zero).
length(unique(x)) - 1 # Ignore the zero.
# Do this a million times
uAssays <- numeric(N) # initialize vector
set.seed(112358)
for (i in 1:N) {
pBar(i, N)
uAssays[i] <- length(unique(sample(genes, 250, replace = TRUE))) - 1
}
set.seed(NULL)
# plot the poisson distribution (with replacement) as our baseline
nMax <- 28
barMids <- barplot(dpois(0:nMax, (250 * 256) / 6091),
col = "#FCF3CF",
names.arg = as.character(0:nMax),
cex.names = 0.5,
axes = TRUE,
ylim = c(0, 0.15),
xlab = "# of TF in set",
ylab = "p")
tu <- table(factor(uAssays, levels = 0:nMax))/N
points(barMids - 0.55, tu, type = "s", col = "#EE22FF")
legend("topright",
legend = c("Poisson",
"unique genes"),
pch = c(22, NA),
pt.cex = 1.4,
pt.bg = c("#FCF3CF", NA),
lty = c(0, 1),
col = c("#000000", "#EE22FF"),
bty = "n")
# Clearly, the distribution does not match our model exactly.
# So what is the "correct" distribution that we could apply in this case? There
# may or may not be one readily available. What we can do instead is to use a
# more general model and fit parameters. This takes us to the domain of
# regression analysis and curve fitting. The general approach is as follows:
# - Decide on a statistical model;
# - Express it in parametrized form;
# - Fit the parameters;
# - Analyze the coefficients;
# Insight into the underlying process that generated our data can be obtained
# by analyzing the fitted parameters, or simply plotting the results. Let's
# look at examples of fits to the sampled distribution above:
# ==== 2.2.2.1 Fit a normal distribution (using nls() )
x <- 0:28
plot(0:28, tu, type="s")
fit <- nls(tu ~ ( a / (sig*sqrt(2*pi)) ) * exp( (-1/2)*((x-mu)/sig)^2 ), start = c(a = 1, mu = 10, sig = 3)) # starting values
points(x, predict(fit, list(x = x)), col = "#CC00CC55", lwd = 2,
type = "s")
coef(fit)
# a mu sig
# 0.9990932 10.0717835 3.0588930
sum(resid(fit)^2)
# [1] 0.0001613488
# ==== 2.2.2.2 Fit a normal distribution (using nlrob()) )
# There's a bit of an art to chosing starting parameters correctly and if the
# nls() fit does not converge, more robust methods are called for.
pkg <- "robustbase"
if (! requireNamespace(pkg, quietly = TRUE)) { install.packages(pkg) }
x <- 0:28
plot(0:28, tu, type="s")
fit <- robustbase::nlrob(tu ~ ( a / (sig*sqrt(6.2831853072)) ) *
exp( (-1/2)*((x-mu)/sig)^2 ),
data = data.frame(tu = tu,
x = 0:28),
start = c(a = 1, mu = 10, sig = 3)) # starting values
points(x, predict(fit, list(x = x)), col = "#CC00CC55", lwd = 2,
type = "s")
coef(fit)
# a mu sig
# 1.002162 10.059189 3.071217
sum(resid(fit)^2)
# [1] 0.0001630868
# ==== 2.2.2.3 Extreme Value distributions: Gumbel
# Many processes that involve best-of choices are better modelled with so called extreme-value distributions: here is the Gumbel distribution from the evd package.
pkg <- "evd"
if (! requireNamespace(pkg, quietly = TRUE)) { install.packages(pkg) }
x <- 0:28
plot(0:28, tu, type="s")
fit <- robustbase::nlrob(tu ~ evd::dgumbel(x, loc = L, scale = S),
data = data.frame(tu = tu, x = 0:28),
start = c(L = 7.3, S = 2.82))
points(x, predict(fit, list(x = x)), type = "s", col = "#55DD88")
coef(fit)
# L S
# 9.322110 2.818266
sum(resid(fit)^2)
# [1] 0.001027447
# ==== 2.2.2.4 Extreme Value distributions: Weibull
# Weibull distributions are common in reliabilty analysis. I found the
# distribution particularly hard to fit as it is quite sensitive to inital
# parameter estimates. https://en.wikipedia.org/wiki/Weibull_distribution
# NOTE: the parameter TH is > X
idx <- 4:28
x <- idx
y <- as.numeric(tu)[idx]
plot(x, y, type="s")
dWei <- function(x, th, l, k) {
a <- k/l
b <- ((x-th)/l)^(k-1)
c <- exp(-((x-th)/l)^k)
y <- a * b * c
return(y)
}
set.seed(112358)
fit <- robustbase::nlrob(y ~ dWei(x, th = TH, l = L, k = K),
data = data.frame(y = y,
x = x),
lower = c(TH = 3.5,
L = 8,
K = 2.5),
upper = c(TH = 3.7,
L = 9,
K = 3),
method = "mtl")
points(x, predict(fit, list(x = x)),
col = "#CC00CC55", lwd = 2, type = "s")
coef(fit)
# TH L K
#3.630807 8.573898 2.795116
sum(resid(fit)^2)
# [1] 7.807073e-05
# ==== 2.2.2.5 Logistic distribution
# Similar to normal distribution, but with heavier tails
# https://en.wikipedia.org/wiki/Logistic_distribution
x <- 0:28
y <- as.numeric(tu)
plot(0:28, y, type="s")
dLogi <- function(x, mu, s) {
y <- (exp(-(x-mu)/s)) / (s * (1+exp(-(x-mu)/s))^2)
return(y)
}
fit <- robustbase::nlrob(y ~ dLogi(x, mu = M, s = S),
data = data.frame(y = y,
x = 0:28),
start = c(M = 10, S = 3))
points(x, predict(fit, list(x = x)),
col = "#CC00CC55", lwd = 2, type = "s")
coef(fit)
# M S
# 10.088968 1.891654
sum(resid(fit)^2)
# [1] 0.0004030595
# ==== 2.2.2.6 Log-Logistic distribution
# Spercial case of logistic, often used in survival analysis
# https://en.wikipedia.org/wiki/Log-logistic_distribution
x <- 0:28
y <- as.numeric(tu)
plot(0:28, y, type="s")
dLogLogi <- function(x, a, b) {
y <- ((b/a)*(x/a)^(b-1)) / (1+((x/a)^b))^2
return(y)
}
fit <- robustbase::nlrob(y ~ dLogLogi(x, a = A, b = B),
data = data.frame(y = y,
x = 0:28),
start = c(A = 9.5, B = 4.8))
points(x, predict(fit, list(x = x)),
col = "#CC00CC55", lwd = 2, type = "s")
coef(fit)
# A B
# 10.310181 5.288385
sum(resid(fit)^2)
# [1] 0.0006343927
# ==== 2.2.2.7 Fitting a negative binomial distribution
# The negative binomial distribution is related to the Poisson distribution.
# Assume you are observing events and and counting how many successes of a
# Bernoulli process (essentially a coin-flip) we encounter before we encounter a
# failure. Unlike the Poisson, it models mean and variance separately and is therefore especially useful for overdispersed functions. (cf. https://en.wikipedia.org/wiki/Negative_binomial_distribution)
#
x <- 0:28
plot(x, tu, type="s")
# Negative binomial
dNB <- function(x, r, p) {
# Note: r is an integer!
y <- choose((x + r - 1), (r - 1)) * (1-p)^x * p^r
return(y)
}
set.seed(112358)
RR <- 104
fit <- robustbase::nlrob(tu ~ dNB(x, r = RR, p = P),
data = data.frame(x = x, RR = RR),
lower = c(P = 0.01),
upper = c(P = 0.99),
method = "mtl")
points(x, predict(fit, list(x = x)),
col = "#CC00CC55", lwd = 2, type = "s")
coef(fit)
# P
# 0.9100729
sum(resid(fit)^2)
# [1] 0.0005669086
# Nb. the parameter R is not continuous: to optimize it as an integer, we try
# reasonable choices and record the best fit.
N <- 150
R <- numeric(N)
for (thisR in 1:N) {
set.seed(112358)
fit <- robustbase::nlrob(y ~ dNB(x, r = thisR, p = P),
data = data.frame(x = x,
thisR = thisR),
lower = c(P = 0.01),
upper = c(P = 0.99),
method = "mtl")
R[thisR] <- sum(resid(fit)^2)
}
plot(R)
which(R == min(R))
# ==== 2.2.2.8 Fitting a binomial distribution
# The workhorse distribution for Bernoulli proceeses
# cf. https://en.wikipedia.org/wiki/Binomial_distribution
x <- 0:28
y <- as.numeric(tu)
plot(x, y, type="s")
dBinom <- function(x, s, p) {
y <- choose(s, x) * (p^x) * ((1-p)^(s-x))
return(y)
}
fit <- robustbase::nlrob(y ~ dBinom(x, s = S, p = P),
data = data.frame(y = y,
x = 0:28),
start = c(S = 240, P = 256/6091))
points(x, predict(fit, list(x = x)),
col = "#CC00CC55", lwd = 2, type = "s")
coef(fit)
# S P
# 122.77746436 0.08376922
sum(resid(fit)^2)
# [1] 1.114272e-06
# Here we go: near perfect fit. But note the parameters! Can you identify the relationship of S and P to our model of choosing unique transcription factors in a random sampling process. Because, honestly: I can't.
# What is the take-home message here?
# - The standard Poisson and hypergeometric distributions apply to
# very specific models of choice processes (with/without replacement);
# - It is not trivial to choose the right statistical model for any particular
# real-world specification of a sampling process. Our model of unique
# choices is neither Poisson nor hypergeometric distributed. Once we have
# identified the parameters of a binomial distribution that models the
# process perfectly, we nevertheless note that we don't seem to be
# able to interpret the parameters easily.
# - Stochastic modeling allows us to validate whether the model is correct,
# but in case there are discrepancies it is not obvious what might be
# a more appropriate model and how to parametrize it.
# - If building an explicit stochastic model is not possible, we'll have to
# to do the best we can, in this case that would mean: choose a more
# general distribution and parametrize it.
# == 2.3 The uniform distribution ==========================================
# The uniform distribution has the same probability over its entire support. R's
# runif() function takes the desired number, the min and the max as arguments.
@ -191,7 +791,7 @@ runif(10) < 1/3
sum(runif(1e6) < 1/3)/1e6 # should be close to 0.33333...
# == 2.3 The Normal Distribution ===========================================
# == 2.4 The Normal Distribution ===========================================
# The king of probability distributions. Why? That's because of the Central
# Limit Theorem (CLT) that essentially says that a process that is subject to
@ -628,7 +1228,7 @@ sum(divs < KLdiv(pmfL1, pmfL2)) # 933
# we used above.
# == 4.3 Continuous distributions: Kolmogorov-Smirnov test ==============
# == 4.3 Continuous distributions: Kolmogorov-Smirnov test =================
# The Kolmogorov-Smirnov (KS) test is meant for continuous distributions, i.e.
# the probability it calculates assumes that the function values are all