Motivation

Note: This post is a work in progress for a manuscript under preparation. I hope to document these adventures more fully in the (supplemental) methods of the final manuscript.

We are interested in finding host quantitative trait loci (QTL) controlling pathology in the lung following influenza virus infection in mice. We have a Collaborative Cross Recombinant Inbred Cross (CC-RIX, or CC-F1) panel that has been infected with influenza A virus (CA04), and we have 12 histology scores measured on day 10 (D10) post-infection.

However, before we do any QTL mapping, we observe that the distribution of the data we have for the histology tissues is not normal–it is distributed as a categorical (ordinal) set of scores, ranging from very mild to very severe. These scores are represented by integer (and half-integer) values, but are not necessarily best modeled as numeric. One option is to “deordinalize” the phenotypes, using ordinal logistic regression, generating a normally distributed data set.

Deordinalization

I summarize the following model we are using for our data, from the following reference:

Additional resources include:

Our data can be described as arising from a multinomial distribution, where, for $y \in \{ 1, 2, \dots, K \} $, we have a set of probabilities:

\begin{equation} Pr(y > 1) = \text{logit}^{-1}(X \beta) \end{equation}
\begin{equation} Pr(y > 2) = \text{logit}^{-1}(X \beta - c_2) \end{equation}
\begin{equation} Pr(y > 3) = \text{logit}^{-1}(X \beta - c_3) \end{equation}
\begin{equation} \dots \nonumber \end{equation}
\begin{equation} Pr(y > K - 1) = \text{logit}^{-1}(X \beta - c_{K-1}) \end{equation}

The parameters $c_1$ through $c_{K-1}$ are the cutpoints. For any possible given $k$,

\begin{equation} Pr(y=k) = Pr(y > k-1) - Pr(y>k). \end{equation}

There is an alternative formulation for logistic regression which uses a latent variable which is used to define the cutpoints for classification of each of the $y$ outcomes. For each individual, $i$, a latent variable, $z_i$ is given, and the value of that variable determines the value of the $y_i$ outcome. The latent variable is modeled using a linear model where the errors are distributed in a logistic fashion.

McCullagh (1980, J Royal Stat Soc B-Methodological) describes the continuous latent variable as a “tolerance” value, which may be an unobservable, latent biological quantity of interest.

A help page for a Bayesian proportional odds logistic regression function, bayespolr(), from the arm package, is here: bayespolr help. It uses BFGS and the optim() package. In contrast to polr(), which has no intercept, bayespolr() will force an intercept into the model. There is also vglm() from the VGAM library. There is the clmm() function in the ordinal package that allows cumulative logit models with random effects (Laplace approx.).

Some useful R packages and functions

library(MCMCpack)
MCMCoprobit()

library(VGAM)
vglm()

library(ordinal)
clm(); clmm()

library(MASS)
polr()

library(ARM)
bayespolr()

library(brms)
brm()

Cumulative Probit model with BRMS

\begin{equation} y_i \sim D(f(\eta_i), \theta) \end{equation}
\begin{equation} \eta_i = \boldsymbol{X}\beta + \boldsymbol{Z}_{\text{cross}}u_{\text{cross}} + \boldsymbol{Z}_{\text{batch}}u_{\text{batch}} \end{equation}
\begin{equation} u_{\text{cross}} \sim \text{student-t}(0, \boldsymbol{\sigma}_{\text{cross}}) \end{equation}
\begin{equation} u_{\text{batch}} \sim \text{student-t}(0, \boldsymbol{\sigma}_{\text{batch}}) \end{equation}
\begin{equation} P(c_k | \mu, d) = \phi(d*(c_k - \mu)) - \phi(d*(c_{k-1} - \mu)) \end{equation}

where $\phi()$ is the cumulative unit normal distribution function.

Key reference:

  • Peter McCullagh, 1980, J Royal Stat Soc B-Methodological, “Regression Models for Ordinal Data”