Deordinalization
Motivation
Note: This post is a work in progress for a manuscript under preparation. I hope to document these adventures more fully in the (supplemental) methods of the final manuscript.
We are interested in finding host quantitative trait loci (QTL) controlling pathology in the lung following influenza virus infection in mice. We have a Collaborative Cross Recombinant Inbred Cross (CC-RIX, or CC-F1) panel that has been infected with influenza A virus (CA04), and we have 12 histology scores measured on day 10 (D10) post-infection.
However, before we do any QTL mapping, we observe that the distribution of the data we have for the histology tissues is not normal–it is distributed as a categorical (ordinal) set of scores, ranging from very mild to very severe. These scores are represented by integer (and half-integer) values, but are not necessarily best modeled as numeric. One option is to “deordinalize” the phenotypes, using ordinal logistic regression, generating a normally distributed data set.
Deordinalization
I summarize the following model we are using for our data, from the following reference:
- Gelman and Hill (2007) Data Analysis Using Regression and Multilevel/Hierarchical Models: § 6.5 “Multinomial regression”; § 15.2 “Ordered categorical regression: storable votes”; and § 17.6 “Multilevel ordered categorial regression”.
Additional resources include:
-
Faraway (2016) Extending the Linear Model with R: § 5.3 “Ordinal Multinomial Responses”
-
Venables and Ripley (2002) Modern Applied Statistics with S: § 7.3 “Poisson and Multinomial Models”
-
Agresti (2012) Categorical Data Analysis: § 7 “Logit Models for Multinomial Responses” and § 12.4 “Random Effects Models for Multinomial Data”; also ICDA: software supplement and R and S-PLUS Manual to Accompany CDA.
-
McCullagh and Nelder (1989) Generalized Linear Models § 5.1 “Models for polytomous data”
-
Fullerton and Xu (2016) Ordered Regression Models: § 2 “Parallel Models”
Our data can be described as arising from a multinomial distribution, where, for $y \in \{ 1, 2, \dots, K \} $, we have a set of probabilities:
The parameters $c_1$ through $c_{K-1}$ are the cutpoints. For any possible given $k$,
There is an alternative formulation for logistic regression which uses a latent variable which is used to define the cutpoints for classification of each of the $y$ outcomes. For each individual, $i$, a latent variable, $z_i$ is given, and the value of that variable determines the value of the $y_i$ outcome. The latent variable is modeled using a linear model where the errors are distributed in a logistic fashion.
McCullagh (1980, J Royal Stat Soc B-Methodological) describes the continuous latent variable as a “tolerance” value, which may be an unobservable, latent biological quantity of interest.
A help page for a Bayesian proportional odds logistic regression function, bayespolr()
, from the arm
package, is here: bayespolr help. It uses BFGS and the optim()
package. In contrast to polr()
, which has no intercept, bayespolr()
will force an intercept into the model. There is also vglm()
from the VGAM
library. There is the clmm()
function in the ordinal
package that allows cumulative logit models with random effects (Laplace approx.).
Some useful R packages and functions
library(MCMCpack)
MCMCoprobit()
library(VGAM)
vglm()
library(ordinal)
clm(); clmm()
library(MASS)
polr()
library(ARM)
bayespolr()
library(brms)
brm()
Cumulative Probit model with BRMS
where $\phi()$ is the cumulative unit normal distribution function.
Key reference:
- Peter McCullagh, 1980, J Royal Stat Soc B-Methodological, “Regression Models for Ordinal Data”