\documentclass[12pt,oneside]{book}
\usepackage[utf8]{inputenc}
\usepackage[a4paper, total={6.4in, 9.7in},top=1in]{geometry}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{hyperref}
\usepackage{parskip}
\usepackage{xcolor}
\usepackage[pdftex]{graphicx}
\graphicspath{{./Figures/}}
\usepackage{enumerate}
\usepackage{flafter}
\usepackage{caption}
\captionsetup{justification = raggedright, singlelinecheck = false}
\usepackage[section,above,below]{placeins}
\usepackage{listings}
\lstset{
language=R, % the language of the code
basicstyle=\scriptsize\ttfamily, % the size of the fonts that are used for the code
numbers=none, % where to put the line-numbers
backgroundcolor=\color{white}, % choose the background color.
showspaces=false, % show spaces adding underscores
showstringspaces=false, % underline spaces within strings
showtabs=false, % show tabs within strings
frame=single, % adds a frame around the code
rulecolor=\color{black}, % if not set, the frame-color may be changed on line-breaks within not-black text
tabsize=2, % sets default tabsize to 2 spaces
breaklines=true, % sets automatic line breaking
breakatwhitespace=false, % sets if automatic breaks should only happen at whitespace
keywordstyle=\color{blue!70!black}, % keyword style
commentstyle=\color{green!50!black}, % comment style
stringstyle=\color{orange!70!black} % string literal style
}
\usepackage[backend=biber, style=authoryear, natbib=true, maxbibnames=6, maxcitenames=2, giveninits=true, doi=false]{biblatex}
\addbibresource{SeanPhD.bib}
\setlength{\bibparsep}{6pt}
\DeclareNameAlias{sortname}{last-first}
\renewcommand{\newunitpunct}{,\ }
\renewcommand{\intitlepunct}{}
\DefineBibliographyStrings{english}{in = {{}{}}}
\newcommand{\mytitle}{Topics in the analysis of composition data}
\hypersetup{pdfinfo={
Title={\mytitle},
Subject={\mytitle},
Author={Sean van der Merwe},
Keywords={Bayes, Dirichlet, Goodness-of-fit, regression, Posterior, Simulation} } }
\renewcommand{\chapterautorefname}{Chapter}
\renewcommand{\sectionautorefname}{Section}
\renewcommand{\subsectionautorefname}{Section}
\renewcommand{\subsubsectionautorefname}{Section}
\newlength{\figwidth} \setlength{\figwidth}{10cm} % I use this to standardise figure sizes
\usepackage{enumerate}
\usepackage{soul}
\newcommand{\com}[1]{{\color{red}#1}} % I'm going to use this command to highlight my changes, remove the red as you see fit. I'll use [] for comments inside it.
\newcommand{\new}[1]{{\color{blue}#1}}
\newcommand{\bx}{\ensuremath{\mathbf{x}}}
\newcommand{\by}{\ensuremath{\mathbf{y}}}
\renewcommand{\baselinestretch}{1.15}
\usepackage{fancyhdr}
\fancypagestyle{C1}{\fancyhf{} \fancyhead[L]{Chapter 1 --- Introduction} \fancyhead[R]{\thepage} \fancyfoot{} \setlength{\headheight}{15pt}}
\fancypagestyle{C2}{\fancyhf{} \fancyhead[L]{Chapter 2 --- Fitting} \fancyhead[R]{\thepage} \fancyfoot{} \setlength{\headheight}{15pt}}
\fancypagestyle{C3}{\fancyhf{} \fancyhead[L]{Chapter 3 --- Discrimination} \fancyhead[R]{\thepage} \fancyfoot{} \setlength{\headheight}{15pt}}
\fancypagestyle{C4}{\fancyhf{} \fancyhead[L]{Chapter 4 --- Classification} \fancyhead[R]{\thepage} \fancyfoot{} \setlength{\headheight}{15pt}}
\fancypagestyle{C5}{\fancyhf{} \fancyhead[L]{Chapter 5 --- MGAM} \fancyhead[R]{\thepage} \fancyfoot{} \setlength{\headheight}{15pt}}
\fancypagestyle{C6}{\fancyhf{} \fancyhead[L]{Chapter 6 --- Regression} \fancyhead[R]{\thepage} \fancyfoot{} \setlength{\headheight}{15pt}}
\fancypagestyle{C7}{\fancyhf{} \fancyhead[L]{Chapter 7 --- Conclusion} \fancyhead[R]{\thepage} \fancyfoot{} \setlength{\headheight}{15pt}}
\fancypagestyle{plain}{\fancyhf{} \fancyhead{} \fancyfoot[C]{\thepage} \setlength{\headheight}{15pt} \renewcommand{\headrulewidth}{0pt}}
\fancypagestyle{preamble}{\fancyhf{} \fancyhead[L]{Preamble} \fancyhead[R]{\thepage} \fancyfoot{} \setlength{\headheight}{15pt}}
\fancypagestyle{contentsstuff}{\fancyhf{} \fancyhead[L]{\leftmark} \fancyhead[R]{\thepage} \fancyfoot{} \setlength{\headheight}{15pt}}
\usepackage{animate}
\begin{document}
\hfuzz=32pt
\hbadness=4800
\begin{refsection}
\pagenumbering{gobble}
\begin{center}
\includegraphics[width=1.2\figwidth]{UFSmathstatsLogo.png}
\begin{Huge}
\mytitle
\end{Huge}
\vspace{1cm}
\begin{LARGE}
Sean van der Merwe
2001007828
\end{LARGE}
\begin{Large}
\vspace{2cm}
Supervisors: Daniel de Waal, Michael von Maltitz
\end{Large}
\vspace{2cm}
Submitted in fulfilment of the requirements for the degree:\\
Ph.D. (Mathematical Statistics)\\
in the Department of Mathematical Statistics and Actuarial Science in the Faculty of Natural and Agricultural Sciences at the University of the Free State.
\vspace{1cm}
\begin{Large}
\today
\end{Large}
\end{center}
\newpage
\frontmatter
\setcounter{page}{1}
\pagestyle{preamble}
\section*{Declaration}
I, Sean van der Merwe, hereby declare that this work, submitted for the degree Ph.D. (Mathematical Statistics), at the University of the Free State, is original work and has not previously been submitted, for degree purposes or otherwise, to any other institution of higher learning. I further declare that all sources cited or quoted are indicated and acknowledged by means of a comprehensive list of references. Copyright hereby cedes to and is invested in the University of the Free State. I declare that all royalties as regards to intellectual property that was developed during the course of and/or in connection with the study at the University of the Free State, will accrue to the University.
\includegraphics[width=60mm]{SeanvdMsignature1.png}
\newpage
\section*{Acknowledgement}
I want to thank Jesus first, for keeping my spirits up through failures and rejections, and reminding me that research is a service to the world, not for any personal gain.
I want to thank those that directly helped me with this work, for guidance and direction, even when I stubbornly refused to listen. In particular, I'm referring to Daan de Waal, Roelof Coetzer, Michael von Maltitz, and Robert Schall.
I want to thank my whole department for being such amazingly pleasant people to work with and always supporting me, even when they didn't know what I was doing or weren't supportive of it. They are all role models for me.
I want to thank my friends and family, for helping me maintain balance, and for their interest in my work.
\newpage
\section*{Abstract}
Many scientific and industrial processes produce data that is best analysed as vectors of relative values, often called compositions or proportions. Any data collected as vectors where the proportion measured in each category is of more importance than the quantity surveyed falls into this area of analysis. Such data is multivariate in nature but should not be analysed using standard multivariate analysis methods directly. Techniques specific to composition data are investigated and recommended in this work.
The Dirichlet distribution is a natural distribution for composition or proportion data. It has the advantage of a low number of parameters, making it a parsimonious choice in many cases. This work primarily explores the Dirichlet distribution from various angles. These include discriminating between populations, classifying observations, and regression of compositions on data of other types. Other models for compositions are also discussed.
Most topics are investigated from the Bayesian perspective. This work particularly contributes to the field of Bayesian analysis using the Dirichlet distribution, expanding the theory of fitting and applying the Dirichlet type I and Dirichlet type II distributions using Bayesian techniques. Prediction and imputation is also explored.
A key contribution of this work is in developing a general approach to building advanced models for composition data, in the Bayesian framework, using the Dirichlet distribution as starting point. These models may include different types of explanatory variables for both expected values and precisions, as well as a mixture of fixed and random effects.
\vspace{4mm}
\noindent%
\textbf{Keywords:} Composition data, proportions, Dirichlet distribution, Bayes, simulation
\vspace{4mm}
%\noindent%
%{\it Note on media:} This document is meant to be viewed in Portable Document Format (.pdf). If you did not receive it in this format then you may obtain the latest version here: \url{https://goo.gl/yBhSUu}.
\newpage
\section*{List of Abbreviations}
\begin{tabular}[t]{|l|l|} \hline
\textbf{Abbreviation}&\textbf{Meaning}\\ \hline
D&Dirichlet type I (distribution)\\ \hline
D2&Dirichlet type II (distribution)\\ \hline
LDA&latent Dirichlet allocation (model)\\ \hline
ln&natural logarithm\\ \hline
LN&logistic-normal (distribution)\\ \hline
log&natural logarithm\\ \hline
MD&matrix Dirichlet (distribution)\\ \hline
MDI&maximal data information (prior)\\ \hline
MDirich2&matrix Dirichlet type II (distribution)\\ \hline
MEV&multivariate extreme value (distribution)\\ \hline
MGAM&multivariate gamma (distribution)\\ \hline
ML&(method of) maximum likelihood\\ \hline
SCE&sum of composition errors\\ \hline
\end{tabular}
\newpage
\pagestyle{contentsstuff}
\tableofcontents
\newpage
\listoffigures
\newpage
\listoftables
\newpage
\newpage
\pagestyle{preamble}
\section*{Problem Statement}
In many experiments or data scenarios the situation exists where measurements are relative, or only of interest to researchers when expressed relative to each other in such a way that they sum to one. The result is data that is multivariate, but not directly suitable for analyses using the popular multivariate normal analysis methods. Further, sample sizes may be small, so even if data are transformed to a free scale then the multivariate normal distribution may have too many parameters, leading to over-fitting. Excessive transformations also affect interpretability.
The Dirichlet and related distributions can be seen as a parsimonious solution, but the published literature is not yet extensive. Most notably, there is a need for Bayesian solutions involving the Dirichlet distribution.
This work seeks to address the problems of:
\begin{itemize}
\item Bayesian analysis of the Dirichlet distribution,
\item discrimination and classification with composition data,
\item and regression of composition data using the Dirichlet distribution.
\end{itemize}
\newpage
\section*{Overview}
In \autoref{ch:intro} the main topics are introduced and illustrated, relevant applications are discussed by means of examples from literature, and advances in the theory of the topics are also mentioned.
In \autoref{ch:p1} the Dirichlet type I and type II distributions are discussed in detail. New theory is derived. Objective Bayesian priors are derived. Performance of the derived Bayesian parameter estimation methods are compared to existing methods via a simulation study. Prediction and imputation using these distributions are also explained.
In \autoref{ch:p2} the discriminant function is developed for the Dirichlet and Matrix Dirichlet distributions. Uses are explained both in theory and with examples.
In \autoref{ch:p3} two new approaches to classification of Dirichlet observations are developed and discussed in detail. These are presented in the context of coal properties as inputs to a coal gasification process.
In \autoref{ch:p4} an alternative to the existing distributions for composition data is developed, namely a new class of multivariate gamma distribution. Again, classification in the context of coal gasification is used to explain the use of this approach. This alternative relates to the Dirichlet distribution in that the Dirichlet can be created as a transformation of a special case of this multivariate gamma distribution.
In \autoref{ch:p5} the use of Dirichlet distributions in a regression setting is discussed. A new approach is developed and explained. The advantages of the new approach are illustrated via simulation and example.
In \autoref{ch:conclusion} the advances made through the work above are summarised.
It is worth mentioning that many of the approaches discussed are not compared directly, as they fall in different model classes. Specifically, the Dirichlet models can be seen as the parsimonious class, as the number of parameters increases in line with the dimension of the problem; while the other models discussed (\textit{e.g.}\ multivariate gamma and logistic normal) are referred to as the rich class, since the number of parameters in those models increase in line with the square of the dimension. Models are compared to others in their own class only.
The thread connecting the topics is an expansion of the tools for working with composition data. The focus is on the Dirichlet Type I models. Being in the parsimonious class, these models require less data points for accurate parameter estimation, and they also have the benefit of not requiring transformation of composition data prior to analysis.
\newpage
\section*{Comment on notation}
Chapters 2 to 6 are based on individual papers that were each written separately, by different combinations of authors, from slightly different viewpoints. As such, they originally contained differing definitions and notation. In order to increase the readability of this document, each paper has undergone minor editing.
In each of these chapters a link is provided to an online version of the relevant paper, in case the reader wishes to view the original. Most changes are related to a consistent notation for the Dirichlet distribution. The main reason for the differences from the original papers is discussed in the first chapter.
\newpage
\section*{Authorship and publication}
Paper 1 (\autoref{ch:p1}) is almost entirely the work of the student, under the direction of Prof.\ Daan de Waal. All derivations are done by the student. All simulation studies were designed and executed by the student. The multiple imputation approach is original to the student, adapted from the multivariate normal imputation approach. It is posted as a technical report as it discusses theory in detail to act as a reference source.
Paper 2 (\autoref{ch:p2}) is mostly the work of Prof.\ Daan de Waal, with the student providing assistance. The theory was derived by the supervisor, along with initial experiments. Results were checked and edited by the student. The student was responsible for typesetting and editing of the publication. This paper is published in the South African Statistics Journal.
Paper 3 (\autoref{ch:p3}) contains equal contributions from Prof.\ Daan de Waal and the student, with Prof.\ Roelof Coetzer providing leadership and direction. The introduction and interpretations relating to the application were written by Prof.\ Coetzer, who posed the original problem. The first approach to solving the problem was derived by the supervisor. The second approach was derived by the student. The student was responsible for typesetting and editing of the publication, and responding to reviewer feedback. This paper is published in Chemometrics.
Paper 4 (\autoref{ch:p4}) is mostly the work of Prof.\ Daan de Waal, with the student providing assistance and expanding the theory in places. Prof.\ Roelof Coetzer provided valuable assistance. The new distribution was developed by the supervisor. The student was responsible for the Appendix. The student was responsible for typesetting and editing of the publication, and responding to reviewer feedback. This paper is published in the South African Statistics Journal.
Paper 5 (\autoref{ch:p5}) is entirely the work of the student and is currently under review. The work has been accepted for publication in the South African Statistics Journal.
The papers are presented in chronological order of being written; however, this order does maintain a logical flow and so the work can be read as a unit.
% >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
\mainmatter
\chapter{Introduction and Literature} \label{ch:intro}
\pagestyle{C1}
\section{Composition data}
Composition data refers to a set of multivariate observations where each individual observation vector sums to exactly one. Alternatively, it is often defined as a vector summing to a positive real number less than one, with a residual `other' component that is considered non-random making up the difference between one and the sum of the random components. Each of these definitions seem more applicable to some practical situations than others, so examples are discussed in this chapter.
When vectors are measured in whole numbers, for example in a classification process, analysis is often based on the multinomial distribution, with the total count being seen as a nuisance parameter. In cases where the total count is not relevant it seems more natural to work directly with the observed proportions. Sometimes the proportions themselves are observed directly, rather than counts and totals. This class of problem includes any situation where the quantity surveyed does not affect the expectation vector.
A good example of composition data is food composition. Most foods have their compositions analysed and reported for health and safety reasons, and being able to model changes in composition can be valuable to regulators. See the book of \textcite{greenfield} for an explanation of the creation and analysis of this type of data. In \textcite{Davis2004} the composition of food was compared between 1950 and 1999 to investigate possible drops in nutritional value over that period. Astrophysicists have considered using the composition of solar wind to determine its origin \parencite{geiss1995}. Another composition example is given by \textcite{wedepohl2011data}, where the elemental composition of glass was used to determine the glass' region of origin. Compositions are of great interest in industrial applications, where the inputs to an industrial process can affect the financial worth of the output.
\subsection{Note on conflicting definitions}
The most popular approaches to analyse composition data involve designating one component of the vector as a reference component. The other components are considered random, while the reference component is not. Analysis then proceeds relative to the reference component. These approaches were regularly applied out of necessity, due to the absence of alternatives. While there may be practical justifications to these approaches in some cases, it is argued here that applying these approaches without clear justification results in loss of practical interpretability.
In the chapters to come the theory of composition data analysis is expanded in such a way that reliance on a reference category is reduced, freeing the researcher to analyse and interpret all components directly. In accordance with this, definitions and notation have been adapted in places to reflect the idea that all components carry equal weight.
Formally, the following definition is used:
$$
\begin{aligned}
&\text{A random variable }\mathbf{X}\text{, a vector of size }P\text{ by 1, with components }X_1,\ldots,X_P\\
&\text{is called a composition random variable if }\\
&0