In our latest “People in ME/CFS Research” spotlight, I interviewed Alison Motsinger-Reif, Ph.D., an Associate Professor in Statistics at North Carolina State University, and the Biostatistics Lead for the JAX ME/CFS CRC. Alison has an interest in computational genetics, pharmacogenetics, and epistasis, and will be combining all of the microbiome, immune profiling, and metabolome data for the ME/CFS project into an interactome, where she can figure out which variables are relevant to the disease. Read on to learn more about her background, research plans, and how she got her start in science with Derya.
Hi Alison! Thanks for meeting with me and answering some questions. So, you’re an Associate Professor in the North Carolina State University Statistics Department. Can you tell me a little bit about your background?
Hey, Courtney! Yeah, so I’m a statistical geneticist, so I often describe myself as having a totally dry lab. I’m motivated by interesting problems and biology in general – genetics, genomics, metabolomics, and proteomics specifically, and my work is to develop and apply interesting methods that embrace the complexity of the biological system instead of just ignoring it. It’s interesting because some of my background is directly related to the ME/CFS project. I actually got started as a wet lab biologist working for Derya as an undergraduate. I went to Vanderbilt University for my undergraduate studies, where Derya had just started as a new Assistant Professor. I worked for him doing very classical immunology and virology focusing on HIV and AIDS when I was there in the group. So I really started as a wet lab biologist.
I’ve always loved biological questions, but at some point I figured out that the tool kit I enjoyed using to answer those questions was quantitative. My motivation is interesting biological questions, and I work on computational methods for high-throughput and complex data. Some of the methods development for my work are to deal with complexities like gene-gene and gene-environment interactions, and integrative -omics, like let’s say you have data on the same genes at the DNA level, the RNA level, and maybe at the protein or metabolomic level too, and I look at how we can leverage that kind of information to do robust statistical analysis. And also looking at interactions amongst variables. In modern biology, we have the ability to collect extremely high-throughput data. So with genetics and genomics technologies, I’m interested in how we would with limited sample sizes look at not just variables, but the interactions between them. And that presents some important statistical challenges.
That’s so interesting! You mentioned you know Derya Unutmaz from your undergraduate studies at Vanderbilt University. How did he get you involved in the current NIH-funded ME/CFS study at Jackson Labs?
Like I mentioned, I’ve known Derya for an extremely long time – probably longer than either of us would like to admit! I worked in his lab while I was an undergrad, and he has always been such an incredible mentor for me in general over the years. But we haven’t had a chance to work together in a long time. He really reached out to me as he was putting this center grant together. Thinking about some of the biostatistical aspects of it, I jumped on it right away and said yes, because I wanted to work with him again and it also sounded like a really interesting biological problem that needs to be solved.
What exactly will you be doing for the ME/CFS project?
As the data is collected, there are a number of exciting aspects of the project. One of the features of the project is the longitudinal design, in that you have the repeated measures for patient samples for so many years. So there are a lot of opportunities and challenges in a longitudinal design that need to be taken into account, like individual variation. So I’m hoping I can bring some biostatistical expertise to the longitudinal analysis. Also, looking at the interactome across all of the data – we’re looking at microbiome, metabolomics, and immunological data, so I’m also hoping I can help bring new statistical approaches and ways of thinking in to this so that we can properly handle that study design, and really take advantage of all of the data that’s there.
What are some of the technologies you use in your work, and what power do they have to help us understand more about ME/CFS?
I use a lot of machine learning approaches to help find reliable signals and validate their potential as predictors in this project. I think a lot about tools for variable selection in building robust statistical models. If we have all of the microbiome, we can characterize that at different levels of specificity. Even the largest studies don’t have the sample size that can fit all of the variables there, and that wouldn’t tell you anything biologically interesting anyway, because we’re trying to get down to the most meaningful variables biologically and as predictors or robust measures of diagnoses – something that consistently could help at a clinical level. I use a variable selection tool that can help us select variables that are the most reliable, as well as other machine learning approaches with training, testing, and validation approaches that get at the reliability of the model we’re building. There’s a difference between fitting models based on what happened in the past versus models that help predict on future unseen data prospectively. Basically I’m trying to use tools that can help us as much as possible to bridge the gap between retrospective and prospective modeling.
Can you explain what variable section is, and why you would need to figure out what variables are important and omit others?
So, you are trying to find the right number of variables so you don’t overfit the data. If I have ten samples in my dataset and I have nine variables, I can always build you a perfect model that will perfectly discriminate each sample from the others. But most likely, that’s not really going to be meaningful. You’ll collect a lot of variables when you have a sample size that small, that just by chance will discriminate between samples that day, but may not discriminate between them if you looked at those variables in the long-term. For example, let’s say you’re trying to figure out what variables help you discriminate between me and my sister. We look decently similar, but if you collected as much data about us as you could, one thing you would get is the color of my shirt today. And using just that one set of data, I could build a model. I’m wearing red today, so the model would say, if she’s wearing red, that’s Alison and not her sister. But you know if you collect the data again tomorrow, that model won’t work. And there are a lot of other variables that won’t work either – brown hair won’t discriminate between us, because we both have that. But I’ve got 6 inches on her height-wise, so that would discriminate today, tomorrow, and every day moving forward. So what we have to do statistically, is we have the power to collect these really high-throughput variables, and with the good combination of study design and variable selection, we can find the variables that are always consistent and the most informative. Going back to the example of what I’m wearing, that’s not a good variable to identify me long-term, but it’s a good variable to identify me today. It’s a perfect example of why the longitudinal study design that’s planned in this project is a powerful one. If we collect data today, and next week, and two weeks from now, you quickly see that the color I’m wearing is a noise variable in consistent modeling. So we’re collecting everything we can collect at the different -omics levels, but we have to get down to what’s the most consistent and reliable predictor, and what’s just noise.
That’s a really great way of explaining variable selection! So, what was your first impression of the ME/CFS field?
This has been my first time working with this disease, but I’ve been really struck by how understudied it seems to be, and how much is still unknown. Having so much that’s unknown is exciting as a researcher, but must be absolutely terrifying as a patient. So I was struck by how much there is to learn, as well as what is such an engaged and excited patient community. It’s exciting to have the people who the research will actually impact contributing to the science and giving their input and knowledge. Sometimes, with what I do with the statistical aspect, it’s easy to become really distant because I sit at the computer all day crunching numbers. Having a community that’s so active is really inspiring to me.
It’s definitely a very unique patient community. It’s awesome how involved they are in the science. So what do you think are the important statistical advances happening right now that are relevant to this disease?
I think there are some really exciting things in areas that touch this project directly. In the field of metabolomics in particular, the methods for bioinformatics and biostatistics are advancing really quickly. I’m working a lot in metabolomics specifically, and it’s starting to be high-throughput enough and people are collecting enough samples that we’re starting to really understand what kind of noise comes off of those platforms. There’s a lot of issues with the annotation of the samples, but the number of metabolites that we know the structure of and how to actually map the metabolite to the gene that it’s the product of, is expanding really rapidly. And also people are understanding what to do with the un-annotated metabolomics space. It’s an exciting time where technology is being used enough that we’re starting to understand what we actually need to model and how to model in a way that it starts to be biologically meaningful. So it’s an exciting time to do metabolomics within this disease space, because things are a little more mature in terms of how we handle the metabolomics variables, and I’m hopeful that this will lead us to some exciting discoveries for ME/CFS.
Yes, it sounds like you will get some very interesting results. Thanks so much for talking with me, Alison!