Mathematician/statistician developing network methods for the analysis of multi-omics data
Before IMforFUTURE: background
I come from Omis, a small town on the coast of Croatia. From an early age, I enjoyed solving mathematical problems and riddles, so I decided to go to the gymnasium with a mathematical program. I continued this path and got my bachelor’s (2014) and master’s degree (2016) in mathematics at the University of Split, Croatia. However, during my studies, I discovered the love for probability and statistics and was fascinated by various applications of mathematics. I was lucky to attend the Research MIMOmics Summer School in Cambridge in the summer of 2017 where I learned a lot about statistics in omics studied. After that and while reading more and more about omics research, I was sure that I wanted to work with modelling and analyzing different omics datasets.
In November 2017 I started my fellowship at the University of Bologna as one of 11 IMforFUTURE’s Early Stage Researchers. In Bologna, I work at the Department of Physics and Astronomy where I collaborate with the group of biophysicists including my supervisor Prof. Gastone Castellani. The aim of my PhD is to develop statistical methods for integration of multi-omics datasets. To reach this end goal of my PhD period, the first step is studying the behaviour of different types of omics data. So far I have studied epigenetics, glycomics and “long-tail distributions” in general because they are characteristic for different biological data.
I studied the process of DNA methylation and analyzed the methylation patterns around some DNA regions of interest. DNA methylation is a process by which methyl groups are added to the DNA molecule. Methylation can change the activity of a DNA segment without changing the sequence which can lead to the differences in various biological processes. For this reason, it is important to better understand the effect of DNA methylation. In mammalian cells, DNA methylation occurs mainly at CpG dinucleotides and there are tools (e.g. Infinium MethylationEPIC Kit) for extracting the information of existence/absence of DNA methylation of CpG sites across the genome. The datasets I am analyzing consist of methylation levels of ~850,000 CpG sites. For the integration of DNA methylation with different omics (such as genetics), we are usually interested in DNA methylation of specific genome region. To analyze this specific portion of the genome, we use the proximity of DNA coordinates and extract methylation levels of CpG sites that are in the “neighbourhood” of the target region. Since, in general, we will end up with multidimensional measurement, the right statistical approach that will recognize the pattern of methylation is a crucial step in the analysis.
Glycosylation is the process in which carbohydrates get attached to proteins. Since this protein modification affects the function of the protein, analysis of the glycans and glycoproteins is an important part of multi-omics research. I am currently on my secondment at IMforFUTURE industrial partner glyXera in Magdeburg where I am learning about glycans/glycoproteins measurements. After this one month period in glyXera, I will have a better understanding of the data production workflow which will help me in developing new methods for glycans analysis.
It is well known that gene length distribution, as the molecular weight of proteins, gene expression and metagenomics data are described by “long-tail distributions”. Descriptively, “long-tail distributions” are characterised by a large number of occurrences far from the centre of the distribution. This type of distribution is found in the studies of population dynamics and there was a long debate on the type of distribution to use to fit these data. To study this problem, I have started with “surrogate data” such as gene lengths and I have found that Poisson-lognormal distribution describes this type of data well. Poisson-lognormal distribution is a Poisson distribution whose parameter follows a lognormal distribution and since there is no closed-form expression for its probability density function, approximation methods are necessary for fitting. Described method can be easily adapted and used for other “long-tailed” biological data.