Big Data for Genomic Sequencing. Interview with Thibault de Malliard.
“Working with empirical genomic data and modern computational models, the laboratory addresses questions relevant to how genetics and the environment influence the frequency and severity of diseases in human populations” –Thibault de Malliard.
Big Data for Genomic Sequencing. On this subject, I have interviewed Thibault de Malliard, researcher at the University of Montreal’s Philip Awadalla Laboratory, who is working on bioinformatics solutions for next-generation genomic sequencing.
Q1. What are the main research activities of the University of Montreal’s Philip Awadalla Laboratory?
Thibault de Malliard: The Philip Awadalla Laboratory is the Medical and Population Genomics Laboratory at the University of Montreal. Working with empirical genomic data and modern computational models, the laboratory addresses questions relevant to how genetics and the environment influence the frequency and severity of diseases in human populations. Its research includes work relevant to all types of human diseases: genetic, immunological, infectious, chronic and cancer.
Using genomic data from single-nucleotide polymorphisms (SNP), next-generation re-sequencing, and gene expression, along with modern statistical tools, the lab is able to locate genome regions that are associated with disease pathology and virulence as well as study the mechanisms that cause the mutations.
Q2. What is the lab’s medical and population genomics research database?
Thibault de Malliard: The lab’s database is regrouping all the mutations (SNPs) found by DNA genotyping, DNA sequencing and RNA sequencing for each samples. There is also annotation data from public databases.
Q3. Why is data management important for the genomic research lab?
Thibault de Malliard: All the data we have is in text csv files. This is what our software takes as input, and it will output other text csv files. So we use a lot of Bash and Perl to extract the information we need and to do some stats. As time goes, we multiply the number of files by sample, by experiment, and finally we get statistics based on the whole data that need recalculating each time we perform a new sequencing/genotyping (mutation frequency, mutations per gene, etc).
With this database, we are also preparing for the lab’s future:
• As the amount of data increases, one day the memory will not fit an associative array.
• Looking to a 200 GB file to find one specific mutation will not be a good option.
• Adding new data to the current files will take more and more time/space.
• We need to be able to select the data according to every parameter we have, i.e., grouping by type of mutation and/or by chromosome, and/or by sample information by gender, ethnicity, age, or pathology.
• We then need to export a file, or count / sum / average it.
Q4. Could you give us a description of what kind of data is in the lab’s genomic research database storing and processing? And for what applications?
Thibault de Malliard: We are storing single nucleotide polymorphisms (SNPs), which are the most common form of genetic mutations among people, from sequencing and genotyping. When an SNP is found for a sample, we also look at what we have at the same position for the other samples:
• There is no SNP but data for the sample, so we know this sample does not have the SNP.
• There is no data for the sample, so we cannot assess whether or not there is an SNP for this sample at this position.
We gather between 1.8 and 2.5 million nucleotides (at least one sample has it) per sample, depending on the experiment technique. We store them in the database along with some information:
• how damaged the SNP can be for the function of the gene
• its frequency in different populations (African, European, French Canadian…).
The database also contains information about each sample, such as gender, ethnicity, pathology. This will keep growing with our needs. So, basically, we have a sample table, a mutations table with their information, an experiment table and a big table linking the 3 previous tables with relations one to many.
Here is a very slightly simplified example of a single record in our database:
|Type of data||Data||Table|
|SNP||T||Begin Mutation information table|
|Damaging for gene function?||synonymous|
|Present in known database?||yes||End Mutation information table|
|Sequencing quality||26||Begin Table linking other tables together||containing information about 1 mutation for 1 sample from 1 sequencing|
|Validated by another experiment?||no||End Table linking other tables|
|Sample||345||Begin Sample table|
|family||10||End Sample table|
|Sequencing information||Illumina Hiseq 2500||Begin Sequencing table|
|Sequencing type (DNA RNA…)||RNAseq|
|Analysis pipeline info||No PCR duplicates only Properly paired||End Sequencing table|
The applications are multiple, but here are some which come to my mind:
• extract subset of data to use with our tools
• doing stats, counts
• find specific data
• annotate our data with public databases
Q5. Why did you decide to deploy TokuDB database storage engine to optimize the lab’s medical and population genomics research database?
Thibault de Malliard: We knew that the data could not be managed with MySQL and MyISAM. One big issue is the insert rate, and TokuDB offered a solution up to 50 times faster. Furthermore, TokuDB allows us to manipulate the structure of the database without blocking access to it. As a research team, we always have new information to add, which means column additions.
Q6. Did you look/consider other vendor alternatives? If yes, which ones?
Thibault de Malliard: None. This is much too time consuming.
Q7. What are you specifically using TokuDB for?
Thibault de Malliard: We only store genetic data with information related to this genetic.
Q8. How many databases do you use? What are the data requirements?
Thibault de Malliard: I had planned to use three databases:
1. Database for RNA/DNA sequencing and from DNA genotyping (described before);
2. Database for data from well-known reference databases (dbsnp, 1000genome);
3. A last one to store analyzed data from database 1 and 2.
The data stored is manly the nucleotide (a character: A, C, G, T) with integer information like quality, position, and Boolean flags. I avoid using any string to keep the table as small as possible.
Q9. Especially, what are the requirements for data ingest of records and retrieve of data?
Thibault de Malliard: As a research team, we do not have high requirements like real-time insertion from logs. But I would say, at most, the import should take less than a night. The update of the database 1 is critical with the addition of a new sequencing or genotyping experiment: a batch of 50M records (can be more than 3 times higher!) has to be inserted. This has been happening monthly, but it should increase this year.
We have a huge amount of data, and we need to get query results as fast as possible, We have been used to one or two days (a weekend) of query time – having 10 seconds is much more preferable!
Q10. Could you give some examples of what are the typical research requests that need data ingestion and retrieval
Thibault de Malliard: We have a table with all the SNPs for 1000 samples. This is currently a 100GB table.
A typical query could be to get the sample that got a mutation different from the 999 others. We also have some samples that are families: a child with its parents. We want to find the SNPs present in this child, but not present in the other family member.
We may want to find mutations common to one group of sample given the gender, a disease state, ethnicity.
Q11. What kind of scalability problems did you have?
Thibault de Malliard: The problem is managing this huge amount of data. The number of connections should be very low. Most of the time, there is only one user. So I had to choose the data types carefully and the relationships between the tables. Lately, I ran into a very slow join with a range so I decided to split the position based tables by chromosome. Now there are 26 tables and some procedures to launch queries through the chromosomes. The gain of time is not quantifiable.
Q12. Do you have any benchmarking measures to sustain the claim that Tokutek’s TokuDB has improved scalability of your system?
Thibault de Malliard: I populated the database with two billion records in the main table and then did queries. While I did not see improvements with my particular workload for queries, I did see significant insertion performance gains. When I tried to add an extra 1M record (Load data infile), it took 51 minutes for MyISAM to load the data, but less than one minute with TokuDB. I extend this amount of data to an RNA sequencing experiment: it should take 2.5 days for MyISAM but one hour for TokuDB.
Q13. What are the lessons learned so far in using TokuDB database storage engine in your application domain?
Thibault de Malliard: We are still developing it and adding data. But inserting data into the two main tables (0.9G records, 2.3G records) was done in a fairly good time, less than one day. Adding columns to fulfill the needs of the team is also a very easy feature: it takes one second to create the column. Updating it is another story, but the table is still accessible during this process.
Another great feature, like the one I use with each query, is to be able to follow the state of the query.
You can follow in the process list the number of rows which were queried. So if you have a good estimation of the number of records expected, you know exactly the time of the query. I cannot count the number of process I killed because the query time expected was not acceptable.
Qx. Anything you wish to add?
Thibault de Malliard: The sequencing/genotyping technologies evolve very fast. Evolving means more data from the machines. I expect our data to grow at least three times each year. We are glad to have TokuDB in place to handle the challenge.
Since 2010, Thibault de Malliard has worked in the University of Montreal’s Philip Awadalla Laboratory where he provides bioinformatics support to the lab crew and develops bioinformatics solutions for next-generation genomic sequencing. Previously, he worked for the French National Institute for Agricultural Research (INRA) with the MIG laboratory (Mathematics, Informatics and Genomics) where, as part of the European Nanomubiop project, he was tasked with developing software to produce probes for a HPV chip. He holds a masters degree in bioinformatics (France).
- Big Data for Good. by Roger Barca, Laura Haas, Alon Halevy, Paul Miller, Roberto V. Zicari. June 5, 2012:
A distinguished panel of experts discuss how Big Data can be used to create Social Capital.
Blog Panel | Intermediate | English | DOWNLOAD (PDF)| June 2012|
Follow ODBMS.org on Twitter: @odbmsorg