Fast and Efficient Genotype Encoding using Sparse 2D Bitmaps for Database-Driven Genomics Applications

Ahmad Al Kawam1, Aniruddha Datta

  • 1Texas A&M University

Details

18:15 - 20:15 | Mon 5 Mar | Caribbean ABC | MoPO.7

Session: Poster Session # 1 and BSN Innovative Health Technology Demonstrations

Abstract

Data management is a main challenge facing many genomics applications. A central target for genomic research is identifying and storing genetic variants present in human populations. Recently, there has been increasing interest in adopting a database representation for variant information. However, the massive scale of variant data pose many storage and access time challenges for database-driven genomic applications. Efficient database-driven variant encoding techniques need to be developed to address this problem. In this paper we propose a variant encoding technique based on 2D sparse bitmaps designed to achieve high compressibility while minimizing access time. Using this approach, we were able to reduce the database storage space of the 1000 Genome dataset pilot data to 4.75GB from the 45.24GB required in a basic implementation. Our approach achieved this reduction while reducing database access time by around 100 times. Furthermore, we compared our approach to the popular Ensembl Variant Database and achieved database size reductions reaching up to 47.33% without compromising access