Recent technological advances have created challenges for geneticists and a need to adapt to a wide range of new bioinformatics tools and an expanding wealth of publicly available data (e.g. mutation databases, software). This wide range of methods and a diversity of file formats used in sequence analysis is a significant issue, with a considerable amount of time spent before anyone can even attempt to analyse the genetic basis of human disorders. Another point to consider is although many possess 'just enough' knowledge to analyse their data, they do not make full use of the tools and databases that are available and also do not know how their data was created. The primary aim of this review is to document some of the key approaches and provide an analysis schema to make the analysis process more efficient and reliable in the context of discovering highly penetrant causal mutations/genes. This review will also compare the methods used to identify highly penetrant variants when data is obtained from consanguineous individuals as opposed to non-consanguineous; and when Mendelian disorders are analysed as opposed to common-complex disorders.
IN TRO D UCTIO NNext generation sequencing (NGS) and other high throughput technologies have brought new challenges concomitantly. The colossal amount of information that is produced has led researchers to look for ways of reducing the time and effort it takes to analyse the resulting data whilst also keeping up with the storage needs of the resulting files -which are in the magnitude of gigabytes each. The recently emerged variant call format (VCF) has somewhat provided a way out of this complex issue [1]. Using a reference sequence and comparing it with the query sequence, only the differences between the two are encoded into a VCF file. Not only are VCF files substantially smaller in size (>300x in relation to BAM files which store all raw read alignments), they also make the data relatively easy to analyse since there are many bioinformatics tools (e.g. annotation, mutation effect prediction) which accept the VCF format as standard input. The Genome Analysis Toolkit (GATK) made available by the Broad Institute also provides useful suggestions to bring a universal standard for the annotation and filtering of VCF files [2]. The abovementioned reasons have made VCF the established format for the sharing of genetic variation produced from large sequencing projects (e.g. 1000 Genomes Project, NHLBI Exome Project -also known as EVS). However the VCF does have some disadvantages. The files can be information dense, initially . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/011130 doi: bioRxiv preprint first posted online Nov. 6, 2014; 2 difficult to understand and parse. Comprehensive information about the VCF and its companion software VFCtools [1] are available online (vcftools.sourceforge.net).Because of the substantial decrease in the price o...