An Experiment to Whole-genome Resequencing Data Analysis (I)

 

Tablo reader up chevron

An Experiment to Whole-genome Resequencing Data Analysis (I) 

Introduction

An experiment used high-throughput sequencing to identify de novo's somatic and germ line mutations, structural variation-SNV, including rearrangement mutations (deletion, duplication and copy number variation) and SNP seats; synthesis of rearrangement mutations and SNP functionality analysis; researchers will analyze the relationship between gene function (including miRNA), recombination rate (Recombination), loss of heterozygosity (LOH) and evolutionary selection and mutation; and how these relationships will make mutation produces corresponding susceptibility mechanisms and functions. Researchers will explore disease genomes and cancer genomes on a comprehensive level of genomics, comparative genomics, and population genetics.

Experimental design and samples

(1) Case-Control design;

(2) Family member group design: parent-child group (4 people, 3 people group or more);

Primary data analysis

1. Data output: total base number, Total Mapping Reads, Uniquely Mapping Reads statistics, in-depth analysis of sequencing.

2. Consistent sequence assembly: comparison analysis with Reference genome sequence, using Bayesian statistical model to detect the most likely genotype of each base site, and assemble a consistent sequence of the individual genome.

3. SNP detection and distribution in the genome: all polymorphic sites in the whole genome are extracted, combined with quality values, sequencing depth, repeatability and other factors for further filtering and screening, and finally a highly reliable SNP data set is obtained. And annotate the detected variation according to the reference genome information.

4. InDel detection and distribution in the genome: During the mapping process, perform gap-to-gap comparison and detect the credible short InDel. In the detection process, the length of the gap is 1 to 5 bases. For each InDel detection, at least 3 Paired-End sequences are required.

5. Structure Variation detection and distribution in the genome: The types of structural variation that can be detected are: insertion, deletion, replication, inversion, translocation, etc. Based on the analysis results of the comparison between the sequenced individual sequence and the reference genome sequence, the whole genome-level structural variation is detected and the detected variation is annotated.

Advanced data analysis

1. Read Mapping

(1) Mask out the pseudo-autosomal region on the Y chromosome, and match Read with the reference sequence NCBI36 (including all chromosomes, unlocated contigs, and mitochondrial sequence mtDNA (the corrected Cambridge reference sequence will be used Alternative)). Use standard sequence matching to perform genome matching on the original sequence file, and initially match Read with the reference genome; give the average quality score distribution of the matches;

(2) Calibration of base quality score. We use the base quality calibration algorithm to score the quality of each base in each read, and calibrate some significant errors, including errors caused by sequencing cycles and dinucleotide structures.

(3) Estimation of sequencing error rate. pseudoautosomal contigs, short repeat regions (including segmental duplication, simple repeat sequence-recognized by tandem repeat recognition algorithm) will be filtered;

2. SNP Calling

We can use the results of integrating multiple SNP detection algorithms to comprehensively and more accurately identify SNPs. Through the consistency analysis of the SNPs identified by various algorithms, the highly consistent SNPs are retained as the final SNP results. These highly consistent SNPs also have very high credibility. The SNP recognition algorithms used in the analysis include methods based on Bayesian and genotype likelihood calculations, and the use of linkage disequilibrium LD or inference techniques to optimize the accuracy of SNP recognition detection.

Statistical distribution of SNV allele frequencies across the genome

The ratio distribution of the number of rare alleles in different types of SNV (a); the main considerations of SNV categories are: (1) nonsense, (2) non-synonymous in chemical structure, (3) all non-synonymous, (4) conservative non-synonymous, (5) non-coding, (6) synonymous, and other types of SNV; in addition, for conservative discussion, we will analyze the non-coding region SNV conservative type and its distribution.

3. Short Insertion / Deletion (Indel) Call

(1). The process of calculating the genome-wide indel variation and genotype detection value

The calculation process mainly includes three steps: (1) detection of potential indel; (2) calculation of likelihood value of genotype through local rematch; (3) inference and detection identification of genotype based on LD linkage imbalance. Indel was not detected on the X and Y chromosomes.

(2). Indel filter processing

4. Fusion gene discovery

The selected gene information comes from the latest version of Ensemble Gene database, RefSeq database and Vega Gene database. The legend below shows the formation of a fusion gene, that is, a schematic diagram of the fusion of each exon from a different chromosome to form a fusion gene.

5. Structure Variation

Structure Variation (SV) is a major source of genomic variation, which is mainly composed of copy number variation (CNV) and unbalance inversion events of large fragments (generally> 1kb). At present, there are about 20,000 SVs detected by some major genomic studies (DGV database). In some areas, even the rate of SV formation is greater than the rate of SNP, and it has a great correlation with the clinical phenotype of the disease. We cannot only identify public SVs by sequencing, but also new SVs. The generation of new SVs is generally reported in terms of germ line and mutation mechanism. However, the current accurate analysis of SV requires a better algorithm. At the same time, we also need to have a more important understanding of the formation mechanism of SV, especially whether SV starts from the insertion or deletion of ancestral genomic loci, rather than simply comparing it with the reference genome sequence based on allele frequency Judgment. The functionality of SV is also combined with population genetics and evolutionary biology. We comprehensively examine the types of SV formation mechanisms.

To be continued in Part II…

Comment Log in or Join Tablo to comment on this chapter...
~

You might like savage's other books...