550 likes | 576 Views
Learn about Whole Genome Sequencing (WGS) and Whole Exome Sequencing (WES), including alignment, references, and tools for analyzing short reads. Explore the importance of depth, coverage, and variants in sequence analysis.
E N D
BF528 - Sequence Analysis 1 - WGS/WES Lecture 6 02/07/2018
Whole Genome Sequencing Whole Genome Sequencing (WGS) • WGS covers 95% to 98% of the genome. • Depth at least 30X to be able to detect heterozygous SNPs. • Analysis is harder, more data and more complex regions
Whole Exome Sequencing Whole Exome Sequencing (WES) • Exons consist of about 1-2% of the genome that code proteins. • Much higher coverage (~100X) • Sequencing known exons • Variants have high impact
Alignment • NGS produces short reads (30-250bp) • We need to identify where in the genome they came from. • We do this based on similarity, observing that the probability of 2 regions of the genome of a given length being exactly the same is very low. • This is a guided analysis: we need a reference to align to. Used to find variants and haplotypes.
The reference • The reference is a haploid representation of the consensus of multiple individuals. • Human genome: • GRCh37 hg19 • GRCh38 hg38 • Mouse: • GRCm38 mm10 • NCBI37 mm9 • It is in fasta format (.fa or .fasta)
Human genome Genome Reference Consortium human build 38 patch 12 (latest) Assembly GRCh38.p12 from NCBI
Human genome Genome Reference Consortium human build 37 patch 13 Assembly GRCh38.p12 from NCBI
The reference To load samtools use: For the first time you need to index the reference (makes a fai file): samtools faidx module load samtools samtools faidx ref.fa samtools faidx ref.fa ch1:10000-10100
Make a small reference Go to: ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/ Download chromosome 22. wget ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz gunzip chr22.fa.gz samtools faidx chr22.fa samtools faidx chr22.fa chr22:29268316-29300343 | less
Let’s get some reads On SRA look for the reads you want, for example WES of NA12878 with access ID SRR1919605 (took an hour) : Note: always use qsub. Note: you might get errors regarding space. You might need to change it’s cache from your home root. echo '/repository/user/main/public/root = "/projectnb/bf528/..."' > $HOME/.ncbi/user-settings.mkfg module load sratoolkit fastq-dump --split-files --gzip SRR1919605
Don’tdownload now, just make a simlink Location: /projectnb/bf528/06/sample_reads/SRR1919605_first400K_1.fastq /projectnb/bf528/06/sample_reads/SRR1919605_first400K_2.fastq
Quality check with FastQC First thing to do is to check the quality of the reads using FastQC. Do not run this. It will take half 12min. The results are at: /projectnb/bf528/06/fastqc module load fastqc fastqc --help fastqc SRR1919605_1.fastq.gz SRR1919605_2.fastq.gz
Quality check with FastQC Started analysis of SRR1919605_1.fastq.gz Approx 5% complete for SRR1919605_1.fastq.gz Approx 10% complete for SRR1919605_1.fastq.gz Approx 15% complete for SRR1919605_1.fastq.gz Approx 20% complete for SRR1919605_1.fastq.gz Approx 25% complete for SRR1919605_1.fastq.gz Approx 30% complete for SRR1919605_1.fastq.gz Approx 35% complete for SRR1919605_1.fastq.gz Approx 40% complete for SRR1919605_1.fastq.gz Approx 45% complete for SRR1919605_1.fastq.gz Approx 50% complete for SRR1919605_1.fastq.gz Approx 55% complete for SRR1919605_1.fastq.gz Approx 60% complete for SRR1919605_1.fastq.gz Approx 65% complete for SRR1919605_1.fastq.gz Approx 70% complete for SRR1919605_1.fastq.gz Approx 75% complete for SRR1919605_1.fastq.gz Approx 80% complete for SRR1919605_1.fastq.gz Approx 85% complete for SRR1919605_1.fastq.gz Approx 90% complete for SRR1919605_1.fastq.gz Approx 95% complete for SRR1919605_1.fastq.gz real 5m47.401s user 5m42.057s sys 0m4.673s
Quality check with FastQC Can take fastq or bam as input. Output is in html format. Better to run on BAM file.
Alignment of short reads There are various tools to align reads: • First hit aligners: • BWA MEM→ the most common • Bowtie2 • SNAP → forces reads to align close to each other • All possible hits aligners: • MrFAST • MrsFAST
Aligner comparison 50 bp reads with SNP (substitutes) Correctly mapped reads Incorrect mapped reads Mark Zeiman, DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads, November 04, 2014
Aligner comparison 50 bp reads with inserts Correctly mapped reads incorrect mapped reads Mark Zeiman, DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads, November 04, 2014
Aligner comparison 50 bp reads with deletion Correctly mapped reads incorrect mapped reads Mark Zeiman, DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads, November 04, 2014
Aligner comparison 100 bp reads with SNP (substitutes) Correctly mapped reads incorrect mapped reads Mark Zeiman, DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads, November 04, 2014
Aligner comparison 100 bp reads with inserts Correctly mapped reads incorrect mapped reads Mark Zeiman, DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads, November 04, 2014
Aligner comparison 100 bp reads with deletion Correctly mapped reads incorrect mapped reads Mark Zeiman, DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads, November 04, 2014
Alignment of short reads • Each aligner requires the reference to be prepared (indexed) in a certain way. • We will do an example with BWA-MEM • http://bio-bwa.sourceforge.net/bwa.shtml Let’s prepare reference indexes for BWA. module load bwa bwa index chr22.fa
BWA-MEM • Align the reads with the gapless bwa-mem algorithm: Output is a SAM file. SAM format is used to save alignments. bwa mem -R "@RG\tID:test\tSM:NA12878" chr22.fa SRR1919605_1.fq.gz SRR1919605_2.fq.gz > SRR1919605.sam
BWA-MEM • Align the reads with the gapless bwa-mem algorithm: Output is a SAM file. SAM format is used to save alignments. bwa mem-R "@RG\tID:test\tSM:NA12878" chr22.fa SRR1919605_1.fq.gz SRR1919605_2.fq.gz > SRR1919605.sam Read group
SAM format https://samtools.github.io/hts-specs/SAMv1.pdf
SAM format • @ header line • Each line has at least 11 fields • 1 line per each read
SAM format The second field is a flag: You can extract flags using -f and -F in samtools.
SAM format header lines start with @ @HD VN:1.5 GO:none SO:coordinate @SQ SN:1 LN:248956422 @SQ SN:10 LN:133797422 @SQ SN:11 LN:135086622 @SQ SN:12 LN:133275309 @SQ SN:13 LN:114364328 @SQ SN:14 LN:107043718 . . . @RG ID:50 LB:SRR1514950 SM:CHM1 @PG ID:MarkDuplicates VN:2.8.0-SNAPSHOT CL:picard.sam.markduplicates.MarkDuplicates INPUT=[SRR1514950.sorted.bam] OUTPUT=SRR1514950.rmdup.bam METRICS_FILE=marked_dup_metrics50.txt REMOVE_DUPLICATES=true TMP_DIR=[/projectnb/vntrseek/CHM1/illumina/temp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates @PG ID:bwa PN:bwa VN:0.7.12-r1039 CL:/share/pkg/bwa/0.7.12/install/bin/bwa.real mem -t 16 /projectnb/vntrseek/share/GRCh38/GRCh38.fa fastq_SRR1514950_1.tar.gz fastq_SRR1514950_2.tar.gz @PG ID:GATK IndelRealigner VN:3.7-0-gcfedb67 CL:knownAlleles=[] targetIntervals=50.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:MarkDuplicates-24B543D9 VN:2.8.0-SNAPSHOT CL:picard.sam.markduplicates.MarkDuplicates INPUT=[SRR1514952.sorted.bam] OUTPUT=SRR1514952.rmdup.bam METRICS_FILE=marked_dup_metrics52.txt REMOVE_DUPLICATES=true TMP_DIR=[/projectnb/vntrseek/CHM1/illumina/temp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.jsonPN:MarkDuplicates @PG ID:bwa-136CB686 PN:bwa VN:0.7.12-r1039 CL:/share/pkg/bwa/0.7.12/install/bin/bwa.real mem -t 16 /projectnb/vntrseek/share/GRCh38/GRCh38.fa fastq_SRR1514952_1.tar.gz fastq_SRR1514952_2.tar.gz @PG ID:GATK IndelRealigner-CC92922 VN:3.7-0-gcfedb67 CL:knownAlleles=[] targetIntervals=52.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null version The reference headers and the line each starts at read groups Programs you ran
SAM format • Each line has at least 11 fields • 1 line per each alignment, some aligners produce multi-alignments SRR1514952.11241320147 11000027 41S60M =10034 -26GAACCCTAGCCCTACCCCAACCCCGAACCCTACCCCGAACCATAACCCTAACCCTAACCCTAACCCTATCCCTAACCCTAGCCCTAACCCTAACCCTAACC #############################################################################A2A+;H;F;FB?A+2+:??DDB88 XA:Z:X,+156030660,76M25S,5;Y,+57217180,76M25S,5;22,+50808410,60M41S,2;6,-147845,18S23M1D38M1D22M,7; MC:Z:75M1I25M MD:Z:27A11A20 NM:i:2 MQ:i:27 AS:i:50 XS:i:51 RG:Z:52 PG:Z:MarkDuplicates-24B543D9
samtools • BAM is a compressed sam. • CRAM is a better compressed SAM. • You can read all with samtools module load samtools samtools view SRR1919605.sam # unmapped ones samtools view -f 4 SRR1919605.sam # pair unmapped samtools view -f 8 SRR1919605.sam # only mapped ones samtools view -F 4 SRR1919605.sam # second reads only samtools view -f 128 SRR1919605.sam # quality at least 30 samtools view -q 30 SRR1919605.sam
Sorting and indexing Always sort and compress your SAM file. Sorting makes your file smaller and gives you instant access to alignments. Note: You could use the -@ parameter to run the sorting in parallel, if so, don’t forget to allocate multiple processes in your qsub. samtools sort -o SRR1919605_sorted.bam -O bam SRR1919605.sam # don’t forget to index your sorted bam samtool indexSRR1919605.bam
Sorting and indexing You can extract reads by location on a sorted indexed SAM/BAM/CRAM file with samtools. samtools sort -n sorts reads by name, no tools have been implemented to extract alignments by read name. Few applications require sorted sam in read name order such as: bedtools bamtofastq. samtools viewSRR1919605_sorted.bam chr22:29268316-29300343
Do it all in one PROC=8 bwa mem -R "@RG\tID:test\tSM:NA12878" chr22.fa SRR1919605_1.fastq SRR1919605_2.fastq | samtools sort -o SRR1919605_sorted.bam -O bam # don’t forget to index it samtools indexalignment.bam Check the header of the new bam file, see your command? samtools view -H SRR1919605_rmdup.bam
Duplicated reads • PCR duplicates • Sequencing duplicates (mirror) "A duplicate could be PCR effect or reading same fragment twice, there is no way to tell." HAVE TO DO FOR WGS AND WES Debate for RNA-seq (Parekh, Swati, et al. "The impact of amplification on differential expression analyses by RNA-seq." Scientific reports 6 (2016): 25533.)
Removing duplicates - samtools samtools rmdupSRR1919605_sorted.bam SRR1919605_rmdup.bam # don’t forget to index it samtools index alignment.bam
Removing duplicates - picard You could use GATK and picard-tools too (from the broad), is better, takes longer. module load java/1.8.0_92 picard picard MarkDuplicates \ I=SRR1919605_sorted.bam \ O=SRR1919605_markdup.bam \ REMOVE_DUPLICATES=TRUE \ M=marked_dup_metrics.txt # don’t forget to index it samtools index alignment.bam
Removing duplicates - comparison Count your reads: samtools can only remove duplicates across one chromosome, picard can do over different chromosomes. picard is more stringent.
Realigning around indels Preprocessing: BWA and Bowtie2 are gapless aligners. You will need to make a dictionary file for your reference. You can use samtools or picard-tools: module load java/1.8.0_92 gatk/3.7 samtools dict chr22.fa > chr22.dict picard CreateSequenceDictionary REFERENCE=chr22.fa OUTPUT=chr22.dict
Realigning around indels First find all the bad mapping intervals and indels: gatk -T RealignerTargetCreator \ -R chr22.fa \ -I SRR1919605_markdup.bam -o SRR1919605.intervals \# if you have your own targets # -known indels.vcf \
Realigning around indels Using the indels you found, try to realign around them allowing gaps: gatk -T IndelRealigner \ -R reference.fasta \ -I SRR1919605.markdup.bam \# if you have your own targets # -known indels.vcf \ -targetIntervals SRR1919605.intervals -o SRR1919605_markdup_realigned.bam
Realigning around indels GATK version 4 does not use the same algorithm, has a LeftAlignIndels. I have not tested it, and didn’t find blogs on it. For more information on realigning around indels see: https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-3-IndelRealignment.pdf
Realigning around indels GATK version 4 does not use the same algorithm, has a LeftAlignIndels. I have not tested it, and didn’t find blogs on it. For more information on realigning around indels see: https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-3-IndelRealignment.pdf Always read blogs, review papers seem to favor their own tool or collaborators’
Check your bam file Check the header of the new bam file, see your commands? Now we are ready to do analysis on this bam file. samtools view -H SRR1919605_rmdup.bam
The overall pipeline input: FastQ Output: indexed sorted bam or cram • FastQC, trimming • Align, don’t forget read groups • Sort and index the sam file • Marking/remove duplicates • Realignment around indels (or STRs…) ------------------------------------------------------------------------------------- Insert: bam Output: vcf 6. Call variants 7. Filter the variants