180 likes | 285 Views
Introductory Sequence Analysis. Lisa Mullan, HGMP-RC, UK. Good overview. Dotplots. window size = 10. CCCCCCTGTG. A dot is placed at the point where there is an exact match between nucleotide residues. TGTCCCCCCTGTG. TAGTGTCCCC. AGTGTCCCCC. AGTAGTGTCCCCCCTAGGCGCCATTGACTAAGGAACTGAG.
E N D
Introductory Sequence Analysis Lisa Mullan, HGMP-RC, UK
Good overview Dotplots window size = 10 CCCCCCTGTG A dot is placed at the point where there is an exact match between nucleotide residues TGTCCCCCCTGTG TAGTGTCCCC AGTGTCCCCC AGTAGTGTCCCCCCTAGGCGCCATTGACTAAGGAACTGAG AGTAGTGTCC GTAGTGTCCC GTGTCCCCCC
A T G C S W R Y K M B V H D N U A 5 4 4 4 4 1 1 4 4 1 4 1 1 1 2 4 T 4 5 4 4 4 1 4 1 1 4 1 4 1 1 2 5 G 4 4 5 4 1 4 1 4 1 4 1 1 4 1 2 4 C 4 4 4 5 1 4 4 1 4 1 1 1 1 4 2 4 S 4 4 1 1 1 4 2 2 2 2 1 1 3 3 1 4 W 1 1 4 4 4 1 2 2 2 2 3 3 1 1 1 1 R 1 4 1 4 2 2 1 4 2 2 3 1 3 1 1 4 Y 4 1 4 1 2 2 4 1 2 2 1 3 1 3 1 1 K 4 1 1 4 2 2 2 2 1 4 1 3 3 1 1 1 M 1 4 4 1 2 2 2 2 4 1 3 1 1 3 1 4 B 4 1 1 1 1 3 3 1 1 3 1 2 2 2 1 1 V 1 4 1 1 1 3 1 3 3 1 2 1 2 2 1 4 H 1 1 1 4 3 1 1 3 1 3 2 2 2 1 1 1 D 1 1 1 4 3 1 1 3 1 3 2 2 2 1 1 1 N 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 U 4 5 4 4 4 1 4 1 1 4 1 4 1 1 2 5 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A T G C A 5 -4 -4 -4 T -4 5 -4 -4 G –4 -4 5 -4 C -4 -4 -4 5 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Dotplots (cont.) But what if your sequences don’t match exactly?
Dotplots (cont.) Choose a threshold value – this value changes depending on the window size If we keep our window size of 10, and define a threshold value of 10 A dot will be placed at the point where the match between nucleotides is > 10 So……..
AGTAGTGTCC A T G C A 5 -4 -4 -4 T -4 5 -4 -4 G –4 -4 5 -4 C -4 -4 -4 5 CCCCCCTGTG TGTCCCCCCTGTG AGTAGTGTCCCCCCTAGGCGCCATTGACTAAGGAACTGAG -4 -4 -4 -4 -4 -4 -4 -4 5 5 GTGTCCCCCC total = -22 No similarity because the total value is less than the threshold value AGTAGTGTCC
GTGGTGTCCC A T G C A 5 -4 -4 -4 T -4 5 -4 -4 G –4 -4 5 -4 C -4 -4 -4 5 CCCCCCTGTG TGTCCCCCCTGTG AGTAGTGTCC AGTAGTGTCCCCCCTAGGCGCCATTGACTAAGGAACTGAG 5 5 5 -4 -4 -4 -4 5 5 5 GTGTCCCCCC total = 14 Similar sequences are marked with a line on the graph, as they exceed our set threshold value of 10 GTAGTGTCCC
AGTGTCCCCC -4 -4 -4 -4 -4 5 5 5 5 5 A T G C A 5 -4 -4 -4 T -4 5 -4 -4 G –4 -4 5 -4 C -4 -4 -4 5 CCCCCCTGTG TGTCCCCCCTGTG AGTAGTGTCCCCCCTAGGCGCCATTGACTAAGGAACTGAG AGTAGTGTCC GTAGTGTCCC GTGTCCCCCC total = 5 No similarity because the total value is less than the threshold value AGTGTCCCCC
GTGTCCCCCC 5 5 5 5 5 5 5 5 5 5 A T G C A 5 -4 -4 -4 T -4 5 -4 -4 G –4 -4 5 -4 C -4 -4 -4 5 CCCCCCTGTG TGTCCCCCCTGTG AGTAGTGTCCCCCCTAGGCGCCATTGACTAAGGAACTGAG AGTAGTGTCC GTAGTGTCCC TAGTGTCCCC GTGTCCCCCC total = 50 Similar sequences are marked with a line on the graph, as they exceed our set threshold value of 10 GTGTCCCCCC
ADS = 8 (2+4+2) ADS ADS = 6 (2+2+2) ANS ADS = 7 (1+4+2) TDS BLAST (Basic Local Alignment Search Tool) fast similarity searching of the database Scores above a threshold (T) ADS FTH GYY ADSFTHGYYKNMDSEGGA Using PAM 250: June 2001- LJM
ADS ADS GHNEHAAGDFADSFTHGYYEMMDSEGGA GYY GYY ADS FTH FTH GYY 8 15 25 12 0 -2 FTH MDS K N BLAST (cont.) GHNEHAAGDFADSFTHGYYEMMDSEGGA GHNEHAAGDFADSFTHGYYEMMDSEGGA Maximum scoring pairs (MSP) when score (S) >50 June 2001- LJM
BLAST (cont.) Score E Sequences producing significant alignments: (bits) Value gi|2501720|sp|Q95153|BRC1_CANFA BREAST CANCER TYPE 1 SUSCEP... 2280 0.0 gi|6552299|ref|NP_009225.1| breast cancer 1, early onset; b... 3213 0.0 gi|2507557|sp|P48754|BRC1_MOUSE BREAST CANCER TYPE 1 SUSCEP... 1569 0.0 gi|12585549|sp|Q61510|Z147_MOUSE ZINC FINGER PROTEIN 147 (E... 57 3e-07
BLAST (cont.) >gi|13195181|gb|AAK15590.1|AF284003_1 (AF284003) BRCA1 [Glaucomys volans] Length = 963 Score = 1112 bits (2876), Expect = 0.0 Identities = 641/957 (66%), Positives = 746/957 (76%), Gaps = 27/957 (2%) Query: 274 CGTNTHASSLQHENSSLLLTKDRMNVEKAEFCNKSKQPGLARSQHNRWAGSKETCNDRRT 333 CGTNTHASSLQHENSSLLLTKDRMNVEKAEFCNKSKQPGLARSQ +RWA SKETCNDR+ Sbjct: 1 CGTNTHASSLQHENSSLLLTKDRMNVEKAEFCNKSKQPGLARSQQSRWAKSKETCNDRQI 60 Query: 334 PSTEKKVDLNADPLCERKEWNKQKLPCSENPRDTEDVPWITLNSSIQKVNEWFSRSDELL 393 PS+EKKVDLNADP E+KE +KQK PCSEN RDT+DVPWITLNSSI+KVNEWFSRSDE+L Sbjct: 61 PSSEKKVDLNADPQYEKKEPSKQKHPCSENSRDQDVPWITLNSSIRKVNEWFSRSDEML 120 Query: 394 XXXXXXXXXXXXNAKVADVLDVLNEVDEYSGSSEKIDLLASDPHEALICKSERVHSKSVE 453 NA++A +L++ NEVD +SGSSEKIDLLA+DPH ALI K +RV SK+V+ Sbjct: 121 TSDDSDDGGSESNAEIAGILEIPNEVDGFSGSSEKIDLLATDPHNALISKCKRVCSKAVK 180
A B C PSI-BLAST June 2001- LJM
AD DS AD SF TH GY YK NM DS EG GA SF SF TH FT HG GY YK GY NM YN DS SE EG GA GA FASTA k-tuple = 2 ADSFTHGYYKNMDSEGGA A k-tup of 1 can be used for more sensitive searches A k-tup of 4-6 is generally used for nucleotide searches
FASTA (cont.) Database sequences YK AD SF GY NM DS EG GA TH k-tuples
GA AD TH SF GY SF AD TH EG DS NM YK YK NM DS EG GA GY GA 80 6 86 94 6 39 41 7 86 FASTA (cont.) MYTMHGAGGAADSFTHGYYKNMDSEGGAEHHAGGS Scores for each matching segments ADSFTHGYGHERYKNMDSEGGGAGT Scores for gapped segments The program now chooses the best ungapped matches to align rigorously
opt E() < 20 217 0:= 22 0 0: one = represents 254 library sequences 24 0 0: 26 0 2:* 28 0 22:* 30 1 136:* 32 4 525:= * 34 28 1423:= * 36 209 2922:= * 38 1366 4829:====== * 40 5192 6736:===================== * 42 11114 8233:================================*=========== 44 15057 9082:===================================*======================== 46 15184 9250:====================================*======================= 48 11557 8856:==================================*=========== 50 8138 8081:===============================*= 52 5579 7105:====================== * 54 4051 6069:================ * 56 3206 5069:============= * 58 2911 4162:============ * 60 2656 3371:=========== * 62 2232 2703:========= * 64 1760 2150:======= * 66 1541 1699:======* 68 1228 1336:=====* 70 979 1047:====* 72 748 818:===* 74 628 638:==* 76 529 497:=*= 78 420 386:=* 80 335 300:=* 82 284 229:*= 84 262 182:*= 86 198 141:* 88 152 109:* inset = represents 4 library sequences 90 123 84:* 92 112 65:* :================*=========== 94 92 50:* :============*========== 96 70 39:* :=========*======== 98 63 30:* :=======*======== 100 50 23:* :=====*======= 102 50 18:* :====*======== 104 47 14:* :===*======== 106 36 11:* :==*====== 108 39 8:* :=*======== 110 24 6:* :=*==== 112 30 5:* :=*====== 114 25 4:* :*====== 116 28 3:* :*====== 118 14 2:* :*=== >120 170 2:* :*======================================= FASTA (cont.) The best scores are: opt bits E(98478) SW:BRC1_HUMAN P38398 BREAST CANCER TYPE 1 SUSCEPT (1863) 9652 1764 0 SW:BRC1_CANFA Q95153 BREAST CANCER TYPE 1 SUSCEPT (1878) 6446 1183 0 SW:BRC1_MOUSE P48754 BREAST CANCER TYPE 1 SUSCEPT (1812) 2601 486 1.4e-135 SW:MLP1_YEAST Q02455 MYOSIN-LIKE PROTEIN MLP1. (1875) 215 54 1.9e-05 SW:MYS2_DICDI P08799 MYOSIN II HEAVY CHAIN, NON M (2116) 214 54 2.4e-05