1 / 18

Introductory Sequence Analysis

Introductory Sequence Analysis. Lisa Mullan, HGMP-RC, UK. Good overview. Dotplots. window size = 10. CCCCCCTGTG. A dot is placed at the point where there is an exact match between nucleotide residues. TGTCCCCCCTGTG. TAGTGTCCCC. AGTGTCCCCC. AGTAGTGTCCCCCCTAGGCGCCATTGACTAAGGAACTGAG.

rosine
Download Presentation

Introductory Sequence Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introductory Sequence Analysis Lisa Mullan, HGMP-RC, UK

  2. Good overview Dotplots window size = 10 CCCCCCTGTG A dot is placed at the point where there is an exact match between nucleotide residues TGTCCCCCCTGTG TAGTGTCCCC AGTGTCCCCC AGTAGTGTCCCCCCTAGGCGCCATTGACTAAGGAACTGAG AGTAGTGTCC GTAGTGTCCC GTGTCCCCCC

  3. A T G C S W R Y K M B V H D N U A 5 4 4 4 4 1 1 4 4 1 4 1 1 1 2 4 T 4 5 4 4 4 1 4 1 1 4 1 4 1 1 2 5 G 4 4 5 4 1 4 1 4 1 4 1 1 4 1 2 4 C 4 4 4 5 1 4 4 1 4 1 1 1 1 4 2 4 S 4 4 1 1 1 4 2 2 2 2 1 1 3 3 1 4 W 1 1 4 4 4 1 2 2 2 2 3 3 1 1 1 1 R 1 4 1 4 2 2 1 4 2 2 3 1 3 1 1 4 Y 4 1 4 1 2 2 4 1 2 2 1 3 1 3 1 1 K 4 1 1 4 2 2 2 2 1 4 1 3 3 1 1 1 M 1 4 4 1 2 2 2 2 4 1 3 1 1 3 1 4 B 4 1 1 1 1 3 3 1 1 3 1 2 2 2 1 1 V 1 4 1 1 1 3 1 3 3 1 2 1 2 2 1 4 H 1 1 1 4 3 1 1 3 1 3 2 2 2 1 1 1 D 1 1 1 4 3 1 1 3 1 3 2 2 2 1 1 1 N 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 U 4 5 4 4 4 1 4 1 1 4 1 4 1 1 2 5 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A T G C A 5 -4 -4 -4 T -4 5 -4 -4 G –4 -4 5 -4 C -4 -4 -4 5 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Dotplots (cont.) But what if your sequences don’t match exactly?

  4. Dotplots (cont.) Choose a threshold value – this value changes depending on the window size If we keep our window size of 10, and define a threshold value of 10 A dot will be placed at the point where the match between nucleotides is > 10 So……..

  5. AGTAGTGTCC A T G C A 5 -4 -4 -4 T -4 5 -4 -4 G –4 -4 5 -4 C -4 -4 -4 5 CCCCCCTGTG TGTCCCCCCTGTG AGTAGTGTCCCCCCTAGGCGCCATTGACTAAGGAACTGAG -4 -4 -4 -4 -4 -4 -4 -4 5 5 GTGTCCCCCC total = -22 No similarity because the total value is less than the threshold value AGTAGTGTCC

  6. GTGGTGTCCC A T G C A 5 -4 -4 -4 T -4 5 -4 -4 G –4 -4 5 -4 C -4 -4 -4 5 CCCCCCTGTG TGTCCCCCCTGTG AGTAGTGTCC AGTAGTGTCCCCCCTAGGCGCCATTGACTAAGGAACTGAG 5 5 5 -4 -4 -4 -4 5 5 5 GTGTCCCCCC total = 14 Similar sequences are marked with a line on the graph, as they exceed our set threshold value of 10 GTAGTGTCCC

  7. AGTGTCCCCC -4 -4 -4 -4 -4 5 5 5 5 5 A T G C A 5 -4 -4 -4 T -4 5 -4 -4 G –4 -4 5 -4 C -4 -4 -4 5 CCCCCCTGTG TGTCCCCCCTGTG AGTAGTGTCCCCCCTAGGCGCCATTGACTAAGGAACTGAG AGTAGTGTCC GTAGTGTCCC GTGTCCCCCC total = 5 No similarity because the total value is less than the threshold value AGTGTCCCCC

  8. GTGTCCCCCC 5 5 5 5 5 5 5 5 5 5 A T G C A 5 -4 -4 -4 T -4 5 -4 -4 G –4 -4 5 -4 C -4 -4 -4 5 CCCCCCTGTG TGTCCCCCCTGTG AGTAGTGTCCCCCCTAGGCGCCATTGACTAAGGAACTGAG AGTAGTGTCC GTAGTGTCCC TAGTGTCCCC GTGTCCCCCC total = 50 Similar sequences are marked with a line on the graph, as they exceed our set threshold value of 10 GTGTCCCCCC

  9. ADS = 8 (2+4+2) ADS ADS = 6 (2+2+2) ANS ADS = 7 (1+4+2) TDS BLAST (Basic Local Alignment Search Tool) fast similarity searching of the database Scores above a threshold (T) ADS FTH GYY ADSFTHGYYKNMDSEGGA Using PAM 250: June 2001- LJM

  10. ADS ADS GHNEHAAGDFADSFTHGYYEMMDSEGGA GYY GYY ADS FTH FTH GYY 8 15 25 12 0 -2 FTH MDS K N BLAST (cont.) GHNEHAAGDFADSFTHGYYEMMDSEGGA GHNEHAAGDFADSFTHGYYEMMDSEGGA Maximum scoring pairs (MSP) when score (S) >50 June 2001- LJM

  11. BLAST (cont.) Score E Sequences producing significant alignments: (bits) Value gi|2501720|sp|Q95153|BRC1_CANFA BREAST CANCER TYPE 1 SUSCEP... 2280 0.0 gi|6552299|ref|NP_009225.1| breast cancer 1, early onset; b... 3213 0.0 gi|2507557|sp|P48754|BRC1_MOUSE BREAST CANCER TYPE 1 SUSCEP... 1569 0.0 gi|12585549|sp|Q61510|Z147_MOUSE ZINC FINGER PROTEIN 147 (E... 57 3e-07

  12. BLAST (cont.) >gi|13195181|gb|AAK15590.1|AF284003_1 (AF284003) BRCA1 [Glaucomys volans] Length = 963 Score = 1112 bits (2876), Expect = 0.0 Identities = 641/957 (66%), Positives = 746/957 (76%), Gaps = 27/957 (2%) Query: 274 CGTNTHASSLQHENSSLLLTKDRMNVEKAEFCNKSKQPGLARSQHNRWAGSKETCNDRRT 333 CGTNTHASSLQHENSSLLLTKDRMNVEKAEFCNKSKQPGLARSQ +RWA SKETCNDR+ Sbjct: 1 CGTNTHASSLQHENSSLLLTKDRMNVEKAEFCNKSKQPGLARSQQSRWAKSKETCNDRQI 60 Query: 334 PSTEKKVDLNADPLCERKEWNKQKLPCSENPRDTEDVPWITLNSSIQKVNEWFSRSDELL 393 PS+EKKVDLNADP E+KE +KQK PCSEN RDT+DVPWITLNSSI+KVNEWFSRSDE+L Sbjct: 61 PSSEKKVDLNADPQYEKKEPSKQKHPCSENSRDQDVPWITLNSSIRKVNEWFSRSDEML 120 Query: 394 XXXXXXXXXXXXNAKVADVLDVLNEVDEYSGSSEKIDLLASDPHEALICKSERVHSKSVE 453 NA++A +L++ NEVD +SGSSEKIDLLA+DPH ALI K +RV SK+V+ Sbjct: 121 TSDDSDDGGSESNAEIAGILEIPNEVDGFSGSSEKIDLLATDPHNALISKCKRVCSKAVK 180

  13. A B C PSI-BLAST June 2001- LJM

  14. AD DS AD SF TH GY YK NM DS EG GA SF SF TH FT HG GY YK GY NM YN DS SE EG GA GA FASTA k-tuple = 2 ADSFTHGYYKNMDSEGGA A k-tup of 1 can be used for more sensitive searches A k-tup of 4-6 is generally used for nucleotide searches

  15. FASTA (cont.) Database sequences YK AD SF GY NM DS EG GA TH k-tuples

  16. GA AD TH SF GY SF AD TH EG DS NM YK YK NM DS EG GA GY GA 80 6 86 94 6 39 41 7 86 FASTA (cont.) MYTMHGAGGAADSFTHGYYKNMDSEGGAEHHAGGS Scores for each matching segments ADSFTHGYGHERYKNMDSEGGGAGT Scores for gapped segments The program now chooses the best ungapped matches to align rigorously

  17. opt E() < 20 217 0:= 22 0 0: one = represents 254 library sequences 24 0 0: 26 0 2:* 28 0 22:* 30 1 136:* 32 4 525:= * 34 28 1423:= * 36 209 2922:= * 38 1366 4829:====== * 40 5192 6736:===================== * 42 11114 8233:================================*=========== 44 15057 9082:===================================*======================== 46 15184 9250:====================================*======================= 48 11557 8856:==================================*=========== 50 8138 8081:===============================*= 52 5579 7105:====================== * 54 4051 6069:================ * 56 3206 5069:============= * 58 2911 4162:============ * 60 2656 3371:=========== * 62 2232 2703:========= * 64 1760 2150:======= * 66 1541 1699:======* 68 1228 1336:=====* 70 979 1047:====* 72 748 818:===* 74 628 638:==* 76 529 497:=*= 78 420 386:=* 80 335 300:=* 82 284 229:*= 84 262 182:*= 86 198 141:* 88 152 109:* inset = represents 4 library sequences 90 123 84:* 92 112 65:* :================*=========== 94 92 50:* :============*========== 96 70 39:* :=========*======== 98 63 30:* :=======*======== 100 50 23:* :=====*======= 102 50 18:* :====*======== 104 47 14:* :===*======== 106 36 11:* :==*====== 108 39 8:* :=*======== 110 24 6:* :=*==== 112 30 5:* :=*====== 114 25 4:* :*====== 116 28 3:* :*====== 118 14 2:* :*=== >120 170 2:* :*======================================= FASTA (cont.) The best scores are: opt bits E(98478) SW:BRC1_HUMAN P38398 BREAST CANCER TYPE 1 SUSCEPT (1863) 9652 1764 0 SW:BRC1_CANFA Q95153 BREAST CANCER TYPE 1 SUSCEPT (1878) 6446 1183 0 SW:BRC1_MOUSE P48754 BREAST CANCER TYPE 1 SUSCEPT (1812) 2601 486 1.4e-135 SW:MLP1_YEAST Q02455 MYOSIN-LIKE PROTEIN MLP1. (1875) 215 54 1.9e-05 SW:MYS2_DICDI P08799 MYOSIN II HEAVY CHAIN, NON M (2116) 214 54 2.4e-05

More Related