Sequencher project

Sequencher project 1. if one of the two strands can be clearly called, the other was ambiguous, you should leave the ambiguous call as it is, do not use the clear strain to determine the ambiguous call, the consensus will be determined by the two calls. example 2. when you export a sequence, export them from 5'-3', all in pearson/fasta format. example

export as fasta format format example: >r100705ec2, 667 bases, 983E3ABA checksum. AGAAGAAGACATAGTAATTAGATCTGAAAATTTTACAAACAATGTTAAAACCATAATAGTACAGCTGAATGAATCTATACAAATT >r100705ecj, 667 bases, 634529C5 checksum. AGAAGAAGACATAGTAATTAGATCTGAAAATTTTACAAACAATGTTAAAACCATAATAGTACAGCTGAATGAATCTATACAAATT >r196807ecl, 667 bases, 79866EC6 checksum. AGAAGAAGACATAGTAATTAGATCTGAAAATTTTACAAACAATGTTAAAACCATAATAGTACAGCTGAATGAATCTATACAAATT

database searchto find out if there is any cross patient or lab strain contamination • on Huey, put all sequences in one directory, one sequence per file, nothing else (use r1pol as demo example, r1env as demo results). • command line: fasterfasta 10 % UVELF & • A: GenBank Last Full Release + Updates • G: GenBank Last Full Release • U: GenBank Updates • V: All Viral sequences from GenBank Last Full Release • Y: All Synthetic (=vector) sequences from GenBank Last Full Release • E: Mullins Lab Sequences • F: Frenkel Lab Sequences • L: LANL HIV Nucleotide Database (May 96) • O: Other (non-human) retroviral sequences (from LANL) only • Output: fas-sum in the same directory • result sent as e-mail.

Log on to Huey from outside • Log on Valis • ssh username@blaze.csi.washington.edu • ssh username@huey.csi.washington.edu

Guide for database search results • if it is the first time you search for this patient, you want to make sure it is not closely related to those from other patients who already in the database. • r1 env fasta result as example • if it is the there are sequences from this patient who is already in the database, you want to make sure they are closely related to those from that patient. • h2 env fasta results as example • m2 env, pol as contamination example

Database search results 1 r100705ec1ps vs LANL HIV Nucleotide Database (May 96) library >>HIV1U36092 Human immunodeficiency virus type 1 sampl (650 nt) initn: 2251 init1: 1552 opt: 2324 Z-score: 1840.3 expect() 7.4e-95 87.417% identity in 604 nt overlap (1-604:15-615) >>HIV1U36094 Human immunodeficiency virus type 1 sampl (650 nt) initn: 2251 init1: 1552 opt: 2324 Z-score: 1840.3 expect() 7.4e-95 87.417% identity in 604 nt overlap (1-604:15-615) >>HIV1U36096 Human immunodeficiency virus type 1 sampl (650 nt) initn: 2251 init1: 1552 opt: 2324 Z-score: 1840.3 expect() 7.4e-95 87.417% identity in 604 nt overlap (1-604:15-615) …………….. r100705ec2 vs LANL HIV Nucleotide Database (May 96) library >>HIV1U23138 Human immunodeficiency virus type 1 isola (1446 nt) initn: 1573 init1: 1573 opt: 2308 Z-score: 1817.6 expect() 6.2e-94 86.452% identity in 620 nt overlap (1-614:723-1336) >>HIV49957 1 Human immunodeficiency virus type 1 isola (1446 nt) initn: 1573 init1: 1573 opt: 2308 Z-score: 1817.6 expect() 6.2e-94 86.452% identity in 620 nt overlap (1-614:723-1336) >>HIVJFL Human immunodeficiency virus type 1 provi (2553 nt) initn: 1500 init1: 1353 opt: 2302 Z-score: 1809.9 expect() 9.4e-94 86.356% identity in 623 nt overlap (1-614:789-1405) ………………. r100705ec3 vs LANL HIV Nucleotide Database (May 96) library >>377_V09_26B 607 bp DNA 5-JAN-19 (607 nt) initn: 1549 init1: 1482 opt: 2326 Z-score: 1678.7 expect() 8e-86 87.622% identity in 614 nt overlap (1-604:1-607) >>HIVU96503 HIV-1 patient 064 clone 064P02 from USA, (664 nt) initn: 1777 init1: 1046 opt: 2323 Z-score: 1676.1 expect() 1e-85 86.741% identity in 626 nt overlap (1-614:24-643) >>377V09_26B 606 bp DNA 5-JAN-199 (606 nt) initn: 1544 init1: 1477 opt: 2321 Z-score: 1675.1 expect() 1.3e-85 87.602% identity in 613 nt overlap (2-604:1-606) …………………. R1 env Database search Results (there are No R1 env seqs in Database yet)

Database search results 2 r100705ec4 vs LANL HIV Nucleotide Database (May 96) library >>HIVSFAAA Human immunodeficiency virus type 1 genom (3954 nt) initn: 2246 init1: 1645 opt: 2395 Z-score: 1877.2 expect() 1.1e-97 88.651% identity in 608 nt overlap (1-605:1273-1877) >>HIVSF162 Human immunodeficiency virus type 1 (HIV- (3954 nt) initn: 2246 init1: 1645 opt: 2395 Z-score: 1877.2 expect() 1.1e-97 88.651% identity in 608 nt overlap (1-605:1273-1877) >>HIV1U36091 Human immunodeficiency virus type 1 sampl (650 nt) initn: 2358 init1: 1596 opt: 2401 Z-score: 1890.9 expect() 1.1e-97 88.760% identity in 605 nt overlap (1-605:14-615) ………………. r100705eca vs LANL HIV Nucleotide Database (May 96) library >>HIVU63632 HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt) initn: 1664 init1: 1664 opt: 2393 Z-score: 1968.8 expect() 3.8e-103 88.468% identity in 607 nt overlap (1-604:6330-6936) >>HIVJRFL Human immunodeficiency virus type 1, isol (8896 nt) initn: 1664 init1: 1664 opt: 2393 Z-score: 1968.8 expect() 3.8e-103 88.468% identity in 607 nt overlap (1-604:6330-6936) >>HIV1U36096 Human immunodeficiency virus type 1 sampl (650 nt) initn: 2379 init1: 1590 opt: 2396 Z-score: 1984.3 expect() 7.1e-103 88.742% identity in 604 nt overlap (1-604:15-615) …………………. r100705ecb vs LANL HIV Nucleotide Database (May 96) library >>HIVJRFL Human immunodeficiency virus type 1, isol (8896 nt) initn: 1678 init1: 1678 opt: 2389 Z-score: 1938.4 expect() 1.9e-101 88.322% identity in 608 nt overlap (1-605:6329-6936) >>HIVU63632 HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt) initn: 1678 init1: 1678 opt: 2389 Z-score: 1938.4 expect() 1.9e-101 88.322% identity in 608 nt overlap (1-605:6329-6936) >>HIVSFAAA Human immunodeficiency virus type 1 genom (3954 nt) initn: 2247 init1: 1627 opt: 2386 Z-score: 1939.9 expect() 3.4e-101 88.487% identity in 608 nt overlap (1-605:1273-1877) ……………………. Database search re 2sults

Database search results 3 h200926ec1 vs LANL HIV Nucleotide Database (May 96) library >>h297929ec1 611 bp DNA 0 190 (611 nt) initn: 3046 init1: 3046 opt: 3046 Z-score: 2512.5 expect() 2.9e-132 99.836% identity in 611 nt overlap (1-611:1-611) >>AF105523 HIV-1 isolate C-DI-10 from Italy, envelop (789 nt) initn: 2432 init1: 2272 opt: 2422 Z-score: 1996.3 expect() 1.3e-103 88.707% identity in 611 nt overlap (1-611:45-652) >>MS97109Fc7 611 bp DNA 0 190 (611 nt) initn: 2406 init1: 2406 opt: 2416 Z-score: 1992.5 expect() 2.6e-103 88.599% identity in 614 nt overlap (1-611:1-611) ……………. h200926ec2 vs LANL HIV Nucleotide Database (May 96) library >>h297929ec1 611 bp DNA 0 190 (611 nt) initn: 3055 init1: 3055 opt: 3055 Z-score: 2519.9 expect() 1.1e-132 100.000% identity in 611 nt overlap (1-611:1-611) >>AF105523 HIV-1 isolate C-DI-10 from Italy, envelop (789 nt) initn: 2441 init1: 2281 opt: 2431 Z-score: 2003.7 expect() 4.9e-104 88.871% identity in 611 nt overlap (1-611:45-652) >>MS97109Fc7 611 bp DNA 0 190 (611 nt) initn: 2415 init1: 2415 opt: 2425 Z-score: 2000.0 expect() 1e-103 88.762% identity in 614 nt overlap (1-611:1-611) ………….. h200926ec3 vs LANL HIV Nucleotide Database (May 96) library >>h297929ec1 611 bp DNA 0 190 (611 nt) initn: 3046 init1: 3046 opt: 3046 Z-score: 2512.5 expect() 2.9e-132 99.836% identity in 611 nt overlap (1-611:1-611) >>AF105523 HIV-1 isolate C-DI-10 from Italy, envelop (789 nt) initn: 2432 init1: 2272 opt: 2422 Z-score: 1996.3 expect() 1.3e-103 88.707% identity in 611 nt overlap (1-611:45-652) >>MS97109Fc7 611 bp DNA 0 190 (611 nt) initn: 2406 init1: 2406 opt: 2416 Z-score: 1992.5 expect() 2.6e-103 88.599% identity in 614 nt overlap (1-611:1-611) ……………... H2 env database Search results (there are h2 env Seqs in the database Already)

Database search results 4 >>HIVU63632 HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt) initn: 1785 init1: 1660 opt: 2399 Z-score: 1965.5 expect() 5.7e-103 88.799% identity in 616 nt overlap (1-611:6329-6936) >>HIVJRFL Human immunodeficiency virus type 1, isol (8896 nt) initn: 1785 init1: 1660 opt: 2399 Z-score: 1965.5 expect() 5.7e-103 88.799% identity in 616 nt overlap (1-611:6329-6936) >>AF204455 HIV-1 clone p2p061-14 country USA envelop (600 nt) initn: 2406 init1: 2406 opt: 2406 Z-score: 1984.4 expect() 7.6e-103 89.000% identity in 600 nt overlap (2-601:1-600) ………………... h200926ec4 vs LANL HIV Nucleotide Database (May 96) library >>h297929ec1 611 bp DNA 0 190 (611 nt) initn: 3046 init1: 3046 opt: 3046 Z-score: 2512.5 expect() 2.9e-132 99.836% identity in 611 nt overlap (1-611:1-611) >>AF105523 HIV-1 isolate C-DI-10 from Italy, envelop (789 nt) initn: 2432 init1: 2272 opt: 2422 Z-score: 1996.3 expect() 1.3e-103 88.707% identity in 611 nt overlap (1-611:45-652) >>MS97109Fc7 611 bp DNA 0 190 (611 nt) initn: 2406 init1: 2406 opt: 2416 Z-score: 1992.5 expect() 2.6e-103 88.599% identity in 614 nt overlap (1-611:1-611) …………………. h200926ec5 vs LANL HIV Nucleotide Database (May 96) library >>h297929ec1 611 bp DNA 0 190 (611 nt) initn: 3046 init1: 3046 opt: 3046 Z-score: 2521.8 expect() 8.6e-133 99.836% identity in 611 nt overlap (1-611:1-611) >>AF105523 HIV-1 isolate C-DI-10 from Italy, envelop (789 nt) initn: 2432 init1: 2272 opt: 2422 Z-score: 2003.7 expect() 4.9e-104 88.707% identity in 611 nt overlap (1-611:45-652) >>MS97109Fc7 611 bp DNA 0 190 (611 nt) initn: 2406 init1: 2406 opt: 2416 Z-score: 1999.9 expect() 1e-103 88.599% identity in 614 nt overlap (1-611:1-611) …………………...

Database search results 5 M2 env db search Results, it shows Evidence for cross- Patient contamination With H1) m200712ec2 vs LANL HIV Nucleotide Database (May 96) library >>h197514eca 602 bp DNA 0 190 (602 nt) initn: 2974 init1: 2974 opt: 2974 Z-score: 2235.4 expect() 7.9e-117 99.336% identity in 602 nt overlap (1-602:1-602) >>HIVJRFL Human immunodeficiency virus type 1, isol (8896 nt) initn: 1818 init1: 1723 opt: 2491 Z-score: 1859.7 expect() 4.5e-97 90.625% identity in 608 nt overlap (1-602:6329-6936) >>HIVU63632 HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt) initn: 1818 init1: 1723 opt: 2491 Z-score: 1859.7 expect() 4.5e-97 90.625% identity in 608 nt overlap (1-602:6329-6936) ………………. m200712ec4 vs LANL HIV Nucleotide Database (May 96) library >>h197514eca 602 bp DNA 0 190 (602 nt) initn: 2962 init1: 2962 opt: 2962 Z-score: 2212.4 expect() 1.5e-115 99.003% identity in 602 nt overlap (1-602:1-602) >>HIVU63632 HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt) initn: 1806 init1: 1711 opt: 2479 Z-score: 1839.1 expect() 6.3e-96 90.296% identity in 608 nt overlap (1-602:6329-6936) >>HIVJRFL Human immunodeficiency virus type 1, isol (8896 nt) initn: 1806 init1: 1711 opt: 2479 Z-score: 1839.1 expect() 6.3e-96 90.296% identity in 608 nt overlap (1-602:6329-6936) ……………….. m200712ec5 vs LANL HIV Nucleotide Database (May 96) library >>h197514eca 602 bp DNA 0 190 (602 nt) initn: 2965 init1: 2965 opt: 2965 Z-score: 2252.2 expect() 9.1e-118 99.169% identity in 602 nt overlap (1-602:1-602) >>HIVU63632 HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt) initn: 1818 init1: 1723 opt: 2482 Z-score: 1872.7 expect() 8.5e-98 90.461% identity in 608 nt overlap (1-602:6329-6936) >>HIVJRFL Human immunodeficiency virus type 1, isol (8896 nt) initn: 1818 init1: 1723 opt: 2482 Z-score: 1872.7 expect() 8.5e-98 90.461% identity in 608 nt overlap (1-602:6329-6936) ………………….

Database search results 6 m200712ec6 vs LANL HIV Nucleotide Database (May 96) library >>h197514eca 602 bp DNA 0 190 (602 nt) initn: 2974 init1: 2974 opt: 2974 Z-score: 2261.3 expect() 2.8e-118 99.336% identity in 602 nt overlap (1-602:1-602) >>HIVU63632 HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt) initn: 1818 init1: 1723 opt: 2491 Z-score: 1881.4 expect() 2.8e-98 90.625% identity in 608 nt overlap (1-602:6329-6936) >>HIVJRFL Human immunodeficiency virus type 1, isol (8896 nt) initn: 1818 init1: 1723 opt: 2491 Z-score: 1881.4 expect() 2.8e-98 90.625% identity in 608 nt overlap (1-602:6329-6936) ………………. m200712ec7 vs LANL HIV Nucleotide Database (May 96) library >>h197514eca 602 bp DNA 0 190 (602 nt) initn: 2956 init1: 2956 opt: 2956 Z-score: 2275.0 expect() 4.9e-119 99.003% identity in 602 nt overlap (1-602:1-602) >>HIVJRFL Human immunodeficiency virus type 1, isol (8896 nt) initn: 1836 init1: 1741 opt: 2509 Z-score: 1918.2 expect() 2.5e-100 90.954% identity in 608 nt overlap (1-602:6329-6936) >>HIVU63632 HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt) initn: 1836 init1: 1741 opt: 2509 Z-score: 1918.2 expect() 2.5e-100 90.954% identity in 608 nt overlap (1-602:6329-6936) ………………. m200712ec8 vs LANL HIV Nucleotide Database (May 96) library >>h197514eca 602 bp DNA 0 190 (602 nt) initn: 2956 init1: 2956 opt: 2956 Z-score: 2216.9 expect() 8.4e-116 99.003% identity in 602 nt overlap (1-602:1-602) >>HIVU63632 HIV-1 isolate JRFL (USA) gag, pol, vif, v (8896 nt) initn: 2453 init1: 1723 opt: 2491 Z-score: 1855.6 expect() 7.6e-97 90.625% identity in 608 nt overlap (1-602:6329-6936) >>HIVJRFL Human immunodeficiency virus type 1, isol (8896 nt) initn: 2453 init1: 1723 opt: 2491 Z-score: 1855.6 expect() 7.6e-97 90.625% identity in 608 nt overlap (1-602:6329-6936) ………………….

Outgroupto find multiple sequences that can be used as the outgroup of the sequences that you are interested in 1. pick couple of sequences 2. Blast NCBI GenBank, find 6 – 12 sequences as your outgroup 3. I put my outgroup as fasta format in one file (show example)

Outgroup examples >HIVU63632 618 bp DNA 14-JUN-1999, 618 bases, 5E9DEF6C checksum. agaagaagaggtagtaattagatctgacaatttcacgaacaatgctaaaa ccataatagtacagctgaaagaatctgtagaaattaattgtacaagaccc aacaacaatacaagaaaaagtatacatata------ggaccagggagagc …….. >HIVJRCSF 618 bp DNA 14-JUN-1999, 618 bases, 89D0C733 checksum. agaagaaaaggttgtaattagatctgacaattttacggacaatgctaaaa ccataatagtacagctgaatgaatctgtaaaaattaattgtacaaggccc agcaacaatacaagaaaaagtatacatata------ggaccagggagagc ……... >HIVU95410 618 bp DNA 14-JUN-1999, 618 bases, C4E18A19 checksum. agaagaagaggtagtaattagatccgacaatttcacggacaatgctaaaa tcataatagtacagctgaatgaatctgtagaaattaattgtacaagaccc aacaacaatacaagaaaaagtatacatata------ggaccaggcagagc ………. >HIVU95413 618 bp DNA 14-JUN-1999, 618 bases, E4689196 checksum. agaagaagaggtagtaattagatccgacaatttcacggacaatgctaaaa tcataatagtacagctgaatgaatctgtagaaattaattgtacaagaccc aacaacaatacaagaaaaagtatacatata------ggaccaggcagagc ……... >HIVBAL1A 618 bp DNA 14-JUN-1999, 618 bases, 4184AF81 checksum. agaagaagaggtagtaattagatccgccaatttcgcggacaatgctaaag tcataatagtacagctgaatgaatctgtagaaattaattgtacaagaccc aacaacaatacaagaaaaagtatacatata------ggaccaggcagagc …………..

clustalw alignment • 1. on Unix, put all sequences in one directory, only individual sequences, including outgroup sequences but nothing else (on valis r1pol as demo example, r1env as demo results) • 2. command line: • cat * > all • clustalw all & • (default outfile is called all.aln, all.dnd output files) • (clustalw program available on valis, huey, watson, crick....)

Steps that are not necessary for preliminary quality checking • 3. change the .aln format to .gde format in clustalw • 4. open GDE on valis or sage to adjust alignment, generate consensus of the first timepoint, put the consensus to the top of the alignment, translate to amino acid alignment. Then export into .ig format. • 5. pretty dot picture: • On valis command line: pcgdots filename.ig • pcgdots –l 120 –o outfilename infilename.ig

Play with sequence format Two methods are introduced here: 1. using clustalw from all.aln to make GCG/MSF, Phylip, gde formats 2. Changing sequences format using readseq command to get fasta format (for future hypermutant test in our current case)

ClustalW command line: %clustalw ************************************************************** ******** CLUSTAL W (1.75) Multiple Sequence Alignments ******** ************************************************************** 1. Sequence Input From Disc 2. Multiple Alignments 3. Profile / Structure Alignments 4. Phylogenetic trees S. Execute a system command H. HELP X. EXIT (leave program) Your choice: 1 Sequences should all be in 1 file. 7 formats accepted: NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF, RSF. Enter the name of the sequence file: allresult.aln

Clustal example 2 Sequence format is Clustal Sequences assumed to be DNA Sequence 1: r100705ec2_ 667 bp Sequence 2: r100705ecj_ 667 bp Sequence 3: r196807ecl_ 667 bp Sequence 4: r198610ecb_ 667 bp Sequence 5: r198610ecd_ 667 bp Sequence 6: r196807eca_ 667 bp Sequence 7: r196807ecj_ 667 bp Sequence 8: r198610ec2_ 667 bp Sequence 9: r196807ech_ 667 bp Sequence 10: r100705ece_ 667 bp Sequence 11: r199d22ec5_ 667 bp . .

Clustal example 3 .************************************************************** ******** CLUSTAL W (1.75) Multiple Sequence Alignments ******** ************************************************************** 1. Sequence Input From Disc 2. Multiple Alignments 3. Profile / Structure Alignments 4. Phylogenetic trees S. Execute a system command H. HELP X. EXIT (leave program) Your choice: 2

Clustal example 4 ****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file 4. Toggle Slow/Fast pairwise alignments = SLOW 5. Pairwise alignment parameters 6. Multiple alignment parameters 7. Reset gaps before alignment? = OFF 8. Toggle screen display = ON 9. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu Your choice: 9

Clustal example 5 ********* Format of Alignment Output ********* 1. Toggle CLUSTAL format output = ON 2. Toggle NBRF/PIR format output = OFF 3. Toggle GCG/MSF format output = OFF 4. Toggle PHYLIP format output = OFF 5. Toggle GDE format output = OFF 6. Toggle GDE output case = LOWER 7. Toggle CLUSTALW sequence numbers = OFF 8. Toggle output order = ALIGNED (attention here) 9. Create alignment output file(s) now? 0. Toggle parameter output = OFF H. HELP Enter number (or [RETURN] to exit): 5

Clustal example 6 ********* Format of Alignment Output ********* 1. Toggle CLUSTAL format output = ON 2. Toggle NBRF/PIR format output = OFF 3. Toggle GCG/MSF format output = OFF 4. Toggle PHYLIP format output = OFF 5. Toggle GDE format output = ON 6. Toggle GDE output case = LOWER 7. Toggle CLUSTALW sequence numbers = OFF 8. Toggle output order = ALIGNED 9. Create alignment output file(s) now? 0. Toggle parameter output = OFF H. HELP Enter number (or [RETURN] to exit): 4

Clustal example 7 ********* Format of Alignment Output ********* 1. Toggle CLUSTAL format output = ON 2. Toggle NBRF/PIR format output = OFF 3. Toggle GCG/MSF format output = OFF 4. Toggle PHYLIP format output = ON 5. Toggle GDE format output = ON 6. Toggle GDE output case = LOWER 7. Toggle CLUSTALW sequence numbers = OFF 8. Toggle output order = ALIGNED 9. Create alignment output file(s) now? 0. Toggle parameter output = OFF H. HELP Enter number (or [RETURN] to exit): 9

Clustal example 8 WARNING: Output file name is the same as input file. Enter new name to avoid overwriting [allresult.aln]: all.aln Enter a name for the PHYLIP output file [allresult.phy]: all.phy Enter a name for the GDE output file [allresult.gde]: all.gde Consensus length = 667 CLUSTAL-Alignment file created [all.aln] WARNING: Truncating sequence names to 10 characters for PHYLIP output. PHYLIP-Alignment file created [all.phy] GDE-Alignment file created [all.gde]

Clustal example 9 ********* Format of Alignment Output ********* 1. Toggle CLUSTAL format output = ON 2. Toggle NBRF/PIR format output = OFF 3. Toggle GCG/MSF format output = OFF 4. Toggle PHYLIP format output = ON 5. Toggle GDE format output = ON 6. Toggle GDE output case = LOWER 7. Toggle CLUSTALW sequence numbers = OFF 8. Toggle output order = ALIGNED 9. Create alignment output file(s) now? 0. Toggle parameter output = OFF H. HELP Enter number (or [RETURN] to exit): (enter return)

Clustal example 10 ****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file 4. Toggle Slow/Fast pairwise alignments = SLOW 5. Pairwise alignment parameters 6. Multiple alignment parameters 7. Reset gaps before alignment? = OFF 8. Toggle screen display = ON 9. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu Your choice: (return)

Clustal example 11 ************************************************************** ******** CLUSTAL W (1.75) Multiple Sequence Alignments ******** ************************************************************** 1. Sequence Input From Disc 2. Multiple Alignments 3. Profile / Structure Alignments 4. Phylogenetic trees S. Execute a system command H. HELP X. EXIT (leave program) Your choice: x

readseq: to get fasta format • command line: %readseq • readSeq (1Feb93), multi-format molbio sequence reader. • Name of output file (?=help, defaults to display): • r1e.fasta • 1. IG/Stanford 10. Olsen (in-only) • 2. GenBank/GB 11. Phylip3.2 • 3. NBRF 12. Phylip • 4. EMBL 13. Plain/Raw • 5. GCG 14. PIR/CODATA • 6. DNAStrider 15. MSF • 7. Fitch 16. ASN.1 • 8. Pearson/Fasta 17. PAUP/NEXUS • 9. Zuker (in-only) 18. Pretty (out-only) • Choose an output format (name or #): • 8 • Name an input sequence or -option: • all.phy

Readseq 2 Sequences in all.phy (format is 12. Phylip) 1) r100705ec2 2) r100705ecj 3) r196807ecl 4) r198610ecb 5) r198610ecd 6) r196807eca 7) r196807ecj 8) r198610ec2 9) r196807ech 10) r100705ece . . . . .Choose a sequence (# or All): all Name an input sequence or -option: (Return)

Hypermutation Definition of Simon Wain-Hobson 1. Monotony of G->A transitions with respect to the viral plus strand, maybe occasionally accompanied by a few (<5%) other substitutions. 2. All parts of retroviral genome are vulnerable. 3. Overall the number of G->A transitions per sequence should be >5 while transition frequency should be >5% of number of Gs. Up to 60% of Gs can be substituted. 4. The distribution of substitutions may be confined to a very small region, say 50 bp. Equally, they may be distributed in an erratic manner throughout the genome. 5. G->A transitions are associated with dinucleotide context declining in the order GpA>GpG>GpT>GpC. Occasionally a few examples have GpG>GpA. 6. May be accompanied by small deletions of 1-5 bases. Larger deletions and small insertions (1-3) bases are rarer.

Hypermute program of Bette Korber (LANL) • http://hiv-web.lanl.gov/HYPERMUT/hypermut.html • copy file “r1e.fasta” (which is fasta format alignment -- I found sometimes .phy format does not work well even this site accept phy format • H2 env as example • couple of RT region as example (h1pro, h2rt) • R1env. hyperdemo file as example

distance matrix: to generate distance matrix for tree making and diversity calculation We will use phylip package on Valis (Unix) today. Phylip is available for Mac and PC. It takes .phy format If it is Mac, start with an alignment file for example all.phy to make a distance matrix: put this file in the same directory as program "dnadist", double click the program, it will ask for infile name. After program is done, it is called outfile. Rename it if you want to keep this file (otherwise it will be overwritten).

Distance matrix 2 On UNIX: yang /export/home/mullab/yang/Sequences/PhylogeneticTraining/r1env %dnadist dnadist: can't read infile Please enter a new filename>all.phy Nucleic acid sequence Distance Matrix program, version 3.572c Settings for this run: D Distance (Kimura, Jin/Nei, ML, J-C)? Kimura 2-parameter T Transition/transversion ratio? 2.0 C One category of substitution rates? Yes L Form of distance matrix? Square M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, VT52, ANSI)? (none) 1 Print out the data at start of run No 2 Print indications of progress of run Yes Are these settings correct? (type Y or letter for one to change) d

Distance matrix 3 Nucleic acid sequence Distance Matrix program, version 3.572c Settings for this run: D Distance (Kimura, Jin/Nei, ML, J-C)? Jin and Nei T Transition/transversion ratio? 2.0 C One category of substitution rates? Yes L Form of distance matrix? Square M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, VT52, ANSI)? (none) 1 Print out the data at start of run No 2 Print indications of progress of run Yes Are these settings correct? (type Y or letter for one to change) d

Distance matrix 4 Nucleic acid sequence Distance Matrix program, version 3.572c Settings for this run: D Distance (Kimura, Jin/Nei, ML, J-C)? Maximum Likelihood T Transition/transversion ratio? 2.0 C One category of substitution rates? Yes F Use empirical base frequencies? Yes L Form of distance matrix? Square M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, VT52, ANSI)? (none) 1 Print out the data at start of run No 2 Print indications of progress of run Yes Are these settings correct? (type Y or letter for one to change) l

Distance matrix 5 Nucleic acid sequence Distance Matrix program, version 3.572c Settings for this run: D Distance (Kimura, Jin/Nei, ML, J-C)? Maximum Likelihood T Transition/transversion ratio? 2.0 C One category of substitution rates? Yes F Use empirical base frequencies? Yes L Form of distance matrix? Lower-triangular M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, VT52, ANSI)? (none) 1 Print out the data at start of run No 2 Print indications of progress of run Yes Are these settings correct? (type Y or letter for one to change) y

Distance matrix 6 Distances calculated for species r100705ec2 .................................................................... r100705ecj ................................................................... r196807ecl .................................................................. r198610ecb ................................................................. r198610ecd ................................................................ r196807eca ............................................................... r196807ecj .............................................................. r198610ec2 ............................................................. r196807ech ............................................................ r100705ece ........................................................... r199d22ec5 .......................................................... r100705ech ......................................................... r198610ece ........................................................ r198610ec8 ....................................................... r199d22ec3 ...................................................... r100705eci ..................................................... r199d22eca .................................................... r196807ecd ................................................... r196807eck .................................................. r199d22ech ................................................. r198610ecf ................................................ r196807ece ............................................... r196807eci .............................................. r199d22ecd ............................................. r199d22ecf ............................................ r196807ecb ........................................... r199d22ecc .......................................... r199d22ec4 ......................................... r199d22ecb ........................................ r100705ec1 ....................................... r100705ecb ...................................... r198610ec3 ..................................... r198610ec7 .................................... r199d22ecg ................................... r195301eca .................................. r195301ece ................................. r196807ecc ................................ r100705ecc ............................... r198610ec1 .............................. Distances written to file (default name outfile ) rename outfile command: mv outfile r1edist

making tree using phylip package on valis yang /export/home/mullab/yang/Sequences/PhylogeneticTraining/r1env %neighbor3.6 neighbor3.6: can't find input file "infile" Please enter a new file name> r1edist Neighbor-Joining/UPGMA method version 3.5 Settings for this run: N Neighbor-joining or UPGMA tree? Neighbor-joining O Outgroup root? No, use as outgroup species 1 L Lower-triangular data matrix? No R Upper-triangular data matrix? No S Subreplicates? No J Randomize input order of species? No. Use input order M Analyze multiple data sets? No 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Y to accept these or type the letter for one to change l

Neighbor-joining 2 Neighbor-Joining/UPGMA method version 3.5 Settings for this run: N Neighbor-joining or UPGMA tree? Neighbor-joining O Outgroup root? No, use as outgroup species 1 L Lower-triangular data matrix? Yes R Upper-triangular data matrix? No S Subreplicates? No J Randomize input order of species? No. Use input order M Analyze multiple data sets? No 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Y to accept these or type the letter for one to change y

Neighbor-joining 3 • Cycle 66: OTU 64 ( -0.00020) joins OTU 65 ( 0.00190) • Cycle 65: node 64 ( 0.01700) joins OTU 66 ( 0.02905) • Cycle 64: node 64 ( 0.01058) joins OTU 67 ( 0.01462) • Cycle 63: node 64 ( 0.00750) joins OTU 68 ( 0.04174) • . • . • . • . • . • . • . • last cycle: • node 1 ( 0.00001) joins node 30 ( 0.00007) joins OTU 34 ( 0.00163) • Output written on output file (there is one output file: outfile) • Tree written on tree file (default output treefile name : outtree) • Done. • To change outtree file to new name: mv outtree r1enjtree • Use treeView program to open treefile and root with outgroup

TreeView Program to open treefile • Tree->define outgroup • Tree->root with outgroup • Tree->order • Select ladderise left (right), click OK • Save as graphic file (pic), open and play with Canvas

calculate inter-timepoint diversity using distance matrix 1. open r1edist in word first 2. replace all ”paragraph mark space space " to ”space space ", save as text file 3. open in excel (select fixed width) 4. calculated average of distance within each timepoint using excel tools If you do not use GDE to sort the alignment after clustalw alignment, you have to run clustalw in foreground instead of background, when run foreground you can change the output order to "Input order" instead of "ALIGNED". In this way the sequence output is sorted by name. You need a name sorted sequence distance matrix to calculate Intro timepoint population diversity.

What do we have here? • From a group of individual nucleotide sequences, We • Get rid of contamination (if any) • Get rid of hypermutated sequences (if any) • Find a group of sequences as outgroup • Make a nucleotide alignment using clustalw • Changing sequences formats as we needed for future steps • Make F84 neighbor joining tree and rooted with outgroup • Calculate nucleotide intra-timepoint diversity • Plot the diversity change with time

Sequencher project

Sequencher project

Presentation Transcript

Project Name: Project Location: Project Purpose:

Project Auditing Project Termination

URFinancials Project Project Review

URFinancials Project Project Review

Project Triples Project

Project Explorer Project

Project Project name

PROJECT MANAGEMENT Project Termination

Project Title Project #

Project Title Project #

Project Title Project #

PROJECT NAME PROJECT LEADER

Project 5 Final Project

Project Project name

Project Name - Project Kickoff

Project Name: Project Location: Project Purpose: