1 / 31

Coursework Sequence Operation

Coursework Sequence Operation. Info3021 2004/5. DNA Sequence Characteristics. A, T, C and G Variable length, and can be huge Pattern is important information There are relationship with protein sequence. Required operation. Translate to protein sequence Compare difference (distance)

cate
Download Presentation

Coursework Sequence Operation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Coursework Sequence Operation Info3021 2004/5

  2. DNA Sequence Characteristics • A, T, C and G • Variable length, and can be huge • Pattern is important information • There are relationship with protein sequence

  3. Required operation • Translate to protein sequence • Compare difference (distance) • Input validation

  4. Standard Genetic Code • A gene sequence could be converted to a protein sequence according to the standard genetic code TTALeu (L) TTTPhe (F) TTTTTAFL gene protein

  5. Translate to protein TTTATTTTTAATGGTT TTTF ATTI TTAL ATGM GTTV FILMV

  6. DNA to Protein Table http://acrux.igh.cnrs.fr/proteomics/codon.html

  7. Sequence Comparison • Given two sequences ATCCGGTCCAAGTTT GTACGGTACATGTTT • How similar are they ?

  8. Distance • One simple method is to check their distance • D(a,b) is a distance metric if all of the following hold: • D(a,b) >= 0 for all sequences a,b • D(a,b) = 0 only if a=b • D(a,b) = D(b,a), commutative • D(a,b) + D(b,c) >= D(a,c) triangle inequality

  9. Matching Distance (1) • In distance calculation, we assume • Identity – 0 • Mismatch – 1 • Matching starts from the first character ATCCGGTCCAAGTTT 10 100 00 10 0 1 0 000 -----4 (distance) GTACGGTACATGTTT

  10. Matching Distance (2) • In distance calculation, we assume • The extra characters belong to “not match” group ATCCGGTCCAAGTTTACCGTT 10 1 00 0 01 00 1 00 00 11 11 1 1 -----10 (distance) GTACGGTACATGTTT

  11. Meaning of The Distance • The similar the two sequences are, the less their distance is • The exact match between two sequences has a distance of 0 • Obviously, the distance between two sequences could be applied to find matching sequences

  12. Precondition for matching distance <= S • The number of characters not matched <S • It is obvious that • If n1 and n2 are the numbers of characters for the two sequences, a1 and a2, c1 and c2, g1 and g2 and t1 and t2 are the numbers of A, C, G and T, we have |n1-n2|<=S, |a1-a2|<=S, |c1-c2|<=S, |t1-t2|<=S

  13. Performance improvement 1. GATATCCGAATTGGATTCAA n=20 2. GATATACAGATC n=12 3. ATC n=3 4. ATCG n=4 List those sequences with a distance to ‘ATC’ less than 1 Do we need to consider No 1 and 2 ?

  14. Representation of comparison ATGCAATCATATGCTTCT TTAGAATTATTC +*++***+**++ * Identical + Different

  15. Input Validation The input characters can only be A, T, C or G The system should reject wrong inputs OK ATCCGG No BFCCGG

  16. Coursework tasks • Design a suitable set of test data. • Define and implement a new data type Sequence_tsql with a member method GetProtein_plsql and Distance_plsql • Prefix your Sequence data type with your Oracle user name e.g. yyjSequence_tsql. • The Sequence type header must include the field(s) used to store the values representing a sequence. Assume that the maximum size of sequence is less than 3000 characters.

  17. Type Head create or replace type yyjsequence_tsql as object ( -- enter here your data fields or attributes to hold the sequence -- data , --The two methods here must be implemented, you can also define any new methods you feel necessary for your work --replace ...... with suitable data types or parameters member function getprotein_plsql return ......, member function distance_plsql(……) return number ); /

  18. Methods • getprotein_plsql is a function to translate a DNA sequence into Protein sequence • distance_plsql is a function to calculate the distance between two DNA sequences • You can add in more parameters in the argument list if necessary

  19. Type Body • Some prompts in type body is provided, you must fill in the detail code • If necessary, add your own methods • Modify the identified assignment statements in the code file provided replacing the prompts with actual code as required.

  20. Possible performance improvement (1) • At the start of Distance_plsql calculate the numbers of characters for the two sequences. Check if the difference between them greater than a user required limit. If yes then return null instead of a number value. • Else go on to compare characters one by one

  21. Possible performance improvement (2) • Implement a technique for pre-computing the 5 numbers of characters. Then modify the algorithm to use the pre-computed value. Although this is a minimal saving here consider the case of pre-computing the numbers of characters for a sequence with 2G characters • Apply debug information to confirm your implementation by pre-computing.

  22. Possible interface improvement • Design a trigger to do automatic input validation. Only A, C, G and T are valid characters in sequences.

  23. Required DNA--SRY atgcaatcatatgcttctgctatgttaagcgtattcaacagcgatgattacagtccagctgtgcaagagaatattcccgctctccggagaagctcttccttcctttgcactgaaagctgtaactctaagtatcagtgtgaaacgggagaaaacagtaaaggcaacgtccaggatagagtgaagcgacccatgaacgcattcatcgtgtggtctcgcgatcagaggcgcaagatggctctagagaatcccagaatgcgaaactcagagatcagcaagcagctgggataccagtggaaaatgcttactgaagccgaaaaatggccattcttccaggaggcacagaaattacaggccatgcacagagagaaatacccgaattataagtatcgacctcgtcggaaggcgaagatgctgccgaagaattgcagtttgcttcccgcagatcccgcttcggtactctgcagcgaagtgcaactggacaacaggttgtacagggatgactgtacgaaagccacacactcaagaatggagcaccagctaggccacttaccgcccatcaacgcagccagctcaccgcagcaacgggaccgctacagccactggacaaagctgtag SRY must be included into your test data

  24. Development Approach (1) • Save all successfully executed SQL in a file to remind you of the objects created and to allow a final script to be prepared. • Remember to drop any objects that you wish to re-create or no longer need • drop table doc1; • drop type yyjDocument_tsql;

  25. Development Approach (2) • The following selects will identify some of the objects created • tables select * from tab; • types select type_name from user_types; • columns describe tablename • Cannot drop a type if a table in existence has used it in a column • Cannot drop a type if it is used in another type

  26. Development Approach (3) • Check out the debugging handout. • set serveroutput on size 10000 • Can now use “print statements” inside code • DBMS_OUTPUT.PUT_LINE(‘result = ' || result ); • To see the output after execution must execute • CALL DBMS_OUTPUT.PUT_LINE('Flush any output');

  27. Submission Requirements (1) • SQL script in hard copy and disk. This must include all create type , table and insert statements to set up the required tables with your test data. At the end of the script include the necessary drop statements to clear out all the database objects that your script has created. Drop tables before dropping types.

  28. Submission Requirements (2) • Test data and expected match compared with SQL generated match values. Test each of the result situations i.e all identical, partly identical and no identical at all. Give your designed sequences and list the expected distances and your results from Oracle. • You have to prove your improvement in performance by means of test output including debug information • You can make virtual DNA sequences for test your code. The DNA sequence has to consist of only A, C, G and T, and its length should be less than 30 except the provided SRY • Explanation / justification of your approach with references to the SQL script and including sample output at various stages. • Critical justification for your approach (not more than 4 pages exclude your code)

  29. Submission Requirements (2) • Remember that you must work on your own and not as a group. It will be expected that the script submitted is substantially different from others received. • The SRY DNA sequence must be included into your test data • The match pattern does not necessarily include those extra characters in the longer sequence

  30. Coursework Assistance • Send question via email to tutor including within the body of the message if necessary extracts from your script. If required attach your complete script as a text file. • You will be expected to have tried to solve the problem yourself. Describe your attempts. • Do not just ask how to do the coursework!

  31. Submission Date • Friday 5th November 2004 • Hand in before 4 pm. • Complete and sign the assignment form and include with your coursework documents and disk.

More Related