310 likes | 515 Views
Coursework Sequence Operation. Info3021 2004/5. DNA Sequence Characteristics. A, T, C and G Variable length, and can be huge Pattern is important information There are relationship with protein sequence. Required operation. Translate to protein sequence Compare difference (distance)
E N D
Coursework Sequence Operation Info3021 2004/5
DNA Sequence Characteristics • A, T, C and G • Variable length, and can be huge • Pattern is important information • There are relationship with protein sequence
Required operation • Translate to protein sequence • Compare difference (distance) • Input validation
Standard Genetic Code • A gene sequence could be converted to a protein sequence according to the standard genetic code TTALeu (L) TTTPhe (F) TTTTTAFL gene protein
Translate to protein TTTATTTTTAATGGTT TTTF ATTI TTAL ATGM GTTV FILMV
DNA to Protein Table http://acrux.igh.cnrs.fr/proteomics/codon.html
Sequence Comparison • Given two sequences ATCCGGTCCAAGTTT GTACGGTACATGTTT • How similar are they ?
Distance • One simple method is to check their distance • D(a,b) is a distance metric if all of the following hold: • D(a,b) >= 0 for all sequences a,b • D(a,b) = 0 only if a=b • D(a,b) = D(b,a), commutative • D(a,b) + D(b,c) >= D(a,c) triangle inequality
Matching Distance (1) • In distance calculation, we assume • Identity – 0 • Mismatch – 1 • Matching starts from the first character ATCCGGTCCAAGTTT 10 100 00 10 0 1 0 000 -----4 (distance) GTACGGTACATGTTT
Matching Distance (2) • In distance calculation, we assume • The extra characters belong to “not match” group ATCCGGTCCAAGTTTACCGTT 10 1 00 0 01 00 1 00 00 11 11 1 1 -----10 (distance) GTACGGTACATGTTT
Meaning of The Distance • The similar the two sequences are, the less their distance is • The exact match between two sequences has a distance of 0 • Obviously, the distance between two sequences could be applied to find matching sequences
Precondition for matching distance <= S • The number of characters not matched <S • It is obvious that • If n1 and n2 are the numbers of characters for the two sequences, a1 and a2, c1 and c2, g1 and g2 and t1 and t2 are the numbers of A, C, G and T, we have |n1-n2|<=S, |a1-a2|<=S, |c1-c2|<=S, |t1-t2|<=S
Performance improvement 1. GATATCCGAATTGGATTCAA n=20 2. GATATACAGATC n=12 3. ATC n=3 4. ATCG n=4 List those sequences with a distance to ‘ATC’ less than 1 Do we need to consider No 1 and 2 ?
Representation of comparison ATGCAATCATATGCTTCT TTAGAATTATTC +*++***+**++ * Identical + Different
Input Validation The input characters can only be A, T, C or G The system should reject wrong inputs OK ATCCGG No BFCCGG
Coursework tasks • Design a suitable set of test data. • Define and implement a new data type Sequence_tsql with a member method GetProtein_plsql and Distance_plsql • Prefix your Sequence data type with your Oracle user name e.g. yyjSequence_tsql. • The Sequence type header must include the field(s) used to store the values representing a sequence. Assume that the maximum size of sequence is less than 3000 characters.
Type Head create or replace type yyjsequence_tsql as object ( -- enter here your data fields or attributes to hold the sequence -- data , --The two methods here must be implemented, you can also define any new methods you feel necessary for your work --replace ...... with suitable data types or parameters member function getprotein_plsql return ......, member function distance_plsql(……) return number ); /
Methods • getprotein_plsql is a function to translate a DNA sequence into Protein sequence • distance_plsql is a function to calculate the distance between two DNA sequences • You can add in more parameters in the argument list if necessary
Type Body • Some prompts in type body is provided, you must fill in the detail code • If necessary, add your own methods • Modify the identified assignment statements in the code file provided replacing the prompts with actual code as required.
Possible performance improvement (1) • At the start of Distance_plsql calculate the numbers of characters for the two sequences. Check if the difference between them greater than a user required limit. If yes then return null instead of a number value. • Else go on to compare characters one by one
Possible performance improvement (2) • Implement a technique for pre-computing the 5 numbers of characters. Then modify the algorithm to use the pre-computed value. Although this is a minimal saving here consider the case of pre-computing the numbers of characters for a sequence with 2G characters • Apply debug information to confirm your implementation by pre-computing.
Possible interface improvement • Design a trigger to do automatic input validation. Only A, C, G and T are valid characters in sequences.
Required DNA--SRY atgcaatcatatgcttctgctatgttaagcgtattcaacagcgatgattacagtccagctgtgcaagagaatattcccgctctccggagaagctcttccttcctttgcactgaaagctgtaactctaagtatcagtgtgaaacgggagaaaacagtaaaggcaacgtccaggatagagtgaagcgacccatgaacgcattcatcgtgtggtctcgcgatcagaggcgcaagatggctctagagaatcccagaatgcgaaactcagagatcagcaagcagctgggataccagtggaaaatgcttactgaagccgaaaaatggccattcttccaggaggcacagaaattacaggccatgcacagagagaaatacccgaattataagtatcgacctcgtcggaaggcgaagatgctgccgaagaattgcagtttgcttcccgcagatcccgcttcggtactctgcagcgaagtgcaactggacaacaggttgtacagggatgactgtacgaaagccacacactcaagaatggagcaccagctaggccacttaccgcccatcaacgcagccagctcaccgcagcaacgggaccgctacagccactggacaaagctgtag SRY must be included into your test data
Development Approach (1) • Save all successfully executed SQL in a file to remind you of the objects created and to allow a final script to be prepared. • Remember to drop any objects that you wish to re-create or no longer need • drop table doc1; • drop type yyjDocument_tsql;
Development Approach (2) • The following selects will identify some of the objects created • tables select * from tab; • types select type_name from user_types; • columns describe tablename • Cannot drop a type if a table in existence has used it in a column • Cannot drop a type if it is used in another type
Development Approach (3) • Check out the debugging handout. • set serveroutput on size 10000 • Can now use “print statements” inside code • DBMS_OUTPUT.PUT_LINE(‘result = ' || result ); • To see the output after execution must execute • CALL DBMS_OUTPUT.PUT_LINE('Flush any output');
Submission Requirements (1) • SQL script in hard copy and disk. This must include all create type , table and insert statements to set up the required tables with your test data. At the end of the script include the necessary drop statements to clear out all the database objects that your script has created. Drop tables before dropping types.
Submission Requirements (2) • Test data and expected match compared with SQL generated match values. Test each of the result situations i.e all identical, partly identical and no identical at all. Give your designed sequences and list the expected distances and your results from Oracle. • You have to prove your improvement in performance by means of test output including debug information • You can make virtual DNA sequences for test your code. The DNA sequence has to consist of only A, C, G and T, and its length should be less than 30 except the provided SRY • Explanation / justification of your approach with references to the SQL script and including sample output at various stages. • Critical justification for your approach (not more than 4 pages exclude your code)
Submission Requirements (2) • Remember that you must work on your own and not as a group. It will be expected that the script submitted is substantially different from others received. • The SRY DNA sequence must be included into your test data • The match pattern does not necessarily include those extra characters in the longer sequence
Coursework Assistance • Send question via email to tutor including within the body of the message if necessary extracts from your script. If required attach your complete script as a text file. • You will be expected to have tried to solve the problem yourself. Describe your attempts. • Do not just ask how to do the coursework!
Submission Date • Friday 5th November 2004 • Hand in before 4 pm. • Complete and sign the assignment form and include with your coursework documents and disk.