290 likes | 389 Views
BioInformatics Consultation Practice 3 Gá bor Pauler , Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666 Pogány, Hungary Tel: +36-309-015-488 E-mail: pauler @ t-online.hu. Content of the Practice. Fragment processing:
E N D
BioInformatics Consultation Practice 3 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666 Pogány, Hungary Tel: +36-309-015-488 E-mail: pauler@t-online.hu
Content of the Practice • Fragment processing: • Restriction site database: WebCutter • Primer cleaning: SMS2 DNA Pattern • Vector cleaning: NCBI VecScreen • Fragment assembly: CAP3 • Auxiliary sequence operations: SMS2 • GUI • Conversion operations • Sequence analysis • Seqence mapping • Random sequences • Uploading sequences: EBI • Registration • Upload auxiliary data • Upload sequence • Data Import/Export/Conversion operations: Excel, Access • Text file formats • Converting text file formats • HTML-tables and wide text • Text to Excel • From Excel to Text, HTML, Picture Metafile, Bitmap, Access tables • Home Assignment 3: Fragment clean and match • References
Fragment processing: Restriction maps: WebCutter: Input • The first task in cloning where bioinformatics is heavily involved is in pre-processing: • Selecting restriction enzymes • Forecasting restriction sites in case the cloned sequence is known • Performing these tasks we need a restriction mapping tool based on database of restricti-on enzymes • We will use WebCutter for this purpose: (http://rna.lundberg.gu.se/cutter2/index.html) • At the Start Screen: • Sequence title: Title of analysis • DNA sequence box:Copy the exami-ned nucleotide sequence in FASTA for-mat through clipboard(max.50000chars) • Type: Type of analysis • Linear: Linear DNA • Circular: Circular (eg. in plasmids) • Silent mutagenesis: sites in non-coding parts • Display options: results can be displayed both in graphic or tabular format ordered by nucleotide position/enzyme name • Enzymes: can be filtered by • Least and Most number of cutting • Lenght of recognition site in bases (as lenght influences accuracy) • By enzyme name list (multiple selection with Ctrl+Click) • Press Analyze sequence to run Click Click Click Click Click Click Click Ctrl Click
Fragment processing: Restriction maps: WebCutter: Output • Character-based restriction map by base positions: • This is great for manual processing and prediction of lenght of possible fragments • However it is hard to process automatically at more numerous fragment lenght computation • Tabular list of restriction sites: • It contains enzime names, number of sites, list of coordinates of sites and recognized sequence wit GCG masked nucleotide codes at uncertain matches • It can be copied into Excel for more detailed fragment lenght forecasts (see later)
Fragment processing: Primer cleaning: SMS2 DNA Pattern • In post-processing the sequenced fragments, the first task is to eliminate sequence of primer, as it can confuse further analysis • As primers are at the very beginning of fragment sequences, usually they are already eliminated in chromatogram analysis, as recognintion of initial sequence is most of the time uncertain • But, in case it is not already eliminated, we can use SMS2 DNA Pattern (http://www.bioinformatics.org/sms2/dna_pattern.html ) to do it: • At the Start Screen: • Raw sequence:Copy one or more nucleotide sequences in FASTA format (max.50000chars) • Search pattern: Sequence of the primer. We can give alternative bases for one position in brackets: [AT] We assume here that sequence of primer(s) used is known! • Submit button: Run • At Output Screen: • It gives coordinates of matching sequences • At both strands of the DNA! Click Results for 180 residue sequence "sample sequence one" starting "ttaaggaccc" >match number 1 to "ctt[ca]" start=68 end=71 on the reverse strand ctta >match number 2 to "ctt[ca]" start=2 end=5 on the reverse strand ctta
Fragment processing: Vector cleaning: NCBI VecScreen: Input • Comparing to primers, it is more cumbersome task to clean up sequence of vectors from fragments: • Vectors sequences are longer • They usually can take place both beginning/end of fragments • Vectors are usually used for multiple purposes containing highly-featured sites • So vector-contamination can totally confuse up any further analysis if it is left in the fragment-sequence! • We will use NCBI’s VecScreen (http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html): • At the Start Screen: • Sequence box:Copy the analyzed nucleotide sequence in FASTA format (max.50000chars) • Run Vecscreen button: it will match sequence against vectors stored in NCBI’s UniVec database (ftp://ftp.ncbi.nih.gov/pub/UniVec/) • At Output Settings Screen: • Graphic output • Sequence retrieval: displaycleaned sequence • View report button: go to output Click Click Click Click
Fragment processing: Vector cleaning: NCBI VecScreen: Output • At Output Screen: • At the top, we can see a graphic map overview of matching vector parts • Different intensity of match-es are coded with color regions • Down, there is a text list of matching vec-tor sequences with: • Data of vector • Matching statistics: ratio of identityes and gaps • Detailed character maps of matching
Fragment assembly: Basic definition, CAP3: Input • There is a limitation in PCR that regular DNA polymerazes work only on. 500-1000 base pair lenght parts, and also most sequencing techniques have serious lenght limitations • So, longer sequences can be assembled only from cloned fragments, which usually have 50-100 base pairs overlap at their end • However, restriction sites do not distribute evenly in the genome, and it may disturb overlapped assembly. Thats why we use restriction maps designing the cloning. • Whenever clone fragments are sequenced and cleaned from primer and vector sequences, we need a software, which Assembles(Összeszerel) the fragments: it finds ca. 100 matching base pairs between beginning/end sequence of one fragment and end/beginning sequence of reverse complement of another fragment. • After assembly of fragments, we will have the Contig(Kontig): the longest possible compromised sequence assembled • We use CAP3 software (http://pbil.univ-lyon1.fr/cap3.php) for fragment assembly: • At Start screen: • Sequence box: copy here fragment DNA sequences in FASTA format after each other • Submit button: Run • At Otput screen: we get a menu of outputs: • Contigs: sequence of the longest resulting contig(s) (ideally there should be one) in FASTA format: Click Click >Contig1 TCCTTTAAATCCCTTACATGATCTGAGTTCAGACCGGCGTGAGCCAGGTCGGTTTCTATC CTTATTTTTTGTTTATATTTTAGTACGAAAGGACCAAGTATTTTAAATAATTTATTTTAT
Fragment assembly: CAP3: Output • Assembly details: • It gives the sequence pairs matched at contig assemply: • + denotes the original fragment sequence, • - denotes the reverse complemented another fragment sequence • Below them it gives the consensus sequence: Number of segment pairs = 2; number of pairwise comparisons = 1 '+' means given segment; '-' means reverse complement Overlaps Containments No. of Constraints Supporting Overlap ******************* Contig 1 ******************** 2006-ISO-TD1- 2006-ISO-16S+ DETAILED DISPLAY OF CONTIGS ******************* Contig 1 ******************** . : . : . : . : . : . : 2006-ISO-TD1- TCCTTTAAATCCCTTACATGATCTGAGTTCAGACCGGCGTGAGCCAGGTCGGTTTCTATC 2006-ISO-16S+ GACCGGCGTGAGCCAGGTCGGTTTCTC-C ____________________________________________________________ consensus TCCTTTAAATCCCTTACATGATCTGAGTTCAGACCGGCGTGAGCCAGGTCGGTTTCTATC
Content of the Practice • Fragment processing: • Restriction site database: WebCutter • Primer cleaning: SMS2 DNA Pattern • Vector cleaning: NCBI VecScreen • Fragment assembly: CAP3 • Auxiliary sequence operations: SMS2 • GUI • Conversion operations • Sequence analysis • Seqence mapping • Random sequences • Uploading sequences: EBI • Registration • Upload auxiliary data • Upload sequence • Data Import/Export/Conversion operations: Excel, Access • Text file formats • Converting text file formats • HTML-tables and wide text • Text to Excel • From Excel to Text, HTML, Picture Metafile, Bitmap, Access tables • Home Assignment 3: Fragment clean and match • References
Auxiliary sequence operations: SMS2: GUI Site: Positions: AatI agg|cct none AatII gacgt|c 160 Acc16I tgc|gca none AccII cg|cg 44 • Before uploading and further analysis of assembled contig sequences we may need certain transformations and format conversions called sequence manipulation.There is an easy-to use, comprehensive toolkit for this called: • Sequence Manipulation Site (SMS2) (http://www.bioinformatics.org/sms2/index.html): • Graphic User Interface (GUI): all SMS2 tool share pretty similar user interface: • Left menu: we can choose the requred operation from the hierarchic ordered list • At Start screens: • Top: We can see the explanation of operation • Sequence box: We can copy here input sequence in FASTA format (or more sequences consecutively, if the current operation requres) • There is always a suitable example nucleotide/protein input sequence in the box, making it easier to try out tools! • Below: are parameter settings of current operation • Submit button: Run • At Output Screens: • Outputs are partially graphic, partially in text format, or HTML tables depending on the operation Click Click Click
Auxiliary sequence operations: SMS2: Format conversion operations: • Split/Combine FASTA: Cutting a longer continous FASTA sequence into standard lenght row or concatenate more sorter FASTA into one. • EMBL/GenBank-FASTA: From EMBL/GenBank record to FASTA sequence • EMBL/GenBank Feature Extractor: FromEMBL/GenBank record it extracts exons and assembles them to cDNA, based on the records feature table • EMBL/GenBank Trans Extractor: FromEMBL/GenBank record it extracts possible translated proteins in FASTA format (considering alternative splicing) • Filter DNA/Protein: From FASTA formatted DNA/Protein sequence it cleans illegal characters (except N, which denotes uncertain sequencing in DNA) • OneToThree/ThreeToOne: It converts FASTA formatted protein sequences between 1-char and 3-char coding format, where * and *** respectively denote uncertain sequencing or translation • Window Extractor DNA/Protein: It extracts a window from FASTA formatted DNA/Protein sequence giving the window center position coordinate and width • Range Extractor DNA/Protein: It extracts multiple ranges from FASTA formatted DNA/Protein sequence, given by comma separated coordinates or coordinate ranges, eg: 1,2,3..15,END • It can concatenate them into one FASTA or split into equal lenght FASTA files • Reverse Complement: It computes reverse(5’-3’3’-5’)/ or complement(AT, CG)/ or reverse complement from FASTA formatted DNA sequence • Complements of mask characters denoting uncertain sequencing are treated by GCG code table! • Split Codons: Acoding DNA sequence given in FASTA format is understood as undisturbed sequence of triplet/codons, and it is split to 3 sequences by in-codon position(1,2,3), eg.: from sequence: ATGATG 3 sequences: AAA,TTT,GGG • It is used solely in codon position statistical analysis
Auxiliary sequence operations: SMS2: Sequence analysis operations: • Restriction Digest: Simulation of a restriction of a longer DNA sequence given in FASTA format with a restriction enzyme selected from SMS database (or its binding site sequence given manually): • It computes a list of possible fragment sequences and writes them in one text file in consecutive FASTA records for further processing • Restriction Summary: The same as above, except that it gives not the fragments itself, but a statistic summary table about their properties • PCR Primer Stats: It forecasts for designed primer sequences given in FASTA format: • Melting temperature (important for PCR temperature programming) • Complementarity or partial complementarity (considerably complementer primers connect to each other instead of cloned DNA strand, reducing PCR efficiency) • In case of linear or circular DNA • PCR Products: It simulates PCR of a DNA sequence given in FASTA format: • Using the selected or manually inputed open/close primers • Prepares a list of expected PCR product sequences in one text file in consecutive FASTA records • ORF Finder: In a DNA sequence given in FASTA format, it searches Open Reading Frames (ORF): sequence parts bordered by stop codons on 2 DNA strands × 3 reading frames of codon starting positions (1,2,3) = in 6 reading frames. It is used finding possible coding parts of a DNA • Gives list of ORFs in 1 file as consecutive FASTA recs, • Gives a summary table about their lenght and position • CpG Island: In a DNA sequence given in FASTA format, it searchesCG-dimer rich „islands”: they are usually take place at the 5’-end of genes in vertebrates(gerincesek) • Gives a summary table about CG-island’s lenght and position • Translate/Reverse translate: Translating FASTA DNA to FASTA 1-char coded Protein, or translate back protein to most likely cDNA sequence based on the selected specie’s Codon Usage Table: probability alterante codons of aminoacids in species
Auxiliary sequence operations: SMS2: Other operations • Sequence mapping operations: • Primer map: In a DNA sequence given in FASTA format, it prepares a graphic map of binding sites of given list of primers • Also gives a summary table of coordinates of sites and primer name • Restriction map: The same as above, just for restriction enzymes • Translation map: In a DNA sequence given in FASTA format, it translates all 6 possible reading frames into FAST 1-car coded Protein sequences • Valid codon table can be selected (the default is Genomial (not Mithocondrial), and Standard for vertebrates) • It assumes that DNA contains only coding parts, no introns should be there • Random sequence generation operations: • Random DNA/cDNA/Protein: Random DNA/cDNA/Protein sequences for: • Simulation or try out other software, or • Make unprepared students really cry at sequence analysis computer lab exam! Wohahaha, Yeah! • Mutate/Shuffle DNA/Protein: In a DNA sequence given in FASTA format, it crea- tes flip/ insert/ shuffle mutations • Random DNA/Protein regions: In a DNA /Protein sequence given in FASTA format, it randomizes regions given by comma separated coordinates or coordinate ranges, eg: 1,2,3..15,END
Content of the Practice • Fragment processing: • Restriction site database: WebCutter • Primer cleaning: SMS2 DNA Pattern • Vector cleaning: NCBI VecScreen • Fragment assembly: CAP3 • Auxiliary sequence operations: SMS2 • GUI • Conversion operations • Sequence analysis • Seqence mapping • Random sequences • Uploading sequences: EBI • Registration • Upload auxiliary data • Upload sequence • Data Import/Export/Conversion operations: Excel, Access • Text file formats • Converting text file formats • HTML-tables and wide text • Text to Excel • From Excel to Text, HTML, Picture Metafile, Bitmap, Access tables • Home Assignment 3: Fragment clean and match • References
Uploading sequences: EBI: Registration, Upload auxiliary data 1 • After cleaning and assembling fragments, now we have a nice sequence we would like to share with other researchers • For this purpose, we will use EBI’s interface (http://www.ebi.ac.uk/embl/Submission/index.html) • At Registration&Login Screen: • Register: Register to EBI database first: • Giving your Personal data and press Save • Then you will receive a validation e-mail to your given address, where you should click a link to validate your registration • After that you can login giving your e-mail and password pressing Log in button • At Function Select Screen: we have to select • Submit sequences option button • At Here link, we get a utility to check out whether there is any vector contamination left in the sequence: it uses EBI’s BLASTN nucleotid alignment tool, to check contami-nation in a FASTA formatted DNA sequen-ce based on EMVEC vector database • At Sequence Type Select Screen: we can give the type of uploaded sequence, eg.: • WGS (Unannotated): whole genom with shotgun cloning • EST: Expressed sequence tags • We can way faster upload dat if it is prefor-matted, then select EMBL, MIENS, etc. • At Valid From Date Screen: we can give whether to show it immediately or delayed. Delayed submit is important when you want to prove later, that you submitted first, but don’t want other researchers to access it until your paper is not published Click Click Click Click Click Click Click Click Click
Uploading sequences: EBI: Upload auxiliary data and sequence • At Publication Reference Screen: • Citation type: published/unpublished journal article, etc. • Title, Year, Jornal name • Authors Initials, Surname • In case of multiple publications you can return to this screen and add more ones • At Auxiliary Info Selection Screen: we can select what kind of environmental info will be attached to sequence: • Organism, Organelle • Strain, Isolate • Contig name • At Auxiliary Info Selection Screen: we can give the previously selected auxiliary data • At Validation Screen: it checks internal logical dependencies of data. Pressing Validate button it searches Organism, Organelle at EBI databases If everything is OK, then: • At Sequence Upload Screen: you can upload sequence in FASTA format Click Click Click Click
Content of the Practice • Fragment processing: • Restriction site database: WebCutter • Primer cleaning: SMS2 DNA Pattern • Vector cleaning: NCBI VecScreen • Fragment assembly: CAP3 • Auxiliary sequence operations: SMS2 • GUI • Conversion operations • Sequence analysis • Seqence mapping • Random sequences • Uploading sequences: EBI • Registration • Upload auxiliary data • Upload sequence • Data Import/Export/Conversion operations: Excel, Access • Text file formats • Converting text file formats • HTML-tables and wide text • Text to Excel • From Excel to Text, HTML, Picture Metafile, Bitmap, Access tables • Home Assignment 3: Fragment clean and match • References
Data Import/Export/Conversion: Introduction, Text file formats • Most of bioinformatic software receives input and gives back output in text files (as FASTA, EMBL, Genbank are all text files) • The problem is that they output sizeable table-like results (eg. restriction site lists) also in text file or in HTML-table, what we would like to effectively transfer to Spreadsheets(Táblázatkezelő) (Excel) or Databases(Adatbáziskezelő) (Access) for advanced analysis. • Learning some simple tricks and techniques, one can avoid days of manual work eating time from research, solving things in 5 minutes! • Text file formats: to describe tables in text files, software use alternative methods: • Fixed column width tables: this is most popular, but it is worst: • All columns of a table have their fixed charcter-width • Data content cannot be longer than column with. If it is shorter excess space is filled with Space(ASCII32) chars • Looking it in a Word processzor(Szövegszerkesztő) columns look nice and aligned (assuming that text is in fixed width Courier New font type) • Sometimes it does not contain column name texts, or only in abbreviated form, as it may not fit in the same number of characters as the data content • Column delimiter symbol-based tables: less frequent, but better: • Columns are separated by a given delimiter symbol (Elhatárolójel) _ , : ; • So looking the file in a word processor, we can see bounch of them • But columns do not look nice and aligned, as their data content can be pretty variable length • So, the first line can contain whatever lenghty column names AA BB CC 6.45 5.5 7.35 15.6 17.8 3.2 AA,BB,CC 6.45,5.5,7.35 15.6,17.8,3.2
Data Import/Export/Conversion: Text file formats 2 ”AA”,”This,not delimiter!”,”CC” 6.45,5.5,7.35 15.6,17.8,3.2 • There are different subspecies of delimiter symbol: • Comma Separated Values, CSV: • Popular among USB-connected instruments • Hovewer in German and Hungarian we use comma as decimal separator instead of dot, so it can confused up with column separators.Also,text data content can contain comma • Therefore text data can be put between Text Markers (””) • Space(ASCII32)delimited format: • This is also very popular format • One serious issue that it is very easy to mix up with fixed column lenght format, which prevents auto-processing: • If columns are not aligned at all rows with spaces, it cannot be processed as fixed format • While Space-delimited format understands two consecutive spaces as Null(Üres)-valued field, messing up columns: eg.: before 7.35 there are 2 spaces. This will be the bad result: • Such a messed up text file can be corrected in Word by selecting the text with, Shift+Cursor and launch Edit|Find/Replace(Szerkesztés|Keresés/Csere) menu to replace two consecutive spaces (__) with one (_), using Replace all(Összes cseréje) button. Repeating this sometimes, space duplications will be eliminated • Colon and semicolon separated formats:better than space delimited, but this characters can appear in stored text also. This can be solved with text mar- kers also • Tab(ASCII9)delimited format:as Tab specially denotes column break • It cannot be mixed up with other characters • But simple users can get confused,as Tab is invisible,except when pressing button ( ) in word AA,BB,CC 6.45,5.5,7.35 15.6,17.8,3.2 AA BB CC 6.45 5.5 7.35 15.6 17.8 3.2 AA BB CC 6.45 5.5 7.35 15.6 17.8 3.2 Click ”AA”;”This;not delimiter!”;”CC” 6.45;5.5;7.35 15.6;17.8;3.2 ╥ ╥ AA BB CC 6.45 5.5 7.35 15.6 17.8 3.2 ╥ ╥ ╥ ╥
Data Import/Export/Conversion: Converting text file formats AA BB CC 6. 45 5.5 7.35 15.6 17.8 3.2 Pull AA BB CC 6.45 5.5 7.35 15.6 17.8 3.2 • Our frequent task is to export table-like text outputs into Excel, Access or PowerPoint (eg. Codon usage frequencies): • Word text to HTML table: • Select the thext with Shift+Pull • Table|Convert|Text to Table (Táblázat|Konvertálás|Szöveg táblázattá) menu: • It tries to autodetect, whether the text is in fixed column width or in delimited format • If it misjudges(eg.on mixing the 2 formats) we can correct it • It gives the number of rows/columns to be created • Properties of HTML table in Word: • Its rows/columns/cells are fully formattable: sizeable, colorable, and frameable, also Font/Style/Size/Color of text can be set • Its cell can contain pictures also, while Excel table cell cannot: picture can be there in background or on overlay • Width of columns can be set to Manual, Uniform, Fit to content, Fit to Window width • One stupid thing in HTML is that default cell margins are huge eating up lot of desktop space, reduce them to 0: • Select all the table with Shift+Pull • Table|Table Properties (Táblázat|Táblázat_tulajdonsá-gai) menu: • |Cells(Cella) tab|Settings(Beállítások) button: • |Uncheck Same as whole table (Teljes táblázat-tal egyezően) • |Set Cell Margins(Margók) to 0cm • Another stupid thing of HTML that default column height is not 0, adding redundant space between rows. Set it to 0: • |Rows(Sorok) tab • |Define row height (Magasság megadása): • |At least(legalább)| 0 cm Shift Click Click Click Click Click Click
Data Import/Export/Conversion: Converting HTML and wide text AA BB CC 6.45 5.5 7.35 15.6 17.8 3.2 Pull Shift • HTML table from Word to Excel/PowerPoint/HTML webpage: • Can be simply copied with Edit|Copy (Szerkesztés|Másolás) Ctrl+CEdit|Paste (Szerkesztés|Beillesztés)Ctrl+V through clipboard keeping all the formattings • HTML table from Word to Text:Select all the table with Shift+Pull: • Table|Convert|Table to Text (Táblázat|Konvertálás|Táblázat szöveggé) menu: writes out to delimited text file format|Give delimiter character:Tab • Text from Wordb to Picture Metafile: • Output of numerous bioinformatic softwares are text files which use so wide lines consisting lot of characters (eg. restriction or alignment maps of sequence wit characters) that they cannot fit into the page body of a Word document or a PowerPoint slide and lines messed up.How we can solve it: • We can reduce font size but it reduces visibility: • Or we can shift from fixed lenght font Courier New to more compact font (eg. Arial narrow), but alinment of rows will be dest- royed because it is non-fixed lenght font • Therefore copy text to clipboard, and instead pasting normally withEdit|Paste(Szerkesztés|Beillesztés) Ctrl+V paste it with Edit|Past special(Szerkesztés|Irányított beillesztés) menu: • |Select Enchanced Metafile(Kép) format • Text will be pasted into Word or PowerPoint as easy-to resize picture, • Additionally,using their drawing tool(View|Tools|Drawing(Nézet|Eszköztárak|Rajzoló) menu), picture still can be edited as a set of graphic objects: we can rewrite characters or put additional graphic • But it cannot be edited as word processor text anymore Click Click AA BB CC 6.45 5.5 7.35 15.6 17.8 3.2 1 cagctggggggaggtggcgaggaagatgacgtggtagttgtcgcggcagctgccaggaga 1 10 20 30 40 50 1 gtcgacccccctccaccgctccttctactgcaccatcaacagcgccgtcgacggtcctct Click Click
Content of the Practice • Fragment processing: • Restriction site database: WebCutter • Primer cleaning: SMS2 DNA Pattern • Vector cleaning: NCBI VecScreen • Fragment assembly: CAP3 • Auxiliary sequence operations: SMS2 • GUI • Conversion operations • Sequence analysis • Seqence mapping • Random sequences • Uploading sequences: EBI • Registration • Upload auxiliary data • Upload sequence • Data Import/Export/Conversion operations: Excel, Access • Text file formats • Converting text file formats • HTML-tables and wide text • Text to Excel • From Excel to Text, HTML, Picture Metafile, Bitmap, Access tables • Home Assignment 3: Fragment clean and match • References
Data Import/Export/Conversion: Text to Excel Pull Shift • Text file table to Excel table: • Select and copy table in a text file to clipboard then paste it into cell (A1) of an empty Excel worksheet with Edit|Past special (Szerkesztés|Irányított beillesztés) menü|Selecting Plain text (Nem formázott szöveg) format: • This will look pretty nasty at first: Excel copies it into separate rows, but columns will be melted together as text in one cell • Select this single column (A1:A3) with Shift+Pull, and make sure that columns to the right of it are empty • Then use Data|Text to Columns(Adatok|Szövegből oszlo-pok) menu to start text breaking wizard: • First it asks whether text data is in fixed/delimited format: • If delimited, give delimiter symbol (eg. Comma), and the text marker, and set whether consecutive delimiters are melted or create empty field: • If fixed, it gives a breaking screen where you can define column delimiter arrows with Click/Pull • Then it shows columns created, and we can decide their data type manually or leave it detected automatically: • First problem is with that Excel by default recognizes text as dates if they conform the international settings of Windows at Start button|Control panel|Inter-national settings|Date- and numeric format(Start gomb|vezérlőpult|Területi beállítások|Dátum- és számformátum). Different dates are left as text! • You can recognize incorrect detection by alignment: text is at left in cell, recognized dates/numbers at right • This can be solved setting Date(Dátum) format con-form with data content (YMD(ÉHN), MDY(HNÉ), etc.) • With Special(Irányított) button we can define Deci-mal separator(Tizedesjel) and Thousand separator(Ezres elválasztó) if it is not detected correctly • Pressing Finish(Bezár) button of the wizard, the table will be placed in consecutive columns with correct data formatting: Click Click Click Click Click Click Pull Click Click Not recog- nized! Click Click
Data Import/Export/Conversion:From Excel to Text/ HTML/ Presentation AA BB CC 6.45 5.5 7.35 15.6 17.8 3.2 Pull Shift • Excel table to text format table: • We can copy selected excel table/diagram or both together to clipboard with Edit|Copy(Szerkesztés|Másolás) Ctrl+C • Paste to Word or PowerPoint with Edit|Pastspecial(Szerkesz-tés|Irányított beillesztés) menu|In Plain text(Nem formázott szöveg) format. It puts Tab(ASCII9) characters among columns as delimiters • If we would like another delimiter, past the table as HTML and convert it to text as described earlier choosing delimiter char • Alternatively, you can concatenate content of columns into continous text in a separate column using cell formulas: =A1& ”,”&B1&”,”&C1 where:&-text concat, „”-constant, A1-cell ref. • Excel table to Presentation: HTML table/Picture Metafile/Bitmap: • Never ever paste it with Edit|Paste(Szerkesztés|Beillesztés) Ctrl+V into Word or PowerPoint!!! Because this embeds the WHOLE Excel file invisibly into teh document/presentation as many times as you pasted any part table: • Embedded Excel can still make computations with cell formula, but most of the time we do not need that • However it will result in a huge document/presentation file, which will frequently freeze Word and PowerPoint • Correctly,you should paste it with Edit|Past special(Szerkesz-tés|Irányított beillesztés) where you have following options: • HTML format: Preserves color/font formatting well and table is fully editable (cell formula replaced with numbers) Row/colum sizes/margins messed up, lot of work to fix! • Picture Metafile:Excellent preservation of all formatting Excellent resizeable Cannot be edited as table Can be edited as drawing with Word/PPT drawing tool At simple table/graphic it consumes less resource than: • Bitmap:It is pasted exactly as you can see on screen Bad resizeability, quality deteriorates rapidly Very limi-ted editability with PaintBrush In case of highly cokplex diagrams bitmap consumes less resources than metafile Ctrl+C Ctrl+V Click Click Click ╥ ╥ AA BB CC 6.45 5.5 7.35 15.6 17.8 3.2 ╥ ╥ ╥ ╥
Data Import/Export/Conversion: Excel diagram to Picture Metafile in PPT Click Re-formatting charts at presentation: There are some features of charts we cannot set in Excel, but it is possible to do in meta-file:Eg. at complex 3D area charts,it would be great to create semi-transparent func-tion surfaces partially covering eachother, but it cannot be done in Excel. • How to do: Copy 3D area chart trough clipboard as metafile • Convert metafile into PPT drawing with View|Toolbars|Drawing|Drawing me- nu|Ungroup(Nézet|Eszköztárak|Raj-zoló |Rajzoló menü|Csoportbontás), repeat it as long as it can be done • Delete unnecessary chart background, axis, axis text, etc. elements • Select all remaining elements, format them Doubleclicking on selection, set their color, border, and transparency • Group elements together again • But a difficult drawing containing 1000s of elements can eat up lot of resources and freeze presentation • Therefore, cut metafile to clipboard with Edit|Cut (Szerkesztés|Kivágás) • Paste as GIF picture with Edit|Paste special|GIF (Szerkesztés|Irányított beillesztés|GIF). It keeps transparency, and reduces resource consumption, but it can be edited only as image anymore Click Click Click Click Pull Click Click Click Click Click Click Szerkesztés Click Kivágás Click Szerkesztés Click Click Click
Data Import/Export/Conversion: From Excel to Access Databse Table Munka1 • As an Excel worksheet can process max. 65535 rows, it is worth to put sizeable data tables into database be-fore Excel freezes.In Access, steps are the following: • With File|New|Empty database|{Path/Name.mdb} |Save(Fájl|Új|Üres adatbázis|{Elérési út/Név.mdb} |Mentés) menu we create a new empty *.mdb data-base file with the given name on given path: • With File|Get external data|Import|Excel+Name.xls |Import(Fájl|Külső adatok átvétele|Importál|Excel fájlok + Név.xls|Importálás) menu, import wizard is launced(only if Access is set up in full setup version!): • First, we select from which worksheet we will import the table: this should have regular row/column structure, with column name at the first line and identical type of data within one column, otherwise Access cannot import: • Next, we can see the table to import, and it asks wheteher there are column names in the first line • Next, it asks whether to put data in new database table or an already existing (it should have compatible column structure to receive data) • Next, we can overview types of columns • Next, it ask to assign primary key to table: No • At Finish, it ask the name of new table: Munka1 • After the wizard finished, new table can be opened with DoubleClick on Tables|Munka1 icon: • Access can handle ca. 10M rows in a table and computes much more faster than Excel • However its programming is much more difficult, can be done in Structured Query Language (SQL) Click Click Click Click Click Click
Home Assignment 3: Fragment clean and match • Clean up the following fragments given in FASTA fromat from primer and vector sequences and try to match them using suitable software! (5pts) • Fragment1: Fragment1.txt • Fragment2: Fragment2.txt • Fragment3: Fragment3.txt • Solution: 3-1HomeAssignSolution.doc
References • Cloning, fragment processing: • Restriction site database: WebCutter: http://rna.lundberg.gu.se/cutter2/index.html • Primer cleaning: SMS2 DNA Pattern: http://www.bioinformatics.org/sms2/index.html • Vector cleaning: NCBI VecScreen: http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html • Fragment assembly: CAP3: http://pbil.univ-lyon1.fr/cap3.php • Auxiliary sequence operations: SMS2: http://www.bioinformatics.org/sms2/index.html • Uploading sequences: EBI: http://www.ebi.ac.uk/embl/Submission/index.html • Data Import/Export/Conversion operations in Excel/Access: • http://www.andrewsexceltips.com/ • http://www.andypope.info/ • http://www.dicks-blog.com/