1 / 34

Reading for the Next Week

Reading for the Next Week. Sequence Analysis and Alignment Chapter 5, Chapter 8, Chapter 11 Only about the 1st third of each chapter. Sequence Files. Fasta format, has simplest structure >Sequence Title [new line] Sequence [new line] very useful for handling sequence alone

Download Presentation

Reading for the Next Week

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reading for the Next Week • Sequence Analysis and Alignment • Chapter 5, Chapter 8, Chapter 11 • Only about the 1st third of each chapter

  2. Sequence Files • Fasta format, has simplest structure >Sequence Title [new line] Sequence [new line] • very useful for handling sequence alone • usually included as one of the formats supported by programs that use sequence

  3. Example of Fasta Format >gi|212244|gb|M16260.1|CHKLCAMR… AGCTCCGTGCGCAGCGGTACCCGTACCGGTACCGGCCCGGTCCCTGAGCCATGGGCCGGCGGTGGGGTTCCCCCGCCCTGCAGCGCTTCCCCGTGTTGGTGCTGCTGCTGCTGCTCCAGGTGTGCGGCCGGCGGTGCGACGAGGCAGCCCCCTGCCAGCCCGGCTTTGCTGCAGAGACCTTCAGCTTCAGTGTGCCCCAGGACAGCGTGGCGGCGGGCAGGGAGCTGGGACGAGTGAGCTTTGCAGCCTGCAGCGGGCGGCCGTGGGCCGTGTATGTCCCGACTGACA…

  4. GENBANK Flat File • holdover from earlier versions of GENBANK, the US government-supported public database • DNA-centric, sequence based view of data • contains a number of fields with non-sequence information

  5. LOCUS CHKLCAMR 3545 bp mRNA linear VRT 30-NOV-1995 DEFINITION Chicken liver cell adhesion molecule L-CAM mRNA, complete cds. ACCESSION M16260 J04074 M22179 VERSION M16260.1 GI:212244 KEYWORDS cadherin; glycoprotein; liver cell adhesion molecule. SOURCE Gallus gallus cDNA to mRNA. ORGANISM Gallus gallus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archosauria; Aves; Neognathae; Galliformes; Phasianidae; Phasianinae; Gallus. REFERENCE 1 (bases 201 to 3545) AUTHORS Gallin,W.J., Sorkin,B.C., Edelman,G.M. and Cunningham,B.A. TITLE Sequence analysis of a cDNA clone encoding the liver cell adhesion molecule, L-CAM JOURNAL Proc. Natl. Acad. Sci. U.S.A. 84 (9), 2808-2812 (1987) MEDLINE 87204217

  6. FEATURES Location/Qualifiers source 1..3545 /organism="Gallus gallus" /db_xref="taxon:9031" /clone="pEC3(20,30,31)" /tissue_type="liver" /dev_stage="10-11 day old embryo" mRNA <1..3545 /product="L-CAM mRNA" CDS 51..2714 /codon_start=1 /product="liver cell adhesion protein precursor" /protein_id="AAA82573.1" /db_xref="GI:212245" /translation="MGRRWGSPALQRFPVLVLLLLLQV…GGEDDE" sig_peptide 51..128 mat_peptide 531..2711 /product="liver cell adhesion protein" BASE COUNT 757 a 1125 c 1051 g 612 t ORIGIN 20 bp upstream of KpnI site. 1 agctccgtgc gcagcggtac ccgtaccggt accggcccgg tccctgagcc atgggccggc 61 ggtggggttc…

  7. Other Formats • XML - extensible markup language • similar to HTML only can implement user-defined tags • Graphic • extracts positions from features and creates a graphical output

  8. Database Types Characteristics, Strengths and Weaknesses

  9. What is a Database? • well-defined storage method for digital data • allows for relatively rapid retrieval of data • allows for complex conditional retrieval

  10. Three Main Types Used in Bioinformatics • Flat File • text stored in a file in stereotyped format • Hierarchical adds “tree” organization • Relational • a set of tables, with unique identifiers, and overlapping content • Object Oriented • data stored as part of a data structure (the object) that includes methods for manipulating the data

  11. Flat File Database • data is stored as an “unstructured” record • relationships between the data are inherent in the database schema, the description of the syntax of the storage file

  12. Flat File Database • Advantages • low overhead, do not need to have a complex computational superstructure to organize the data and keep track of it in memory • retrieval is not computationally complex • can take advantage of generalized standards for information organization

  13. Disadvantages • no random access, therefore the simplicity of storage imposes a cost on access and manipulation • partially resolved by indexing • change in the schema requires parsing and rewriting the whole database • all linkages between data entries must be explicitly defined either in the schema or by software that accesses the database

  14. Relational Databases • functionally consist of a set of tables, where each row in the table contains a set of properties of some entity • extensive formal analysis of relational approach has yielded a set of “normalizations” that maximize the interconnections between information, minimize redundancies

  15. Relational Databases • Advantages • readily available database management systems (DBMSs) that handle the computational overhead invisibly • high interconnectivity of data enhances data mining process • Structured Query Language (SQL) exists to make searching automated and relatively rapid, even complex searches

  16. changing schema does not necessarily involve rewriting whole database; can add new tables or new columns to existing tables • most common commercial database type therefore lots of support available (if you have the money) • wide usage means user skills are generalizable

  17. Disadvantages • overhead (computational and expertise) makes cost high for small databases • content-based query only rudimentary, can not do complex “fuzzy” queries within SQL • all implementations do not fully conform to theoretical criteria, therefore problems arise in large databases and/or complex queries

  18. 5’ Break

  19. Object-Oriented Databases • based on the object concept, a computational entity that consists of data and a set of methods that will perform operations on that data • ACeDB, the core DBMS for the C. elegans sequencing project is object oriented

  20. Object-Oriented Databases • Advantages • pre-existing schema is already worked out if you use ACeDB (http://www.acedb.org/) • a lot of procedural programming is not needed because methods for data manipulation are intrinsic to the object • natural database for object-oriented languages like C++ and Java

  21. Disadvantages • not easy to tweak; the DBMS is fairly complex, really only the developer community can alter it • if their data model is not adequate for your project, there is no easy way to expand it • therefore, tends to be good for specific genomes, high throughput operations, not databases set up and maintained by small users for idiosyncratic projects

  22. Combined Relational/Object • Relational database (tables) that can hold objects • As implemented, the DBMS simulates the object by creating a set of hidden tables • Larger computational overhead, less user control of database structure

  23. Summary • each database type has strengths and weaknesses • choice of database to use depends on many cost factors (money, computational overhead, learning curve for use, pre-existing support) • there is no single right choice

  24. GENBANK • the core GENBANK archival database is a flat file format • historically that is the way it started • when a major revamping was undertaken in the mid 90s, stayed with flat file format, but introduced a defined hierarchical data model using ASN.1

  25. ASN.1 • the underlying structure of the GENBANK database uses Abstract Syntax Notation as the syntax definition • this is a standard, general syntax definition for holding information in a machine-parseable form • hierarchical structure helps organize data

  26. Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Chicken liver cell adhesion molecule L-CAM mRNA, and translated products" , update-date std { year 1995 , month 11 , day 30 } , source { org { taxname "Gallus gallus" , common "chicken" , db { { db "taxon" , tag

  27. GENBANK Data Model • to implement a database, you must have a data model • for flat files, consists of a set of rules about • the format of data storage • the syntax of the storage • implementation of any data analysis or manipulation is the responsibility of the user

  28. GENBANK Data Model • explicitly defined in on-line document • http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/DATAMODL.HTML • note: although not object oriented, the specification uses much of the terminology of object-oriented programming

  29. BioSeqs • has a at least one Seq-id • contains information about a biological sequence • virtual - contains a molecular type, a size, and topology (e.g. a band on a gel, an intron whose sequence has not been determined)

  30. raw - simple single sequence, which has all the properties of virtual plus actual sequence data • segmented - contains identifiers for other bioseqs and relative positional information, thus yielding a size • map - contains a rough size and co-ordinates that represent some kind of map data

  31. Bioseq Sets • sets of bioseq entities that are related somehow • nuc-prot set - nucleotide type bioseq and one or more associated protein type bioseqs • population set - set of related bioseqs that are aligned with each other. This is a basic type for population and phylogenetic studies

  32. Seq-Annot • A self-contained annotation that refers to a specific bio-seq entity • Can have multiple seq-annots • These elements hold the annotation data, e.g. positions of start sites, stop site, introns, regulatory sequences

  33. Our Lab Project • one of the main elements of the labs in this course is designing a database, populating it, analyzing the sequences in various ways, and annotating the database • we will use a flat file format to store data on a family of proteins • therefore, we need to define a schema

More Related