350 likes | 434 Views
Reading for the Next Week. Sequence Analysis and Alignment Chapter 5, Chapter 8, Chapter 11 Only about the 1st third of each chapter. Sequence Files. Fasta format, has simplest structure >Sequence Title [new line] Sequence [new line] very useful for handling sequence alone
E N D
Reading for the Next Week • Sequence Analysis and Alignment • Chapter 5, Chapter 8, Chapter 11 • Only about the 1st third of each chapter
Sequence Files • Fasta format, has simplest structure >Sequence Title [new line] Sequence [new line] • very useful for handling sequence alone • usually included as one of the formats supported by programs that use sequence
Example of Fasta Format >gi|212244|gb|M16260.1|CHKLCAMR… AGCTCCGTGCGCAGCGGTACCCGTACCGGTACCGGCCCGGTCCCTGAGCCATGGGCCGGCGGTGGGGTTCCCCCGCCCTGCAGCGCTTCCCCGTGTTGGTGCTGCTGCTGCTGCTCCAGGTGTGCGGCCGGCGGTGCGACGAGGCAGCCCCCTGCCAGCCCGGCTTTGCTGCAGAGACCTTCAGCTTCAGTGTGCCCCAGGACAGCGTGGCGGCGGGCAGGGAGCTGGGACGAGTGAGCTTTGCAGCCTGCAGCGGGCGGCCGTGGGCCGTGTATGTCCCGACTGACA…
GENBANK Flat File • holdover from earlier versions of GENBANK, the US government-supported public database • DNA-centric, sequence based view of data • contains a number of fields with non-sequence information
LOCUS CHKLCAMR 3545 bp mRNA linear VRT 30-NOV-1995 DEFINITION Chicken liver cell adhesion molecule L-CAM mRNA, complete cds. ACCESSION M16260 J04074 M22179 VERSION M16260.1 GI:212244 KEYWORDS cadherin; glycoprotein; liver cell adhesion molecule. SOURCE Gallus gallus cDNA to mRNA. ORGANISM Gallus gallus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archosauria; Aves; Neognathae; Galliformes; Phasianidae; Phasianinae; Gallus. REFERENCE 1 (bases 201 to 3545) AUTHORS Gallin,W.J., Sorkin,B.C., Edelman,G.M. and Cunningham,B.A. TITLE Sequence analysis of a cDNA clone encoding the liver cell adhesion molecule, L-CAM JOURNAL Proc. Natl. Acad. Sci. U.S.A. 84 (9), 2808-2812 (1987) MEDLINE 87204217
FEATURES Location/Qualifiers source 1..3545 /organism="Gallus gallus" /db_xref="taxon:9031" /clone="pEC3(20,30,31)" /tissue_type="liver" /dev_stage="10-11 day old embryo" mRNA <1..3545 /product="L-CAM mRNA" CDS 51..2714 /codon_start=1 /product="liver cell adhesion protein precursor" /protein_id="AAA82573.1" /db_xref="GI:212245" /translation="MGRRWGSPALQRFPVLVLLLLLQV…GGEDDE" sig_peptide 51..128 mat_peptide 531..2711 /product="liver cell adhesion protein" BASE COUNT 757 a 1125 c 1051 g 612 t ORIGIN 20 bp upstream of KpnI site. 1 agctccgtgc gcagcggtac ccgtaccggt accggcccgg tccctgagcc atgggccggc 61 ggtggggttc…
Other Formats • XML - extensible markup language • similar to HTML only can implement user-defined tags • Graphic • extracts positions from features and creates a graphical output
Database Types Characteristics, Strengths and Weaknesses
What is a Database? • well-defined storage method for digital data • allows for relatively rapid retrieval of data • allows for complex conditional retrieval
Three Main Types Used in Bioinformatics • Flat File • text stored in a file in stereotyped format • Hierarchical adds “tree” organization • Relational • a set of tables, with unique identifiers, and overlapping content • Object Oriented • data stored as part of a data structure (the object) that includes methods for manipulating the data
Flat File Database • data is stored as an “unstructured” record • relationships between the data are inherent in the database schema, the description of the syntax of the storage file
Flat File Database • Advantages • low overhead, do not need to have a complex computational superstructure to organize the data and keep track of it in memory • retrieval is not computationally complex • can take advantage of generalized standards for information organization
Disadvantages • no random access, therefore the simplicity of storage imposes a cost on access and manipulation • partially resolved by indexing • change in the schema requires parsing and rewriting the whole database • all linkages between data entries must be explicitly defined either in the schema or by software that accesses the database
Relational Databases • functionally consist of a set of tables, where each row in the table contains a set of properties of some entity • extensive formal analysis of relational approach has yielded a set of “normalizations” that maximize the interconnections between information, minimize redundancies
Relational Databases • Advantages • readily available database management systems (DBMSs) that handle the computational overhead invisibly • high interconnectivity of data enhances data mining process • Structured Query Language (SQL) exists to make searching automated and relatively rapid, even complex searches
changing schema does not necessarily involve rewriting whole database; can add new tables or new columns to existing tables • most common commercial database type therefore lots of support available (if you have the money) • wide usage means user skills are generalizable
Disadvantages • overhead (computational and expertise) makes cost high for small databases • content-based query only rudimentary, can not do complex “fuzzy” queries within SQL • all implementations do not fully conform to theoretical criteria, therefore problems arise in large databases and/or complex queries
Object-Oriented Databases • based on the object concept, a computational entity that consists of data and a set of methods that will perform operations on that data • ACeDB, the core DBMS for the C. elegans sequencing project is object oriented
Object-Oriented Databases • Advantages • pre-existing schema is already worked out if you use ACeDB (http://www.acedb.org/) • a lot of procedural programming is not needed because methods for data manipulation are intrinsic to the object • natural database for object-oriented languages like C++ and Java
Disadvantages • not easy to tweak; the DBMS is fairly complex, really only the developer community can alter it • if their data model is not adequate for your project, there is no easy way to expand it • therefore, tends to be good for specific genomes, high throughput operations, not databases set up and maintained by small users for idiosyncratic projects
Combined Relational/Object • Relational database (tables) that can hold objects • As implemented, the DBMS simulates the object by creating a set of hidden tables • Larger computational overhead, less user control of database structure
Summary • each database type has strengths and weaknesses • choice of database to use depends on many cost factors (money, computational overhead, learning curve for use, pre-existing support) • there is no single right choice
GENBANK • the core GENBANK archival database is a flat file format • historically that is the way it started • when a major revamping was undertaken in the mid 90s, stayed with flat file format, but introduced a defined hierarchical data model using ASN.1
ASN.1 • the underlying structure of the GENBANK database uses Abstract Syntax Notation as the syntax definition • this is a standard, general syntax definition for holding information in a machine-parseable form • hierarchical structure helps organize data
Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Chicken liver cell adhesion molecule L-CAM mRNA, and translated products" , update-date std { year 1995 , month 11 , day 30 } , source { org { taxname "Gallus gallus" , common "chicken" , db { { db "taxon" , tag
GENBANK Data Model • to implement a database, you must have a data model • for flat files, consists of a set of rules about • the format of data storage • the syntax of the storage • implementation of any data analysis or manipulation is the responsibility of the user
GENBANK Data Model • explicitly defined in on-line document • http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/DATAMODL.HTML • note: although not object oriented, the specification uses much of the terminology of object-oriented programming
BioSeqs • has a at least one Seq-id • contains information about a biological sequence • virtual - contains a molecular type, a size, and topology (e.g. a band on a gel, an intron whose sequence has not been determined)
raw - simple single sequence, which has all the properties of virtual plus actual sequence data • segmented - contains identifiers for other bioseqs and relative positional information, thus yielding a size • map - contains a rough size and co-ordinates that represent some kind of map data
Bioseq Sets • sets of bioseq entities that are related somehow • nuc-prot set - nucleotide type bioseq and one or more associated protein type bioseqs • population set - set of related bioseqs that are aligned with each other. This is a basic type for population and phylogenetic studies
Seq-Annot • A self-contained annotation that refers to a specific bio-seq entity • Can have multiple seq-annots • These elements hold the annotation data, e.g. positions of start sites, stop site, introns, regulatory sequences
Our Lab Project • one of the main elements of the labs in this course is designing a database, populating it, analyzing the sequences in various ways, and annotating the database • we will use a flat file format to store data on a family of proteins • therefore, we need to define a schema