260 likes | 271 Views
Rice Sequence and Map Analysis Leonid Teytelman. Rice Genome Annotation Sequence Alignments Automation Comparative Maps Genetic Marker Correspondences FPC Map FPC I-Map EnsEMBL Pipeline Automated Annotation Compute Farms. Rice Genome Annotation. Aligned Data Sets:.
E N D
Rice Sequence and Map Analysis Leonid Teytelman
Rice Genome Annotation • Sequence Alignments • Automation • Comparative Maps • Genetic Marker Correspondences • FPC Map • FPC I-Map • EnsEMBL Pipeline • Automated Annotation • Compute Farms
Aligned Data Sets: • Rice Coding Sequences • Rice Complete CDSs • Rice TIGR GIs • Rice BGI EST Clusters • Rice dbEST ESTs • Rice BGI ESTs • Non-Rice Coding Sequences • Maize Unigene Clusters • Maize TIGR GIs • Maize dbEST ESTs • Barley dbEST ESTs • Wheat dbEST ESTs • Sorghum dbEST ESTs Rice CUGI BAC ends Rice JRGP/Cornell RFLP Markers Rice Cornell SSRs
Alignment Tools: Target Queries • BLAT: search & alignment • pslReps: filtering of low-quality matches • e-PCR: matches based on near-identity to the PCR primers, and correct order
Alignment Tools: • BLAT: search & alignment • pslReps: filtering of low-quality matches • e-PCR: matches based on near-identity to the PCR primers, and correct order Target Target Queries
Alignment Methods: • Rice Coding Sequences: • BLAT search & alignment • pslReps filtering of repetitive matches • Accept based on percent of EST length matched • Non-Rice Coding Sequences : • BLAT search & alignment • pslReps filtering of repetitive matches • Accept based on hit length and hit frequency • Rice BAC ends: • BLAT search & alignment • Accept based on gap length, percent of BAC end length matched, percent identity, and hit frequency.
Alignment Methods: • Rice Markers: • BLAT search & alignment • Accept based on percent of marker length matched and the gap length in case of genomic markers. • Utilize genetic map information; accept those whose genetic & physical chromosome assignment is concordant. • Rice SSRs: • e-PCR with default parameters, allowing 0 mismatches in the primers
February 2002 BAC/PAC Dataset Total BACs/PACs: 1,847 Total bp: 250,879,896 (250MB ) Phase 1: 78 Phase 2: 1,238 Phase 3: 531 Annotated Phase 3: 330 Annotated Genes: 8,034
Automating Alignments: • For each group of data sets, there is a script to automatically: • Run pslReps • Load results into the database • Discard low-quality matches • Update documentation
Map Correspondences Same marker on multiple mapping studies • Name-identity • Curated evidence • Sequence-based correspondences for JRGP and Cornell markers: • BLAT search & alignment • Utilize genetic mapping information, accepting matches on same chromosome and less than 30cM apart.
curator same name sequence-based
same name curator
Cornell/JRGP markers mapped to sequenced clones were assigned positions on the FPC contigs.
Total: 2,272 4,417
EnsEMBL Pipeline Overview RepeatMasker Genscan Blast GenomeBuilder Hmmer RepeatMasker BLAT GeneWise Hmmer • System for automated genome annotation • Executes and keeps track of computational jobs • Analysis job execution is serial, allowing stage dependencies • Jobs are user-defined • Can take advantage of a compute farm
Organization • Utilizes and expands on the EnsEMBL-core modules and database schema • Database stores: • analysis program names and parameters • analysis results • rules for job dependencies • and progress status for each job • Perl modules: • access the database • execute specified analysis programs • parse and load into the database the analysis results
Cluster Utilization • How to split up tasks? • Contig-by-contig approach • How to execute jobs on slave nodes? • Load management an scheduling (LSF, PBS, etc) • Management of management: • Automatic job submission • Error/completion checking