1 / 8

Pipeline Introduction

This pipeline codifies the process of creating and managing datasets, reducing human resources, and minimizing errors. It consists of sequential steps, plugin and script calls, and cluster jobs. Two pipeline types: resources pipeline for downloading and loading external resources, and analysis pipeline for extracting and analyzing data. Resources repository serves as a cache for files and enables synchronization across projects.

czeitz
Download Presentation

Pipeline Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pipeline Introduction • Sequential steps of • Plugin calls • Script calls • Cluster jobs • Purpose • Codifies the process of creating the data set • Reduces human resources • Reduces human error and omissions

  2. Two Pipeline Types • Resources pipeline • Downloads resources from external sources • Loads resources into database • Example: NRDB files • Analysis pipeline • Extract data from database • Run analysis programs on data • On main or cluster server • Put value added data back into database

  3. Resource Pipeline • Invoked by: • loadresources xmlfile propfile • Take a tour of a resources XML file

  4. Resources Repository • Destination of downloads • Houses files in a file system • Serves as a cache for files • Has API to access files by name and version • If you request an existing file by name and version, repository returns it without downloading • But the wget arguments must match (these are remembered by the repository) • Particularly useful if multiple projects want to synchronize their data input

  5. Analysis Pipeline • Take a tour of the analysis pipeline file • Take a tour of the Steps.pm file • Take a tour of the property file (there’s also one for the resource pipeline

  6. Pipeline Directory Structure • The directory which houses all the information for the pipeline including: • Input data • Logs • Result data • Pipeline control information: • Which steps have been completed • Property files to control cluster • Structured for easy comprehension • Take a tour of the directory structure

  7. Analysis Pipeline API • GUS::Pipeline::Manager.pm • Declares properties • Prevents steps from rerun • Calls plugins • Executes commands • Eases communication with cluster • GUS::Pipeline::MakeTaskDirs.pm • Helps make directories expected by distribjob on the cluster • GUS::Pipeline::TaskRunAndValidate.pm • Helps run a series of tasks on the cluster

  8. DJob • Manages the distribution of tasks across a compute cluster • Handles the case of a very large number of inputs which are processed independently and uniformly • For example, blasting a set of EST against a genome • Now available for clusters using PBS cluster scheduler • http://core.pcbi.upenn.edu/tools/liniactools.html

More Related