1 / 45

Advanced Bioinformatics

Advanced Bioinformatics. Medicago Basic project: S tudy gene expression under a single condition. Team members. Jente Lifei Yuebang Nick. Our chosen eukaryotic organism:. Yeast. Input data. F astq files as sequence data G enome.fa file as a reference genome G enes.gtf.

Download Presentation

Advanced Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Bioinformatics Medicago Basic project: Study gene expression under a single condition

  2. Team members • Jente • Lifei • Yuebang • Nick

  3. Our chosen eukaryotic organism: Yeast

  4. Input data • Fastq files as sequence data • Genome.fa file as a reference genome • Genes.gtf

  5. Tophat, Cufflinks and Cuffmerge • Genes.gtf, genome.* and the fastq files are used to generate .bam files • The accepted_hits.bam is used by Cufflinks to generate a file called transcripts.gtf • Because the experiment was in triplo, we get 3 transcripts.gtf files. These are merged together with Cuffmerge.

  6. gtf_to_fasta • With the program gtf_to_fasta we create a fasta file which contains all the transcripts with sequences. • So now we have a fasta and a gtf file to extract data from with the help of programs and scripts.

  7. The Big Hash Table • From the FASTA we use/determine: • Gene_id • Sequence length • GC content • Codon usage • From the GTF we use/determine: • Gene_id • Expression level • Inter-transcript size • Intron length

  8. Reading gtf file: • Sort top 100 expressed genes • From the GTF we use/determine: • Gene_id • Expression level • Inter-transcript size • Intron length

  9. Key point: • First order, then get top 100 genes. • Build hash table: gene_id(keys) to FPKM, intron length, inter-transcript(values).

  10. Using array:Gene_ID and FPKM in seq[8] • Inter-transcript: use defined($seq2[1]) • Intron length: divid into different conditons (subroutines) After reading next transcript line, calculate last intron length .

  11. Important: hash table –matching!

  12. Why we need to analyse FPKM, intron length,inter-transcript(correlation)?

  13. FPKM: gene expression level • Intron length: positive to gene expression level • Inter-transcript: gene density

  14. Reading the fasta file • The important information is the sequence. • From this GC content, codon usage etc. can be determined. • To couple this info to the gtf output, we analyse the ID as well.

  15. Reading the fasta file • The analysis was performed by reading the file line by line, just like the exercises. • Then the ID was extracted from the first line and saved in a heshtable. • Normally heshtables have only a key and one value but we managed to put arrays in these values.

  16. Reading the fasta file >xxxxx 1:783285 gene_idetcetcetcetcetcetc. AGCTGCTAGGCTGCGCATCGTGAGCTGCCTTG %hesh ID; seqLength, GC_content, codonUsage

  17. Combine the best of both! • The array values from the %gtfhesh table are pushed into the %fastahesh table. • For example: my $newval = $gtf {$i} [0]; my $newval2 = $gtf {$i} [1]; push @{ $fasta{$ID} }, "$newval\t", "$newval2\t”;

  18. # Heshtable # • In this way we obtained a table that contained: • ID; length, CUP, GC, TSp, TEp, ITL, Intron size(s) • We give options to show a variable number of genes and to sort on specific parameters. • Now Jente will unleash his package…

  19. Package: Jente My Package • Codon Usage Bias • R: correlations Jente

  20. Codon Usage Bias • Relative Synonymous Codon Usage (RSCU) • Effective Numbers of Codons (NC)

  21. Codon Usage Bias • RSCU • Not in pipeline • Optional subroutine

  22. Codon Usage Bias NC = 2 + + + Only possible for sequences that use all amino acids Codon Usage Proportion (CUP) CUP =

  23. R: Correlations R = -0.1205 FPKM GC

  24. R: Correlations R = -0.1220 Highly expressed genes have a more extreme codon bias tRNAs?

  25. R: Correlations R = -0,1282 Highly expressed genes are smaller More efficient?

  26. R: Correlations R = 0,9588 Longer genes use more codons...

  27. Visualize highly expressed genes in the interaction network • What are Networks? • A map of interactions or relationships • A collection of nodes and links (edges) • Why Network? • predict protein function through identification of partners • Protein’s relative position in a network • Mechanistic understanding of the gene-function & phenotype association

  28. Visualize highly expressed genes in the interaction network

  29. Interaction network (1) Download Yeast Interactome: http://interactome.dfci.harvard.edu/S_cerevisiae/index.php?page=download http://www.yeastnet.org/data/

  30. Interaction network (2) RuningCytoscape and import yeast Interactome

  31. Interaction network (3) Visualize analysis of the interaction network

  32. Interaction network (4) Visualize the highly expressed genes in interaction network

  33. Interaction network (5)

  34. Interaction network (6) Top 100 genes interactome data

  35. Interaction network (7)

  36. Interaction network (8)

  37. Interaction network (9)

  38. Interaction network (10) Visualize the highly expressed genes in interaction network

  39. Interaction network (11) Interaction network of top 100 intractome data

  40. Interaction network (12)

  41. GO graph (1) IntallBiNGO

  42. GO graph (2) Import the top 100 expression genes, and start BiNGO

  43. GO graph (3)

  44. Conclusion In the CCSB-Y|1 file, 8 genes of top 100 highly expressed genes are found, and no directly interaction among them in the interaction network It is confirmed highly expressed genes are related to production of protein by GO term.

  45. Thank you forever 

More Related