590 likes | 713 Views
Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc.edu 13928761660. www.cse.sc.edu/~jtang/BJFU. 作业. GTTGCAGCAATGGTAGACTCAACGGTAGCAATAACTGCAGGACCTAGAGGAAAAACAGTAGGGATTAATAAGCCCTATGGAGCACCAGAAATTACAAAAGATGGTTATAAGGTGATGAAGGGTATCAAGCCTGAA 为什么用缺省 blast 出不来结果?需要如何选择? 相关物种的最新 pubmed 文章有哪些?.
E N D
Bioinformatics生物信息学理论和实践唐继军jtang@cse.sc.edu13928761660Bioinformatics生物信息学理论和实践唐继军jtang@cse.sc.edu13928761660
作业 • GTTGCAGCAATGGTAGACTCAACGGTAGCAATAACTGCAGGACCTAGAGGAAAAACAGTAGGGATTAATAAGCCCTATGGAGCACCAGAAATTACAAAAGATGGTTATAAGGTGATGAAGGGTATCAAGCCTGAA • 为什么用缺省blast出不来结果?需要如何选择? • 相关物种的最新pubmed文章有哪些?
DNA Sequencing capability has grown exponentially DNA sequences in GenBank Doubling time = 18 months
Bioinformatics Paradigm • Find the data • Download the data • Reformat the data • Collect the samples • Run molecular analysis • Filter the data • Run analysis software • Collect and sort results • Publish / Data sharing
Multi-Sequence FASTA file >FBpp0074027 type=protein; loc=X:complement(16159413..16159860,16160061..16160497); ID=FBpp0074027; name=CG12507-PA; parent=FBgn0030729,FBtr0074248; dbxref=FlyBase:FBpp0074027,FlyBase_Annotation_IDs:CG12507 PA,GB_protein:AAF48569.1,GB_protein:AAF48569; MD5=123b97d79d04a06c66e12fa665e6d801; release=r5.1; species=Dmel; length=294; MRCLMPLLLANCIAANPSFEDPDRSLDMEAKDSSVVDTMGMGMGVLDPTQ PKQMNYQKPPLGYKDYDYYLGSRRMADPYGADNDLSASSAIKIHGEGNLA SLNRPVSGVAHKPLPWYGDYSGKLLASAPPMYPSRSYDPYIRRYDRYDEQ YHRNYPQYFEDMYMHRQRFDPYDSYSPRIPQYPEPYVMYPDRYPDAPPLR DYPKLRRGYIGEPMAPIDSYSSSKYVSSKQSDLSFPVRNERIVYYAHLPE IVRTPYDSGSPEDRNSAPYKLNKKKIKNIQRPLANNSTTYKMTL >FBpp0082232 type=protein; loc=3R:complement(9207109..9207225,9207285..9207431); ID=FBpp0082232; name=mRpS21-PA; parent=FBgn0044511,FBtr0082764; dbxref=FlyBase:FBpp0082232,FlyBase_Annotation_IDs:CG32854-PA,GB_protein:AAN13563.1,GB_protein:AAN13563; MD5=dcf91821f75ffab320491d124a0d816c; release=r5.1; species=Dmel; length=87; MRHVQFLARTVLVQNNNVEEACRLLNRVLGKEELLDQFRRTRFYEKPYQV RRRINFEKCKAIYNEDMNRKIQFVLRKNRAEPFPGCS >FBpp0091159 type=protein; loc=2R:complement(2511337..2511531,2511594..2511767,2511824..2511979,2512032..2512082); ID=FBpp0091159; name=CG33919-PA; parent=FBgn0053919,FBtr0091923; dbxref=FlyBase:FBpp0091159,FlyBase_Annotation_IDs:CG33919-PA,GB_protein:AAZ52801.1,GB_protein:AAZ52801; MD5=c91d880b654cd612d7292676f95038c5; release=r5.1; species=Dmel; length=191; MKLVLVVLLGCCFIGQLTNTQLVYKLKKIECLVNRTRVSNVSCHVKAINW NLAVVNMDCFMIVPLHNPIIRMQVFTKDYSNQYKPFLVDVKIRICEVIER RNFIPYGVIMWKLFKRYTNVNHSCPFSGHLIARDGFLDTSLLPPFPQGFY QVSLVVTDTNSTSTDYVGTMKFFLQAMEHIKSKKTHNLVHN >FBpp0070770 type=protein; loc=X:join(5584802..5585021,5585925..5586137,5586198..5586342,5586410..5586605); ID=FBpp0070770; name=cv-PA; parent=FBgn0000394,FBtr0070804; dbxref=FlyBase:FBpp0070770,FlyBase_Annotation_IDs:CG12410-PA,GB_protein:AAF46063.1,GB_protein:AAF46063; MD5=0626ee34a518f248bbdda11a211f9b14; release=r5.1; species=Dmel; length=257; MEIWRSLTVGTIVLLAIVCFYGTVESCNEVVCASIVSKCMLTQSCKCELK NCSCCKECLKCLGKNYEECCSCVELCPKPNDTRNSLSKKSHVEDFDGVPE LFNAVATPDEGDSFGYNWNVFTFQVDFDKYLKGPKLEKDGHYFLRTNDKN LDEAIQERDNIVTVNCTVIYLDQCVSWNKCRTSCQTTGASSTRWFHDGCC ECVGSTCINYGVNESRCRKCPESKGELGDELDDPMEEEMQDFGESMGPFD GPVNNNY …
Other Important Databases • Genomes • Proteins • Biochemical & Regulatory Pathways • Gene Expression • Genetic Variation (mutants, SNPs) • Protein-Protein Interactions • Gene Ontology (Biological Function)
UCSC Genome Browser Search by gene name: or by sequence:
Lots of additional data can be added as optional "tracks" - anything that can be mapped to locations on the genome
KEGG: Kyoto Encylopedia of Genes and Genomes • Enzymatic and regulatory pathways • Mapped out by EC number and cross-referenced to genes in all known organisms (wherever sequence information exits) • Parallel maps of regulatory pathways
Genome Ontology • Genetics is a messy science • Scientists have been working in isolation on individual species for many years - naming genes, mutants, odd phenotypes • “sonic hedgehog” • Now that we have complete genome sequences, how to reconcile the names across all species? • Genome Ontology uses a single 3 part system • Molecular function (specific tasks) • Biological process (broad biologial goals - e.g cell division) • Cellular component (location)
Filename Extensions • Most Linux filenames start with a lower case letter and end with a dot followed by one, two, or three letters: myfile.txt • However, this is just a common convention and is not required. • It is also possible to have additional dots in the filename. • The part of the name following the dot is called the “extension.” • The extension is often used to designate the type of file.
Some Common Extensions • By convention: • files that end in .txt are text files • files that end in .c are source code in the "C” language • files that end in .html are HTML files for the Web • Compressed files have the .zip or .gz extension • Linux does not require these extensions (unlike Windows), but it is a sensible idea and one that you should follow
Working with Directories • Directories are a means of organizing your files on a Linux computer. • They are equivalent to folders on Windows and Macintosh computers • Directories contain files, executable programs, and sub-directories • Understanding how to use directories is crucial to manipulating your files on a Linux system.
Your Home Directory • When you login to the server, you always start in your Home directory. • Create sub-directories to store specific projects or groups of information, just as you would place folders in a filing cabinet. • Do not accumulate thousands of files with cryptic names in your Home directory
File & Directory Commands • This is a minimal list of Linux commands that you must know for file management: • All of these commands can be modified with many options. Learn to use Linux ‘man’ pages for more information.
Navigation • pwd (present working directory) shows the name and location of the directory where you are currently working:> pwd /home/jtang • This is a “pathname,” the slashes indicate sub-directories • The initial slash is the “root” of the whole filesytem • ls (list) gives you a list of the files in the current directory: • > ls assembin4.fasta Misc test2.txt bin temp testfile • Use the ls -l (long) option to get more information about each file > ls -l total 1768 drwxr-x--- 2 browns02 users 8192 Aug 28 18:26 Opioid -rw-r----- 1 browns02 users 6205 May 30 2000 af124329.gb_in2 -rw-r----- 1 browns02 users 131944 May 31 2000 af151074.fasta
Sub-directories • cd (change directory) moves you to another directory >cd Misc > pwd /u/browns02/Misc • mkdir (make directory) creates a new sub-directory inside of the current directory > ls assembler phrap space > mkdir subdir > ls assembler phrap space subdir • rmdir (remove directory) deletes a sub-directory, but the sub-directory must be empty > rmdir subdir > ls assembler phrap space
Shortcuts • There are some important shortcuts in Linux for specifying directories • . (dot) means "the current directory" • .. means "the parent directory" - the directory one level above the current directory, so cd .. will move you up one level • ~ (tilde) means your Home directory, so cd ~ will move you back to your Home. • Just typing a plain cd will also bring you back to your home directory
Create new files • pico • nano • vi/vim • emacs
Programming • perl • python • c/c++ • R • Java
Linux File Protections • File protection (also known as permissions) enables the user to set up a file so that only specific people can read (r), write/delete (w), and execute (x) it. • Write and delete privilege are the same on a Linux system since write privilege allows someone to overwrite a file with a different one.
File Owners and Groups • Linux file permissions are defined according to ownership. The person who creates a file is its owner. • You are the owner of files in your Home directory and all its sub-directories • In addition, there is a concept known as a Group. • Members of a group have privileges to see each other's files. • We create groups as the members of a single lab - the students, technicians, postdocs, visitors, etc. who work for a given PI.
View File Permissions $ ls -l total 2 -rw-r--r-- 1 jtang None 56 Feb 29 11:21 data.txt -rwxr-xr-x 1 jtang None 33 Feb 29 11:21 test.pl • Use the ls -l command to see the permissions for all files in a directory: • The username of the owner is shown in the third column. (The owner of the files listed above is jtang) • The owner belongs to the group “None” • The access rights for these files is shown in the first column. This column consists of 10 characters known as the attributes of the file: r, w, x, and - rindicates read permission w indicates write (and delete) permission x indicates execute (run) permission - indicates no permission for that operation
$ ls -l total 2 -rw-r--r-- 1 jtang None 56 Feb 29 11:21 data.txt -rwxr-xr-x 1 jtang None 33 Feb 29 11:21 test.pl • The first character in the attribute string indicates if a file is a directory (d) or a regular file (-). • The next 3 characters (rwx) give the file permissions for the owner of the file. • The middle 3 characters give the permissions for other members of the owner's group. • The last 3 characters give the permissions for everyone else (others) • The default protections assigned to new files on our system is: -rw-r----- (owner=read and write, group =read, others=nothing)
Change Protections • Only the owner of a file can change its protections • To change the protections on a file use the chmod (change mode) command. [Beware, this is a confusing command.] • Taken all together, it looks like this: > chmod 644 data.txt This will set the owner to have read, write; add the permission for the group and the world to read 600, 755, 700,