Instruction to use the SVARAP program Plan

Instruction to use the SVARAP programPlan • Principle of SVARAP program • Use of SVARAP: • GDE Alignment • Formatting the GDE alignment • Variability analysis • Activation of « macros » • Pasting the GDE alignment • Checking-up the GDE alignment format • Rough data of variability analysis by nucleotidic site • Variability analysis by window of 50 nucleotides for 2000 nucleotides length • Variability analysis by nucleotidic site for 2000 nucleotides length • Program ASVARAP: study of amino acid variability • Examples • Download / References • Contact

Principle of SVARAP program • « SVARAP » (Sequence VARiability Analysis Program) analyses, evidences and graphically represents variability or genetic diversity of nucleotidic sequences. Ii uses a Microsoft Excel® file which is able to analyse simultaneously up to 100 séquences of up to 4000 nucleotides. • Variability is defined as the proportion of analysed sequences for which the nucleotide at a given position is not the most frequently found in the studied set of sequences. • The program generates graphes and calculates mean, median, minimal and maximal values, and coefficient of variation for windows of 50 nucleotides. It also analyses site by site. • Classically, tools aligning sequences identify sites and natures of nucleotidic differences. Quantitative analysis of variability or diversity may increase the level of information to find some discriminant or conserved regions, which could be aimed by PCR; or highly polymorphic « spots ». Thompson J. D., Gibson T. J., Plewniak F., Jeanmougin F., Higgins D. G. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 1997, 25(24) : 4876-82. Next

How SVARAP works ? • Sequences are aligned and the alignement in GDE format is copied then pasted in a cell of our program that format the sequences to facilitate future analysis. Notably, each nucleotide stand in a different cell to get in a same column the nucleotides corresponding to a same nucleotidic site. • Consensus nucleotide at each nucleotidic site (defined as the most frequently found at this position for the studied set of sequences) is automatically generated. • The program simultaneously calculates the absolute numbers of each of the 4 nucleotides (G, A, C, T, or deletions or insertions), and their frequencies (en %). Diversity or variability is defined as the proportion of sequences for which, at a given site, nucleotide differ of the nucleotide which is the most frequently found for the studied set of sequences. It is calculated with the formula: 100 – (maximal value in % of frequency for each of the four nucleotide at a given nucleotidic site).The program also calculates the number of nucleotides of different nature harbored at a given site. Results are analysed to calculate for windows of 50 nucleotides the median, mean, minimal and maximal values of variability. Concommitantly, a site by site analysis is also done and given for length of 2000 nucleotides. • Finally, SVARAP graphically represents the diversity/variability.

Alignment of sequences in GDE format • Initial « material » is a set of sequences (maximuml 100 sequences). • SVARAP uses an alignment in GDE format (Genetic Data Environment). Firstly, sequences are aligned with ClustalX v.1.8 [Thompson, 1997] after asking in the Output Format Options for creation of a GDE file. Then, the alignment is copied then pasted in a cell of our file Microsoft Excel® nommé « AnaVarNuc_Pos… ». Next

To get an alignment in GDE format using clustal X v1.8 (1/2) • Open ClustalX (1.8) and append sequences in FASTA format. • Select tab « Alignment », then output Format Options... Next

To get an alignment in GDE format using clustal X v1.8(2/2) • Select GDE format. • Start alignment. • Locate the GDE file.

Formatting the GDE alignment using Microsoft Word® • Like for most of sequences analysis, it is necessary to format sequences. • Copy then paste in a Microsoft Word® then 1/ delete all paragraphe jump; 2/ replace the « - » by another kind (. for instance) that do not lead to paragraph jump; 3/ add a paragraph jump before the name of sequences. Then paste a paragraph jump (<enter>) after the name of sequences (and before the 1st nucleotide).

Activating « macros » • The Microsoft Excel® contains « macros ». It is necessary to activate them to use the file; it is possible to suppress this step :

Pasting the GDE alignment in SVARAP 1 2 3 How to analyse > 4000 nucleotides or > 2000 nucleotides simultaneously. Link to final analysis 4 1 • 1. 2 files, analysing variability for nucleotides 1 to 2000 or 2001 to 4000, are downloadable, as analysis for 4000 nucleotides cannot be done simultaneously. • 2. When using this program: click on column B then key <Suppr> to delete prior work. • 3. Paste in a same cell (white space, cell B2, the GDE alignment formatted using Microsoft Word®). • Sheet « Paste the alignment » 2 3

Verify format of GDE alignment (1/2) • In column A, only sequence name, and in columns F, I, L and O, only sequences. Right number of sequences. • If not: check the GDE alignment. • Sheet « Sep1000 » Next

Verify format of GDE alignment(2/2) • In column B, only sequence name, and in column C, only sequences. Right number of sequences. • If not: check the GDE alignment. • Sheet « Nuc 1-1000 » and « Nuc 1001-2000 »

Analysis of variability 2 5 6 3 1 4 • This sheet and the table contain the main part of analysis of variability: the level of variability (1.) correspond to the proportion of sequences for which, at a given nucleotidic site, the nucleotide differ compared with the nucleotide the most frequently found in the studied set of sequences. Positions that are defined (2.) correspond to those defined in your set of sequences. The number of distinct variations (3.) correspond to the number of different nucleotides observed at a given site. • This analysis is done by windows of 200 bases for reasons related to Microsoft Excel software (4.). • 5. Analysis in absolute value. 6. Analysis in % • Sheets « Var...» 1 2 3 4 5 6 Next

Consensus sequence on a length of 2000 nucleotides 1 • The consensus nucleotide is calculated for each of the nucleotidic sites on the whole length of the studied sequences. • # (1.) correspond to an indetermination: examples: major representation equivalent for 2 nucleotides; insertions or deletions as major representation. • Sheet « Consensus » 1 Next

Rough data of variability by nucleotidic site on a length of 2000 nucleotides • The variability is calculated for each of the nucleotidic positions on the whole length of the studied sequences. • Sheet « Consensus »

Analysis by window of 50 nucleotides • Variability is calculated and analysed by windows of 50 nucleotides on the whole length of the studied sequences. The analysis is available: • in tables Sheet « Data fen 50 » • in graphe Sheet « Fig 1-2000 fen 50 »

Analysis by nucleotidic site for a length of 2000 nucleotides (1/2) 1 • A graph for variability calculated for each of the nucleotidic sites on the whole length of the studied sequences is systematically generated. • Sheet « Fig var par position » • Each window of 250 nucleotides can be printed separately or copied then pasted in another software (1.). Or all 2000 nucleotides are printable at the same time: 1 Next

Analysis by nucleotidic site for a length of 2000 nucleotides(2/2) • Look before printing of the variability calculated for each of the nucleotidic positions on the whole length for the studied sequences. • Sheet « Fig var par position »

How to analyse more than 4000 nucleotides This program is not only limited concerning the length of studied sequences. It can analyse more than 4000 nucleotides, and more than 2000 nucleotides at the same time. To analyse more than 4000 nucleotides: • Copy the file « AnaVarNuc_Pos 1-2000 » • Go to sheet « Paste alignment » • Unmask all columns (<Format><Colonnes><Afficher>) • Go to cells F2 to F201 and replace 1 by the starting site to analyse in your alignment (e.g. 8000, or 10224); then replace in column G2 to G201, respectively 1001 by a value incremented of 1000 vs the one written in column F (e.g. 9000, or 11224) • You have so programmed the analysis of nucleotides 8000 to 10000, or 10224 to 12224.

How to analyse more than 2000 nucleotides at the same time This program is not only limited concerning the length of studied sequences. It can analyse more than 4000 nucleotides, and more than 2000 nucleotides at the same time. To analyse more than 2000 nucleotides at the same time: • Use the values of variability for 2000 nucleotidic sites ad stored in the sheet called « consensus ». When copying in a new Microsoft Excel® file these values by 2000 nucleotides from several files, you are creating graphics for the appropriate length.

Applications for SVARAP An example of use of SVARAP • SVARAP produces rapidly graphical representations which can be easily interpreted. • It leads in a first step to analyse genetic diversity in a set of sequences by windows of 50 nucleotides. • A more precise information is also available with site by site analysis. Next

Contact • For informations or questions, you can contact me at : Philippe.Colson@ap-hm.fr

Download Download the instructions for use of SVARAP Download SVARAP to analyse nucleotidic positions 1 to 2000 (Microsoft Excel® v97) Link to Clustal X v1.8 Download SVARAP to analyse nucleotidic positions 2001 to 4000 (Microsoft Excel® v97) Download ASVARAP to analyse amino acid positions 1 to 1000 (Microsoft Excel® v97) References • URL: http://ifr48.free.fr/recherche/jeu_cadre/jeu_rickettsie.html

1/ Delete the paragraph jump In Microsoft Word® v97 - French edition: • <Edition><Remplacer><Plus><Spécial><Marque de paragraphe><Remplacer tout>

2/ Replace dashes To copy then paste In Microsoft Word® v97 - French edition: • <Edition><Remplacer> • Dans rechercher: - • Dans remplacer par: ― • <Remplacer tout>

3/ Add paragraph jumps before and after the name of sequences. In Microsoft Word® v97 - French edition: • <Edition> • Dans rechercher: # • Dans remplacer par: # par Marque de paragraphe#

Application ASVARAP • The study of variability can also concern amino acid sequences (amino acids 1 to 1000). The principle and use are the same as for SVARAP : Download

Instruction to use the SVARAP program Plan