**** * ***** *** * * ***** * * * * * * * * * * * * * * * * * * * * * * * * * * * *** * * *

Master Index Current Directory Index Go to SkepticTank Go to Human Rights activist Keith Henson Go to Scientology cult

Skeptic Tank!

**** * ***** *** * * ***** * * * * * * * * * * * * * * * * * * * * * * * * * * * *** * * * * * * * * * ***** * * Version 1.0 for Windows By Ramin Charles Nakisa Hardware ~~~~~~~~ * Anything that runs Windows ie. 80n86 where n > 1. * Maths coprocessor not required, but preferable! Files ~~~~~ a2 2588 Rat nAChR alpha-2 subunit a3 2382 Rat nAChR alpha-3 subunit charge mat 780 Charge comparison matrix dna35 mat 780 DNA +3/-5 comparison matrix dna53 mat 780 DNA +5/-3 comparison matrix dotplot exe 53248 The business! dotplot txt 12258 The file you're reading now egfr 5897 Human EGF receptor pam250 mat 780 PAM250 protein comparison matrix readseq exe 70625 Quickwin version of Uncle Don's Readseq 12 file(s) 150118 bytes Oooooh, A New Program, I Want To Try It NOW! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For fast satisfaction, try comparing two neuronal nicotinic acetylcholine receptor subunit sequences from swissprot saved under the filenames a2 and a3. The interface is like any other Windows program. Use the File popup menu to load the horizontal sequence. You get a dialog box listing all the files in the current directory. Select a2, then the program gives a SeqInfo dialog box, which tells you a little bit about the sequence you just loaded ie. its name, length, sequence format and whether it's DNA, RNA or protein. Click OK. Then repeat for the vertical sequence, choosing a3. As in the MSDOS version, there are two ways of calculating the dotplot, described in detail below. For a quick fix, use the Identities option on the Draw popup menu. The default parameters are a window size of 10 and a threshold of 6. This gives a diagonal line across your screen. Try tweaking the dotplot by clicking on Parameters and setting the threshold to 1. Pretty, isn't it? You can even see the three transmembrane regions in the middle of the protein and the final transmembrane at the C-terminus. Try to find the internal repeats in a human epidermal growth factor receptor. To do this, load the sequence called egfr as both the horizontal and the vertical sequence. Then set the parameters as window=20 and threshold=1. Load the PAM250.MAT score matrix using the File popup menu "Open Matrix" option. The feature table gives: FT REPEAT 75 300 APPROXIMATE. FT REPEAT 390 600 APPROXIMATE. Window Size and Stringency ~~~~~~~~~~~~~~~~~~~~~~~~~~ The program will prompt you for a window size and a stringency. For the simplest case, where the program puts a dot on the screen for every identity, the window size is one and the stringency is one. This will be very NOISY, as can be seen in this dotplot of two well-known sequences. C A P T A I N K I R K C * C A * * A P * P T * T A * * A I * * I N * N N * N E E M M O O C A P T A I N K I R K The real homology we are looking for is CAPTAIN, but there are hits off this main `diagonal'. We get around this problem by using a window, where for each diagonal the number of hits must exceed a certain threshold (or stringency). Here is the dotplot above with a window size of 2 and a stringency of 2. C A P T A I N K I R K C * C A * A P * P T * T A * A I * I N * N N N E E M M O O C A P T A I N K I R K The noise is gone! The same applies to dotplots using a score matrix, that is, the noise decreases for increasing window size and stringency, but eventually the signal decreases too. Experiment. Raison D'ątre ~~~~~~~~~~~~~ This program fills a niche in the PC molecular biology freeware/shareware world. I decided to write it because dotplots are easily implemented on a PC, not being too CPU intensive (unless the sequences to be compared are large) and being fun to play around with if made interactive. The windows version was fairly easy to write because windows lends itself to graphics-oriented programs. The operating system does a lot of the work for you, such as the mouse movement and the menus. There's just a lot more setting up to do than there is for MSDOS. I didn't believe it when I saw the Windows "Hello, World" program in my Microsoft C manual. Hundreds of lines of code compared with two or three for MSDOS! One major drawback of this version is that it does not allow you to point to the dotplot and see what bits of sequence you are actually looking at. I'll fix this as soon as my girlfriend gives me another weekend to myself! The program owes a great deal to Dan Gilbert's amazingly good sequence reading/writing module UREADSEQ.C available from his equally amazing molecular biology server at Indiana. This module allows DOTPLOT to read the following formats: 1. IG/Stanford 8. Pearson/Fasta 2. GenBank/GB 9. Zuker 3. NBRF/PIR 10. Olsen 4. EMBL 11. Phylip3.4/Phylip 5. GCG 12. Phylip3.3/Interleaved 6. DNAStrider 13. Plain/Raw 7. Fitch The Section for Computer Bullies ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I made the calculation of scores even faster by calculating scores diagonal-by-diagonal rather than point-by-point. I created a local variable in which the running score for each diagonal was stored and then for each point on the diagonal added the new score from just beyond the window and subtracted the score from just before the window. In C, this was as follows sum -= aa_sim( sequence1[i+k-window], sequence2[k], score_table ); sum += aa_sim( sequence1[i+k+window+1], sequence2[k+win2+1], score_table ); This was a much more efficient way of doing the window averaging. You might think that modifying UREADSEQ for Windows was easy. WRONG! The major difference between memory management under Windows and an ANSI version of C is the memory management. Instead of using malloc/calloc Windows needs handles to memory. Unfortunately the readseq function allocates memory itself and returns a pointer to char. I had to change the readseq function to return a handle to the memory block containing the sequence. To get a pointer to the sequence you just lock the block with LocalLock, and unlock it after you've finished with it. That way Windows can shift the block about as it sees fit. I increased kStartLength to 10000 so that readseq would never use realloc, which would cause problems. It's a bit of a bodge, but it seems to work fine. You can play around with the PAM matrix if you like. The format is identical to the MSDOS version of dotplot. By default it looks like X=0 C 12 S 0 2 T -2 1 3 P -3 1 0 6 A -2 1 1 1 2 G -3 1 0 -1 1 5 N -4 1 0 -1 0 0 2 D -5 0 0 -1 0 1 2 4 E -5 0 0 -1 0 0 1 3 4 Q -5 -1 -1 0 0 -1 1 2 2 4 H -3 -1 -1 0 -1 -2 2 1 1 3 6 R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 W 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 Y -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F W Y This is just a score matrix. For example, if the pair of amino acids to be compared are leucine and arginine, the score matrix above gives a score of -3. An identity generally gives a large positive score (tyrosine-tyrosine gives a score of 17) with the largest scores for the rare amino acids. The matrix is not calculated according to physico-chemical properties of amino acids, it is statistically derived from comparison of many related proteins. Some people claim that the particular score matrix used makes a great deal of difference in the database searches and alignments, but don't take their word for it; you should play around with it yourself. If the score matrix makes that much difference then maybe your sequence similarity is just a figment of your crazed imagination... Anyway, you can edit the PAM.MAT file. Just bear these things in mind: * Don't interchange columns and rows. The letters are there for your convenience, so that editing the matrix is easy. The program always reads the matrix in the same way regardless of the letters. * Use integers, preferably in the same range as the above matrix ie. -8 to +17. * Don't forget to back up the original PAM.MAT, or you could get into a pickle! I've included a CHARGE.MAT, DNA53.MAT and DNA35.MAT. The CHARGE.MAT file scores identities as +5 if both amino acids are charged (D, E, R, K). Non-charged residues are scored as 0. Opposite charges are scored as -3 and identical charges with non-identical residues are scored as +3. The DNA53.MAT scores +5 for an identity and -3 for different nucleotides. You can probably guess what the DNA35.MAT does! Grovelling Credits Section ~~~~~~~~~~~~~~~~~~~~~~~~~~ I think Dan Gilbert is a marvellous man. UREADSEQ is FAB. In case you ever read this, Dan, next time you're in London drop in to Imperial and I'll buy you a pint of Old Rosie at the Phoenix and Firkin. * Copyright 1990 by d.g.gilbert * biology dept., indiana university, bloomington, in 47405 * e-mail: gilbertd@bio.indiana.edu * * This program may be freely copied and used by anyone. * Developers are encouraged to incorporate parts in their * programs, rather than devise their own private sequence * format. * * This should compile and run with any ANSI C compiler. * Please advise me of any bugs, additions or corrections. Thanks also for feedback on Dotplot 2.0 and 3.0 from * Finn Drablos in Norway, who suggested a change in the cursor. * Finn Drablos AGAIN for pointing out a bug with self-comparisons which revealed a particularly sinister bug in the score matrix reader function. * Francis Durst in France, who pointed out the emulator bug and understood that programming is heavily influenced by a girlfriend's trips. Desperate Plea for Recognition ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you enjoyed using dotplot, please DON'T SEND ME ANY MONEY! I don't want money. If I did I wouldn't have started a PhD. I want PRAISE! RECOGNITION! FAME! PRAISE (again)! So, cite dotplot as follows: Nakisa, R.C. (1993). DotPlot, a program for graphical comparison of nucleic acid and protein sequences for IBM PC. Published electronically on the Internet and available by anonymous ftp from ftp.bio.indiana.edu. Substitute the name of the ftp server or mail server that you used for ftp.bio.indiana.edu, unless you got the program from Uncle Don! Please send your flattering minutiae, ego boosters, gripes and suggested improvements by EMAIL to ramin@ic.ac.uk ................ for Internet people Alternatively, SNAILMAIL: Ramin Nakisa, Biophysics Section, The Blackett Laboratory, Imperial College of Science, Technology and Medicine, Prince Consort Road, London SW7 2BZ Great Britain. Tel: 071 589-5111 x 6729 FAX: 071 589-0191


E-Mail Fredric L. Rice / The Skeptic Tank