Msearch - Multiple database search and alignment code.

(C) 1993-95 Pittsburgh Supercomputing Center

Pre-Release 3.0 Users Manual

Pittsburgh Supercomputing Center
4400 Fifth Avenue
Pittsburgh, PA 15213
Phone: 412-268-4960
FAX: 412-268-8200
Email: biomed@psc.edu

This code was written by Alexander J. Ropelewski (ropelews@psc.edu) Hugh B. Nicholas Jr. (nicholas@psc.edu) and Sally H. Fish (sfish@psc.edu)

Please send suggestions for improving this code and error reports to biomed@psc.edu. Sorry, but we cannot provide consulting support to this code at other sites. AJR will try to provide help on the installation of this code.



WHAT IS MSEARCH?

Msearch is a powerful sequence database search and alignment tool, designed to run on any parallel computer or workstation cluster that supports release 3.3 of the PVM (*) message passing system availiable from Oak Ridge National Laboratory. Msearch has been described in an article published in the May-June 1995 issue of the PSCNEWS

(*)PVM can be obtained via anonymous FTP to netlib2.cs.utk.edu. PVM can also be obtained through the World Wide Web (WWW) and electronic mail. WWW information is availiable from http://www.epm.ornl.gov/pvm/. Electronic mail information can be obtained by sending the message "help" to netlib@ornl.gov
Although the Msearch code is designed to be portable, we have only thoroughly tested this release on the Cray-T3d. Moderate testing has been performed on Cray Multiprocessor systems (J90 and C90) and SGI INDY workstations. Future releases of this code will be tested on additional platforms.


ALGORITHMS

Msearch allows users to select a number of popular sequence comparison algorithms:

Maxsegs:

The N-Best local sequence alignment algorithm described by Waterman and Eggert. This algorithm is an extension of the Smith-Waterman Algorithm.

References:

Global:

The N-Best global sequence alignment algorithm described by Ropelewski, et. al. (Unpublished) This algorithm is an extension of the Needleman-Wunch Algorithm. (Or more precisely, a similarity version of an extended Sellers algorithm)

References:

Localprofile:

The N-Best profile alignment algorithm described by Ropelewski, et. al. (Unpublished) This algorithm is a combination of the algorithms described by Gribskov et. al., and Waterman-Eggert.

References:

Globalprofile:

The N-Best profile alignment algorithm described by Ropelewski, et. al. (Unpublished)

References:


DATABASES:

Msearch is designed to search sequence files in the NBRF-PIR, GenBank, EMBL, Swiss-Prot and FASTA(*) formats. Databases are accessed through a database configuration file, which allows the actual location of the sequence files to be hidden from the users. The database configuration file also provides a facility for grouping several sequence files together to form a database entry . However, for the program to work properly, DATABASES ENTRIES MUST BE OF THE SAME DATA TYPE (Protein or Nucleic Acid).

See the "FORMAT OF THE DATABASE FILE" section for more information.

(*) Support for the FASTA format is new in this release. We do not normally keep the databases in FASTA format, so this routine has not been thoroughly tested. We would appreciate receiving reports and bug-fixes that you may have with these routines. Please send the reports/fixes to biomed@psc.edu

PERFORMANCE CONSIDERATIONS:

There are only two commands in Msearch that take an extraordinary amount of time to complete. These two commands are "SEARCH" and "ALIGN"

The "SEARCH" command is relatively well optimized, and usually scales linearly until the program approaches the IO time. Using additional processors will usually speed up this command. On the other hand using an extreemly low cutoff value will increase the time that this command takes to complete. This is because once the search is completed, the listing file is sorted on a single processor. Thus, the lower the cutoff value, the more items there are to sort.

The "Align" command usually takes much longer to complete than the database search. The speed of this command is usually not improved dramatically by using additional processors. Like the search command, selecting a high cutoff value will cause this routine to run faster.


COMMANDS:

Scoring commands:

Scoring commands are used to select the proper weights used to compare the sequences. Msearch maximizes sequence SIMILARITY rather than minimizing sequence dissimilarity. EACH OF THESE PARAMETERS ARE REQUIRED

GAP
Format: gap= {REAL}
where {REAL} is the penality charged for extending a gap (usually a negative value)

The gap command sets the gap length penality. For example, a gap of length 4 will be charged (4 times this value) + NEWGAP.

NEWGAP
Format: newgap={REAL}
where {REAL} is the penality charged for creating a gap (usually a negative value)

The newgap command sets the gap opening penality. For example, a gap of length 4 will be charged (4 times GAP) + this value.

SCORING
Format: scoring={MATRIX}
where {MATRIX} is one of the following:
  • PAM40 - The Dayhoff PAM 40 matrix
  • PAM80 - The Dayhoff PAM 80 matrix
  • PAM120 - The Dayhoff PAM 120 matrix
  • PAM200 - The Dayhoff PAM 200 matrix
  • PAM250 - The Dayhoff PAM 250 matrix
  • PAM320 - The Dayhoff PAM 320 matrix

The scoring command sets the matix used to compare the sequences.

Note that no DNA matricies are availiable in this release. DNA matricies will be added in a future version.

Output commands:

Output commands are used to control the program's output in some way. NONE OF THESE PARAMETERS ARE REQUIRED

TITLE
Format: title={TEXT}
where {TEXT} is a descriptive label.

This command will let you label your output.

CUTOFF
Format: cutoff
Format: cutoff percent={REAL}
Format: cutoff={REAL}
where {REAL} is the percentage of the maximum score or the absolute cutoff value.

The cutoff command selects the alignment cutoff parameter. Only alignments scoring above this cutoff value will be produced.

The first form of the command, indicates that the program should compute its own cutoff value. The computed value is 10% of the score that would be produced if the query sequence was compared with itself. This is the recomended mode.

The second form of the command allows the user to set the cutoff to be a certain percentage of the score that would be produced if the query sequence was compared with itself. For example, if one was dealing with long, multiple domain proteins, one might want to set this value to 5%. To set this value to 5% enter: "cutoff percent=0.05".

The third form of the command is usually not recomended. It sets the cutoff to a specific value. This powerful option should only be used with discression.

NUMBER
Format: number={INTEGER}
where {INTEGER} is the number of subalignments to be retreived.

The number parameter selects the number of subalignments to be retreived for EACH PAIR OF SEQUENCES. This parameter is particularly useful if one is using a query sequence that has repeats or multiple domains. This parameter is not meaningful when performing a database search.

Data commands:

Data commands are used to tell the program how it should treat the sequence data. SOME OF THESE PARAMETERS ARE REQUIRED

ALPHABET
Format: alphabet library={TYPE} query={TYPE}
where {TYPE} is either PROTEIN or NUCLEIC This command tells the program what alphabet to use reading in the files. Generally, an alphabet can be thought of as the "data-type" of the sequence data. Currently, only the PROTEIN and the NUCLEIC alphabets are supported. There are no default values.

Because the Msearch program is capable of translating DNA sequences into protein sequences, it is particularly important the user set these parameters properly.

Note that NUCLEIC-NUCLEIC comparisons are not possible in this version solely because approprite nucleic acid matricies have not been added to the code yet.

TRANSLATE
Format: translate library={STRING} query={STRING}
where {STRING} is six characters of either 1 (translate) or 0 (don't translate)

This command tells the program which translation frames to use. The default is "000000" for both the library and the query. "000000" means do not translate these sequences.

Of course, if one was comparing a protein to a DNA sequence, translation would be necessarry. The DNA sequence will always be the sequence that needs to be translated. Indicate the translation frames, with a "1". If you are not interested in having a particular frame translated, use a "0":

         XXXXXX
         ||||||__Third reverse complimented frame
         |||||
         |||||___Second reverse complimented frame 
         ||||
         ||||____First reverse complimented frame 
         |||
         |||_____Third forward frame
         ||
         ||______Second forward frame
         |
         |_______First forward frame
       
For example "111000" means translate the forward frames one, two, and three.

Because the Msearch program is capable of translating DNA sequences into protein sequences, it is particularly important the user set these parameters properly.

Mode commands:

Mode commands are used to tell the program what operation should be performed on the sequence data.

SEARCH
Format: search

This command will perform a database search via the method selected with the SEARCHMETHOD command. This command is very time intensive. Please see the performance considerations section of this document on how to improve the speed of this command

ALIGN
Format: align

This command will align sequences from a database search via the method selected with the ALIGNMETHOD command. You must first SEARCH then ALIGN. Searches can be saved with the "SAVE" command, however if the databases have been changes since the search was done, you will need to repeat the search.

END
Format: end

This command is usually never entered by the user. It is used to indicate that no more sequences in a save file are to be aligned.

QUIT
Format: quit

This command ends the program.

Algorithm commands:

Algorithm commands are used to tell the program what comparison algorithm should be used for the various program operations.

SEARCHMETHOD
Format: searchmethod={ALGORITHM}
where {ALGORITHM} is either MAXSEGS LOCALPROFILE or GLOBALPROFILE

This sets the searching algorithm.

ALIGNMETHOD
Format: alignmethod={ALGORITHM}
where {ALGORITHM} is either MAXSEGS LOCALPROFILE or GLOBALPROFILE

This sets the aligning algorithm.

File Commands:

File commands are used to tell the program what files need to be read, and what files need to be written.

DATABASE
Format: database={NAME}
where {NAME} is any logical name found in the databases file.

This command will tell the program which database you want to search Upon starting the program. all availiable databases and names are listed.

LISTING
Format: listing={FILE}
where {FILE} is a legal filename

This command will have the program write an optional listing file when a database search is performed (via the "SEARCH" command). The listing file contains the sequence identifiers and definitions, sorted from highest to lowest score. Only those pairs that score higher than the cutoff value are reported. Below is a sample listing file:

734.00 PSRSAW   x PSRSAW   ...phospholipase A2 (EC 3.1.1.4) Western
712.00 PSRSAW   x PSRSAE   ...phospholipase A2 (EC 3.1.1.4) Eastern
567.00 PSRSAW   x PSTV     ...phospholipase A2 (EC 3.1.1.4) himehabu
548.00 PSRSAW   x PSRSAT   ...phospholipase A2 (EC 3.1.1.4) crotoxin
QUERY
Format: query={FILE}
where {FILE} is a legal filename

This command will tell the program what file the query sequences or the profile is in. Query sequences can be in the NBRF-PIR, GenBank, EMBL or Swiss-Prot file formats. The sequence profile must be in the GCG Wisconsin package profile file format.

ALIGNMENTS
Format: alignments={FILE}
where {FILE} is a legal filename

This command will have the program write an alignment file when alignments are requested (via the "ALIGN" command). Below is a sample alignment produced by the program:

Alignment #  1 between PIR1:PSNJ2M and PSRSAW scored:  268.00

The query sequence (PSRSAW-PSRSAW) is    122 residues long.
...Usr$Temp:[Ropelewski]Psrsaw.Pir1;1 => PSRSAW

The library sequence (PSNJ2M-PSNJ2M) is    118 residues long.
...phospholipase A2 (EC 3.1.1.4) II - Mozambique cobra


          1          *          *         *         *         *       58
PSRSAW => SLVQFETLIM.KIAGRSGLLW.YSAYGCYCGWGGHGLPQDATDRCCFVHDCCYGKA.T.DCN
          :| ||  :|   :::|:   | :: ||||||:|| | : |  |||| ||| ||| |   :|
PSNJ2M => NLYQFKNMIHCTVPSRP..WWHFADYGCYCGRGGKGTAVDDLDRCCQVHDNCYGEAEKLGCW
          1         *           *         *         *         *       60

          59        *          *         *         *              116
PSRSAW => PKTVSYTYSEENGEIIC.GGDDPCGTQICECDKAAAICFRDNIPSYDNKYWLFPPKDCR
          |    | |   :| : | ||:: |:: :|:||  || ||       | :| :     |:
PSNJ2M => PYLTLYKYECSQGKLTCSGGNNKCAAAVCNCDLVAANCF.AGARYIDANYNINLKERCQ
          61        *         *         *          *              118

The alignment contains:
121 pairs,   46 matches,   67 mismatches and   8 insertions/deletions.
SAVE
Format: save={FILE}
where {FILE} is a legal filename. {FILE} MUST NOT ALREADY EXIST!

This important command is used to tell the program where to store the intermediate search results. The intermediate results are stored in a binary format. THIS FILE MUST BE SPECIFIED BEFORE THE "SEARCH" COMMAND IS ISSUED!


PVM SETUP:

In order to run the Msearch program, The PVM(*) message passing system must be installed on every computer that you will use.

(*)PVM can be obtained via anonymous FTP to netlib2.cs.utk.edu. PVM can also be obtained through the World Wide Web (WWW) and electronic mail. WWW information is availiable from http://www.epm.ornl.gov/pvm/. Electronic mail information can be obtained by sending the message "help" to netlib@ornl.gov
Once PVM is installed on your system, you can then start the PVM daemon. There are a variety of ways that you can start the PVM daemon; Instructions for starting the PVM daemon on a variety of platforms are listed below. If the method listed below does not work on your system, or if you desire a more secure method of starting PVM, please read the "starting the PVMD" section in the PVM manual.

Keep in mind that on some machines, such as the Cray T3d, users do not explicitly start the PVM daemon.

Starting PVM on Cray MPP systems:

Starting PVM on Cray parallel-vector systems:

Starting PVM on workstations:

PVM daemon problems


FORMAT OF THE DATABASE FILE:

The each line in the DATABASE file has the following format:

{KEYWORD}: {LOGICAL-NAME} {PARAMETER}

{KEYWORD} can be one of the following:
TYPE
DESCRIPTION
FILENAME
DATABASE

{LOGICAL-NAME} contains characters that the program will understand as either a sequence file or a database entry.

{PARAMETER} contains parameters that are required by the keyword.

How to add a sequence file:

Sequence files require three keywords "TYPE", "DESCRIPTION" and "FILENAME". All of these keywords must have the same {LOGICAL-NAME}. For Example:
   TYPE:        PIR1  PROTEIN
   DESCRIPTION: PIR1  Pir1 (Annotated sequences)
   FILENAME:    PIR1  /afs/psc/common/usr/local/biomed/db/nbrf/pir1
   
is a valid sequence file entry in the DATABASES file. To access this file, the user will simply use the logical name "PIR1". Another valid entry would be:
  
   TYPE:        PIR2  PROTEIN
   DESCRIPTION: PIR2  Pir2 (Partially annotated sequences) 
   FILENAME:    PIR2  /afs/psc/common/usr/local/biomed/db/nbrf/pir2
   
To access this file, the user will simply use the logical name "PIR2"

How to create a database entry:

A database entry can be a convienent way to refer to many logical file names. Database entries require three keywords "TYPE", "DATABASE" and "DESCRIPTION". All of these keywords must have the same {LOGICAL-NAME}. Database entries must be defined in the DATABASE file AFTER the sequence files have been defined. Below is a valid database entry:
   TYPE:        PIR   DATABASE
   DATABASE:    PIR   PIR1 PIR2  
   DESCRIPTION: PIR   NBRF-PIR database sections 1 and 2
   
To access this database entry, the user will simply use the logical name "PIR".

Example DATABASE file:


  TYPE:        SP    PROTEIN
  DESCRIPTION: SP    Swiss Protein Data Base
  FILENAME:    SP    /afs/psc/common/usr/local/biomed/db/swiss/swiss 

  TYPE:        PIR1  PROTEIN
  DESCRIPTION: PIR1  Pir1 (Annotated sequences)
  FILENAME:    PIR1  /afs/psc/common/usr/local/biomed/db/nbrf/pir1

  TYPE:        PIR2  PROTEIN
  DESCRIPTION: PIR2  Pir2 (Partially annotated sequences) 
  FILENAME:    PIR2  /afs/psc/common/usr/local/biomed/db/nbrf/pir2

  TYPE:        PIR3  PROTEIN
  DESCRIPTION: PIR3  Pir3 (Unannotated sequences) 
  FILENAME:    PIR3  /afs/psc/common/usr/local/biomed/db/nbrf/pir3


  TYPE:        GBBCT  NUCLEIC
  DESCRIPTION: GBBCT  Genbank Bacterial 
  FILENAME:    GBBCT  /afs/psc/common/usr/local/biomed/db/genbank/gbbct.seq

  TYPE:        GBEST  NUCLEIC
  DESCRIPTION: GBEST  Genbank Expressed taged sequences 
  FILENAME:    GBEST  /afs/psc/common/usr/local/biomed/db/genbank/gbest.seq

  TYPE:        SWISS DATABASE
  DATABASE:    SWISS SP
  DESCRIPTION: SWISS Swiss-Protein database 

  TYPE:        PIR   DATABASE
  DATABASE:    PIR   PIR1 PIR2 PIR3 
  DESCRIPTION: PIR   NBRF-PIR database 

  TYPE:        GENBANK DATABASE
  DATABASE:    GENBANK GBBCT GBEST 
  DESCRIPTION: GENBANK GenBank database 
  

INSTALLING THE SOFTWARE:

The software should install trouble free on most systems that can make use of the PVM message passing system. Although the Msearch code is designed to be portable, we have only thoroughly tested this release on the Cray-T3d. Moderate testing has been performed on Cray Multiprocessor systems (J90 and C90) and SGI INDY workstations. Future releases of this code will be tested on additional platforms.

If there is something in this code that causes you instalation trouble, please let us know. We cannot test this code on every machine capable of running PVM. Please send your report to biomed@psc.edu

Installing on a Cray T3d:

Installing on SGI INDY machines:

Installing on "generic" PVM machines:

There are no specific instructions for installing this code on a "generic" PVM machine. However, you will probally have to:

SAMPLE INPUT FILES

Sample T3d input file:

     database=NRL
     cutoff=150.0
     number=1
     gap=-8.0
     newgap=-0.0
     scoring=pam250
     alphabet library=protein,query=protein
     searchmethod=maxsegs
     alignmethod=maxsegs
     listing=file.list
     query=snake.query
     alignments=snake_max.alignments
     save=file.save
     title=comparing sequence vs NRL database.
     translate library=000000 query=000000
     search
     quit
     

Sample PVM input file:

     5
     database=NRL
     cutoff=150.0
     number=1
     gap=-8.0
     newgap=-0.0
     scoring=pam250
     alphabet library=protein,query=protein
     searchmethod=maxsegs
     alignmethod=maxsegs
     listing=file.list
     query=snake.query
     alignments=snake_max.alignments
     save=file.save
     title=comparing sequence vs NRL database.
     translate library=000000 query=000000
     search
     quit
     

{PSC} Home Pages {BIOMED}