Homology Searching for Biological Sequences Databases

Abstract

Molecular Biology is in the middle of a major paradigm shift - driven by computing. Although it is already an informational science in many respects, the field is rapidly becoming much more computational and analytical. However, bridging the gap between the real world of Molecular Biology and the precise logical nature of computer Science requires an interdisciplinary perspective. The computational challenges lies at mapping and interpreting biological information being collected worldwide. We, in this report, are to concentrate on an all important task: Homology searching in biological sequences databases.

1. Introduction

Rapid progress in genetics and biochemistry research combined with the tools provided by modern biotechnology has generated massive volumes of genetic and protein sequence data. These sequence data are accessible to researchers in a variety of databases including the GenBank, EMBL (European Molecular Biology Laboratory), SWISS-PROT (Swiss Proteins), PIR (Protein Identification Resource), and PDB (Protein Data Bank) databases. They contain a large number of deoxyribose nucleic acid (DNA) and protein sequences with their associated biological and bibliographical information. The DNA sequences consist of a string of nucleotides identified by the base they contain. The protein sequences consist of a string of amino acids. In the past, these sequences and their associated information were gathered from articles published in scientific journals. Presently, an increasing number of unpublished sequences are submitted electronically into these databases as they are discovered. As a result, the numbers of entries are growing almost exponentially with time.

As these databases grow in size, biomedical researchers need computational tools to retrieve biological information from them, analyze the sequence patterns they contain, predict the three-dimensional structure of the molecules the sequences represent, reconstruct evolutionary trees from the sequenced data, and track the inheritance of chromosomes based on the likelihood of specific sequences occurring in different individuals. These software tools will be used to learn basic facts about biology such as which sequences of DNA are used to code proteins while other combinations of DNA are not used for protein synthesis. They will also be used to understand genes and how they influence diseases.

Many of the biomedical research activities mentioned above are computationally intensive tasks. This report discusses comprehensively for one such area, an important sequence pattern analysis activity: inter- and intrasequence homology searching.

When researchers discover new sequences, they are to search databases for discovering homologs for the new sequences as a first step to establish their origin and function. Also they often search the databases at regular intervals since new sequences are continuously being added into the databases. Dynamic programming is the best-known algorithm for establishing homology. Unfortunately, depending both on the volume of data to be processed and accuracy of the comparisons, the computation by dynamic programming can be time consuming and even a performance bottleneck. Usually, this constraint is being met by different techniques: software heuristics, parallelization on massively parallel computers, distribution on a network of workstations, or dedicated hardware. Each of these techniques has advantages and drawbacks, as summarized below.

2. Homology Searching Approaches

1. Software heuristics can drastically reduce the time of computation: instead of systematically computing all the possible alignments between two sequences, the search starts from areas where strong similarities (identical words) have been found. The length of the words determines the speed of the process: the longer the word, the faster the comparison. At the same time, the longer the word, the better the chances of missing some significant alignments.

2. Massively parallel computers are not available in biology laboratories and access must be performed through the Internet. The computation is fast, but the remote access restricts the interactivity: results are returned by e-mail and the delay depends on the load of the server.

3. Networks of workstations aim to use local computing power. This idea is attractive (and works), but is practically difficult to implement in the laboratories: it often requires a homogeneous park of machines, a distributed software system allowing different operating systems and typically, a system engineer for installing the software.

4. Dedicated machines (or hardware accelerators) are small hardware units connected to a PC or a workstation. They can only support a limited class of computations, but perform very rapidly.

This report is organized as follows: The following section describes the types of biological sequences that are being discovered and stored in the various databases and the relationship between these types. We then describe the contents and storage format of a well-known, publicly available, genetic sequence database- GenBank. Next, the residue substitution scoring matrices that define the similarity scores for all possible pairs of residues are discussed. After this, we present some background on preprocessing and some efficient algorithms, which can do selective, sensitive and rapid homology searches. Also, we discuss some parallel searching approaches. Finally, we surveys some dedicated machines designed specifically for speeding up sequence comparisons.

Note: Before going into the next section, here is a note about the homologous sequence. Sequence similarity searching is the most powerful method today for inferring the biological function of a gene (or the protein that it encodes) and because of the enormous amount of information that is preserved throughout the evolutionary process. Proteins that share a common ancestor are called homologous. Sequence comparison is most informative when it detects homologous proteins. Homologous proteins always share a common three-dimensional folding structure, common active sites or binding domains, and common functions.

3. Origin and Structure of DNA

DNA contains the complete genetic information that defines the structure, function, development, and reproduction of an organism. Genes are regions of DNA and proteins are the products of genes. DNA is usually found in the form of a double helix with two chains wound in opposite directions around a central axis. Each chain consists of a sequence of nucleotide bases which can be one of the following four types: adenine (A), cytosine(C), guanine (G), and thymine (T). The two chains of a DNA molecule have complementary bases, with A-T and C-G being the only pairings that occur, so the sequence of a second chain can be determined if the sequence of the first chain is known. In replication, the chains unwind and each chain is used as a template to form a new companion chain.

The production of a protein from a segment of DNA occurs in two major steps: transcription and translation. A protein is a polymer consisting of a sequence of amino acids. Twenty different amino acids are commonly found in proteins. The DNA itself does not act as a template for protein synthesis. In transcription, a complementary RNA copy of one of the two DNA chains is formed with ribose nucleotides. RNA is a single stranded molecule similar to DNA except that the sugar backbone contains ribosome instead of deoxyribosome. During transcription, the base thymine (T) is encoded as uracil (U) while the other three bases remain the same. As a result, a sequence of RNA is composed of the bases A, C, G, and U.

This RNA sequence is translated into a sequence of amino acids that combine to form a protein. During translation, three bases, referred to as a codon, are read at a time and translated into one amino acid. There are 43=64 possible codons. Sixty-one of them specify the 20 possible amino acids. Thousands of different types of proteins occur in biological organisms. They are responsible for catalyzing and regulating biochemical reactions and transporting molecules as well as being the basis for the structure of skin, hair, and muscle.

Thus, two types of sequences are of interest.

The first type is the DNA sequence formed from the four letter base alphabet A, C, G, and T. These sequences are found in the GenBank, EMBL, and DDBJ (DNA Data base of Japan).

The second type is the protein sequence that consists of the 20-letter amino acid alphabet. Protein sequences are collected in many databases such as SWISS-PROT, PIR, and PDB.

We are to discuss both types of sequences in this report. RNA sequences will not be discussed here, as the computational methods used to perform RNA homology searching are the same as those used for DNA sequences.

4. An Example Database: GenBank

GenBank is an international database, which contains a large number of DNA sequences and their cognate information. Currently, GenBank is maintained and distributed by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health. The sequences in GenBank are obtained from scanning over 325,000 articles in 3,400 scientific journals annually.

In addition, GenBank includes sequences from the European Molecular Biology Laboratory (EMBL) and the DNA Data base of Japan (DDBJ). Recently, NCBI has accepted unpublished sequences via electronic submission directly from the scientific community for inclusion in the database. Latest release version of GenBank is 108.0, with 2,532,359 sequences and 1,797,137,713 bases. There is a new release every two months. Over 400 genomes are completely sequenced, including viruses, bacteria, and organelles. Also, GenBank have complete sequences for 17 microbial genomes. Over 40,000 species are represented in GanBank, and all are listed in Taxonomy browser, which is based on phylogenetic relationships. Human sequences in GenBank represent about 5.7% of the complete genome. NCBI (http://www.ncbi.nlm.nih.gov/) makes these available in a variety of ways to accommodate the needs of the scientific community. GenBank has grown exponentially over the last decade.

The GenBank distribution consists of the twenty-two flat files. To ensure portability and accessibility, these files are distributed in ASCII format. Portability is a very important feature as GenBank is distributed to scientists worldwide who have a variety of computing environments. Accessibility is also important since it allows most scientists to use their existing text processing tools instead of developing special tools just to access the data in this database.

GenBank has four different file types (extensions): text (txt), index (idx), form (frm), and sequence (seq). The two text files contain the release notes, which describes the file contents, formats, statistics, and other information, and a short directory of the sequence entries, which contains brief descriptions of all the sequences contained in GenBank. Each entry in this file contains the sequence name (LOTUS), its brief definition, and its length. There are five index files, which are used to index the sequence entries in the database. The five index keys are accession numbers, keywords, authors, journals, and gene symbols. These indexes are sorted alphabetically. File number 8 is a sequence data submission form. This form can be filled out with a text editor and returned to the database via e-mail. The last file type (14 files) is the sequence entry files which are used by sequence analysis software.

5. Residue Substitution Scoring Matrices

To compare two DNA or protein sequences, the similarities between the individual residues in these sequences must first be defined. A residue is a nucleic acid for DNA or an amino acid for a protein. A substitution-scoring matrix that defines the similarities between all possible pairs of residues is needed for optimal sensitivity. In this section, some well-known and application-specific scoring matrices have been discussed. These matrices contain the score for substituting or exchanging one residue by another one. A pair wise alignment score is the sum of substitution scores and gap penalties over all aligned residues, so that the best alignment is considered to be the one that obtains the highest score.

A substitution-scoring matrix for proteins must have atleast 20*20 elements since there are 20 amino acids. One possible way of constructing such a matrix is to base it on the number of common bases between the two amino acids. Since each amino acid has three bases, a similarity score can be zero, one, two, or three units. That is, a score of zero is given if the two amino acids have no common bases, a score of one unit is given if there is one common base, and so on.

Dayhoff et al. proved that simple scoring methods as mentioned above were not sophisticated enough to detect the similarity or sensitivity between protein sequences. As a result, they presented a method to construct superior matrices based on combined effects of several biological factors, including the nature of genetic code, the rates of mutation at the nucleotide level, and natural selection.

Score matrices have been based on log-odds ratios derived from a Markov mutational model. Such a model assumes that mutation (substitution) events are random and independent. A matrix M of probabilities for substituting base i by base j after any given amount of evolution can be calculated by successive iteration of a reference mutation matrix:

Mn = (M1) n

M1 is a matrix reflecting 99% sequence conservation (No mutation) and one point accepted mutation (1 PAM) per 100 bases. Mn then represents the substitution probabilities after n PAMs. To model the case for which all base substitutions are equally likely, the diagonal elements of M1 are all 0.99, while all off-diagonal elements are 0.00333. For a biased mutation model in which a given transition is threefold more likely than a given transversion, the off-diagonal elements of M1 corresponding to transitions are 0.006 and those for transversions are 0.002.

The n-PAM log-odds score Sij for aligning a given pair of bases is simply the log of the relative chance of that pair occurring as a result of evolution as opposed to that occurring from a random alignment of two bases.

Sij = log (piMnij/pi pj),

Where pi is the underlying frequency of base i. The symmetry in the choice of the matrices M1 described above essentially assumes equal frequencies for the four nucleotides and Sij can be written as log (4Mij).

Table - 1
C S T P AGNDEQHRKMILV F Y W B Z X -
C12
S02
T-21-3
P-3106
A-21112
G-310-115
N-410-1002
D-500-101 24
E-500-100134
Q-5-1-100-11224
H-3-1-10-1-221136
R-40-10-2-30-1-1126
K-500-1-1-21001035
M-5-2-1 -2-1-3-2 -3 -2 -1 -2006
I-2-10-2-1-3-2-2-2-2-2-2-225
L-6-3-2-3-2-4-3-4-3-2-2-3-3426
V-2-10-10-1-2-2-2-2-2-2-22424
F-4-3-3-5-4-5-4-6-5-5-2-4-5012-19
Y0-3-3-5-3-5-2-4-4-40-4-4-2-1-1-2710
W-8 -2-5-6-6-7-4-7-7-5-32 -3-4 -5 -2 -6 00 17
B-400-10023311-11-2-2-3-2-4-3-5 3
Z-50-100023311-11-2-2-3-2-5-4-6 23
X-300-10-1-1-1-1-1-1-1-1-1-1-1-1-2-2 -4-1-1-1
--8-8-8-8-8-8-8-8-8-8-8-8-8-8-8-8-8-8-8-8-8-8-8 1

The table 1 shows a substitution-scoring matrix that was constructed from Dayhoff's method. This table is referred to as the PAM250 matrix. The score Mij in the matrix means that the amino acid i is expected to change to j 10Mij/10 times after an evolutionary interval of 250 PAM units. One PAM (accepted point mutation) is equal to one accepted amino acid mutation per 100 links of protein. For example, amino acid C will change back to C about 16 times after an evolutionary interval of 250 PAM units. The table is presented as a lower triangular matrix because the substitution score is symmetrical. That is , the expected number of times that an amino acid X will change to Y is the same as that of Y will change to X. In addition to the 20 amino acids, this table has three additional residues (B,Z,X) and a dash. Sometimes, scientists do not wish to distinguish between amino acids D and N or E and Q. As a result, B and Z replace these letters, respectively. For completeness, an unknown amino acid X is also included in the matrix, and it is assumed to be different from another unknown X. An unknown residue could be a result of a transmission error or storage media error where a valid residue was changed into an invalid one. A dash denotes a deletion. That is, a pair (A, -), or (-, A), means that A is deleted instead of substituted by dash.

Table - 2
A C G T UX-
A-1-1-1-1-1-1-2
C-11-1-1-1-1-2
G-1-11-1-1-1-2
T-1-1-111-1-2
U-1-1-111-1-2
X-1-1-1-1-1-1-2
--2-2-2-2-2-20

As pointed out by Dayhoff, single mutations of nucleotides are seldom observed in the gene. These mutations may occur but they are probably rejected by natural selection. As a result, an identity substitution-scoring matrix as shown in table 2 is used for the nucleic acids (both DNA and RNA). That is, a score of one unit is given for substituting one nucleic acid by another identical one. Otherwise, a score of negative one is given. However, a score of one is also given for substituting T by U, or vice versa, since T of DNA is known to transcribe into U of RNA. For completeness, an unknown base X is also included in the matrix, and it is assumed to be different from another unknown base. The deletion score has been arbitrarily chosen to be -2 units so that a deletion of a base scores worse than its substitution.

Multiply aligned sequences provide implicit protein structural information. Such information is rapidly increasing with the expansion of sequence databanks and is now available for most newly determined sequences. By identifying constraints on residues and regions from multiple alignments, applications such as database searching can be improved. To use multiple-alignment information effectively: each column of the alignment is used to calculate the values in a column of the alignment of a position-specific scoring matrix (PSSM). To search a database, the PSSM is slid along each sequence entry, and, at each alignment, the score for every amino acid, and at each alignment, the score for every amino acid is obtained from the aligned PSSM column. PSSMs have also been applied to structural prediction tasks.

An alternative approach to constructing log-odds substitution matrices is to obtain substitution probabilities directly from alignments of distantly related sequences without extrapolation. This has been accomplished in two ways.

One method utilized ungapped multiple-sequence alignments representing groups of related proteins present in the Blocks Database. Substitution probabilities were calculated from counts of amino acid pairs within each column of every block in the database. To reduce the contribution of closely related sequences to pair counts, sequences were clumped within blocks based on a percentage identity, and their contributions were averaged when calculating pair frequencies. For example, BLOSUM 62 (blocks substitution matrix at 62%) is the log-odds matrix derived from pair counts between sequence segments that are less than 62% identical. This procedure resulted in a series of log-odds matrices with average mutual information per residue pair(relative entropy) ranging from 0.2 to 1.5 bits (1 bit of information is the answer yes is as likely as the answer to a yes/no question where the answer yes is as likely as the answer no).

A second method utilized alignments based on selected superpositions of homologous structures. Structure-based log-odds matrices can be derived directly from pair counts in the same ways as for the BLOSUM series, but without the clumping procedure.

6. Sequential Homology searching Algorithms

6.1 Dynamic Programming Algorithms

Molecular biologists frequently compare biosequences to see if any similarities can be found in the hope that what is true of one sequence either physically or functionally is true of its analogue. Such comparisons need rigorous algorithms. In this section, we discuss some algorithms based on dynamic programming techniques.

To compare two sequences, one needs an algorithm that gives a measure of similarity between them. There are many simple methods for measuring the similarity between two sequences. One of these methods is to count the number of common residues between the two sequences. To take the positions of the residues into consideration, another method slides the shorter sequence along the longer one and counts the number of matches. The maximum number of matches can be used as a measure of similarity. These trivial methods do not provide good measures of similarity between two biological (genetic or protein) sequences. Biological sequences are known to mutate as they evolve. That is, some of their residues may be changed, deleted, or inserted. In addition, each pair of residues do not have the same measure of similarity. As a result, useful methods take changes (substitution), deletion, and insertion into account.

To define the terms substitution, deletion, and insertion more precisely, consider an example. Let sequence A = CSTPGND and sequence B = CSDTND. One possible comparison consists of the following substitutions, deletions, and insertions.

CS-TPGND
CSDTN--D

When comparing two sequences by writing one sequence above the other, this process is referred to as aligning the two sequences. In the above alignment, B3 = D is inserted into the first sequence or it is deleted from the second one, depending on the viewpoint. A4 = P is substituted by B3 = N or B5 by A4 again depending on the viewpoint. The dash in the sequences represents a gap. Therefore, there is a gap of length one between A2 and A3 in the first sequence and a gap of length two between B5 and B6 in the second sequence.

Sequence comparison or alignment is a difficult problem when substitutions, deletions, and insertions are allowed. Waterman proved that there are approximately (1+2) 2n+1/sqrt(n) possible alignments between two sequences where n is the length of the two sequences. For example, if the length of each sequence is 1,000 residues, the number of possible alignments is approximately 10767 . As a result, it is hopeless to enumerate all possible alignments.

Needleman and Wunsch introduced the first heuristic method, which allowed substitution, deletion, and insertion for measuring the similarity between two protein sequences. Their method also gave the criterion for obtaining the optimal similarity score (measure). Basically, Needleman defined the similarity score between two sequences as the sum of all individual elementary similarities. The elementary similarities consist of substitutions, deletions, and insertions. This problem was later defined mathematically by Sellers in terms of evolutionary distance between two sequences. Basically, the difference between the evolutionary distance and similarity score calculations is that the first one uses a minimization function while the other uses a maximization function. Seller's algorithm for computing the evolutionary distance between two sequences can be rewritten in terms of similarity score as follows. The following notations are used in this algorithm.

A = a1a2a3...aM.
B = b1b2b3...bN.

S (A,B) = maximum similarity score between sequences A and B.

S (ai,bj) = elementary score for substituting bj for ai.

S (a1,-) = elementary score for deleting ai.

S (-,bj) = elementary score for inserting bj.

g = S(ai,-) = S (-,bj) = gap score.

Si,j = the accumulative similarity score between sequences A and B up to the ith and jth positions.

Si-1,j + g
Si,j= max{Si-1,j-1 +s(ai,bj)
Si,j-1b+g

The above equation is a recurrence relation which determine the similarity score between two sequences of length M and N by using the similarity scores of shorter sequences. That is, it starts from i = 0 and j = 0 and increments them until i = M and j = N.

Here, for finding local similarity either sequence can slide along the other one without penalty. The initial conditions are given as follows.

Si,0 = 0, (0 < i < M) S0.j = 0 (0 < j < N)

If the gap penalty is not constant, g must be replaced with the actual penalty function.

To see the complexity of this problem, consider the following example. Let us find the similarity score between sequences A = aatcgatcct and B = atccgcct That is, to find the longest segment that is common to both sequences. The substitution scoring matrix given in table 2 is used where g = -2.

Begin by initializing the first row S0,j and first column Si,0 to zeroes for 0 <= i <= 10 and 0 j <= 8. Then the remaining elements can be calculated using the recurrence relation, row-by-row or column-by-column. For example, S1,1 = Max{0+ -2,0+1, 0+ -2} = 1. Using this process, the entire matrix is calculated. The similarity score between these two sequences is the largest number in the matrix. For this example, the similarity score is 4.

The illustrated algorithm is very simple. It has a constant gap function. That is, this algorithm does not distinguish between sequences with one gap of length N from a sequence with N gaps of length one. In genetics, a mutated sequence with one gap of length N is more realistic than the one with N gaps of length one. As a result, it is more useful to have an algorithm that assigns a higher score to the sequence with one gap. Seller's algorithm was later generalized by Waterman et al. to include this criterion. However, their algorithm requires M2 * N computation steps while the Seller's algorithm requires only M * M steps. Later, Gotoh modified the Waterman algorithm to reduce the number of computational steps to M * N.

The Gotoh algorithm can be presented as follows. Let the two sequences be A = a1a2a3...aM and B = b1b2b3...bN. Let s(ai,bj) be a given similarity score for substituting residue ai by bj. The gap penalty function is defined as wk = -uk-v,(u,v >= 0) where k is the gap length.

Pi,j
Si,j = max {Si-1.j-1 + s(ai,bj)
Qi,j

Pi,j = Max[Si-1,j + w1, Pi-1,j + u]

Qi,j = Max[Si,j-1 + w1, Qi,j-1 +u]

The initial conditions are Si,0 = Pi,0 = Qi,0 = 0 and S0,j = P0,j = Q0,j = 0, for 0 <= i <= M and 0 <= j <= N.

In sequence similarity searching, one face one more major difficulty that one do not encounter in business databases. As can be seen from the previous section, the comparison algorithms do not return a simple match or mismatch but instead a degree of similarity between the query and the target sequences. For this reason, an index table, based on the sequences, can't be constructed for the databases. As a result, the query sequence must be compared against all the sequences in the database for each search. On the other hand, the associated sequence information such as accession number, keyword phrases, authors, journal citations, and gene names can be indexed. For this reason, the search process is divided into two phases. In the first phase, retrieve the gene name, LOCUS, of the most similar sequence from the database. In the second phase, the retrieved gene name is used as an index to retrieve all of its associated information using available software tools.

For a given query sequence one may consider performing local pair wise comparison algorithms, discussed above against every entry sequence in the database. However, performing 10,000 such comparisons for the current databases can be very time consuming. For example, comparing a 300 residue (base) protein against the current PIR database with the basic dynamic programming algorithm requires the computation of 1.2 billion matrix entries. This is a task that takes on the order of an hour on a current workstation. Also, with the advent of the Human Genome Initiative, the DNA sequence data available will be in the trillions of bases within this decade. For this reason, designing algorithmically faster techniques for similarity searches is a very important problem.

6.2 Heuristic Algorithms

The problem of searching for protein similarities efficiently has lead many investigators to consider designing very fast heuristic procedures: simple, often ad-hoc, computations that produce answers that are 'nearly' correct with respect to a formally stated optimization criterion.

6.2.1 FASTA

One of the most popular database searching tools of this genre is FASTA developed by Lipman and Pearson. FASTA is heuristic: it reports most of the alignments that would be produced by an equivalent dynamic programming calculation. But it misses some matches and also reports some spurious ones with respect to the more robust computation. On the other hand, it is very fast and reports results of interest to investigators. The details are given below.

For a small integer k, recall that the set of k-substring pairs is Pk={(i,j) : aiai+1....ai+k+1 = bjbj+1....bj+k+1} and assume its size is R. For each entry, B, of the database and the query A, FASTA counts the number of pairs in each diagonal of the C-matrix of the traditional dynamic programming algorithm. Formally, diagonal d is the set, {(i,j):i-j=d}, and the algorithm computes Count(d)=|{(i,j): j-i=d and (i,j) in P^{k} }| for each diagonal d in [-M, N]. FASTA computes these counts efficiently in O(R) time by encoding each k-substring as an integer. Intuitively the code for a substring over an alphabet A is obtained by viewing it as a k-digit |A|-ary number, i.e.,

[Code(c_{k-1}c_{k-2} ... c_{0}) = sum_{i=0}^{k-1}c_{i} |A|^{i}].

The code for each k-substring of the query and entry can be computed in linear time using the simple relationship Code(a_{i}a_{i+1} ... a_{i+k-1}) = a_{i} |A|^{k-1} + Code(a_{i+1}a_{i+2} ... a_{i+k}) /|A|^{k} - element array according to its code. As each database entry is processed, its substring codes are computed, used to index the positions of the matching substrings in the query, and the diagonal counts incremented accordingly. Thus if k is large enough so that R is small relative to MN, then this procedure is quite fast. FASTA examines the diagonal counts and if there are "high enough" counts in one or more diagonals, report the entry as being similar to the query. In a postprocess, it produces an alignment between the query and entry as in the heuristic procedure of Wilbur and Lipman. This procedure essentially builds a source to sink path by greedily selecting, in order of length, substring pairs compatible with those already chosen to be a part of the path.

The FASTA package contains a program TFASTA, which translates the sequences of a nucleotide database and compares them against a protein query sequence.

6.2.2 BLAST

There are a number of other heuristic algorithms of the same as FASTA. The recent one is BLAST, which is notable for being faster than FASTA, yet capable of detecting biologically meaningful similarities with comparable accuracy. Given a query and an entry, BLAST looks for segment pairs of high score. A segment pair is a substring from A and a substring from B of equal length and the score of the pair is that of the no-gap alignment between them.

Since insertion and deletion events tend to significantly change the shape of a protein and hence its function and the presence of a high-scoring segment pair or pairs is evidence of functional similarity between biological sequences, the significance of no-gap alignment is felt. The segment pairs embody a local similarity concept. Also, BLAST gives an assessment of the statistical significance of any match that it reports. BLAST returns to the user all database entries that have a segment pair with the query of score greater than S ranked according to probability. BLAST is heuristic, for it may miss some such matches although in practice it misses very few.

Many similarity measures, including the one, which BLAST employs, begin with a matrix of similarity scores for all possible pairs of residues. Identities and conservative replacements have positive scores, while unlikely replacements have negative scores. For amino acid sequence comparisons, BLAST use the PAM-120 matrix, while for DNA sequence comparisons, identities get +5, and mismatches -4.

The central idea used in BLAST is the notion of a neighborhood. The T-neighborhood of a sequence W is the set of all sequences that align with W with score better than T. Underlying and critical to the precise definition is the particular alignment and scoring model. So for BLAST, the T-neighborhood of W is exactly those sequences of equal length, which form a segment pair of score higher than T. This idea suggests a simple strategy for finding all entries that have segment pairs of length k and score greater than T with the query.

That is, generate the set of all sequences that are in the T-neighborhood of some k-substring of the query and see if an entry contains one of these strings. Thus, one can design a deterministic finite automaton for the sequence set and use it to scan an entry in the database for a match. The scanning speed of such an algorithm is exceedingly fast, that is, on the order of half a million characters a second on a 20MHz computer. But for the general problem, the length of the segment pair is not known in advance and the number of sequences in the neighborhoods grows exponentially when k and T are very low.

To circumvent this difficulty, BLAST uses a fast scanning strategy to find short segment pairs above a certain score and then checks each of these to see if they are a portion of a segment pair of score S or greater. The BLAST algorithm is outlined below.

The first approach: Suppose that w = 4 and map each word to an integer between 1 and 204, so a word can be used as an index into an array of size 204 = 160 000. Let the ith entry of such an array point to the list of all occurrences in the query sequence of the ith word. Thus, during scanning, each database word leads immediately to the corresponding hits. Typically, only a few thousand of the 204 possible words will be in this table and it can be modified to modify this approach to use far fewer than 204 pointers.

The second approach: A DFA accepting the T-neighborhood of strings is built in O(kMZ) time where Z is the average cardinality of a T-neighborhood. The FDA scans the database and each time a string in the neighborhood set is found, the next step is executed. A total of O (L) time is taken in this step where L is the sum of the lengths of all the entries in the database.

Extending hits: For a given "hit", the DFA delivers the substring(s) whence the matching sequence was generated. Thus, step 2 delivers a segment pair of length k and score T or greater between the query and an entry. BLAST considers extending this segment pair both by appending and prepending symbols to the given pair in order to see if it is a subsegment of a pair of score S or greater. If so, it reports the entry and the segment pair. In practice, the process of extending in a given direction is quit once the score drops beyond a settable threshold, so that on average a constant amount of time spent checking a given hit. The probability of a hit occurring at a given position is roughly O(MZ/|A|^{k}), so on average O(LMZ/|A|^{k}) time is spent in this step.

The procedure above is heuristic in that there is some chance that a segment pair of score S is greater does not contain a sub segment of length k and score T, or greater. This possibility increases with increasing T, and decreases with increasing k. For the case of protein comparisons under an integrated version of the PAM scores, a reasonable compromise for the default choice was k=4 and T=17(for which Z=25). Practically, BLAST is about an order of magnitude faster than FASTA and can process a database at a rate of about one-third of a million characters per second on a 20MHz general-purpose machine.

The BLAST package consists of five programs.


Here are some variants of BLAST

Visual BLAST and Visual FASTA

When routinely analyzing protein sequences, detailed analysis of database search results made with BLAST and FASTA becomes exceedingly time consuming and tedious work, as the resultant file may contain a list of hundreds of potential homologies. The interpretation of these results is usually carried out with a text editor, which is not a convenient tool for this analysis. In addition, the format of data within BLAST and FASTA output files makes them difficult to read. To facilitate and accelerate this analysis, two easy-to-use programs "Visual BLAST and Visual FASTA" were designed for interactive analysis of full BLAST and FASTA output. These programs are based on the same intuitive graphical user interface (GUI) with extensive viewing, searching, editing, printing and multithreading capabilities. They also improve the browsing of BLAST/FASTA results by offering a more convenient presentation of these results. These tools include a pairwise sequence alignment viewer, a Hydrophobic Cluster Analysis plot alignment viewer and a tool displaying a graphical map of all database sequences aligned with the query sequence.

PowerBLAST

At the rate of DNA sequencing increases, analysis by sequence similarity search will need to become much more efficient in terms of sensitivity, specificity, automation potential, and consistency in annotation. PowerBLAST was developed by Zhang,J., et al., in part, to address these problems. PowerBLAST includes a number of options for masking repetitive elements and low complexity subsequences. PowerBLAST is capable of processing sequences of any length because it divides long query sequences into overlapping fragments and then merges the results after searching. The results may be viewed graphically, as a textual representation, or as an HTML page with links to GenBank and Entrez.

6.2.3 SENSEI

Now, we discuss about another DNA sequence similarity search implementation,called SENSEI (SENsitive SEarch Implementation) that improves upon the performance of BLASTN by almost an order of magnitude for comparable sensitivity.

Nucleic acid sequences have special properties that should make it possible to develop better tools for specifically searching nucleic acid sequence databases. These properties include a small four-nucleotide alphabet and simple evolutionary models for base mutations as discussed at the beginning. Moreover, nucleotide query sequences are often much longer in length than protein sequences.

In BLASTN, the query sequence and the target sequence should share a common word or k-tuple to be hits. Search sensitivity can be increased for distantly related sequences by not only storing exact words, but also storing the inexact words. The first level of inexact words for a k-tuple would be all the k-tuples that differ from the original k-tuple by one base. At each base position, there are 3 other possible bases leading to a total of 3k neighboring k-tuples. In BLAST, this is implemented by storing in the word table all the k-tuples that match some position in the query sequence or its reverse complement either exactly or nearly exactly. For a 40 kilo base query, and a word size of k = 11 nucleotides, storing all the words that match the query sequence in at least 10 of the 11 nucleotide positions requires a word table of 2(strands) * (3 * 11 +1) * 40 kbase * 2 bytes /location = 5.42 Megabytes. This table size increases to over 84 Megabytes if higher sensitivity is required by matching at least 9 of the 11 nucleotides.

In SENSEI, the size of the word table is reduced by indexing only exact words derived from the positive strand of the query sequence. Target sequences are searched in both the forward and reverse-complement directions and inexact word matches for the target sequence are generated at run-time. Also, the word table is the same size as the length of the query sequence; it does not depend upon the number of neighboring words searched. For a 40 kilobase query sequence, the word table stores 40,000 addresses.

In addition, SENSEI offers a choice of two alternative encoding for the DNA alphabet. The first option encodes each base using 2 bits: BLASTN uses the same encoding. However, SENSEI uses an index word of 8 sequential bases while the default for BLASTN is 11 bases. The second encoding option uses a single bit to represent each base; thus A and G are indistinguishable: both are coded by the same bit (0), and C and T are indistinguishable. This is a default option in SENSEI, but it is not available in BLASTN. For nearly identical sequences, better search sensitivity is obtained with the 2 bit/base encoding. For distantly related sequences, search sensitivity is improved by using the 1-bit/base encoding.

Extending word hits on both sides to check if they result in a significant HSP is the most computationally expensive phase of the search for BLASTN. Traditionally, BLAST extends the word hit by moving a single base at a time. It would be more efficient to move forward k (k>1 bases at a time. However, the problem with extending k bases at a time is that it requires knowing the score between every pair of k-tuples, resulting in a score table with 42k entries. The size of this score table can be reduced by hashing. A simple and effective hash function is the logical exclusive-or(XOR) of the k-tuples.

In summary, SENSEI uses a logical exclusive-or (XOR) to encode the score table and extends HSP scores 8 base pairs at a time. Theoretically, this approach can increase the rate of HSP extension by a factor or 8. In practice, the speed-up depends on the cache size and memory hierarchy of the CPU.

A new developmental version of BLAST (Washington University (WU-BLAST2 available from http://blast.wustl.edu/") has been released that incorporates Sum statistics for gapped alignments.

6.2.4 Gapped BLAST, PSI-BLAST and PHI-BLAST

BLAST is a heuristic that attempts to optimize a specific similarity measure. It permits a tradeoff between speed and sensitivity, with the setting of a threshold parameter. The higher the threshold value the greater the speed, but also an increased probability of missing weak similarities. The BLAST program requires time proportional to the product of the lengths of the query sequence and the database searched. Since the rate of change in database sizes currently exceeds that of processor speeds, computers running BLAST are subjected to increasing load. Also, the extending the hits step is taking 90 % of total time. Thus gapped alignments are allowed and a condition was inserted for extending the hits in the original BLAST algorithm. That is, if there exists two non-overlapping word pairs (hits) on the same diagonal and within a distance A of one another before an extension is invoked.

BLAST searches may be iterated, with a position-specific score matrix generated from significant alignments found in round i for round i+1. The BLAST algorithm is easily generalized to use an arbitrary position-specific score matrix in place of a query sequence and associated substitution matrix. The authors have automated the procedure of generating such a matrix from algorithm to take this matrix as input. The speed of the resulting Position-Specific Iterated BLAST or PSI-BLAST and ease of operation can bring the power of these methods into common use.

PHI-BLAST (Pattern-HitInitiatedBLAST) is a search program that combines matching of regular expressions with local alignments surrounding the match. One very restrictive way to identify protein motifs is by regular expressions that must contain each instance of the motif. The PROSITE database is a compilation of restricted regular expressions that describe protein motifs. Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question: What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences? PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology. PHI-BLAST may be preferable to other flavors ofBLAST because it is faster and because it allows the user to express a rigid pattern occurrence requirement.

6.2.5 Some more Heuristic algorithms

Cantalloube and his team have implemented a homology searching algorithm called Automat, which can perform a fast matching of a given protein sequence against a large protein sequence database, and also to produce, for each residue of the sequence, occurrence frequencies on the number and length of matching oligopeptides in the database containing this residue. The knowledge of these frequencies is useful in asserting the probability and thus the relevance of the matches detected by the program. Matches longer than a given length can be listed for further analysis. Apart from a computation time overhead proportional to the square of the sequence length, the scanning of the database is performed in the fastest possible way (one operation per residue), allowing the systematic use of this software on large databases.

Strelets et al. proposed a new algorithm for data bank homology search; the principal advantages of the new algorithm are linear computational complexity, low memory requirements and high sensitivity to the presence of local region homology.

Most algorithms still require scanning the entire collection of data, and thus increasingly ineffective with larger, unstructured databases such as GenBank. FLASH (Fast Lookup Algorithm for String Homology) has been designed using the indexing-based approach. This enables fast similarity searching through a large database of strings and thanks to a redundant table-lookup scheme; recovering database items that match a test sequence requires minimal data access.

6.2.6 FLASH

Let X denotes a string (v0,v1,v2,....vN-1) of nucleotides or amino acids of length N. Each token, or element in the string, can assume one of rho possible values (For Proteins, rho = 20 and for nucleic acids, rho = 4). During preprocessing, a hash-table data structure A is constructed. The key idea is to generate a large set of highly descriptive indices

lambdak(Xij), k = 1 to dI,

for each token in a given protein or DNA strand; Xij represents the token at the jth location of the string Xi. The constant dI is the index density, that is, the number of indices that are generated for each position j in Xi. During the creation of the hash table, various indices to access a bin in A are used and an entry for the jth location of Xi is created. The process gets repeated till all the strings in the database have been exhausted.

To search the same approach is used. Given the test input Xr to be matched against D, the database, indices for each of the Nr positions in Xr using the same algorithm that was used to build A, are generated. For each generated index, the contents of the corresponding location in A are retrieved. The retrieved entries correspond to a set of string/position combinations from D that generate the same index value at storage time; for each of the recovered strings, also recover the position where the first token that have produced identical indices are properly aligned. It is then simple to identify and rank the similarity matches between Xr and each recovered string.

Apart from these, there are some more very good algorithms for homology searching. K.M.Chao et al. in their paper "Aligning two sequences within a specified diagonal band", describes an algorithm for aligning two sequences within a diagonal band that requires only O(NW) computation time and O(Nb) space, where N is the length of the shorter of the two sequences and W is the width of the band. X.Huang et al. in their paper "A space-efficient algorithm for local similarities" gives a dynamic-programming local-similarity algorithm that needs only space proportional to the sum of the sequence lengths.

Hidden Markov model (HMM)-based search methods improve both the sensitivity and selectivity of sequence database searches. They use position-dependent scores to characterize and build a model for an entire family of sequences. HMMs work best with large sequence families. Bucher and Hofmann have proposed an algorithm combining aspects of HMMs and dynamic programming to build models for single sequences using position-independent scoring and affine gap penalties. The scores drawn from a traditional scoring matrix, such as BLOSUM62. Instead of computing the long-odd score space, they compute the probabilities for the alignment. In the innermost loop of the dynamic programming, rather than choose a step corresponding to the most likely alignment, they add the probabilities from all the possible local alignment. Thus, the number at each cell in the matrix is proportional to the sum of the probabilities of all the alignments ending at that point. The sum of these numbers in all the cells of this matrix is an estimate of the relatedness of the two sequences. This number is divided by the null estimate, which is the number calculated in a similar fashion for two equal-length random sequences.

7. Parallel Techniques for Sequence Similarity Searching

A search of the NBRF protein database at about10 million amino acids, with the rapid and popular search program FASTA and an input protein sequence of about300 amino acids, takes about20 minutes on a VAX6510 (at about 7 Mflops; million floating point operations per second). In this section, we surveys some of the research attempts to realize the power of parallel computing.

To expedite the search, prior research efforts exploit the power of parallel computation. Two general parallelization methods can be used to speed up the search.

One method is to parallelize the comparison operation in which all processors cooperate to determine each similarity score.

The other method is to parallelize the entire search process in which each processor performs a number of comparisons independently. In the second case, a processor computes the entire similarity score for the sequences it is comparing.

The performance of any computational approach depends on the balance of the workload among the processors. If the workload can be balanced well, better performance can be achieved. That is, better performance can be achieved if the percentage of load imbalance is small. The percentage of load imbalance can be used to compare different parallel computational approaches.

PLIB = (LargestLoad - SmallestLoad)/(LargestLoad) * 100.

If PLIB is less than one, one achieve over99% of the total amount of processing is done in parallel.

The first method uses fine grain parallelism techniques. Fine parallelism refers to the parallelism within the comparison algorithm. To search for the most similar sequence, one need to compare the query sequence against all the sequences in the database. The amount of communication and idle time is prohibitive for the fine grain approach. To reduce the communication time for longer sequences where the number of processors is smaller than the length of the sequences, Edmiston decomposed similarity matrix into computational blocks. Each processor calculates all elements in each block before it communicates with its neighbors.

The second method uses a coarser grain parallelism approach, which has lower communication overhead. However, the overall performance of the coarser grain approach depends on its load balancing technique. The load balancing technique must be able to assign approximately about the same amount of work to each processor. There are two load-balancing techniques. The first one was presented byGuan et al. and the second bySittig et al. .

Guan balanced the workload by partitioning the database into a number of portions according to the number of processors allocated. Ideally, each portion should be equal to the size of the database divided by the number of allocated processors.

In this approach, the advantage is low communication overhead, achieved by assigning each processor to search its own portion independently without communicating with the other processors. The disadvantage is the potential of a high percentage of load imbalances due to the poor partitioning technique as biological sequences are of varying length.

Sittig also adopted a course grain parallelism approach using a dynamic load balancing technique where the workload is assigned to each processor dynamically during query processing. This is in contrast to the static technique used by Guan where the allocation of work is assigned prior to the commencement of the actual execution. This dynamic approach has the opposite advantage and disadvantage from the Guan. In this approach, the sequences in the database were first sorted in decreasing length order. One processor is used as the master to distribute the sequences to the workers (remaining processors) for comparison and to collect the similarity scores back from them. The main job of the master is to keep the workers busy as long as there is work to be done. That is, when a worker processor completes the processing of a sequence, it requests from the master an additional sequence. This form of dynamic load balancing continues until all the sequences have been compared. The sequences in the database were sorted to ensure that the last sequence compared is the shortest one.

In this master-worker paradigm, the advantage is low PLIB while the disadvantage is higher communication overhead.

To obtain low communication overhead, low PLIB, and to eliminate any processor bottlenecks, there is a combined (bucket) approach. In this, a greedy allocation approach (as found in Sittig) is performed statically (as in Guan). To achieve a near uniform allocation of sequences to P processors (buckets), the following greedy algorithm is used. First, the sequences are sorted by their length in decreasing order. Then, starting from the longest one, each sequence is placed into the bucket that has the smallest sum of sequence lengths. This, in fact, is the static equivalency of the dynamic approach of Sittig. One way to implement this sequence placement algorithm is given below.


  1. INPUT:
  2. inputfile = an input file which contains two field: a sequence identification and its length.
  3. num_buck = the number of buckets
  4. OUTPUT:
  5. outputfile = an output file which contains two field; a bucket number and sequence identification.
  6. PROCEDURE
  7. LET seq_id = a sequence identification
  8. LET seq_size = a sequence length.
  9. LET bsize[i] = the sum of sequence lengths in bucket i.
  10. LET size = a temporary variable
  11. LET MAXINT = the largest integer.
  12. BEGIN
  13. FOR (i=1 to num_buck) bsize[i]=0;
  14. WHILE (NOT END FILE)
  15. READ seq_id and seq_size FROM inputfile
  16. size=MAXINT
  17. FOR (i=1 to num_back)
  18. IF(bsize[i]
  19. Size=bsize[i]
  20. Buck=i;
  21. END_IF
  22. END_FOR
  23. bsize[buck]=bsize[buck]+seq_size
  24. WRITE buck and seq_id TO outputfile
  25. END_WHILE
  26. END_PROCEDURE

Initially, the size of each bucket is set to zero by line13. Before searching for the smallest one, the temporary variable size is initialized to the largest integer. Then, to find the smallest bucket, a loop(lines17 to22) to examine all the buckets is used. On line23, the size of the chosen bucket is updated. On line24, the sequence identification and its assigned bucket is written to the output file. When the procedure is completed, the difference between the smallest and the largest buckets can be found easily by examining all the sizes of buckets bsize[1..num_buck].


7.1 Implementations of BLAST for Parallel Computers

The third part (extending the hits) of the BLAST algorithm is the most time consuming piece. The comparison between the query sequence and a database sequence is completely independent of the comparison of the query sequence to other database sequences; therefore, this part of the BLAST is very well suited for parallelization to speed up the process of homology searching. The BLAST sequence comparison programs have been ported to a variety of parallel computers - the shared memory machines and the distributed memory architectures.

7.1.1 Shared memory Version

The original source code for the BLAST programs contained a parallel version for Silicon Graphics PowerIRIS parallel computers, which works as follows. After one processor reads the entire database and the query sequence into the shared memory, all processors can access the data. Each processor compares the query sequence against different parts of the database. A counter variable, which serves as a pointer into the database, manages the distribution of the database sequences among the processors. Since this database pointer is located in the shared memory, each processor can access it in a guarded region, get the next uncompared part of the database assigned, and move the pointer to the following uncompared database part. Each processor writes its individual results into a different part of the shared memory, where one processor can access them and combine them after all database sequences have been compared against the database sequence.

7.1.2 Distributed memory version

For distributed memory machines, including the cluster-of-workstations, the above method does not work, because no processor can access the data of the other processors directly. Therefore, the database pointer was replaced by an additional process, which handles the distribution of database sequences among the processors. Each processor reads the whole database and the query sequence and sends a request for database sequences to the new server process. The server process replies with the number of the database sequence to be compared next. Then, the process will compare a chunk of a previously defined size of database sequences, store the results in its local memory, and ask the server for the next part of the database. After the query sequence is compared against the entire database, all processors but one send their individual results to the remaining processor, which combines the results and generates the output.

7.2 BLAZE, a Parallel implementation of Smith & Waterman Algorithm

Most word-based or index-based sequence search algorithms such as FASTA or BLAST use approximate methods initially to find regions of local homology. These methods are not as sensitive as the original full dynamic programming algorithms of Needleman & Wunch or Smith & Waterman. Even when indexing methods use a word size of one (FASTA) or allow inexact word matching (BLAST), they do not consider all possible alignments and can miss biologically important sequence similarities.

The Needleman-Wunsch and Smith-Waterman methods gain their sensitivity by comparing each base (residues) or amino acid in the query with each base or amino acid in the database. In addition, they can utilize a table of amino acid similarity scores (or log-likelihood replacement scores or PAM matrices) to increase the significance of similarities of proteins that are related to each other in evolution. One can also vary the penalty for the introduction of an insertion-detection gap based on its length. Given a table of amino acid similarity scores and specific gap penalties, the Needleman-Wunsch and Smith-Waterman algorithms result in scores, which are optimal for either the global or local alignment of two sequences respectively.

The approximate methods were developed to be able to align a query with every sequence in a database in a reasonable amount of computer time. Comparing a sequence with current databases using the Smith-Waterman or Needleman-Wunsch algorithms is extremely computer intensive requiring either a massively parallel computer or a supercomputer. They also require memory proportional to the length of the query times the length of the database.

Brutlag et al. have implemented the full dynamic programming algorithm of Smith and Waterman as modified by Gotoh on a massively parallel MasPar MP1104 computer (4096 4-bit processors with a total of 256 megabytes of memory). This implementation, named BLAZE, affords the complete sensitivity of the Smith and Waterman algorithm using PAM matrices and gap penalties, while simultaneously maintaining interactive performance. Searches of the SWISS-PROT21 vers. protein database (7,866,594 residues) with a100 amino acid query take 14.4 seconds with a single penalty for the insertion or extension of a gap and35 seconds when using a full affine gap penalty. This represents about55 million residue comparisons per second respectively. These rapid search rates also permit full database inversion. One can compare all sequences in a database with all of the others. Results of such a database inversion can be used to classify sequences in superfamilies, discover evolutionary relationships, or discover conserved coding segments.

The discussions above indicate that general purpose massively parallel computers can give high sensitivity and accuracy of database search at interactive speeds. The MP1104 with4096 processors has sufficient memory to hold the entire GenBank and SWISS-PROT database simultaneously that can overcome the I/O bottleneck plaguing many other efforts to parallelize sequence comparisons.

8. State-of-the-art Dedicated Machines

The dedicated machines are classified into two categories: the ASIC systems and the FPGA systems.

8.1 ASIC systems

An ASIC (ApplicationSpecific Integrated Circuit) component is a chip devoted to a single function (or a restricted class of functions). Once designed and fabricated, it can't be modified, and thus can't develop without major investment.

In dedicatedASIC systems for sequence comparisons, the computation is usually performed by a linear array of identicalASIC processors. The peak performance of these machines is impressive since all the processors (a few hundred) work concurrently and synchronously. The BioSCAN machine and BISP machine belong to this category.

BioScan accelerates the identification of similar segments for DNA or protein sequences without allowing gaps. Chip contains812 1-bit processors, and a system with 16 chips is currently working, leading to a total of 12 992 processors. In that configuration, scanning a database limits the query sequence to12 992 characters. Currently, this is the most powerful system for detecting similar segments of identical length.

BISP (Biological Information Signal Processor) provides a high-speed and linearly extensible system that can locate similar subsequences of two DNA or protein sequences. It implements a modified version of the Smith and Waterman algorithm, and allows many parameters to be met. A chip contain 16 processors and a prototype machine of 16 chips (256 processors) has been validated, making possible the computation of unlimited sequence length.

The computational power of these machines depends directly on the clock speed and the number of processors.

8.2 FPGA systems

The FPGA (FieldProgrammable Gate Array) components are based on recent technology. Their advantages, compared to the ASIC approach, are 2-fold: no specific chips need to be fabricated, and they can be reconfigurated indefinitely(correction and evolution of the hardware is possible). However,they are slower and not so optimized asASIC components. The Bioccelerator (http://www.compugen-us.com/index.html) and the DeCypher II (http://www.timelogic.com) systems are designed with this technology.

The Bioccelerator system is a Unix workstation peripheral, attached to the sequence database server through a SCSI connection. Currently implemented algorithms include:framesearch, profilesearch, Smith-Waterman local alignment, and Needleman-Wunsch global and overlap alignment programs.

The DeCyber II system has been mainly designed for speeding up database searching with the Smith and Waterman algorithm. It is based onFGPA accelerator cards plugged into a personal workstation. The performance depends on the number of cards.

The SAMBA system belongs to theASIC category since the heart of the system is dedicated VLSI processor array, but the complete system contains an FPGA-Memory interface. The array is connected to the host workstation through anFPGA-Memory board, which acts as an array controller and a high-speed mechanism for correctly feeding the array and filtering results on the fly.

The array of the SAMBA prototype is composed of 32 full custom identical chips which each house four processors, leading to a 128 processor array. The chip has been designed at IRISA and provides a computational power of 400 million operations per second. Hence, the array is able to reach a peak performance of 12.8 billion operations per second.

The structure of the array, a systolic linear array, is very well suited for supporting string-processing parallelization. SAMBA includes many features, not present in other dedicated systems.

9. Future Computational Challenges

At the current time, the size N of biological sequences databases is growing at an exponential rate, and many investigators foresee a time when there will be petabytes of sequence information available. On the other hand, the size of a query, P, tends to remain fixed, e.g., a protein sequence averages about 300 residues and the longest are about 1000 residues. So designers of efficient computational methods should be principally concerned with how the time to perform a search grows as a function of N. Despite this all the currently used methods take an amount of time that grows linearly in N, i.e., they are O (N) algorithms. This includes not only rigorous methods like Smith and Waterman, but also popular heuristics FASTA and BLAST and even the systolic array chips. What this implies is that when a database increases in size by a factor of 1000, then all these methods will be taking 1000 times longer to search that database. The only difference among them is their coefficients of proportionality, e.g., Smith and Waterman takes roughly 1.0N milli-seconds, BLAST roughly 1.0N micro-seconds, and the BISP chips roughly 0.3N micro-seconds for a search against a typical protein query sequence on a typical workstation. Thus while a custom chip may take about 3 seconds to search 10 million residues or nucleotides, it will take 3000 seconds, or about 50 minutes when there are 10 billion symbols to be searched. BLAST will be taking hours and Smith and Waterman months on these future databases.

What would be very desirable, and as argued above probably essential, is to have search methods whose running time is sub linear in N, i.e., O (Nalpha) for some alpha lesser than 1. For example, suppose there is an algorithm that takes O(N.5) time, which is to say that as N grows, the time taken grows as the square root of N. For example, if the algorithm takes about 10 seconds on a 10 million symbol database, then on 10 billion symbols, it will take about sqrt(1000) = 31 times longer, or about 5 minutes. Note that while it was slower than the hypothetical chip on 10 million symbols, it is faster on 10 billion.

10. Some Related Issues

10.1 Compression of nucleotide databases for fast searching

It is desirable that databases are stored compactly, that sequences can be accessed independently of the order in which they were stored, and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching.

Williams and Zobel have designed a new direct coding compression scheme for use in homology search applications such as FASTA, BLAST and CAFE. This scheme yields compact storage as lossless - nucleotide bases and wildcards are represented and have extremely fast decompression. Also, direct coding is faster than the other codings, require much less memory, and has significantly lower retrieval times. Direct coding gives good compression and allows faster retrieval, thus yielding significant improvements in search times for high-speed homology search tools.

10.2 Database Searching and Clustering

As a result of scientific, historical and political factors, the data generated by the human genome project are disseminated in dozens of independent and heterogeneous databases and the information found in these databases is not sufficiently cross-referenced. F. Achard and Dessen,P. have developed genXref, an automated system for link reference, because embarking on a manual cross-referencing of genome data would require too much expensive human expertise. This uses information retrieval technology to generate links between databases.

To introduce some order into these data available in a database, it is often desirable to group the sequences in a database, in a genome or in a set of genomes, into protein families. This provides evolutionary, functional and structural information on the sequences. In spite of the high degree of sophistication of existing approaches for searching a sequence database with a query sequence, it is still virtually impossible to identify quickly and clearly a cluster of sequences that a given query sequence belongs to. Antje Krause et al. has introduced a procedure called SYSTEmatic Re-Searching (SYSTERS), which, given a query sequence, can delineates a cluster of sequences that the query is a member of. Membership in such a cluster is not given with some probability or other quantitative measure. A database sequence is either in the SYSTERS cluster or it is not.

SYSTERS systematically iterates database searching with a sequence identified in the current search. In contrast to PSI-BLAST, which generates a position-specific score matrix as input for the next match, SYSTERS take a strictly conservative approach where only sequences of high similarity to the last query are used for re-searching. This approach allows us to define a set of sequences related to the query where false positives are extremely unlikely.

10.3 Intermediate sequences increase the detection of homology between sequences

Park,J., et al. has introduced a method that can relate two homologous sequences, which have diverged beyond the point not to able to recognize their homology, through a third sequence that is suitably intermediate between the two.

10.4 A workbench for large-scale sequence homology analysis

There is a bottleneck in the manual evaluation of the matches reported by the search programs, which often form a list of many thousands of potential homologies. Also, restricting the amount of results by using a high score cutoff only makes the problem of missing similarities worse. Erik L.L.Sonnhammer and R.Durbin has designed a workbench which can filter out as many unwanted matches as possible and thus makes it easy for large-scale sequence homology analysis.

Also they have designed a graphics software tool for sequence comparison. Graphical dot matrix plots can provide the most complete and detailed comparison of two sequences. DOTTER is a dot-plot program for X-windows, which can compare DNA or protein sequences and also DNA versus protein. The main feature of DOTTER is that the user can vary the stringency cutoffs interactively, so that the dot matrix only needs to be calculated once. The other useful features are dot matrix compression, mouse-controlled zooming, sequence alignment display and saving/loading of dot-matrices. Since the matrix only has to be calculated once and since the algorithm is fast and linear in space, DOTTER is practical to use even for sequences as long as cosmids.

Numerous homologous sequences from diverse species can be retrieved from databases by BLAST, FASTA. However, the current database search techniques are not able to discriminate whether the best hit is an orthologue, i.e., the functionally corresponding counterpart of the query sequence in another species, or only a paralogue, i.e., a homologous member of a multigene family that shares, in the best case, only some functional features with the query. Yan P. Yuan et al. have developed a method that shows probable orthologues and paralogues as predicted.

LALNVIEW

It is a graphical program designed by Duret,L. et al. for visualising local alignments between two sequences (protein or nucleic acids). Sequences are represented by colored rectangles to give an overall picture of their similarities. This can also display sequence features (exon, intron, active site, domain, propeptide, etc.) along with the alignment. When using LALNVIEW through our Web servers, sequence features are automatically extracted from database annotations (SWISS-PROT, GenBank, or EMBL) and displayed with the alignment. This is a useful tool for analyzing pairwise sequence homology and what is known about the structure or function of sequences.

DIALIGN

It is a new method for pairwise alignment of nucleic acid and protein sequences. While standard alignment programs rely on comparing single residues and imposing gap penalties. DIALIGN constructs alignments by comparing whole segments of the sequences. No gap penalty is employed, This program is especially adequate if sequences share only local similarities.

11. Conclusion

The concurrent development of rapid methods for molecular cloning, DNA sequencing, high-performance computer workstations, and rapid protein and DNA sequence comparison algorithms has revolutionized the practice of molecular biology. Newly determined sequences are routinely compared against large sequence databases, and inferences about structure and function are frequently based on sequence similarity. Predicted growth of sequence databases and the advent of large-scale DNA sequencing projects have prompted increased interest in better methods for comparing protein and DNA sequences. Biological sequence comparison algorithms must balance sensitivity - the ability to calculate high-ranking scores for distantly related sequences - with selectivity-the ability to calculate low-ranking scores for unrelated sequences. This report has discussed some of the approaches for homology searching in biological sequences databases.

Click for Bioinformatics Links

God bless you!

Back to my Home Page