Sunday, April 25, 2010

Simple Statistical Algorithm for Biological Sequence Compression

Abstract :

This paper introduces a novel algorithm for biological sequence compression that
makes use of both statistical properties and repetition within sequences. A panel of
experts is maintained to estimate the probability distribution of the next symbol in
the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of
the biological sequence. Each symbol is then encoded by arithmetic coding. Most compression algorithms fall into one of two categories, namely substitutional
compression and statistical compression. Those in the former class replace a long
repeated subsequence by a pointer to an earlier instance of the subsequence or to
an entry in a dictionary Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time.


1. Introduction
Modelling DNA and protein sequences is an important step in understanding biology.
Deoxyribonucleic acid (DNA) contains genetic instructions for an organism.
A DNA sequence is composed of nucleotides of four types: adenine (abbreviated A),
cytosine (C), guanine (G) and thymine (T). In its double-helix form, two complementary
strands are joined by hydrogen bonds joining A with T and C with G. The
reverse complement of a DNA sequence is also considered when comparing DNA
sequences. Certain regions in a DNA sequence are translated to proteins, which control
the development of organisms. The alphabet of protein sequences consists of 20
amino acids, each of which is determined by a triplet of nucleotides called a codon.
The amount of DNA sequenced from organisms is increasing rapidly. Compression
of biological sequences is useful, not primarily for managing the genome database,
but for modelling and learning about sequences. Work by Stern et al. [21] recognizes
the importance of mutual compressibility for discovering patterns of interest from
genomes. Chen et al. [6] and Powell et al. [19] show that compressibility is a good
measurement of relatedness between sequences and can be effectively used in sequence
alignment and evolutionary tree construction.
Since DNA is the “instruction of life”, it is expected that DNA sequences are not
random and should be compressible. Some DNA sequences are highly repetitive. It is
estimated that 55% of the human genome is repeat DNA. A repeat subsequence is a
copy of a previous subsequence in the genome, either forward or reverse complement.
Most DNA repeats are not exact as nucleotides can be changed, inserted or deleted.
As an example, the ALU family are repeats of length about 300 bases, and any one
is only about 87% similar to a consensus sequence.

Interestingly, most general purpose text compression algorithms fail to compress
DNA to below the naive 2 bits per symbol. That is because DNA regularities are
different from those in text and are rarely modelled by those compressors. A number
of special purpose compression algorithms for DNA have been developed recently.
Most of these search for repeat subsequences and encode them by reference to a
previous instance. As a DNA subsequence could be (approximately) repeated many
times, using information from many of those repeat positions is expected to give
better compression ratios.
In this paper, we present the expert model (XM) and an algorithm for biological
sequence compression. Our compressor encodes each symbol by estimating the
probability based on information obtained from previous symbols. If the symbol is
part of a repeat, the information from one or more previous occurrences is used.
Once the symbol’s probability distribution is determined, it is encoded by a primary
compression algorithm such as arithmetic coding.
This paper is organized as follows. Section 2 reviews current research on biological
compression. Our expert model is described in section 3 and experimental results are
presented in section 4. Finally, section 5 concludes our work.

2. Background

Most compression algorithms fall into one of two categories, namely substitutional
compression and statistical compression. Those in the former class replace a long
repeated subsequence by a pointer to an earlier instance of the subsequence or to
an entry in a dictionary. Examples of this category are the popular Lempel-Ziv
compression algorithms [25, 26] and their variants. As DNA sequences are known to
be highly repetitive, a substitutional scheme is a natural approach to take. Indeed,
most DNA compressors to date are in this category.
On the other hand, a statistical compression encoder such as prediction by partial
match (PPM) [8] predicts the probability distribution of each symbol. Statistical
compression algorithms depend on assumptions about how the sequence is generated
to calculate the distribution. These assumptions are said to be the model of the
sequence. If the model gives a high probability to the actual value of the next symbol,
good compression is obtained. A model that produces good compression makes good
predictions and is a good description of the data.
The earliest special purpose DNA compression algorithm found in the literature is
BioCompress developed by Grumbach and Tahi [11]. BioCompress detects an exact
repeat in DNA using an automaton, and uses Fibonacci coding to encode the length
and position of its previous location. If a subsequence is not a repeat, it is encoded
by the naive 2 bits per symbol technique. The improved version, BioCompress-2
[12] uses a Markov model of order 2 to encode non-repeat regions. The Cfact DNA
compressor developed by Rivals et al. [20] also searches for the longest exact repeats
but is a two-pass algorithm. It builds the suffix tree of the sequence in the first
pass, and does the actual encoding in the second pass. Regions not repeated are also
encoded by 2 bits per symbol. The Off-line approach by Apostolico and Lonardi
[3] iteratively selects repeated substrings for which encoding would gain maximum
compression.
A similar substitution approach is used in Recompress by Chen et al. [6] except
that approximate repeats are exploited. An inexact repeat subsequence is encoded by
a pair of integers, as for BioCompress-2, and a list of edit operations for mutations,
insertions and deletions. Since almost all repeats in DNA are approximate, Recompress
obtains better compression ratios than BioCompress-2 and Cfact. The same
compression technique is used in the DNACompress algorithm by Chen et al. [7],
which finds significant inexact repeats in one pass and encodes these in another pass.
Most other compression algorithms employ similar techniques to Recompress to
encode approximate repeats. They differ only in the encoding of non-repeat regions
and in detecting repeats. The CTW+LZ algorithm developed by Matsumoto et al.
[16] encodes significantly long repeats by the substitution method, and encodes short
repeats and non repeat areas by context tree weighting [23]. At the cost of time
complexity, DNAPack Behzadi and Fessant [4] employs a dynamic programming
approach to find repeats. Non-repeat regions are encoded by the best choice from an
order 2 Markov model, context tree weighting, and naive 2 bits per symbol methods.
Several DNA compression algorithms combine substitution and statistical styles.
An inexact repeat is encoded using (i) a pointer to a previous occurrence and (ii) the
probabilities of symbols being copied, changed, inserted or deleted. In the MNL
algorithm by Tabus et al. [22] and its improvement, GeMNL by Korodi and Tabus
[14], the DNA sequence is split into fixed size blocks. To encode a block, the algorithm
searches the history for a regressor, which is a subsequence having the minimum
Hamming distance from the current block, and represents it by a pointer to the
match as well as a bit mask for the differences between the block and the regressor.
The bit mask is encoded using a probability distribution estimated by the normalized
maximum likelihood of similarity between the regressor and the block.
Probably the only two pure statistical DNA compressors published so far are CDNA
by Loewenstern and Yianilos [15] and ARM by Allison et al. [2]. In the former algorithm,
the probability distribution of each symbol is obtained by approximate partial
matches from history. Each approximate match is with a previous subsequence having
a small Hamming distance to the context preceding the symbol to be encoded. Predictions
are combined using a set of weights, which are learnt adaptively. The latter
ARM algorithm forms the probability of a subsequence by summing the probabilities
over all explanations of how the subsequence is generated. Both these approaches
yield significantly better compression ratios than those in the substitutional class and
can also produce information content sequences. CDNA has many parameters which
do not have biological interpretations. Both are very computationally intensive.
The expert model presented in this paper is a statistical algorithm. The encoder
maintains a panel of experts and combines them for prediction but a much simpler
and computationally cheaper mechanism is used than in those above. The framework
allows any kind of expert to be used, though we report here only experts obtained from
statistics and repetitivenes of sequences. Weights for expert combination are based
on expert performance. Our compressor is found to be superior to any compression
algorithms to date and its speed is practical. The algorithm is capable of biological
knowledge discovery based on per element information content sequences [10]. This
is a purpose of our compressibility research.

3. Algorithm description

As a statistical method, our XM algorithm compresses each symbol by forming
the probability distribution for the symbol and then using a primary compression
scheme to code it. The probability distribution at a position is based on symbols
seen previously. Correspondingly, the decoder, also having seen all previous decoded
symbols, is able to compute the identical probability distribution and can recover the
symbol at the position.
In order to form the probability distribution of a symbol, the algorithm maintains a
set of experts, whose predictions of the symbol are combined into a single probability
distribution. An expert is any entity that can provide a probability distribution at a
position. Expert opinions about a symbol are blended to give a combined prediction
for the symbol.
The statistics of symbols may change over the sequence. One expert may perform
well on some region, but could give bad advice on others. A symbol is likely to
have similar statistical properties to the context surrounding, particularly the context
preceding the symbol. The reliability of an expert is evaluated from its recent
predictions. A reliable expert has high weight for combination while an unreliable
one has little influence on the final prediction or may be ignored.


3.1. Type of experts
An expert can be anything that provides a reasonably good probability distribution
for a position in the sequence. A simple expert can be a Markov model (Markov
expert). An order-k Markov expert gives the probability of a symbol in a position
given k preceding symbols. Initially, the Markov expert does not have any prior
knowledge of the sequence and thus gives a uniform distribution to a symbol. The
probability distribution adapts as the encoding proceeds. Essentially, the Markov
expert provides the background statistical distribution of symbols over the sequence.
Here we use an order-2 Markov expert for DNA, and order-1 for protein.
Different areas of a DNA sequence may have differing functions and thus may have
different symbol distributions. Another type of expert is the context Markov expert,
whose probability distribution is not based on the entire history of the sequence but
on a limited preceding context. In other words, the context Markov expert bases its
prediction on the local statistics. The context Markov expert currently used by XM
is order-1 with a context of 512 previous symbols.
The compressibility of biological sequences comes mainly from repeated subsequences.
Therefore, it is important to include experts that make use of this feature.
XM employs a copy expert that considers the next symbol to be part of a copied
region from a particular offset. A copy expert with offset f suggests that the symbol
at position i is likely to be the same as the symbol at position i − f.
A copy expert does not blindly give a high probability to its suggested symbol. It
uses an adaptive code [5], over some recent history, for correct/incorrect predictions.
The copy expert gives a probability to its predicted symbol of:
p =
r + 1
w + 2
(1)
where w is the window size over which the expert reviews its performance and r is
the number of correct predictions the expert has made. The remaining probability,
1 − p, is distributed evenly to the other symbols in the alphabet.
For complementary reverse repeats, a similar reverse expert is used. This works
exactly the same as the copy expert, except that it suggests the complementary
symbol to the one from the earlier instance and it proceeds in the reverse direction.

3.2. Proposing experts

At position i of the sequence, there are O(i) possible copy and reverse experts.
This is too many to combine efficiently and anyway most would be ignored. To be
efficient, the algorithm must use at most a small number of copy and reverse experts
at any one time. We currently employ a simple hashing technique to propose likely
experts. Every position is stored in a hash table with the hash key composed of h
symbols preceding the position. If there is an opening for a new expert at any point,
the hash table is consulted.

3.3. Combining expert predictions

The core part of our XM algorithm is the evaluation and combination of expert
predictions. Suppose a panel of experts E is available to the encoder. Expert _k gives
the prediction P(xn+1_k, x1..n) of symbol xn+1 based on its observations of preceding
n symbols. A sensible way to combine experts’ predictions is based on Bayesian
averaging:
P(xn+1x1..n) =Xk2E
P(xn+1_k, x1..n)w_k,n
=Xk2E
P(xn+1_k, x1..n)P(_kx1..n)
(2)
In other words, the weight w_k,n of expert _k for encoding xn+1 is the posterior
probability P(_kx1..n) of _k after encoding n symbols. w_k,n can be estimated by
Bayes’s theorem:
w_k,n = P(_kx1..n)
= Qn
i=1 P(xi_k, x1..i−1)P(_k)
Qn
i=1 P(xix1..i−1)
(3)
If we assume that every expert has the same prior probability P(_k) then normalizing
equation 3 by a common factor M we have:
w_k,n =
1
M
n Yi=1
P(xi_k, x1..i−1) (4)
The normalization factor M, in fact does not matter as equation 2 could be again
normalized to have PP(xn+1x1..n) = 1. Take the negative log of equation 4 and
ignore the constant term:
−log2(w_k,n) _ −
n Xi=1
log2P(xi_k, x1..i−1) (5)
Since −log2P(xi_k, x1..i−1) is the cost of encoding symbol xi by expert _k, the right
hand side of equation 5 is the length of encoding of subsequence x1..n by expert _k.
As we want to evaluate experts on a recent history of size w, only the message length
of encoding symbols xn−w+1..n is used to determine weights of experts. We find that,
the algorithm works best when negative log 2 of the expert weight varies as three
times the average code length over a window of size w = 20:
−log2(w_k,n) _ −
3
w
n X i=n−w+1
log2P(xi_k, x1..i−1)
= 3AveMsgLen(xn−w+1..n_k)
(6)
or
w_k,n / 2−3AveMsgLen(xn−w+1..n_k) (7)
Suppose there are three hypotheses about how a symbol is generated: by the
distribution of the species genome; by the distribution of the current subsequence;
or by repeating from an earlier subsequence. We therefore entertain three experts
for these hypotheses: (i) a Markov expert for the species genome distribution, (ii) a
context Markov expert for the local distribution, and (iii) a repeat expert, which is
the combination of any available copy and reverse experts, for the third hypothesis.
The experts’ predictions are blended as in equations 2 and 7.
If a symbol is part of a significant repeat, the copy or reverse expert of that repeat
must predict significantly better than a general prediction such as that from the
Markov expert. We therefore define a listen threshold, T, to determine the reliability
of a copy or reverse expert. A copy or reverse expert is considered reliable if its
average code word length is smaller than Cmk −T bits where Cmk is the average code
word of the Markov expert. T is a parameter of the algorithm.
The algorithm can be used as an entropy estimator or a compressor for biological
sequences. The information content of every single symbol is estimated by the negative
log of its probability. To compress the sequence, we use arithmetic coding [24]
to code each symbol based on the probability distribution combined from experts.


4. Experimental results

We implemented the encoder and decoder of XM in Java and ran experiments on
a workstation with Pentium IV 2.4Ghz CPU and 1GB of RAM, using the Sun Java
run-time environment 1.5. The compression results are calculated from the size of real
encoded files. Note that the figures for actual compression and information content
are similar up to four decimal places. The subtle difference between the information
content computation and the actual compression is due to rounding in arithmetic
coding and padding the last byte of the encoded files.

For comparison, we applied our algorithm on a standard dataset of DNA sequences
that has been used in most other DNA compression publications. The dataset
contains 11 sequences including two chloroplast genomes (CHMPXX and CHNTXX),
five human genes (HUMDYSTROP, HUMGHCSA, HUMHBB, HUMHDABCD
and HUMHPRTB), two mitochondria genomes (MPOMTCG and MTPACG)
and genomes of two viruses (HEHCMVCG and VACCG). For DNA compression, we
use hash key of length 11 and listen threshold of 0.5 bits.



Table 1. Comparison of DNA compression.

Table 1 compares the compression results, in bits per symbol (bps), of XM to that
of other DNA compressors on the dataset. Due to space limitations, we present here
the most efficient algorithms, including BioCompress-2 (BioC) [12], Recompress
(GenC) [6], DNACompress (DNAC) [7], DNAPack (DNAP) [4], CDNA [15] and
GeMNL [14]. Comparison with other DNA compressors can be found on the website:
ftp://ftp.infotech.monash.edu.au/software/DNAcompress-XM/ The results of
CDNA are reported for only 9 sequences in precision of two decimal places. The
GeMNL results are also reported without the sequence HUMHBB and in two decimal
place precision but we are able to obtain higher precision by downloading the encoded
files from the author’s website. We include the average compression results of each
algorithm in the last row.
XM outperforms all other algorithms in most sequences from the standard dataset.
The average compression ratio is also significantly better. For CDNA and GeMNL,
due to missing compression results of several sequences, we are unable to compute
the same average. Instead, we compare the average of the only available results.
The average compression ratio of nine sequences reported for CDNA is 1.6911 bps,
while XM’s average performance on the same set is 1.6815 bps. On the ten sequences
excluding HUMHBB, GeMNL averages 1.6980 bps, compared to XM’s 1.6883 bps.
Total time for XM to encode these 11 sequences is about 8 seconds. Decoding time
is similar since both encoder and decoder do essentially the same computation.





Figure 1. Information content of the HUMHBB sequence.

As a statistical compressor, the expert model is able to produce the information
content sequence from DNA or protein. This is important when we want to analyze
areas of interest [21, 9, 10]. For example, figure 1 shows a graph of information content
along the HUMHBB sequence. The data in the graph is smoothed with a window
size of 300 for viewing purposes. One can notice spikes in the graph corresponding
to areas of repeats in the sequence.
The alphabet for proteins consists of 20 symbols and thus the base line of protein
entropy is log220 = 4.322 bps. Similar to DNA, most general purpose compressors
fail to compress to less than that base line. Nevill-Manning and Witten [18] designed
CP, a protein-oriented compression algorithm based on PPM. However, compression
ratios obtained by CP are only marginally better than the base line entropy. Several
other attempts such as ProtComp [13], LZ-CTW [16] and BW [1] show that protein
sequence are indeed compressible with better compression ratios.
We experimented with compressing protein using XM on a protein corpus gathered
by [18] which consists of proteomes of four species: Haemophilus Influenzae (HI),
Saccharomyces Cerevisiae (SC), Methanococcus Jannaschii (MJ) and Homo Sapiens
(HS). As an amino acid is coded by three nucleotides, we use a shorter hash key for protein, of length 6. The listen threshold is raised to 1.0 bit as the upper bound
entropy of protein is 4.322 bps instead of 2.0 bps in DNA.




Table 2. Comparison of protein compression.
Table 2 shows the compression ratios of CP, ProtComp, LZ-CWT and XM of the four protein sequences.

Note that an incorrect protein corpus that was more compressible was made available
at some point resulting in a significantly lower compression ratios being reported in
ProtComp [13] and BW [1]. We obtained the compression results of ProtComp on
the correct protein corpus from the author’s website but were unable to do so for
BW as the authors have moved to new projects [17]. We found that our algorithm is
able to compress proteins better than CP and LZ-CWT and marginally better than
ProtComp for all sequences in the corpus.


5. Conclusion
We have presented the expert model, XM, which is simple and based on biological
principles. The associated compression algorithm is efficient and effective for both
DNA and protein sequence compression. The algorithm utilizes approximate repeats
and statistical properties of the biological sequence for compression. As a statistical
compression method, XM is able to compute the information content of every symbol
in a sequence which is useful in knowledge discovery [21, 9, 10]. Our algorithm
is shown to outperform all published DNA and protein compressors to date while
maintaining a practical running time.

References
[1] D. Adjeroh and F. Nan. On compressibility of protein sequences. DCC, pages 422–434, 2006.
[2] L. Allison, T. Edgoose, and T. I. Dix. Compression of strings with approximate repeats. ISMB,
pages 8–16, 1998.
[3] A. Apostolico and S. Lonardi. Compression of biological sequences by greedy off-line textual
substitution. DCC, pages 143–152, 2000.
[4] B. Behzadi and F. L. Fessant. DNA compression challenge revisited: A dynamic programming [5] D. M. Boulton and C. S. Wallace. The information content of a multistate distribution. Theoretical
Biology, 23(2):269–278, 1969.
[6] X. Chen, S. Kwong, and M. Li. A compression algorithm for DNA sequences and its applications
in genome comparison. RECOMB, page 107, 2000.
[7] X. Chen, M. Li, B. Ma, and T. John. DNACompress: Fast and effective DNA sequence
compression. Bioinformatics, 18(2):1696–1698, Dec 2002.
[8] J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string
matching. IEEE Trans. Comm., COM-32(4):396–402, April 1984.
[9] T. I. Dix, D. R. Powell, L. Allison, S. Jaeger, J. Bernal, and L. Stern. Exploring long DNA
sequences by information content. Probabilistic Modeling and Machine Learning in Structural
and Systems Biology, Workshop Proc, pages 97–102, 2006.
[10] T. I. Dix, D. R. Powell, L. Allison, S. Jaeger, J. Bernal, and L. Stern. Comparative analysis
of long DNA sequences by per element information content using different contexts. BMC
Bioinformatics, to appear, 2007
[11] S. Grumbach and F. Tahi. Compression of DNA sequences. DCC, pages 340–350, 1993.

[12] S. Grumbach and F. Tahi. A new challenge for compression algorithms: Genetic sequences.
Inf. Process. Manage., 30(6):875–866, 1994.

[13] A. Hategan and I. Tabus. Protein is compressible. NORSIG, pages 192–195, 2004.

[14] G. Korodi and I. Tabus. An efficient normalized maximum likelihood algorithm for DNA
sequence compression. ACM Trans. Inf. Syst., 23(1):3–34, 2005.

2 comments:

  1. If you are still asking "How many Calories should I eat?" you are clearly in the wrong direction! Weight loss is a biological sequence, not a magical one.
    how many calories should I eat

    ReplyDelete
  2. Hello everyone..

    I'm selling fresh leads. Details in leads are:

    Full name
    SSN
    DOB
    Phone Numbers
    Address
    City
    State
    Zip
    Residential Status
    Account Number
    DL number
    Emails

    All leads are genuine, fresh & generated by spaming, I Will provide you samples for checking if u want.

    Dealing in almost all types of leads.

    SSN Leads
    Dead Fullz
    Premium Leads
    Mortgage Leads
    Bank Account Leads
    Employee Leads
    Business Leads
    Home Owners Leads
    DL Leads
    Emails Leads
    Phone Numbers Leads

    Each lead will b cost $1.

    Also cvv Fullz available track 1 & track 2 with pin.

    Interested person contact, scammers stay away, sampling is free of cost.

    email > leads.sellers1212@gmail.com
    Whatsapp > +923172721122
    Telegram > @leadsupplier
    ICQ > 752822040

    ReplyDelete