A genomic signature is a mathematical structure,
typically a vector of numbers, which is used to denote unique characteristics of
a DNA sequence. A DNA sequence can be defined as a long string
over the alphabet
={A,C,G,T}
of bases Adenine (A), Guanine (G), Cytosine (C), and Thymine (T), that
constitute DNA. The genome G of a species is defined as the set of
all its chromosomes. Each chromosome is a genomic sequence, as is any
subsequence of it. Let
be a
sufficiently long (greater than a few kilobases (kb)) genomic sequence taken
from genome G. Then, the genomic signature
of
is a vector of
numerical quantities computed from the manner in which characters from
constitute
such that:
The species from which a DNA sequence is derived is called the
origin of that sequence. In genomic research, scientists often
encounter short DNA sequences of unknown origin. A genomic signature that is
highly conserved within the genome of a species and differs between genomes of
different species is extremely useful in the identification of such short DNA
segments. For a given short sequence
,
is computed and
then matched with an existing database of
signatures of all
known species with sequences genomes using a suitable similarity measure. The
species with the closest signature to
is predicted as the
origin of
. Other species
having signatures with high similarities to
are predicted as
close relatives. In the absence of any close signatures in the database, the
discovery of a new species is hypothesized. The application of a good genomic
signature to the short sequence origin identification problem can be effectively
used for metagenomics, infectious microbial sequence identification, global
cataloging of species, and building of a star-trek-like genome tricorder with
scientific, industrial, and domestic applications. While several genomic
signatures have been defined in the scientific literature, none of them have
demonstrated high accuracy in predicting the origins of short DNA sequences.
| In this work, we propose the de Bruijn chain (DBC)
signature |
Fig. 1. Using de Bruijn chains to predict the origin of short DNA sequences. |
A DNA word of length w is a string over
. Two other
sequence-based genomic signatures have been proposed in the scientific
literature. These are the word count vector
(WCV) [1] and the
dinucleotide odds ratio vector
(DOR) [2],
respectively. Let
denote the
set of all strings in
. For strings
x and y with |x| <= |y|, occ(x, y) is the count
of occurrences of x as a substring of y. The frequency of x
in y is freq(x,y)=occ(x,y)/(|y|-|x|+1). Given a word length
w and a genomic sequence H, the corresponding WCV signature is a
4w-long vector with ith component equal to
occ(xi,H), where xi is the
ith string in lexicographic order in
. A DNA word of
length 2 is called a dinucleotide. Given H, the DOR
signature is a 16-long vector with the entries for dinucleotides in
lexicographic order. The entry corresponding to the dinucleotide XY is
computed by the expression freq(XY,H)/(freq(X,H)*freq(Y,H)). Although the
and
signatures have been
studied with respect to conservation within a species and variation between
species, their effectiveness in identifying short DNA sequences accurately has
not been studied systematically. Also, a formal mathematical framework within
which separation between the above signatures for different species is
characterized, is lacking in the literature.
The de Bruijn chain (DBC) signature was proposed by us in [3].
The order-w de Bruijn graph
over alphabet
is a
directed graph, where
when
, for some
; such an edge is
labeled
. Fig. 2 depicts
a binary de Bruijn graph of order 3. Let H be a genomic sequence
of length n. We think of H as tracing a walk in
. The edge
count of edge
in H, where
, is
. Now consider the
Markov chain underlying the above de Bruijn graph
. The said Markov
chain has state space
and a sparse
transition probability matrix with nonzero transition probabilities only for
edges in
; such a Markov
chain is called an order-w de Bruijn chain (DBC). Assuming that
all DBCs computed from genomic sequences are ergodic and hence that there is a
unique stationary distribution
on
satisfying
, for word
length w>= 1, we obtain
with associated edge
counts. Let
be a positive
integer threshold. Let
be the set of
edges with counts at most
. Then
edge deletion is the process of deleting edges
from
while varying
from
0 to
and
deleting edges with tied counts in arbitrary order. As
increases
from 0 to
, the
number of isolated vertices increases from 0 to 4w. The
ordered vertex isolation frequency vector (OVIF)
is the
4w-vector whose ith component is the
frequency of the last edge whose deletion isolates the vertex labeled with
xi, the ith string in lexicographic order.
The de Bruijn chain signature
is the 2·
4w-vector
, where
is the estimated
stationary distribution for the order-w DBC and '·' represents vector
concatenation. Fig. 3 depicts the construction of the order-2 DBC
signature.
![]() Fig. 2. de Bruijn graph of order 3 over the binary alphabet {0,1} The red line depicts the walk traced by the sequence 0001110111000101 on the graph. |
Fig. 3. Construction of the DBC signature of order 2. |
Including both theoretical and empirical results that we obtain, we make the following contributions:
|
and | .
|
We propose two datasets, L1: a list of
50 diverse species including archaea, bacteria, and eukaryotes, and
L2: a list of 74 closely related
-proteobacterial
species, to serve as standard datasets in comparing accuracies of all
available sequence-based genomic signatures in the literature. The intention
of the first list is to test the ability of a signature to distinguish between
diverse species while the intention of the second list is to test the ability
of a signature to distinguish between closely-related species. First, the
order-2 DBC signatures were computed for all species in both lists
L1 and L2, and stored in a database
D. We test the accuracy of the DBC signatures in identifying the origin
of short DNA sequences using DNA sequences of lengths 5 kb, 10 kb, 25 kb, 50
kb, and 100 kb. For each of the five lengths listed above, from each species
in both lists L1 and L2, 100
subsequences of that length are randomly sampled. The order-2 DBC
signature of each subsequence is computed and compared with all signatures in
the database D using the Pearson correlation coefficient. The species
whose signature matches with the highest correlation is the predicted target.
The number of predicted targets among 100 sequences, which correspond
to the true origin is the accuracy. Fig. 5 illustrates the
accuracy of the order-2 DBC signature using short DNA sequences sampled
from the species on the x-axis, for different sample lengths. Observe
that the DBC signature is extremely well-conserved (high accuracy) among
archaea (the first 10 species on the x-axis), bacteria (the next 20
species on the x-axis), and most eukaryotes (the last 20 species on the
x-axis). As the sample length goes down, so does the accuracy, as
expected. Even at sequences as small as 10 kb, a median accuracy greater than
80% is observed.
Fig. 5. Median accuracies of DBC, DOR, WCV, and combo signatures using diverse species. |
Fig. 6. Median accuracies of DBC, DOR, WCV, and combo signatures using closely-related species. |
We have examined genomic signatures from the point of view of accurate identification of the origin of short unknown DNA sequences. The genomic signatures introduced in this paper are derived from the structure and properties of de Bruijn chains. When a sample sequence is sufficiently long, the target organism for the sample can be retrieved by querying a database of signatures. Given an unknown DNA sequence, its possible high-level location in the phylogenetic tree can be predicted using the combination of the DBC and DOR signatures, after which its origin and closest relatives can be predicted using the DBC signature alone. We have demonstrated both theoretically and empirically that the DBC signature is a powerful signature, able to efficiently identify the origin of an unknown genomic sequence as short as a few kilobases. This implies that the origin and the closest relatives of an unknown sequence can be identified with very little actual sequencing. We also observed the effect of order on efficiency of the DBC signature. In continuing work, we are exploring the effect of size of the signature database on short sequence target prediction efficiency. We are also studying the phylogeny implied by distances between DBC signatures and the extent to which this phylogenetic structure is conserved on random sampling of short sequences for phylogenetic reconstruction. A software system that predicts the origin using signatures computed from available genomic sequences of species in NCBI's taxonomy database is under construction.
References