2 2 Mapping Description For two input strings x,y of length n, we independently and equiprobably select a window of width w in both strings: x[i,i+w-1] and y[i,i+w-1]. Using fixed parameters of q-gram length q and q, a q-gram vector v (vector of 1 2 w,q quantities of each q-gram) is composed for each window for each q=q,…,q. The obtained vectors are 1 concatenated for each string. The Manhattan distance d between them is the sum of Manhattan distances between q-gram vectors of the windows:

q d ( x[i, i + w - 1], y[i, i + w - 1]) = vw,q ( x[i, i + w - 1]) - vw,q ( y[i, i + w - 1])l1 (2) q = qUsing the defined distance it is possible to show that necessary embedding properties (1a) and (1b) hold.

Upper Bound For x,y n let a q-gram x[i,i+q-1] be “good”, if there is a q-gram y[i,i+q-1] such that |j-i|s, where s is the relative shift of the q-grams. If all q-grams in a window x[i,i+w-1] are «good» and are located successively in x[i,i+w-1] as well as in y[i,i+w-1], then ||v (x[i,i+w-1])-v (y[i,i+w-1])||2s. As s does not exceed ed(x,y), so q q ||v (x[i,i+w-1])-v (y[i,i+w-1])||2ed(x,y).

q q For the independent and equiprobable chose of the window we get for the property (1a):

Lemma 1 For x,y n, if ed(x,y)1-k w/(n-w+1).

1 1 2 1, 1 2-q 1 Lower Bound De Bruijn graphs. For a string x and a parameter q de Bruijn graph [Bruijn, 1946] G[x;q] is a graph, whose vertices are all q-1-grams (q-1-spectrum) of the string x. An edge a1a2…aq connects vertices labeled a1a2…aq-and a2a3…aq. Such graphs are widely used in genetics [Pevzner, 1989] and in cryptographic stream ciphers’ analysis. Let a de Bruijn graph built from the union of q-1-spectra of two strings x,y be G[x,y;q], and a path corresponding to a string x be.

x XII-th International Conference "Knowledge - Dialogue - Solution" Let us consider possible local mutual configurations of paths and on G[x,y;q] that correspond to two x y symbolic string being compared. Let us call the right and left branching points the vertices where the ways, correspondingly, diverge or converge. To disambiguate their definition it is necessary that there is only one way of continuing a path from any graph vertex (Euler paths on G[x;q] and G[y;q] should be unique). Let a "loop" be a right branching point together with the nearest following left branching point in the direction of passing paths.

Because of its definition, a loop does not contain identical arcs, otherwise it could be possible to divide it into smaller loops. The situation when there are no loops, when in G[x,y;q] is only one left or right branch point, or a left one and a right point following it, is called a "fork". With this configuration, a left folk compulsorily contains as least one of the starting arcs of at least one of the paths, and a right folk contains terminating arcs of at least one of the ways. For some,, let a "shift" be a special case of a fork, when there is a non-empty subpath x y,w| |0, that =’ and = ’, where ’, ’ are some, possibly empty, subpaths.

с с x x с y с y x y Concepts of "transposition" and "rotation" are used to discriminate between ways of obtaining identical spectra from different strings ([12], [11]). A transposition is the following situation: for t,s q-1: x=… t … s…t … s … y=… t … s…t … s … or x=… t … t … t … y=… t … t … t … A rotation is a situation when q-1-grams on the edges of a string are identical. It this case a corresponding path on the de Bruijn graph is a cycle, and, from any of its vertices one can get different strings with the same spectrum.

The following theorem was conjectured in [Ukkonen, 1992] and proven in [Pevzner, 1995] Theorem 1 Any two strings with the same q-spectrum can be transformed into each other by transpositions and rotations.

Corollary 1 For a string x w and graph G[x;q], if q>(w+1)/3 then the corresponding Euler path on G[x;q] is unique or is an Euler cycle.

Corollary 2 For two strings x,y w such that they can transformed into each other by rotations only, holds ed(x,y) 2 (w-q+1)/2.

Our aim is to determine such a distance measure between two w-wide windows and such a threshold that strings with a distances less than this threshold would represent a shift and can be aligned in a fixed number of operations, by simply editing them at the beginning and the end of the window, namely, by eliminating forks at the edges of the windows. The usual q-gram distance (with a fixed q) cannot provide the desired result since for any q there can be found two strings x,y where on the graph G[x,y;q] there will be a loop.

Therefore we propose distance (2) and in the following lemma show that with its aid it is possible to determine whether it is possible to align windows with a small number of edit operations.

We checked lemma 3 experimentally for those values of parameter w that still allowed for brute force string comparison. For a binary alphabet, all pairs of 2w strings were compared for w = 8,…,17 (experiment ran for days). For a ternary alphabet all pairs of 3w strings were compared for w = 8,…,10 (2 days). None of the experiments has found a pair of strings violating lemma.

Let there be two types of pairs of windows x[i,i+w-1],y[i,i+w-1]: «good» and «bad» – correspondingly those for which condition (3) holds or not. The next lemma says that it is possible to simultaneously align simultaneously successive «good» windows. Conditions of lemma 4 (with corollaries 1 and 2) exclude non-unique determination of branching points.

Lemma 4 Let conditions of lemma 3 hold, w=8,…,n, if for all i=1,…,n-w+1 holds d(x[i,i+w-1],y[i,i+w1])<(q+1)(q+2), than ed(x,y)<2(q+1)(q+2).

Let N be the number of bad windows. Let us find the upper limit (lemma 5) on the edit cost for all possible arrangements of N windows, using the following string edit algorithm. Assume we have aligned strings up to position j-1. If all consecutive pairs of windows, beginning from position j and to j+r, are "good", then we align them with not more than 2Q operations, using the result of lemma 4, and continue from position j+r+w-1. If the next pair of windows in position j is bad, we use one edit operation to replace symbol x[j] with symbol y[j], thus aligning one symbol and continuing to the next pair of windows in position j+1.

Neural and Growing Networks Lemma 5 Let conditions of lemma 4 hold. Let T= (n-1)/w. The cost of aligning strings y and x with the help of the above algorithm is upper bounded with (2Q-1)min(T,N)+min(N,n-1-(w-1)min(T,N))+2Q 2Q(N+1) So for the independent and equiprobable window choice we get the following lemma that specifies property (1b):

Lemma 6 For x,y n and holding conditions of lemmas 3-5, if ed(x,y)>k, then 2, P[d (x,y)(q+1)(q+2)]>(k /2Q-1)/(n-w+1).

Nearest Neighbor Search In this part we consider one of the possible procedures for the nearest string search (NNS) with the help of edit distance approximation described above. Let P={p,…,p |pP} be a collection of strings and p be the input 1 P i probe string, to which it is necessary to find the nearest one (by the edit distance) from P. Searching for the exact nearest neighbor is often a laborious task. Namely, for large dimensionalities dimensions of input space (in our case it is d=||n) the existing NNS algorithms are reduced to the linear search on P. On the other side, the "approximately" nearest neighbor is often sufficient in applications and it is often much easier to find it. This caused a large number of works concerned with NNS approximations.

First we transform vectors v into hash-values distributed around ||v || with the help of a p-stable distribution, w,q w,q and a modified scheme from [Datar, 2004] (see subsection "Modification"). Then we apply a known scheme of locality-sensitive hashing (LSH) [Indyk, 1998].

LSH. Let us describe the original [Indyk, 1998] LSH scheme applied to strings with the classic edit metrics. Define a ball with radius r containing points distanced from its center not farther then r: B(t,r)={q:ed(q,s)r}.

Definition of locality-sensitive functions. A family of hash-functions H={h: X} is called (r,r,p,p 1 2 1 2)sensitive, if for any x,y n and any independently an equiprobably chosen hH holds the following:

t B(s,r ) P[h(t)=h(s)] p and t B(s,r ) P[h(t)=h(s)] p, 1 1 2 (4) r 1

a string pP based on the value of the hash-vector g(p): a string p is put into a cell with an identifier equal to the i i i hash-vector value. The aim is to get high collision probability between nearby strings, and low probability between distant ones. Then, applying the same hash to the probe we check whether it equals one of the previously stored hashes of vectors from P: for probe p we calculate all hash-vectors g p and examine corresponding 0 i( 0), i=1,…,L cells. If some cell contains a string p* B(p,r ), the algorithm returns YES and p* and NO otherwise (thus i 2 i representing a solution to the so called (r,r 1 2)-PLEB task [Indyk, 1998]). Algorithm terminates after checking 2l cells. For such procedure holds the following theorem. For such a procedure, the following theorem holds:

Theorem 2 [Indyk, 1998] Let H be a (r,r,p,p 1 2 1 2)-sensitive family of functions, K=-log|P|/log(p, L=|P|, where 2). Then the above algorithm solves (r,r =ln(p /p ) 1 2)-PLEB task and takes O(||q|P|+|P|1+) space, O(|P|) distance 1 calculations, and O(|P|K) calculations of hash functions.

LSH with a 1-stable distribution. In [Datar, 2004], it is proposed to use the property of stable distributions that linear combinations of theirs random values i are distributed as one random variable multiplied by the norm of linear combination’s coefficients. Due to linearity of scalar product (v,)-(v,) ||v ||. Hash-functions are 1 2 1-v 2 lp defined as:

h(v)=((v,)+b)/r, (5) where b is an equiprobably distributed random variable on [0,r], is a vector with elements taken from Cauchy distribution.

XII-th International Conference "Knowledge - Dialogue - Solution" If one divides a real axis into equal intervals, then, intuitively, vectors with the similar norm will likely fall into the same interval. It is possible to show [Datar, 2004] that for two fixed vectors the hash-function (5) is r 1 t 1 r c r p(c) = f (c)(1 - )dt = 2 arctan - ln1 + (6) c c c r c where f(.) is probability density function of the absolute values of, and с=||v || is the distance between the 1-v hashed vectors. As p(c) is a monotonically decreasing function, the family of such functions is locality-sensitive (see definition 1). So, hash-functions (5) can be used in the LSH scheme.

Our modification. Let us propose a method of using hash-functions (4) in a different way from that described above, providing analogy to the distributed approaches [Sokolov, 2005]. Instead of multiplying a fixed vector v by a number of random vectors to form hash-vectors g, we take K random vectors v i (s) obtained by random j w,q and independent sampling with a window of width w from string s (see "Mapping description"). For each of them, we generate a separate random vector whose elements are taken from the Cauchy distribution. Let us i designate the scalar product of these two vectors h’=(v i (s),) and let us fix the hash-functions of form i w,q h’(v)=(h’+b)/r. Taking into account lemmas 1 and 6, holds the following lemma the following lemma holds i indicating that the family of such functions is also locality-sensitive.

Lemma 7 P[h'(x)=h'(y)|ed(x,y)k ]p(2k (q+1))(1-k w/(n-w+1)), P[h'(x)=h'(y)|ed(x,y)k ]1-(k /(2(q+1)(q+2))1 1 1 2 1)(1-p((q+1)(q+2))), where function p(.) is defined in (6).

With this method of hash-function formation, it is possible to get output vectors without matrix multiplying of intermediate q-gram representations by random vectors, thus obtaining a scheme consistent with a neural network approach [Sokolov, 2005].

Binarization and ternarization. Parameter r in (5) can be chosen to minimize /p (see [Datar, 2004]) and to p speed up the NNS procedure (theorem 2). It is also attractive to have either binary or ternary vectors at the output that are more beneficial than integer-valued ones because of a more efficient implementation. On the other side (sparse) binary or ternary vectors are widely used in distributed processing models [Rachkovskij, 2001].

Hash-function (5) will take binary values {0,1} if 0

( ) ( ) Density of zero elements is an important parameter in many schemes of distributed coding, for ternary vectors it is defined by 1/ arctan r / ||v||-||v||/2r ln(1+(r / ||v||)2) and is increasing with the increasing of r.

( ) Conclusion We analyzed the concept of the distributed representations of sequences [Sokolov, 2005] from the point of view of metric embeddings, presented a new q-gram approximation method of the edit distance, and proved the possibility of constructing locality-sensitive functions. Thus we showed that the distributed representations used for the comparison of sequential data in the neural network paradigm could be justified with the aid of the methods from the embedding theory. This approach can also be considered as the substantiation of the Broder approach [Broder, 1995] who takes for the document similarity measure the degree of the coincidence of the sets of their q- grams and also other bag-of-grams methods. We also gave conditions for obtaining binary and ternary vectors at the output that can be useful for a unified approach to representation and processing of various data types and modalities [Rachkovskij, 2001].

Neural and Growing Networks Bibliography [Bar-Yossef, 2004] Z. Bar-Yossef, T.S. Jayram, R. Krauthgamer, R. Kumar: Approximating Edit Distance Efficiently. FOCS, pp. 550-559, [Batu, 2004] T. Batu, F. Ergun, J. Kilian, A. Magen, S. Raskhodnikova, R. Rubinfeld, R. Sami. A sublinear algorithm for weakly approximating edit distance, In Proc. 36th STOC, [Broder, 1998] A. Broder. On the resemblance and containment of documents. In SEQS: Sequences ’97, 1998.

Материалы этого сайта размещены для ознакомления, все права принадлежат их авторам.
Если Вы не согласны с тем, что Ваш материал размещён на этом сайте, пожалуйста, напишите нам, мы в течении 1-2 рабочих дней удалим его.