US20080010038A1 - Shift-invariant probabilistic latent component analysis - Google Patents

Shift-invariant probabilistic latent component analysis Download PDF

Info

Publication number
US20080010038A1
US20080010038A1 US11/482,492 US48249206A US2008010038A1 US 20080010038 A1 US20080010038 A1 US 20080010038A1 US 48249206 A US48249206 A US 48249206A US 2008010038 A1 US2008010038 A1 US 2008010038A1
Authority
US
United States
Prior art keywords
input data
distribution
distributions
components
kernel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US11/482,492
Other versions
US7318005B1 (en
Inventor
Paris Smaragdis
Bhiksha Ramakrishnan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mitsubishi Electric Research Laboratories Inc
Original Assignee
Mitsubishi Electric Research Laboratories Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitsubishi Electric Research Laboratories Inc filed Critical Mitsubishi Electric Research Laboratories Inc
Priority to US11/482,492 priority Critical patent/US7318005B1/en
Assigned to MITSUBISHI ELECTRIC RESEARCH LABORATORIES, INC. reassignment MITSUBISHI ELECTRIC RESEARCH LABORATORIES, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: RAMAKRISHNAN, BHIKSHA, SMARAGDIS, PARIS
Application granted granted Critical
Publication of US7318005B1 publication Critical patent/US7318005B1/en
Publication of US20080010038A1 publication Critical patent/US20080010038A1/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis

Definitions

  • This invention relates generally to analyzing data acquired of real world systems and phenomena, and more particularly to methods for determining low-dimensional components of complex, high-dimensional data structures, such as acoustic signals and images, or any arbitrary data set.
  • Component analysis can reduce dimensionality, extract features, or discover underlying structures of the data.
  • PCA Principal component analysis
  • ICA independent component analysis
  • Non-negative matrix factorization can also be used for component analysis, Lee D. D., and Seung H. S., “Learning the parts of objects by non-negative matrix factorization,” Nature, Vol. 401, No. 6755, 21 Oct. 1999, pp. 788-791.
  • Non-negativity is a valuable property for working only with positive data. Because a large majority of acoustic, image and text data operations deal with positive only data, NMF presents an appropriate alternative to PCA and ICA. A particular reason for the success of NMF is that using non-negative components, in combination with non-negative weights, often translates to a meaningful solution.
  • NMF provides meaningful components for variety of data types such as images, and acoustic magnitude spectra.
  • the embodiments of the invention provide a model and method for decomposing input data represented by probability distributions into a set of shift invariant components.
  • the model is a latent variable model, which can be extended to deal with shift invariance in a probability space in order to model more complex input data.
  • the embodiments of the invention also provide an expectation maximization procedure for estimating the set of shift invariant components.
  • Shift-invariant probabilistic latent component analysis is applied to input data represented according to the model. Shift invariance enables the method to discover components that appear in like the data at arbitrarily shifted positions, either temporally, spatially, or both. This property is desirable when dealing with input data such as images and acoustic signals, where translation of the same feature, such as objects or sounds, is often observed.
  • the invention decomposes in the probability density space. This provides an interesting statistical interpretation, and also helps to enforce implicitly non-negativity, while using probabilistic optimization techniques.
  • FIG. 1 is a flow diagram of a method for decomposing input data into a set of shift invariant components according to an embodiment of the invention
  • FIG. 2A is a diagram of input data represented by three Gaussian functions according to an embodiment of the invention.
  • FIGS. 2B-2D are diagrams of components of the distributions of FIG. 2A obtained by the method according to an embodiment of the invention.
  • FIG. 3 is a diagram of output data approximated the input data by combing the components of FIGS. 2B-2C according to an embodiment of the invention
  • FIG. 4 a diagram of input data represented by multiple Gaussian functions according to an embodiment of the invention
  • FIGS. 5A-5D are diagrams of components of the distributions of FIG. 4 according to an embodiment of the invention.
  • FIG. 6 is a diagram of output data approximated the input data by combing the components of FIGS. 5A-5D according to an embodiment of the invention
  • FIG. 7 is a diagram of input data represented by three Gaussian functions according to an embodiment of the invention.
  • FIGS. 8A-8B are diagrams of components of the distributions of FIG. 7 according to an embodiment of the invention.
  • FIG. 9 is a diagram of output data approximated the input data by combing the components of FIGS. 8A-8B according to an embodiment of the invention.
  • FIG. 10 is a diagram of musical notes represented by distributions according to an embodiment of the invention.
  • FIGS. 11A-11B are diagrams of components of the distributions of FIG. 10 according to an embodiment of the invention.
  • FIG. 12 is a diagram of output data approximated the input data by combing the components of FIGS. 11A-11B according to an embodiment of the invention.
  • FIG. 13 is a diagram of kernel distributions of magnitude spectrogram of speech according to an embodiment of the invention.
  • FIG. 14 is an image of a choir with a multitude of heads at various locations in the image
  • FIG. 15 is a diagram of a kernel distribution of an average head according to an embodiment of the invention.
  • FIG. 16 is a diagram of an impulse distribution of the locations of the heads in FIG. 14 according to an embodiment of the invention.
  • FIG. 17 is an image of handwriting characters
  • FIG. 18A is a diagram of kernel distributions of the characters in FIG. 17 , according to an embodiment of the invention.
  • FIG. 18B is a diagram of corresponding prior distributions for the kernel distributions of FIG. 18A , according to an embodiment of the invention.
  • FIG. 18C is a diagram of the impulse distribution corresponding to the characters in FIG. 17 according to an embodiment of the invention.
  • FIG. 19 is an image of results of approximation of the characters in FIG. 17 according to an embodiment of the invention.
  • one embodiment of our invention provides a method 100 for decomposing input signal 101 into a set of shift invariant components 131 .
  • the signal can be an acoustic signal with temporally shifted magnitude spectra, or an image where pixel intensities are shifted spatially.
  • Other forms of input data are described below.
  • the input data has a high dimensionality. It is desired to determine low-dimensional components that can be used to approximate high dimensional input data.
  • the model 121 is a mixture of probability distributions.
  • the probability distributions can be continuous Gaussian functions or discrete histograms.
  • the components can be combined 140 to produce output data 141 that approximate the input data 111 .
  • the embodiments of our invention provide the model 121 and method 100 for decomposing the input signal 101 represented 110 by probability distributions into a set of shift invariant components 131 .
  • P(x) is an N-dimensional distribution of the input signal 101
  • a random variable x x 1 , x 2 , . . . , x N
  • z is a latent variable
  • z) are one dimensional marginal distributions.
  • the latent variable is an index for the set of components that are extracted.
  • the latent variable can take values of, e.g., ⁇ 1,2,3 ⁇ , in which case we are decomposing to three components. Alternatively, the latent variable can be a continuous variable and take any number of values, including fractional values
  • our model represents a product of a mixture of the marginal probability distributions to approximate the N-dimensional probability distribution representing the input signal 101 .
  • the marginal distributions themselves are dependent on the latent variable z.
  • the model can be used by our method to determine an underlying structure of the input signal 101 . This is done by estimating 130 both P(x j
  • z) is performed using a variant of an expectation-maximization (EM) procedure, which is described in more detail in Appendix A and Appendix B.
  • the EM procedure includes an expectation step and a maximization step. We alternate between these two steps in an iterative manner until a termination condition is reached, for example, a predetermined number of iterations, or a predetermined accuracy.
  • z) contains a latent marginal distribution, along the dimension of the variable x j relating to the latent variable z, and P(z) contains the prior probability of that latent variable.
  • the above method can also be adapted to work for discrete variables x and z, or all possible combinations, as described in Appendix B.
  • the method also works when the input data are represented by un-normalized histograms as opposed to probability distributions.
  • the only added measure we need to take in the discrete case is to normalize each P(x j
  • FIG. 2A we represent a two dimensional random variable P(x) by three 2D-Gaussian functions with diagonal covariances: x ⁇ N ⁇ ( [ 1 - 1 ] , [ 0.4 0 0 0.4 ] ) + 1 2 ⁇ N ⁇ ( [ 0 2 ] , [ 0.7 0 0 0.1 ] ) + 1 2 ⁇ N ⁇ ( [ - 2 1 ] , [ 0.1 0 0 0.4 ] ) . ( 5 )
  • FIGS. 2B-2D show the expected result after forty iterations of our EM-like training.
  • FIG. 2B shows the prior probabilities P(z i ).
  • FIG. 2C shows the marginal probabilities from an up-down dimension P(x 2 ,
  • FIG. 2D shows the marginal probabilities from a left-right dimension P(x 1 ,
  • FIG. 3 shows an approximation to P(x) using our PLCA model.
  • Equation (6) reduces to the PLCA form in equation (1).
  • this enables us to deal with more complex input data.
  • One pattern 401 is a Gaussian function oriented from bottom right to top left
  • the other pattern 402 is a set of two Gaussian functions that form a ‘wedge’ oriented towards the top right. Both of these patterns are not easily approximated by products of marginal probabilities as in the above described model. However, the patterns can be modeled by the convolutive model because the patterns exhibit repetition along the left-right axis.
  • FIG. 5A shows the left the latent variable prior probabilities P(z)
  • FIGS. 5 B-C show the two kernel distributions P(x, ⁇
  • FIG. 5D shows the impulse distributions P(y
  • FIG. 6 shows a diagram of output data approximated the input data by combing the components of FIGS. 5A-5D .
  • the kernel distributions we derive can be shifted in the left-right dimension, but also in the up-down dimension.
  • each latent variable is over the latent variables, and both the convolution parameters ⁇ x and ⁇ y :
  • R ⁇ ( x , y , ⁇ x , ⁇ y , z ) P ⁇ ( z ) ⁇ P ⁇ ( ⁇ x , ⁇ y ⁇ z ) ⁇ P ⁇ ( x - ⁇ x , y - ⁇ y ⁇ z ) ⁇ z ⁇ P ⁇ ( z ) ⁇ ⁇ ⁇ P ⁇ ( ⁇ x , ⁇ y ⁇ z ) ⁇ P ⁇ ( x - ⁇ x , y - ⁇ y ⁇ z ) ⁇ d ⁇ x ⁇ d ⁇ y . ( 12 )
  • ⁇ P ⁇ ( z ) ⁇ ⁇ ⁇ ⁇ P ⁇ ( x , y ) ⁇ R ⁇ ( x , y , ⁇ x , ⁇ y , z ) ⁇ d x ⁇ d y ⁇ d ⁇ x ⁇ d ⁇ y ( 13 )
  • ⁇ P ⁇ ( ⁇ x , ⁇ y ⁇ z ) ⁇ ⁇ P ⁇ ( x , y ) ⁇ R ⁇ ( x , y , ⁇ x , ⁇ y , z ) ⁇ d x ⁇ d y P ⁇ ( z ) ( 14 )
  • P I ⁇ ( x , y ⁇ z ) ⁇ ⁇ P ⁇ ( x + ⁇ x , y + ⁇ y
  • the iterative procedure EM procedure is guaranteed to obtain a locally optimal estimate for the kernel and impulse distributions.
  • it is advantageous to utilize a modification whereby, at each iteration, we ‘anneal’ the kernel by raising the kernel to an exponent: P ( ⁇ x , ⁇ y ) ⁇ c ⁇ P ( ⁇ x , ⁇ y ) ⁇ , (16) where 0 ⁇ 1, initially set to a value less than 1, and c is a normalizing constant. As the iterations proceed, it is gradually raised to 1.
  • This procedure can yield a locally optimal estimate of the kernel and impulse distributions, and the procedure is much more likely to result in ‘meaningful’ decompositions, wherein the kernel captures most of the repetitive structure in the distribution, while the impulse distribution includes a mixture of impulse, such as peaks, identifying the location of the shifted structures, thereby producing a sparse code.
  • FIG. 7 shows an input distribution P(x, y).
  • the latent variable prior probabilities, kernel and impulse distributions that are extracted are also shown in FIGS. 8A-8B .
  • the kernel distributions have correctly converged to the two shifted forms, whereas the impulse distributions are placed to perform the proper decomposition. Convolving each pair of kernel and impulse distributions, and summing the distributions results in a good approximation of the input distribution as shown in FIG. 9 .
  • acoustic signals where we operate on a time-frequency distribution representation.
  • We represent an acoustic signal by a distribution of its energy in time and frequency axes. It is effectively a scaled histogram of acoustic components that fall in each time and frequency grid point.
  • FIG. 10 shows a constant-Q time-frequency distribution of the piano note sequence, see Brown, J. C., “Calculation of a Constant Q Spectral Transform,” Journal of the Acoustical Society of America vol. 89, pp. 425-434, 1991, incorporated herein by reference.
  • the harmonic series repeat in various shifted positions to represent each note.
  • This time-frequency distribution seeking a single latent variable. Analysis of this distribution results into a kernel function that is a harmonic series as shown in FIG. 11A , and an impulse function that places that series in the time frequency plane as shown in FIG. 11B .
  • the approximation to the reconstructed input is shown in FIG. 12 .
  • the kernel distribution looks very much like a harmonic series, whereas the impulse distribution has energy only at the fundamental frequency.
  • the impulse distribution indicates where the kernels need to be placed to reconstruct the input data.
  • TIMIT is a standard database used for the development of signal processing and classification processes.
  • 513 frequencies which results in a discretized input distribution of 513 frequencies over 938 time intervals.
  • FIG. 13 shows the resulting kernel distributions from an analysis of the magnitude spectrogram of speech.
  • the distributions are stacked from left to right, and separated by dotted lines.
  • the shape of the distributions correspond to magnitude spectrograms of various speech phonemes, and the time-frequency form of these kernels resemble the structure of various phonemes.
  • Image data can be thought of as distributions, e.g., a probability or count of photons landing on a particular point on an optical sensor, i.e., pixel intensities. Image data can be decomposed by our method to yield interesting results. We start with an example application where we wish to extract a single kernel distribution from a complex image.
  • the input data is a 136 by 511 pixel image of a choir with a multitude of heads shifted at various locations in the image.
  • We analyze the input data looking for a single latent variable and kernels of pixel height 32 and width 24.
  • the kernel distribution corresponds to the shape and appearance of an ‘average’ head.
  • the impulse distribution is shown in FIG. 16 .
  • the impulse distribution indicates the appropriate shifted locations for each choir member in the input image.
  • FIG. 17 shows a more complex example, which is a 81 by 113 pixel image of handwriting.
  • Three types characters e.g., lamda, alpha and gamma, compose the image.
  • the characters are located arbitrarily on an x, y grid.
  • Analyzing the data according to our model and method we can extract three 15 by 15 kernel distributions shown in FIG. 18A .
  • the three kernel distributions are actually the three characters in the input image.
  • FIG. 18B shows the corresponding prior distributions.
  • the impulse distribution, as shown in FIG. 18C includes spikes at the locations that correspond to each character in the input image.
  • FIG. 19 shows that an approximation that results using this decomposition has the surprising property of ‘streamlining’ the characters and making all instances of the same character look more alike than in the original image.
  • the latent variable prior probabilities shown in FIG. 18B , essentially indicate an amount of energy contained in each character.
  • the ‘alpha’ due to a more elaborate stroke, contains more energy from the ‘gamma’, which in turn contains somewhat more energy than the less elaborate ‘lamda’.
  • PLCA non-convolutive form of PLCA can essentially be characterized as a generalization of Hofmann's probabilistic latent semantic analysis (PLSA), Hofmann, T., “Probabilistic Latent Semantic Analysis,” Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI'99, 1999, incorporated herein by reference.
  • PLSA probabilistic latent semantic analysis
  • LSA latent semantic analysis
  • SVD singular value decomposition
  • PCA principal component analysis
  • a PARAFAC decompositions attempts to factor multi-linear data structures into vector components. That corresponds to PLCA for arbitrary dimensional distributions.
  • PARAFAC processes are predominantly least squares approximations to arbitrary input data, whereas our method is explicitly approximating probability densities by marginal probabilities and has a probabilistic foundation. Versions of the PARAFAC process are known for non-negative inputs. However, to our knowledge no probabilistic or convolutive framework has been described in the prior art.
  • NMF non-negative matrix factorization
  • Lee D. D., and Seung H. S. “Learning the parts of objects by non-negative matrix factorization,” Nature, Vol. 401, No. 6755, pp. 788-791, 21 October, 1999, incorporated herein by reference.
  • NMF non-negative matrix factorization
  • the objective is to factor a non-negative matrix using two lower rank non-negative matrices.
  • An interesting connection between our method and NMF comes through with the cost function, which is minimized when performing NMF. That function is most often an adaptation of the Kullback-Leibler divergence for arbitrary non-negative functions. That divergence is minimized between the input and the product of the estimated factors.
  • our method can be related to positive deconvolution, Li, L., Speed, T., “Deconvolution of sparse positive spikes: is it ill-posed?,” University of California at Berkeley, Department of Statistics Technical Report 586, 2000, incorporated herein by reference.
  • This is a particularly desirable operation in the fields of astronomical imaging and bio-informatics.
  • the objective is to obtain a non-negative deconvolution of both a convolved sequence and a filter applied to the sequence.
  • Most prior art methods rely on least-squares formulation.
  • our method can be adapted to that problem. If we define the filter to be a kernel distribution, then we can proceed by performing shift invariant PLCA. However, we only update the impulse distribution and keep the kernel distribution fixed to the filter we use for the deconvolution. Due to the lower number of variables to be estimated, convergence is much more rapid than when performing a complete shift invariant PLCA.
  • Equation (16) A point worthy of some discussion is the exponentiation operation in Equation (16), which we use as a mechanism to ensure sparsity on the impulse distributions. Although we stopped short of a probabilistic explanation, we note that this operation corresponds to information theoretic manipulations.
  • the ‘flattening’ that the exponentiation produces causes the entropy of the kernel distributions to increase. Because the data we model have a fixed entropy, the increased kernel entropy is ‘borrowed’ from the impulse distributions. This forces the entropy of the impulse distributions to decrease, which, in turn, causes this form of sparse learning. Alternatively, we can raise the impulse distributions to a power greater than one to achieve similar results. However, because the kernel distributions are, in general smaller it is more efficient to manipulate them instead. This way of forcing sparsity in such a decomposition related to sparse NMF, Hoyer, P. O., “Non-negative Matrix Factorization with sparseness constraints,” Journal of Machine Learning Research 5:1457-1469, 2004, incorporated herein by reference.
  • One consideration is the number and size of components that are desired. In most of the examples described above, the number and size are known a priori. A larger number of components usually has the effect of distributing the desired result to more components, thereby providing a more detailed description, or otherwise, allocating the components to irrelevant information. Fewer components results in either non-detection of some desired components or a consolidation of multiple desired components into one.
  • the method can be used to estimate missing items. This is useful in the case where it is known that there is a specific component that is, e.g., spatially shifted such as an image of a face, and the amount of shift is unknown.
  • the embodiment of the invention is a method for decomposing probability distributions into shift invariant components.
  • We presented our approach in gradually complicating cases starting from a simple static model to an arbitrary dimensionality and shift invariance model.
  • the method can be applied to any arbitrary data set.
  • the update rules for PLCA are obtained through a variant of the expectation maximization alorithm.
  • E x refers to the expectation operator with respect to P(x), the true distribution of x.
  • H(x) is the entropy of x.
  • I x ⁇ I y ⁇ f(x,y) ⁇ I y ⁇ I x ⁇ f(x,y) ⁇
  • I x ⁇ P(x) ⁇ 1
  • I x ⁇ P(x)g(x) ⁇ E x g(x).
  • z ⁇ x j ⁇ .
  • each x j can be continuous or discrete.
  • the parameters of this distribution are P(z) and P(x j
  • z), i.e. A ⁇ P(z), P(x j
  • R ⁇ ( x , z ) ⁇ P ( n ) ⁇ ( z ⁇ x ) P ( n ) ⁇ ( z ) ⁇ ⁇ j ⁇ P ( n ) ⁇ ( x j ⁇ x ) I z ′ ⁇ ⁇ P ( n ) ⁇ ( z ′ ) ⁇ ⁇ j ⁇ P ( n ) ⁇ ( x j ⁇ ′ ) ⁇ ( 24 )
  • P(x j ) is the true marginal density of x j .
  • the shift-invariant mixture model models the distribution of some dimensions of a multi-variate random variable as a convolution of a density kernel and a shift-invariant “impulse” density.
  • x be the multi-variate random variable.
  • Equation 43 assumes that the random variable y is cardinal, irrespective of whether it is discrete or continuous. Also, as before, z may be either continuous or discrete.

Abstract

A method decomposes input data acquired of a signal. An input signal is sampled to acquire input data. The input data is represented as a probability distribution. An expectation-maximization procedure is applied iteratively to the probability distribution to determine components of the probability distributions.

Description

    FIELD OF THE INVENTION
  • This invention relates generally to analyzing data acquired of real world systems and phenomena, and more particularly to methods for determining low-dimensional components of complex, high-dimensional data structures, such as acoustic signals and images, or any arbitrary data set.
  • BACKGROUND OF THE INVENTION
  • Many practical applications in the fields of signal processing often preprocess input data using component analysis. Component analysis can reduce dimensionality, extract features, or discover underlying structures of the data.
  • Principal component analysis (PCA) and independent component analysis (ICA) are frequently employed for various tasks, such as feature discovery or object extraction. The statistical properties of PCA and ICA make them indispensable tools for machine learning applications.
  • Non-negative matrix factorization (NMF) can also be used for component analysis, Lee D. D., and Seung H. S., “Learning the parts of objects by non-negative matrix factorization,” Nature, Vol. 401, No. 6755, 21 Oct. 1999, pp. 788-791. Non-negativity is a valuable property for working only with positive data. Because a large majority of acoustic, image and text data operations deal with positive only data, NMF presents an appropriate alternative to PCA and ICA. A particular reason for the success of NMF is that using non-negative components, in combination with non-negative weights, often translates to a meaningful solution.
  • In contrast, methods that do not use non-negativity yield a set of bases that contain negative elements. Then, cross-cancellation between the non-negative elements must be employed to approximate the input. Components with negative elements are hard to interpret in a positive only data framework and are often used for their statistical properties and not for the insight they provide of the underlying data structure. In contrast, NMF provides meaningful components for variety of data types such as images, and acoustic magnitude spectra.
  • However, the downside of NMF is that it is defined in a purely non-statistical framework, which prohibits NMF to be applied to probabilistic applications.
  • SUMMARY OF THE INVENTION
  • The embodiments of the invention provide a model and method for decomposing input data represented by probability distributions into a set of shift invariant components. The model is a latent variable model, which can be extended to deal with shift invariance in a probability space in order to model more complex input data. The embodiments of the invention also provide an expectation maximization procedure for estimating the set of shift invariant components.
  • Shift-invariant probabilistic latent component analysis (PLCA) is applied to input data represented according to the model. Shift invariance enables the method to discover components that appear in like the data at arbitrarily shifted positions, either temporally, spatially, or both. This property is desirable when dealing with input data such as images and acoustic signals, where translation of the same feature, such as objects or sounds, is often observed.
  • In contrast with a conventional linear decomposition, the invention decomposes in the probability density space. This provides an interesting statistical interpretation, and also helps to enforce implicitly non-negativity, while using probabilistic optimization techniques.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a flow diagram of a method for decomposing input data into a set of shift invariant components according to an embodiment of the invention;
  • FIG. 2A is a diagram of input data represented by three Gaussian functions according to an embodiment of the invention;
  • FIGS. 2B-2D are diagrams of components of the distributions of FIG. 2A obtained by the method according to an embodiment of the invention;
  • FIG. 3 is a diagram of output data approximated the input data by combing the components of FIGS. 2B-2C according to an embodiment of the invention;
  • FIG. 4 a diagram of input data represented by multiple Gaussian functions according to an embodiment of the invention;
  • FIGS. 5A-5D are diagrams of components of the distributions of FIG. 4 according to an embodiment of the invention;
  • FIG. 6 is a diagram of output data approximated the input data by combing the components of FIGS. 5A-5D according to an embodiment of the invention;
  • FIG. 7 is a diagram of input data represented by three Gaussian functions according to an embodiment of the invention;
  • FIGS. 8A-8B are diagrams of components of the distributions of FIG. 7 according to an embodiment of the invention;
  • FIG. 9 is a diagram of output data approximated the input data by combing the components of FIGS. 8A-8B according to an embodiment of the invention;
  • FIG. 10 is a diagram of musical notes represented by distributions according to an embodiment of the invention;
  • FIGS. 11A-11B are diagrams of components of the distributions of FIG. 10 according to an embodiment of the invention;
  • FIG. 12 is a diagram of output data approximated the input data by combing the components of FIGS. 11A-11B according to an embodiment of the invention;
  • FIG. 13 is a diagram of kernel distributions of magnitude spectrogram of speech according to an embodiment of the invention;
  • FIG. 14 is an image of a choir with a multitude of heads at various locations in the image;
  • FIG. 15 is a diagram of a kernel distribution of an average head according to an embodiment of the invention;
  • FIG. 16 is a diagram of an impulse distribution of the locations of the heads in FIG. 14 according to an embodiment of the invention;
  • FIG. 17 is an image of handwriting characters;
  • FIG. 18A is a diagram of kernel distributions of the characters in FIG. 17, according to an embodiment of the invention;
  • FIG. 18B is a diagram of corresponding prior distributions for the kernel distributions of FIG. 18A, according to an embodiment of the invention;
  • FIG. 18C is a diagram of the impulse distribution corresponding to the characters in FIG. 17 according to an embodiment of the invention.
  • FIG. 19 is an image of results of approximation of the characters in FIG. 17 according to an embodiment of the invention.
  • DETAILED DESCRIPTION OF THE EMBODIMENT
  • As shown in FIG. 1, one embodiment of our invention provides a method 100 for decomposing input signal 101 into a set of shift invariant components 131.
  • We sample an input signal 101 to acquire 110 input data 111. The signal can be an acoustic signal with temporally shifted magnitude spectra, or an image where pixel intensities are shifted spatially. Other forms of input data are described below. Typically, the input data has a high dimensionality. It is desired to determine low-dimensional components that can be used to approximate high dimensional input data.
  • We represent 120 the input data 111 by a model 121. The model 121 is a mixture of probability distributions. The probability distributions can be continuous Gaussian functions or discrete histograms.
  • We apply iteratively an expectation-maximization procedure 130 to the 121 model to determine linearly components 131 of the probability distributions.
  • The components can be combined 140 to produce output data 141 that approximate the input data 111.
  • Probabilistic Latent Component Analysis
  • The embodiments of our invention provide the model 121 and method 100 for decomposing the input signal 101 represented 110 by probability distributions into a set of shift invariant components 131. Our model is defined as: P ( x ) = z P ( z ) j = 1 N P ( x z z ) , ( 1 )
    where P(x) is an N-dimensional distribution of the input signal 101, a random variable x=x1, x2, . . . , xN, z is a latent variable, and the P(xj|z) are one dimensional marginal distributions. The latent variable is an index for the set of components that are extracted. The latent variable can take values of, e.g., {1,2,3}, in which case we are decomposing to three components. Alternatively, the latent variable can be a continuous variable and take any number of values, including fractional values.
  • Effectively, our model represents a product of a mixture of the marginal probability distributions to approximate the N-dimensional probability distribution representing the input signal 101. The marginal distributions themselves are dependent on the latent variable z. The model can be used by our method to determine an underlying structure of the input signal 101. This is done by estimating 130 both P(xj|z) and P(z) from the input data P(x).
  • The estimation of the marginal probabilities P(xj|z) is performed using a variant of an expectation-maximization (EM) procedure, which is described in more detail in Appendix A and Appendix B. The EM procedure includes an expectation step and a maximization step. We alternate between these two steps in an iterative manner until a termination condition is reached, for example, a predetermined number of iterations, or a predetermined accuracy.
  • In the expectation step, we estimate a weight R of the latent variable z: R ( x , z ) = P ( z ) j = 1 N P ( x j z ) z P ( z ) i = 1 N P ( x j z ) . ( 2 )
  • In the maximization step, we re-estimate the marginal probabilities using the above weight to obtain a new and more accurate estimate: P ( z ) = P ( x ) R ( x , z ) x . ( 3 ) P ( x j z ) = P ( x ) R ( x , z ) x k , k j P ( z ) . ( 4 )
  • P(xj|z) contains a latent marginal distribution, along the dimension of the variable xj relating to the latent variable z, and P(z) contains the prior probability of that latent variable. Repeating the above EM steps in an alternating manner multiple times produces a converging solution for the marginal probabilities and the latent variable prior probabilities.
  • The above method can also be adapted to work for discrete variables x and z, or all possible combinations, as described in Appendix B. The method also works when the input data are represented by un-normalized histograms as opposed to probability distributions. The only added measure we need to take in the discrete case is to normalize each P(xj|z) to integrate or sum to one, in every iteration to ensure that it corresponds to a true marginal distribution.
  • We describe the use of our decomposition model and method using an example problem. As shown in FIG. 2A, we represent a two dimensional random variable P(x) by three 2D-Gaussian functions with diagonal covariances: x ~ 𝒩 ( [ 1 - 1 ] , [ 0.4 0 0 0.4 ] ) + 1 2 𝒩 ( [ 0 2 ] , [ 0.7 0 0 0.1 ] ) + 1 2 𝒩 ( [ - 2 1 ] , [ 0.1 0 0 0.4 ] ) . ( 5 )
  • We sample P(x) and operate in the discrete domain using the discrete forms of the above equations as shown in equations (37, 38, 38) in Appendix B. The latent variable z is also discretized so as to assume only three values, one for each component we desire to extract.
  • FIGS. 2B-2D show the expected result after forty iterations of our EM-like training. FIG. 2B shows the prior probabilities P(zi). FIG. 2C shows the marginal probabilities from an up-down dimension P(x2,|zi). FIG. 2D shows the marginal probabilities from a left-right dimension P(x1,|zi).
  • FIG. 3 shows an approximation to P(x) using our PLCA model. By combining, e.g., multiplying, pairs of marginal probabilities that are associated with the same latent variable, we can describe all of the Gaussian functions that originally were used to represent P(x). The latent prior probabilities reflect the proper mixing weights, and the relative presence of each Gaussian function. The prior probabilities properly describe a ratio in the mixture, albeit normalized to sum to one.
  • Shift Invariant Probabilistic Latent Component Analysis
  • The above model is extended to deal with shift-invariance in real-world data. We discover surprisingly a fundamental relationship with a seemingly unrelated family of processes as described below.
  • Shift Invariance Along One Dimension
  • First, we consider the problem where we have a somewhat more complex description for each component than just a product of marginal distributions. We approximate a two-dimensional distribution using a set of left-right ‘shifting’ two-dimensional kernel distributions: P ( x , y ) = z P ( z ) P ( x , τ z ) P ( y - τ z ) τ . ( 6 )
  • Instead of multiplying marginal probabilities, we now convolve a “kernel” distribution P(x, τ|z) with an ‘impulse’ distribution P(y−τ|z) along a left-right dimension. The variables y and r are cardinal numbers to make the imposed shifting by the convolution operation a meaningful operation.
  • The kernel distributions extend over both dimensions, whereas the impulse distributions exist only in the dimension on which the convolution takes place. An optimal estimate for the kernel and impulse distributions is
    P(z)=δ(z), P(y)(y|z=0)=δ(y), P(x, τ|z=0)=P(x,y).
  • This solution returns the original distribution P(x, y), without providing any further insight into its structure. To obtain more useful decompositions, we constrain the kernel to be zero outside [τ12], i.e., we impose the constraint that P(x, τ|z)=0.
  • If P(τ|z)=_(y), then Equation (6) reduces to the PLCA form in equation (1).
  • We still estimate all three distributions, i.e., prior probabilities, kernels and impulse, in Equation (6) given P(x, y). In order to perform the estimation using this model, we extend the PLCA method to deal with a convolution operation. The extensions are described in detail in Appendix C.
  • We apply 130 the EM procedure as described above. A weight of each latent variable in the expectation step is now defined over the parameter τ, as well as the latent variable z, and the weight is: R ( x , y , τ , z ) = P ( z ) P ( x , τ z ) P ( y - τ z ) z P ( z ) P ( x , τ z ) P ( y - τ z ) τ , ( 7 )
    and the new estimates for P(z), P(x, τ|z) and P(y|z) are defined by the proper integrations over the input P(x, y) weighted by the contribution R(x, y, τ, z): P ( z ) = P ( x , y ) R ( x , y , τ , z ) x y τ ( 8 ) P ( x , τ z ) = P ( x , y ) R ( x , y , τ , z ) y P ( z ) ( 9 ) P ( y z ) = P ( x , y + τ ) R ( x , y + τ , τ , z ) x τ P ( x , y + τ ) R ( x , y + τ , τ , z ) x y τ ( 10 )
  • Just as before, the above equations are iteratively applied until the estimates for P(z), P(x, τ|z) and P(y|z) converge to a desired solution. In the above equations, the variables x, y and τ are continuous. However, the same method can be applied to discrete variables as described in Appendix C.
  • As shown in FIG. 4, this enables us to deal with more complex input data. In this example, we have two repeating and shifted patterns that compose the input distribution P(x, y). One pattern 401 is a Gaussian function oriented from bottom right to top left, and the other pattern 402 is a set of two Gaussian functions that form a ‘wedge’ oriented towards the top right. Both of these patterns are not easily approximated by products of marginal probabilities as in the above described model. However, the patterns can be modeled by the convolutive model because the patterns exhibit repetition along the left-right axis.
  • We analyze this distribution using the discrete form as described in Equations (58, 59, 60 and 60) in Appendix C. In this particular example, we limited the kernel distributions so that P(x, τ|z)=0, ∀τ<0, and τ>2. The latent variable z is also discretized and assumes only two values (z1 and z2). The results of the decomposition of the converged convolutive model are shown in FIGS. 5A-D.
  • FIG. 5A shows the left the latent variable prior probabilities P(z), FIGS. 5B-C show the two kernel distributions P(x, τ|z) and P(x, τ|z), and FIG. 5D shows the impulse distributions P(y|z).
  • FIG. 6 shows a diagram of output data approximated the input data by combing the components of FIGS. 5A-5D.
  • We see that by convolving the pairs P(x, τ|z) with P(y|z), we can model the input very well and also discover useful information about its structure. We set the kernel distribution P(x, τ|z) to be non-zero for only a limited interval of τ. If P(x, τ|z) are unconstrained, then a variety of other solutions, e.g., P(x, τ|z)=P(x, y), P(y|z)=τ(y), can be obtained that may model P(x) better than the solutions obtained in the example. Other forms of partitioning P(x, y) in P(x, τ|z) and setting P(y|z) to be an appropriate assortment of delta functions also provide an adequate solution. Therefore, like many dimensionality reduction schemes, the limiting of the extent of P(x, τ|z) forces the kernels to be informative.
  • Shift Invariance Along Multiple Dimensions
  • Having dealt with the case of shift invariance on one dimension, we now turn to shift invariance on multiple dimensions. Specifically, because we apply our model to two-dimensional real-world data, such as images and spectrograms, we present the case of shift invariance on both-dimensions of a two-dimensional distribution, whose two dimensions x and y we designate the “left-right” and “up-down” dimensions respectively. Generalizations to an arbitrary number of dimensions are described in Appendix C.
  • The kernel distributions we derive can be shifted in the left-right dimension, but also in the up-down dimension. The model for this case is defined using a two-dimensional convolution as: P ( x , y ) = z P ( z ) P ( τ x , τ y z ) P ( x - τ x , y - τ y z ) τ x τ y . ( 11 )
  • We restrict the kernel distributions P(τx, τy|z), such that P(τx, τy|z)=0, ∀(τx, τy)∉
    Figure US20080010038A1-20080110-P00900
    τx, τy, where
    Figure US20080010038A1-20080110-P00900
    defines a convex region.
    Figure US20080010038A1-20080110-P00900
    is selected such that its extent is smaller than that of the input distribution P(x, y), while the domain of the impulse distributions P(x, y|z) is set to be as large as the input distribution, so that there is space to shift the kernel with respect to the impulse distribution in both dimensions.
  • A detailed derivation is described in Appendix C. The ‘contribution’ of each latent variable is over the latent variables, and both the convolution parameters τx and τy: R ( x , y , τ x , τ y , z ) = P ( z ) P ( τ x , τ y z ) P ( x - τ x , y - τ y z ) z P ( z ) P ( τ x , τ y z ) P ( x - τ x , y - τ y z ) τ x τ y . ( 12 )
  • As described above, the estimation of the updated prior probabilities, kernel and impulse distributions can be done by the proper integrations: P ( z ) = P ( x , y ) R ( x , y , τ x , τ y , z ) x y τ x τ y ( 13 )      P ( τ x , τ y z ) = P ( x , y ) R ( x , y , τ x , τ y , z ) x y P ( z ) ( 14 ) P I ( x , y z ) = P ( x + τ x , y + τ y ) R ( x + τ x , y + τ y , τ x , τ y , z ) x y P ( x + τ x , y + τ y ) R ( x + τ x , y + τ y , τ x , τ y , z ) x y τ x τ y ( 15
  • The iterative procedure EM procedure is guaranteed to obtain a locally optimal estimate for the kernel and impulse distributions. However, it is advantageous to utilize a modification, whereby, at each iteration, we ‘anneal’ the kernel by raising the kernel to an exponent:
    Px, τy)←c·Px, τy)α,   (16)
    where 0<α≦1, initially set to a value less than 1, and c is a normalizing constant. As the iterations proceed, it is gradually raised to 1.
  • This procedure can yield a locally optimal estimate of the kernel and impulse distributions, and the procedure is much more likely to result in ‘meaningful’ decompositions, wherein the kernel captures most of the repetitive structure in the distribution, while the impulse distribution includes a mixture of impulse, such as peaks, identifying the location of the shifted structures, thereby producing a sparse code.
  • FIG. 7 shows an input distribution P(x, y). The latent variable prior probabilities, kernel and impulse distributions that are extracted are also shown in FIGS. 8A-8B. The kernel distributions have correctly converged to the two shifted forms, whereas the impulse distributions are placed to perform the proper decomposition. Convolving each pair of kernel and impulse distributions, and summing the distributions results in a good approximation of the input distribution as shown in FIG. 9.
  • Analyzing Real-World Data
  • We now describe our ability analyze and extract interesting features from complex real world input data. We apply our method to applications in two signaling domains, acoustic and visual. Instead of performing training directly on training data as in the prior art, we perform ‘EM-like’ training on the distribution or the observed histogram. The reason this is done is to provide a way to analyze certain classes of highly complex distributions and obtain easily interpretable results. The class of distributions that we are most interested in deal with time-frequency distributions of acoustic data, and spatial-intensity distributions of image data.
  • Acoustic Data
  • We start with acoustic signals, where we operate on a time-frequency distribution representation. We represent an acoustic signal by a distribution of its energy in time and frequency axes. It is effectively a scaled histogram of acoustic components that fall in each time and frequency grid point.
  • We start with an example where only one kernel distribution is sought, albeit in multiple shifted positions. Our example input data are recorded piano notes, e.g., C, D, E, F, D, E, C, G.
  • In a note, most of the signal energy is at a fundamental frequency, which also defines a pitch of the note. Then, decreasing amounts of energy are at higher frequencies, which are integer multiples of the fundamental frequency, the so called harmonics of the note. On a musical instrument, such as the piano. Neighboring notes have essentially the same energy distribution across frequencies, albeit shifted along the frequency axis, assuming a logarithmic frequency spacing representation.
  • In a time-frequency representation, by playing different notes at different times, we effectively have shifting in both the time axis denoting when the note is played, and the frequency axis denoting which note is played.
  • FIG. 10 shows a constant-Q time-frequency distribution of the piano note sequence, see Brown, J. C., “Calculation of a Constant Q Spectral Transform,” Journal of the Acoustical Society of America vol. 89, pp. 425-434, 1991, incorporated herein by reference.
  • The harmonic series repeat in various shifted positions to represent each note. We analyzed this time-frequency distribution seeking a single latent variable. Analysis of this distribution results into a kernel function that is a harmonic series as shown in FIG. 11A, and an impulse function that places that series in the time frequency plane as shown in FIG. 11B. The approximation to the reconstructed input is shown in FIG. 12.
  • The kernel distribution looks very much like a harmonic series, whereas the impulse distribution has energy only at the fundamental frequency. The impulse distribution indicates where the kernels need to be placed to reconstruct the input data. Thus, we have, in an unsupervised manner, discovered that the piano recording was constructed by single harmonic template shifted appropriately in time and frequency. From this analysis, we can define the timbre of a piano note, i.e., the kernel distribution, and also perform a transcription of the ‘performance’ by noting a maxima of the impulse distribution.
  • The same results can be obtained when the piano notes overlap in time and are not recorded in relative isolation as in the above example.
  • In another example application, we extract multiple kernel distributions from a speech signal. We analyze a magnitude spectrogram representation of male speech and determine its kernel distributions. For this application, we use about thirty seconds of male speech obtained from the TIMIT speech database, Zue, “Speech database development at MIT: TIMIT and beyond,” Speech Communication, 9, 351-356, 1990, incorporated herein by reference. TIMIT is a standard database used for the development of signal processing and classification processes. We extract 513 frequencies, which results in a discretized input distribution of 513 frequencies over 938 time intervals. We discover 20 latent variables, and we define the kernel distribution size to extend throughout all frequencies but only for 8 time intervals. This kernel size yields bases that are shifted only in time, but not in frequency. Because both the kernel and the input have the same frequency width, there is no space to shift along that dimension.
  • FIG. 13 shows the resulting kernel distributions from an analysis of the magnitude spectrogram of speech. The distributions are stacked from left to right, and separated by dotted lines.
  • Interestingly, the shape of the distributions correspond to magnitude spectrograms of various speech phonemes, and the time-frequency form of these kernels resemble the structure of various phonemes. One can see a harmonic structure in each kernel distribution, as well as a formant structure characteristic of a phoneme.
  • Due to the additivity in this model, qualitatively similar results can be obtained when using mixtures of speakers as an input. In effect, we find that the building blocks of speech are indeed phonemes shifted in various parts in time. Analyzing different speaker types results in a different set of kernel distributions, i.e., phonemes, which reflects the unique nature of each speaker.
  • Image data
  • Image data (pixels) can be thought of as distributions, e.g., a probability or count of photons landing on a particular point on an optical sensor, i.e., pixel intensities. Image data can be decomposed by our method to yield interesting results. We start with an example application where we wish to extract a single kernel distribution from a complex image.
  • As shown in FIG. 14, the input data is a 136 by 511 pixel image of a choir with a multitude of heads shifted at various locations in the image. We analyze the input data looking for a single latent variable and kernels of pixel height 32 and width 24.
  • After analysis of the input image using our method, we obtained a kernel distribution as shown in FIG. 15, with color inversion. The kernel distribution corresponds to the shape and appearance of an ‘average’ head. The impulse distribution is shown in FIG. 16. The impulse distribution indicates the appropriate shifted locations for each choir member in the input image.
  • FIG. 17 shows a more complex example, which is a 81 by 113 pixel image of handwriting. Three types characters, e.g., lamda, alpha and gamma, compose the image. The characters are located arbitrarily on an x, y grid. Analyzing the data according to our model and method, we can extract three 15 by 15 kernel distributions shown in FIG. 18A. The three kernel distributions are actually the three characters in the input image. FIG. 18B shows the corresponding prior distributions. The impulse distribution, as shown in FIG. 18C, includes spikes at the locations that correspond to each character in the input image.
  • FIG. 19 shows that an approximation that results using this decomposition has the surprising property of ‘streamlining’ the characters and making all instances of the same character look more alike than in the original image. The latent variable prior probabilities, shown in FIG. 18B, essentially indicate an amount of energy contained in each character. The ‘alpha’, due to a more elaborate stroke, contains more energy from the ‘gamma’, which in turn contains somewhat more energy than the less elaborate ‘lamda’. As in the case of the acoustic data, we can obtain qualitatively the same results, even when the characters overlap.
  • Other Applications
  • Our non-convolutive form of PLCA can essentially be characterized as a generalization of Hofmann's probabilistic latent semantic analysis (PLSA), Hofmann, T., “Probabilistic Latent Semantic Analysis,” Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI'99, 1999, incorporated herein by reference.
  • In association with PLSA, our method relates and enriches well known data mining methods employing latent semantic analysis (LSA), singular value decomposition (SVD), and principal component analysis (PCA).
  • Our method also is related to PARAFAC, Bro, R., “PARAFAC,” Tutorial and applications,” in Chemometrics and Intelligent Laboratory Systems, Volume 38, Issue 2, pp. 149-171, October 1997, incorporated herein by reference. A PARAFAC decompositions attempts to factor multi-linear data structures into vector components. That corresponds to PLCA for arbitrary dimensional distributions. A key difference is that PARAFAC processes are predominantly least squares approximations to arbitrary input data, whereas our method is explicitly approximating probability densities by marginal probabilities and has a probabilistic foundation. Versions of the PARAFAC process are known for non-negative inputs. However, to our knowledge no probabilistic or convolutive framework has been described in the prior art.
  • Also related is non-negative matrix factorization (NMF), Lee D. D., and Seung H. S., “Learning the parts of objects by non-negative matrix factorization,” Nature, Vol. 401, No. 6755, pp. 788-791, 21 October, 1999, incorporated herein by reference. In NMF, the objective is to factor a non-negative matrix using two lower rank non-negative matrices. An interesting connection between our method and NMF comes through with the cost function, which is minimized when performing NMF. That function is most often an adaptation of the Kullback-Leibler divergence for arbitrary non-negative functions. That divergence is minimized between the input and the product of the estimated factors.
  • Interestingly enough, the EM procedure we use for PLCA essentially minimize the KL divergence between the input probability density and the density specified product of marginal probabilities. In a way, the left and right factors in NMF correspond to P(x1|z), and P(x2|z) with P(z) already factored. Even though the estimation for NMF and our PLCA are radically different, we achieve qualitatively the same results for a wide variety of applications in signal processing.
  • Subsequent convolutive extensions to NMF directly correspond to convolutive PLCA. In fact all results, which use various forms of NMF, see Smaragdis, P., Brown, J. C., “Non-negative Matrix Factorization for Polyphonic Music Transcription,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 177-180, October 2003, Smaragdis, P., “Discovering Auditory Objects Through Non-Negativity Constraints,” Statistical and Perceptual Audio Processing (SAPA), SAPA 2004, October 2004, and Smaragdis, P., “Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs,” International Congress on Independent Component Analysis and Blind Signal Separation, ISBN: 3-540-23056-4, Vol. 3195/2004, pp.494, September 2004 (Springer Lecture Notes in Computer Science, all incorporated herein by reference, can be replicated using the embodiments described herein.
  • In some way, our method can be related to positive deconvolution, Li, L., Speed, T., “Deconvolution of sparse positive spikes: is it ill-posed?,” University of California at Berkeley, Department of Statistics Technical Report 586, 2000, incorporated herein by reference. This is a particularly desirable operation in the fields of astronomical imaging and bio-informatics. The objective is to obtain a non-negative deconvolution of both a convolved sequence and a filter applied to the sequence. Most prior art methods rely on least-squares formulation.
  • Our method can be adapted to that problem. If we define the filter to be a kernel distribution, then we can proceed by performing shift invariant PLCA. However, we only update the impulse distribution and keep the kernel distribution fixed to the filter we use for the deconvolution. Due to the lower number of variables to be estimated, convergence is much more rapid than when performing a complete shift invariant PLCA.
  • A point worthy of some discussion is the exponentiation operation in Equation (16), which we use as a mechanism to ensure sparsity on the impulse distributions. Although we stopped short of a probabilistic explanation, we note that this operation corresponds to information theoretic manipulations.
  • The ‘flattening’ that the exponentiation produces causes the entropy of the kernel distributions to increase. Because the data we model have a fixed entropy, the increased kernel entropy is ‘borrowed’ from the impulse distributions. This forces the entropy of the impulse distributions to decrease, which, in turn, causes this form of sparse learning. Alternatively, we can raise the impulse distributions to a power greater than one to achieve similar results. However, because the kernel distributions are, in general smaller it is more efficient to manipulate them instead. This way of forcing sparsity in such a decomposition related to sparse NMF, Hoyer, P. O., “Non-negative Matrix Factorization with sparseness constraints,” Journal of Machine Learning Research 5:1457-1469, 2004, incorporated herein by reference.
  • One consideration is the number and size of components that are desired. In most of the examples described above, the number and size are known a priori. A larger number of components usually has the effect of distributing the desired result to more components, thereby providing a more detailed description, or otherwise, allocating the components to irrelevant information. Fewer components results in either non-detection of some desired components or a consolidation of multiple desired components into one.
  • Large components can result in overfitting because there is little space to shift, whereas small components end up being insufficient to model the input desirably. In general, as in many dimensionality reduction processes, it is hard to reliably estimate how many and how large components are correct for an optimal result. However, our probabilistic enables the use of conventional techniques, such as the Schwarz-Bayesian information criterion and other similar measures.
  • It should also be noted, that if some of the components or their weights are known, then these can be used to find missing components or weights. That is, the method can be used to estimate missing items. This is useful in the case where it is known that there is a specific component that is, e.g., spatially shifted such as an image of a face, and the amount of shift is unknown.
  • Effect of the Invention
  • The embodiment of the invention is a method for decomposing probability distributions into shift invariant components. We presented our approach in gradually complicating cases starting from a simple static model to an arbitrary dimensionality and shift invariance model. We also provide an EM procedure to perform the decomposition. The method can be applied to any arbitrary data set.
  • Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.
  • Appendix A The Basic Update Rule
  • The update rules for PLCA are obtained through a variant of the expectation maximization alorithm. We attempt to estimate the parameters A of a model P(x; A) for the distribution (or density) of a random variable x, such that the KL diverence between P(x; A) and the true distribution of x, P(x) is minimized. The KL divergence between the two is defined by:
    D(P(x)∥P(x; A))=−E x log P(x; A)−H(x)   (17)
    Here Ex refers to the expectation operator with respect to P(x), the true distribution of x. H(x) is the entropy of x.
  • Introducing a second random variable z, and by Bayes' rule
    log P(x; A)=log P(x; z; A)−log P(z|x; A)
    Taking expectations on both side with respect to P(z|x; A′). i.e. the conditional probability of z obtained with any parameter A′, and nothing that log P(x;A) does not depend on z,
    log P(x; A)=E z|x;A′ {log P(x, z; A)}−E z|x;A′{log P(z|x; A)}  (18)
    Combining the above with Equation 17,
    D(P(x)∥P(x; A))=−E x {E z|x;A′{log P(x, z; A)}}+E x {E z|x;A′{log P(z|x; A)}}−H(x)   (19)
    D(P(x)∥P(x; A))−D(P(x)∥P(x; A′))=E x 55 E z|x;A′{log P(x, z; A′)}−E z|x;A′{log P(x, z; A)}}−D(P(z|x; A′)∥P(z|x; A))   (20)
    The KL divergence between two distributions is always non-negative (Theorem 2.6.3 in (11)), i.e., D(P(z|x; A′)∥P(z|x; A))≧0 ∀A. Hence, assuredly,
    E x {E z|x;A′{log P(x, z; A)}}≧E x {E z|x;A′{log P(x, z; A′)}}
    Figure US20080010038A1-20080110-P00001
    D(P(x)∥P(x; A))≧D(P(x)∥P(x; A′))   (21)
  • I.e., the distance between P(x|A) and P(x) is assuredly lesser than or equal to the distance between P(x) and P(x|A′) if A minimizes Ex{Ez|x;A′{log P(x, z; A)}}. This leads to the following iterative update rule for the parameters of P(x; A):
    A (n+1)=arg maxA Q(A, A (n))
    Q(A, A (n))=E x {E z|x;A (n) {log P(x, z; A)}}  (22)
    where A(n) is the estimate obtained for A in the nth update. Iterations of Equation 22 will result estimates of A that will monotonically decrease D(P(x)∥P(x; A)).
  • Appendix B Update Rules for Non-Convolutive Mixture Models
  • We define an “integral” operator Ix{f(x)} such that for a continuous variable x I{f(x)}=∫−∞ f(x)dx, while for a discrete random variable Ix{f(x)}=Σxf(x). By the definition of the operator, Ix{Iy{f(x,y)}}=Iy{Ix{f(x,y)}}, Ix{P(x)}=1 and Ix{P(x)g(x)}=Exg(x).
  • A non-convolutive mixture model for the distribution of the data is P ( x ; Λ ) = I z { P ( z ) j P ( 𝓍 j z ) } ( 23 )
    where x={xj}. Note that the above formulation places no restriction on z, which might be either continuous or discrete. Similarly each xj can be continuous or discrete. The parameters of this distribution are P(z) and P(xj|z), i.e. A={P(z), P(xj|z): ∀(z,j)}. We will denote the estimates obtained in the nth update by the superscript (n).
    Let us define R ( x , z ) P ( n ) ( z x ) = P ( n ) ( z ) j P ( n ) ( 𝓍 j x ) I z { P ( n ) ( z ) j P ( n ) ( 𝓍 j z ) } ( 24 )
    We can now write E z x ; Λ ( n ) { log P ( x , z ; Λ ) } = I z { R ( x , z ) log ( P ( z ) j P ( z j x ) ) } = I z { R ( x , z ) ( log P ( z ) + j log P ( 𝓍 j z ) ) } ( 25 )
  • The Update equations are easily derived from Equations 22 and 25, with the additional incorporation of Lagrangian terms to enforce the constraints that the total probability masses under P(z) and P(xj|z) are unity.
  • We can express the constrained form of the equation to be optimized as: Λ ( n + 1 ) = argmax Λ Q ( Λ , Λ ( n ) ) ( 26 ) Q ( Λ , Λ ( n ) ) = E x { I z { R ( x , z ) log P ( z ) } } + E x { I z { j R ( x , z ) log P ( 𝓍 j z ) } } - λ I z { P ( z ) } - I z { j λ z , j I x j { P ( 𝓍 j z ) } } = I z { E x { R ( x , z ) } log P ( z ) - λ P ( z ) } + I z { j { E x { R ( x , z ) log P ( 𝓍 j z ) } - λ z , j I z j { P ( 𝓍 j z ) } } } ( 27 )
  • We note that in general the optimization of Ix{h}g(x))} with respect to g(x) leads to h ( g ( x ) ) g ( x ) = 0
    both for discrete and continuous x, by direct differentiation in he former case and by the calculus of variations in the latter.
  • The (n+1)th estimate of P(z) is obtained by optimizing Q(A, A(n)) with respect to P(z), which gives us E x { R ( x , z ) } P ( n + 1 ) ( z ) - λ = 0 λ P ( n + 1 ) ( z ) = E x { R ( x , z ) } ( 28 )
    Since Iz{P(n+1)(z)}=1 and Iz{R(x,z)}=1, applying the Iz{·} operator to both sides of Equation 28 we get λ=1, leading to the update equation
    p (n+1)(z)=E x {R(x, z)}  (29)
  • To derive the (n+1)th estimate of P(xj|z) we first note from reference (10) that
    E x {R(x,z)log P(x j |z)}=E x j {log P(x j |z)E x|xj {R(x, z)}}  (30)
    We can therefore rewrite Q(A, A(n)) as Q ( Λ , Λ ( n ) ) = I z { j { E 𝓍 j { log P ( 𝓍 j z ) E x 𝓍 j { R ( x , z ) } } - λ z , j I x j { P ( 𝓍 j z ) } } } + C = I z { j { I 𝓍 j { P ( 𝓍 j ) log P ( 𝓍 j z ) E x 𝓍 j { R ( x , z ) } - λ z , j P ( 𝓍 j z ) } } } + C ( 31 )
    where C represents all terms that are not a function of P(xj|z). P(xj) is the true marginal density of xj. Optimizing Q(A, A(n)) with respect to P(n+1)(xj|z), we obtain P ( 𝓍 j ) E x 𝓍 j { R ( x , z ) } P ( n + 1 ) ( 𝓍 j z ) - λ i , j = 0 P ( 𝓍 j ) E x 𝓍 j { R ( x , z ) } = λ z , j P ( n + 1 ) ( 𝓍 j z ) . ( 32 )
    Since Ix j {P(n+1)(xj|z)}=1, we can apply the Ix j 55 ·} operator to both sides of the above equation to obtain λ i , j = I 𝓍 j { P ( x j ) E x 𝓍 j { R ( x , z ) } } = E 𝓍 j { E x 𝓍 j { R ( x , z ) } } = E x { R ( x , z ) } = P ( n + 1 ) ( z ) and ( 33 ) P ( n + 1 ) ( 𝓍 j z ) = P ( 𝓍 j ) E x 𝓍 j { R ( x , z ) } P ( n + 1 ) ( z ) = I x / 𝓍 j { P ( x ) R ( x , z ) } P ( n + 1 ) ( 𝓍 ) ( 34 )
    where x/xj={xj:i≠j} represent's the set of all components of x excluding xj. Equations 29 and 34 from the final update equations.
  • If z is a discrete random variable, the non-convolutive mixture model is given by P ( x ; Λ ) = z P ( z ) j P ( 𝓍 j z ) ( 35 )
    The update equations are given by R ( x , z ) P ( z ) j P ( x j z ) z P ( z ) j P ( x j z ) ( 36 ) P ( n + 1 ) ( z ) = I x { P ( x ) R ( x , z ) } ( 37 ) P ( n + 1 ) ( x j z ) = I x / x j { P ( x ) R ( x , z ) } P ( n + 1 ) ( z ) ( 38 )
    If x is a discrete random variable (i.e. every xj is discrete), the specific form of the update rules (Equations 29 and 34) are: P ( n + 1 ) ( x ) = j x j P ( x ) R ( x , z ) ( 39 ) P ( n + 1 ) ( x j z ) = i : i j x i P ( x ) R ( x , z ) P ( n + 1 ) ( z ) ( 40 )
    If x is a continuous random variable (i.e. every xj is continuous), the update equations become: P ( n + 1 ) ( z ) = - P ( x ) R ( x , z ) x ( 41 ) P ( n + 1 ) ( x j z ) = - - - P ( x ) R ( x , z ) x 1 x 2 x i i j P ( n + 1 ) ( z ) ( 42 )
  • Appendix C Update Rules for Shift-Invariant Mixture Models
  • The shift-invariant mixture model models the distribution of some dimensions of a multi-variate random variable as a convolution of a density kernel and a shift-invariant “impulse” density. As before, let x be the multi-variate random variable. Let y represent the set of components of x that are modelled in a shift-invariant manner, and w the rest of the components. i.e. x=w ∪ y and w ∩ y=φ (where φ represents the null set).
  • The shift-invariant model for the distribution of x models it as follows:
    P(x; A)=I z {P(z)I τ {P(w, τ|z)P(y−τ|z)}}  (43)
    where τ is a random variable that is defined over the same domain as y. The terms to be estimated are P(z), P(w, τ|z) and P(y|z). i.e. A={P(z) P(w, τ|z), P(y|z)}. Note that Equation 43 assumes that the random variable y is cardinal, irrespective of whether it is discrete or continuous. Also, as before, z may be either continuous or discrete.
  • Let us define R ( x , τ , z ) = R ( w , y , τ , z ) P n ( z , τ x ) = P ( n ) ( z ) P ( n ) ( w , τ z ) P ( n ) ( y - τ z ) I z { P ( z ) I τ { P ( w , τ z ) P ( y - τ z ) } } ( 44 ) R ( x , z ) I τ { R ( w , y , τ , z ) } = P ( n ) ( z x ) ( 45 )
  • The (n+1)th estimate of P(z) is derived identically as in Appendix B and is given by
    P (n+1)(z)=E x {R(x, z)}  (46)

Claims (18)

1. A computer-implemented method for decomposing input data acquired of a signal, comprising the steps of:
sampling an input signal to acquire input data;
representing the input data as a probability distribution, in which the probability distribution is defined as
P ( x ) = _ z P ( z ) j = 1 N P ( x z z ) _ , _
where P(x) is an N-dimensional distribution of a random variable x=x1, x2, . . . , xN, z is a latent variable, and P(xi|z) are one-dimensional marginal distributions; and
applying iteratively an expectation-maximization procedure to the probability distribution to determine components of the probability distributions;
combining the components; and
producing output data from the combined components, the output data approximating the input data;
storing the output data in a computer-readable medium.
2. (canceled)
3. The method of claim 1, in which the signal is an acoustic signal.
4. The method of claim 1, in which the signal is an image.
5. The method of claim 1, in which the components have a reduced dimensionality than the input data.
6. The method of claim 1, in which the probability distribution is represented by arbitrary functions.
7. The method of claim 6, in which the arbitrary functions are continuous Gaussian functions.
8. The method of claim 1, in which the probability distribution is represented by discrete histograms.
9. The method of claim 1, in which the components are shift invariant.
10. (canceled)
11. The method of claim 10, in which the latent variable is discrete.
12. The method of claim 10, in which the latent variable is continuous.
13. The method of claim 1, in which the expectation maximization process includes an expectation step and a maximization step, and in which the steps are iterated until a termination condition is reached.
14. The method of claim 13, in which the expectation maximization process produces a sparse code.
15. The method of claim 10, further comprising:
estimating a weight R of the latent variable z according to
R ( x , z ) = P ( z ) j = 1 N P ( x j z ) z P ( z ) i = 1 N P ( x j z ) .
16. The method of claim 15, further comprising:
re-estimating the marginal probabilities using the weight to obtain a more accurate estimates
P ( z ) = P ( x ) R ( x , z ) x , and P ( x j z ) = P ( x ) R ( x , z ) x k , k j P ( z ) ,
in which P(xj|z) contains the latent marginal distribution, along a dimension of the variable xj relating to the latent variable z, and P(z) contains a prior probability of that latent variable.
17. The method of claim 1, in which the input data are continuous.
18. The method of claim 1, in which the input data are discrete.
US11/482,492 2006-07-07 2006-07-07 Shift-invariant probabilistic latent component analysis Expired - Fee Related US7318005B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/482,492 US7318005B1 (en) 2006-07-07 2006-07-07 Shift-invariant probabilistic latent component analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/482,492 US7318005B1 (en) 2006-07-07 2006-07-07 Shift-invariant probabilistic latent component analysis

Publications (2)

Publication Number Publication Date
US7318005B1 US7318005B1 (en) 2008-01-08
US20080010038A1 true US20080010038A1 (en) 2008-01-10

Family

ID=38893463

Family Applications (1)

Application Number Title Priority Date Filing Date
US11/482,492 Expired - Fee Related US7318005B1 (en) 2006-07-07 2006-07-07 Shift-invariant probabilistic latent component analysis

Country Status (1)

Country Link
US (1) US7318005B1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100312797A1 (en) * 2009-06-05 2010-12-09 Xerox Corporation Hybrid tensor-based cluster analysis
US8380331B1 (en) * 2008-10-30 2013-02-19 Adobe Systems Incorporated Method and apparatus for relative pitch tracking of multiple arbitrary sounds
US20130121506A1 (en) * 2011-09-23 2013-05-16 Gautham J. Mysore Online Source Separation
US20130121511A1 (en) * 2009-03-31 2013-05-16 Paris Smaragdis User-Guided Audio Selection from Complex Sound Mixtures
US20140257527A1 (en) * 2011-11-01 2014-09-11 Ecole Centrale De Nantes Method and device for real-time simulation of complex systems and processes
US8965832B2 (en) 2012-02-29 2015-02-24 Adobe Systems Incorporated Feature estimation in sound sources
US20150142433A1 (en) * 2013-11-20 2015-05-21 Adobe Systems Incorporated Irregular Pattern Identification using Landmark based Convolution

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070203783A1 (en) * 2006-02-24 2007-08-30 Beltramo Mark A Market simulation model
US10094911B2 (en) * 2014-12-02 2018-10-09 Fondazione Istituto Italiano Di Tecnologia Method for tracking a target acoustic source
CN104537377B (en) * 2014-12-19 2018-03-06 上海大学 A kind of view data dimension reduction method based on two-dimentional nuclear entropy constituent analysis
US10410084B2 (en) 2016-10-26 2019-09-10 Canon Virginia, Inc. Devices, systems, and methods for anomaly detection
US10997712B2 (en) 2018-01-18 2021-05-04 Canon Virginia, Inc. Devices, systems, and methods for anchor-point-enabled multi-scale subfield alignment
US10997462B2 (en) 2018-04-04 2021-05-04 Canon Virginia, Inc. Devices, systems, and methods for clustering reference images for non-destructive testing
US11429806B2 (en) 2018-11-09 2022-08-30 Canon Virginia, Inc. Devices, systems, and methods for anomaly detection
US11321846B2 (en) 2019-03-28 2022-05-03 Canon Virginia, Inc. Devices, systems, and methods for topological normalization for anomaly detection

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020169730A1 (en) * 2001-08-29 2002-11-14 Emmanuel Lazaridis Methods for classifying objects and identifying latent classes
US20040054572A1 (en) * 2000-07-27 2004-03-18 Alison Oldale Collaborative filtering
US20040095374A1 (en) * 2002-11-14 2004-05-20 Nebojsa Jojic System and method for automatically learning flexible sprites in video layers
US20050119829A1 (en) * 2003-11-28 2005-06-02 Bishop Christopher M. Robust bayesian mixture modeling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040054572A1 (en) * 2000-07-27 2004-03-18 Alison Oldale Collaborative filtering
US20020169730A1 (en) * 2001-08-29 2002-11-14 Emmanuel Lazaridis Methods for classifying objects and identifying latent classes
US20040095374A1 (en) * 2002-11-14 2004-05-20 Nebojsa Jojic System and method for automatically learning flexible sprites in video layers
US20050119829A1 (en) * 2003-11-28 2005-06-02 Bishop Christopher M. Robust bayesian mixture modeling

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8380331B1 (en) * 2008-10-30 2013-02-19 Adobe Systems Incorporated Method and apparatus for relative pitch tracking of multiple arbitrary sounds
US20130121511A1 (en) * 2009-03-31 2013-05-16 Paris Smaragdis User-Guided Audio Selection from Complex Sound Mixtures
US8954175B2 (en) * 2009-03-31 2015-02-10 Adobe Systems Incorporated User-guided audio selection from complex sound mixtures
US20100312797A1 (en) * 2009-06-05 2010-12-09 Xerox Corporation Hybrid tensor-based cluster analysis
US8060512B2 (en) * 2009-06-05 2011-11-15 Xerox Corporation Hybrid tensor-based cluster analysis
US20130121506A1 (en) * 2011-09-23 2013-05-16 Gautham J. Mysore Online Source Separation
US9966088B2 (en) * 2011-09-23 2018-05-08 Adobe Systems Incorporated Online source separation
US20140257527A1 (en) * 2011-11-01 2014-09-11 Ecole Centrale De Nantes Method and device for real-time simulation of complex systems and processes
US9817374B2 (en) * 2011-11-01 2017-11-14 Ecole Centrale De Nantes Method and device for the real-time simulation of complex systems and processes
US8965832B2 (en) 2012-02-29 2015-02-24 Adobe Systems Incorporated Feature estimation in sound sources
US20150142433A1 (en) * 2013-11-20 2015-05-21 Adobe Systems Incorporated Irregular Pattern Identification using Landmark based Convolution
US10002622B2 (en) * 2013-11-20 2018-06-19 Adobe Systems Incorporated Irregular pattern identification using landmark based convolution

Also Published As

Publication number Publication date
US7318005B1 (en) 2008-01-08

Similar Documents

Publication Publication Date Title
US7318005B1 (en) Shift-invariant probabilistic latent component analysis
Cichocki et al. Csiszar’s divergences for non-negative matrix factorization: Family of new algorithms
US10366705B2 (en) Method and system of signal decomposition using extended time-frequency transformations
Smaragdis et al. Shift-invariant probabilistic latent component analysis
Smaragdis et al. A probabilistic latent variable model for acoustic modeling
Højen-Sørensen et al. Mean-field approaches to independent component analysis
DE60309142T2 (en) System for estimating parameters of a Gaussian mixture model (GMM) or a GMM-based hidden Markov model
Jang et al. Learning statistically efficient features for speaker recognition
US7853539B2 (en) Discriminating speech and non-speech with regularized least squares
US9679559B2 (en) Source signal separation by discriminatively-trained non-negative matrix factorization
US6591235B1 (en) High dimensional data mining and visualization via gaussianization
Hansen et al. On independent component analysis for multimedia signals
Grais et al. Regularized nonnegative matrix factorization using Gaussian mixture priors for supervised single channel source separation
US8014536B2 (en) Audio source separation based on flexible pre-trained probabilistic source models
Dikmen et al. Gamma Markov random fields for audio source modeling
Rami et al. Texture retrieval using mixtures of generalized Gaussian distribution and Cauchy–Schwarz divergence in wavelet domain
Şimşekli et al. Non-negative tensor factorization models for Bayesian audio processing
Gao Single channel blind source separation
Kameoka et al. Composite autoregressive system for sparse source-filter representation of speech
Chung et al. Training and compensation of class-conditioned NMF bases for speech enhancement
Kersten et al. Sound texture synthesis with hidden markov tree models in the wavelet domain
Cipli et al. Multi-class acoustic event classification of hydrophone data
O'Grady Sparce Separation of Under-determined Speech Mixtures
US11010635B2 (en) Method for processing electronic data
Yeung et al. Integrating multiple observations for model-based single-microphone speech separation with conditional random fields

Legal Events

Date Code Title Description
AS Assignment

Owner name: MITSUBISHI ELECTRIC RESEARCH LABORATORIES, INC., M

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SMARAGDIS, PARIS;RAMAKRISHNAN, BHIKSHA;REEL/FRAME:019029/0478

Effective date: 20060607

STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20200108