US20040204939A1 - Systems and methods for speaker change detection - Google Patents

Systems and methods for speaker change detection Download PDF

Info

Publication number
US20040204939A1
US20040204939A1 US10/685,586 US68558603A US2004204939A1 US 20040204939 A1 US20040204939 A1 US 20040204939A1 US 68558603 A US68558603 A US 68558603A US 2004204939 A1 US2004204939 A1 US 2004204939A1
Authority
US
United States
Prior art keywords
phone
intervals
classes
speaker
boundary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/685,586
Inventor
Daben Liu
Francis Kubala
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Raytheon BBN Technologies Corp
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US10/685,586 priority Critical patent/US20040204939A1/en
Assigned to BBNT SOLUTIONS LLC reassignment BBNT SOLUTIONS LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KUBALA, FRANCIS, LIU, DABEN
Assigned to FLEET NATIONAL BANK, AS AGENT reassignment FLEET NATIONAL BANK, AS AGENT PATENT & TRADEMARK SECURITY AGREEMENT Assignors: BBNT SOLUTIONS LLC
Publication of US20040204939A1 publication Critical patent/US20040204939A1/en
Assigned to BBN TECHNOLOGIES CORP. reassignment BBN TECHNOLOGIES CORP. MERGER (SEE DOCUMENT FOR DETAILS). Assignors: BBNT SOLUTIONS LLC
Assigned to BBN TECHNOLOGIES CORP. (AS SUCCESSOR BY MERGER TO BBNT SOLUTIONS LLC) reassignment BBN TECHNOLOGIES CORP. (AS SUCCESSOR BY MERGER TO BBNT SOLUTIONS LLC) RELEASE OF SECURITY INTEREST Assignors: BANK OF AMERICA, N.A. (SUCCESSOR BY MERGER TO FLEET NATIONAL BANK)
Assigned to APPLIED MEDICAL RESOURCES CORPORATION reassignment APPLIED MEDICAL RESOURCES CORPORATION RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: CITIBANK N.A., AS ADMINISTRATIVE AGENT
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems

Definitions

  • the present invention relates generally to speech processing and, more particularly, to speaker change detection in an audio signal.
  • Speech has not traditionally been valued as an archival information source. As effective as the spoken word is for communicating, archiving spoken segments in a useful and easily retrievable manner has long been a difficult proposition. Although the act of recording audio is not difficult, automatically transcribing and indexing speech in an intelligent and useful manner can be difficult.
  • Speech is typically received by a speech recognition system as a continuous stream of words without breaks.
  • the speech recognition system initially processes the speech to generate a formatted version of the speech.
  • the speech may be transcribed and linguistic information, such as sentence structures, may be associated with the transcription. Additionally, information relating to the speakers may be associated with the transcription.
  • Many speech recognition applications assume that all of the input speech originated from a single speaker. However, speech signals, such as news broadcasts, may include speech from a number of different speakers. It is often desirable to identify different speakers in the transcription of the broadcast. Before identifying an individual speaker, however, it is first necessary to determine when the different speakers begin and end speaking (called speaker change detection).
  • FIG. 1 is a diagram illustrating speaker change detection using a conventional speaker change detection technique.
  • An input audio stream is divided into a continuous stream of audio intervals 101 .
  • the intervals each cover a fixed time duration such as 10 ms.
  • a total of 50 ms of audio is illustrated.
  • Boundaries between intervals 101 such as boundaries 110 - 113 , are considered to be potential candidates for a speaker change determination.
  • the conventional speaker change detection system would examine all, or a predetermined number of intervals 101 , to the left of a particular boundary, such as boundary 112 , and to the right of boundary 112 .
  • boundary 112 was assumed to be a speaker change. Whether the audio to the left and right of boundary 112 is similar or dissimilar may be based on comparing cepstral vectors for the audio.
  • interval boundaries occur at arbitrary locations. Accordingly, to ensure that a speaker change boundary is not missed, the intervals are assigned relatively short interval durations (e.g., 10 ms). Calculating whether audio segments are similar or dissimilar at each of the interval boundaries can, thus, be computationally burdensome.
  • a method consistent with aspects of the invention detects speaker changes in an input audio stream.
  • the method includes segmenting the input audio stream into predetermined length intervals and decoding the intervals to produce a set of phones corresponding to each of the intervals.
  • the method further includes generating a similarity measurement based on a first portion of the audio stream within an interval prior to a boundary between adjacent phones and a second portion of the audio stream within the interval and after the boundary. Further, the method includes detecting speaker changes based on the similarity measurement.
  • a device consistent with another aspect of the invention detects speaker changes in an audio signal.
  • the device comprises a processor and a memory containing instructions.
  • the instructions when executed by the processor, cause the processor to: segment the audio signal into predetermined length intervals and decode the intervals to produce a set of phones corresponding to the intervals.
  • the instructions additionally cause the processor to generate a similarity measurement based on a first portion of the audio signal prior to a boundary between phones in one of the sets of phones and a second portion of the audio signal after the boundary, and detect speaker changes based on the similarity measurement.
  • Yet another aspect of the invention is directed to a device for detecting speaker changes in an audio signal.
  • the device comprises a segmentation component that segments the audio signal into predetermined length intervals and a phone classification decode component that decodes the intervals to produce a set of phone classes corresponding to each of the intervals.
  • a speaker change detection component detects locations of speaker changes in the audio signal based on a similarity value calculated over a first portion of the audio signal prior to a boundary between phone classes in one of the sets of phone classes and a second portion of the audio signal after the boundary in the one set of phone classes.
  • Another aspect of the invention is directed to a system comprising an indexer, a memory system, and a server.
  • the indexer receives input audio data and generates a rich transcription from the audio data.
  • the rich transcription includes metadata that defines speaker changes in the audio data.
  • the indexer further includes a segmentation component configured to divide the audio data into overlapping segments and a speaker change detection component.
  • the speaker change detection component detects locations of speaker changes in the audio data based on a similarity value calculated at locations in the segments that correspond to phone class boundaries.
  • the memory system stores the rich transcription and the server receives requests for documents and responds to the requests by transmitting ones of the rich transcriptions that match the requests.
  • FIG. 1 is a diagram illustrating speaker change detection using a conventional speaker change detection technique
  • FIG. 2 is a diagram of a system in which systems and methods consistent with the present invention may be implemented
  • FIG. 3 is a diagram illustrating an exemplary computing device
  • FIG. 4 is diagram illustrating functional components of the speaker segmentation logic shown in FIG. 2;
  • FIG. 5 is a diagram illustrating an audio stream broken into intervals consistent with an aspect of the invention
  • FIG. 6 is a diagram illustrating exemplary phone classes decoded for an audio interval.
  • FIG. 7 is a flow chart illustrating the operation of the speaker change detection component shown in FIG. 4.
  • a speaker change detection (SCD) system consistent with the present invention performs speaker change detection on an input audio stream.
  • the speaker change detection can be performed at real-time or faster than real-time.
  • Speaker changes are detected only at phone boundaries detected within audio segments of a predetermined length.
  • the predetermined length of the audio segments may be set at a much longer length, such as 30 seconds, than conventional predetermined intervals (e.g., 10 ms) used in detecting speaker changes.
  • FIG. 2 is a diagram of an exemplary system 200 in which systems and methods consistent with the present invention may be implemented.
  • system 200 provides indexing and retrieval of input audio for clients 250 .
  • system 200 may index input speech data, create a structural summarization of the data, and provide tools for searching and browsing the stored data.
  • System 200 may include multimedia sources 210 , an indexer 220 , memory system 230 , and server 240 connected to clients 250 via network 260 .
  • Network 260 may include any type of network, such as a local area network (LAN), a wide area network (WAN) (e.g., the Internet), a public telephone network (e.g., the Public Switched Telephone Network (PSTN)), a virtual private network (VPN), or a combination of networks.
  • PSTN Public Switched Telephone Network
  • VPN virtual private network
  • the various connections shown in FIG. 2 may be made via wired, wireless, and/or optical connections.
  • Multimedia sources 210 may include one or more audio sources, such as radio broadcasts, or video sources, such as television broadcasts.
  • Indexer 220 may receive audio data from one of these sources as an audio stream or file.
  • Indexer 220 may receive the input audio data from multimedia sources 210 and generate a rich transcription therefrom. For example, indexer 220 may segment the input data by speaker, cluster audio segments from the same speaker, identify speakers by name or gender, and transcribe the spoken words. Indexer 220 may also segment the input data based on topic and locate the names of people, places, and organizations. Indexer 220 may further analyze the input data to identify when each word was spoken (possibly based on a time value). Indexer 220 may include any or all of this information as metadata associated with the transcription of the input audio data.
  • indexer 220 may include speaker segmentation logic 221 , speech recognition logic 222 , speaker clustering logic 223 , speaker identification logic 224 , name spotting logic 225 , topic classification logic 226 , and story segmentation logic 227 .
  • Speaker segmentation logic 221 detects changes in speakers in a manner consistent with aspects of the present invention. Speaker segmentation logic 221 is described in more detail below.
  • Speech recognition logic 222 may use statistical models, such as acoustic models and language models, to process input audio data.
  • the language models may include n-gram language models, where the probability of each word is a function of the previous word (for a bi-gram language model) and the previous two words (for a tri-gram language model).
  • the higher the order of the language model the higher the recognition accuracy at the cost of slower recognition speeds.
  • the language models may be trained on data that is manually and accurately transcribed by a human.
  • Speaker clustering logic 223 may identify all of the segments from the same speaker in a single document (i.e., a body of media that is contiguous in time (from beginning to end or from time A to time B)) and group them into speaker clusters. Speaker clustering logic 223 may then assign each of the speaker clusters a unique label. Speaker identification logic 224 may identify the speaker in each speaker cluster by name or gender.
  • Name spotting logic 225 may locate the names of people, places, and organizations in the transcription. Name spotting logic 225 may extract the names and store them in a database. Topic classification logic 226 may assign topics to the transcription. Each of the words in the transcription may contribute differently to each of the topics assigned to the transcription. Topic classification logic 226 may generate a rank-ordered list of all possible topics and corresponding scores for the transcription.
  • Story segmentation logic 227 may change the continuous stream of words in the transcription into document-like units with coherent sets of topic labels and other document features. This information may constitute metadata corresponding to the input audio data. Story segmentation logic 227 may output the metadata in the form of documents to memory system 230 , where a document corresponds to a body of media that is contiguous in time (from beginning to end or from time A to time B).
  • logic 222 - 227 may be implemented in a manner similar to that described by John Makhoul et al., “Speech and Language Technologies for Audio Indexing and Retrieval,” Proceedings of the IEEE, Vol. 88, No. 8, August 2000, pp.1338-1353, which is incorporated herein by reference.
  • Memory system 230 may store documents from indexer 220 .
  • Memory system 230 may include one or more databases 231 .
  • Database 231 may include a conventional database, such as a relational database, that stores documents from indexer 220 .
  • Database 231 may also store documents received from clients 250 via server 240 .
  • Server 240 may include logic that interacts with memory system 230 to store documents in database 231 , query or search database 231 , and retrieve documents from database 231 .
  • Server 240 may include a computer or another device that is capable of interacting with memory system 230 and clients 250 via network 260 .
  • Server 240 may receive queries from clients 250 and use the queries to retrieve relevant documents from memory system 230 .
  • Clients 250 may include personal computers, laptops, personal digital assistants, or other types of devices that are capable of interacting with server 240 to retrieve documents from memory system 230 .
  • Clients 250 may present information to users via a graphical user interface, such as a web browser window.
  • audio streams are transcribed as rich transcriptions that include metadata that defines information, such as speaker identification and story segments, related to the audio streams.
  • Indexer 220 generates the rich transcriptions.
  • Clients 250 via server 240 , may then search and browse the rich transcriptions.
  • FIG. 3 is a diagram illustrating an exemplary computing device 300 that may correspond to server 240 or clients 250 .
  • Computing device 300 may include bus 310 , processor 320 , main memory 330 , read only memory (ROM) 340 , storage device 350 , input device 360 , output device 370 , and communication interface 380 .
  • Bus 310 permits communication among the components of computing device 300 .
  • Processor 320 may include any type of conventional processor or microprocessor that interprets and executes instructions.
  • Main memory 330 may include a random access memory (RAM) or another type of dynamic storage device that stores information and instructions for execution by processor 320 .
  • ROM 340 may include a conventional ROM device or another type of static storage device that stores static information and instructions for use by processor 320 .
  • Storage device 350 may include a magnetic and/or optical recording medium and its corresponding drive.
  • Input device 360 may include one or more conventional mechanisms that permit an operator to input information to computing device 300 , such as a keyboard, a mouse, a pen, a number pad, a microphone and/or biometric mechanisms, etc.
  • Output device 370 may include one or more conventional mechanisms that output information to the operator, including a display, a printer, speakers, etc.
  • Communication interface 380 may include any transceiver-like mechanism that enables computing device 300 to communicate with other devices and/or systems.
  • communication interface 380 may include mechanisms for communicating with another device or system via a network, such as network 260 .
  • speaker segmentation logic 221 locates positions in the input audio stream at which there are speaker changes.
  • FIG. 4 is diagram illustrating functional components of speaker change detection logic 221 .
  • speaker change detection logic 221 includes input stream segmentation component 401 , phone classification decode component 402 , and speaker change detection (SCD) component 403 .
  • Input stream segmentation component 401 breaks the input stream into a continuous stream of audio intervals.
  • the audio intervals may be of a predetermined length, such as 30 seconds.
  • FIG. 5 illustrates an audio stream broken into 30 second intervals 501 - 504 .
  • Two successive intervals may include overlapping portions.
  • Intervals 502 and 503 may include overlapping portion 505 .
  • Overlapping portion 505 ensures that speakers changes at the boundaries of two intervals can still be detected.
  • Portion 505 is the end of audio interval 502 and the beginning of audio interval 503 .
  • Input stream segmentation component 401 passes successive intervals, such as intervals 501 - 504 , to phone classification decode component 402 .
  • phone classification decode component 402 locates phones in each of its input audio intervals, such as intervals 501 - 504 .
  • a phone is the smallest acoustic event that distinguishes one word from another.
  • the number of phones used to represent a particular language may vary depending on the particular phone model that is employed.
  • the phone system described in Kubala et al. “The 1997 Byblos System Applied to Broadcast News Transcription,” Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop , Landsdowne, Va., February 1998, 45 context independent phone models were trained for both genders.
  • Another silence phone model was trained for silence, giving a total of 91 phone models. The 91 phone models were then used to decode speech, resulting in an output sequence of phones with gender and silence labels.
  • Phone classification decode component 402 instead of attempting to decode audio intervals 501 - 504 using a complete phone set, classifies audio intervals 501 - 504 using a simplified phone decode set. More particularly, phone classification decode component 402 may classify phones into three phone classes: vowels and nasals, fricatives or sibilants, and obstruents. The phones within each class have similar acoustic characteristics. Vowels and nasals are similar in that they both have pitch and high energy.
  • Phone classification decode component 402 may use four additional phones directed to non-speech events: music, laughter, breath and lip-smack, and silence. This gives a total of eight phone classes, which are illustrated in Table I, below. TABLE I CONVENTIONAL PHONES CLASS INCLUDED IN CLASS Vowels and Nasals AX, IX, AH, EH, IH, OH, UH, EY, IY, AY, OY, AW, OW, UW, AO, AA, AE, EI, ER, AXR, M, N, NX, L, R, W, Y Fricatives V, F, HH, TH, DH, Z, ZH, S, SH Obstruents B, D, G, P, T, K, DX, JH, CH Music Laughter Breath and lip-smack Silence
  • the phone class “vowels and nasals” includes a number of conventional phones.
  • the phone classes “fricatives” and “obstruents” may also include a number of conventional phones.
  • Phone classification decode component 402 does not, however, need to distinguish between the individual phones shown in any particular phone class. Instead, phone classification decode component 402 simply classifies incoming audio as a sequence of the phone classes shown in Table I. This allows phone classification decode component 402 to be trained on a reduced number of events (seven), thus requiring a simpler, and a more computationally efficient, phone decode model.
  • Phone classification decode component 402 may use a 5-state Hidden Markov Model (HMM) to model each of the seven phone classes.
  • HMM Hidden Markov Model
  • One codebook with 64 diagonal Gaussian Mixture Models (GMM) is shared by the 5 states from the same phone class and for each state a Gaussian mixture weight is trained.
  • GMM Gaussian Mixture Models
  • One suitable implementation of phone class decode component 402 is discussed in Daben Liu et al., “Fast Speaker Change Detection for Broadcast News Transcription and Indexing,” Proceedings of Eurospeech 99, Budapest, Hungary, September 1999, pp.1031-1034, which is incorporated herein by reference.
  • FIG. 6 is a diagram illustrating exemplary phone classes 601 - 605 decoded for an audio interval, such as interval 501 , by phone classification decode component 402 .
  • a typical 30 second speech interval may include approximately 300 phone classes. Between each phone class is a phone boundary 610 - 612 .
  • SCD component 403 receives the phone-decoded audio intervals from phone classification decode component 402 .
  • SCD component 403 may then analyze each boundary 610 - 612 to determine if the boundary corresponds to a speaker change. More specifically, SCD component 403 compares the audio signal before the boundary and after the boundary within the audio interval. For example, when analyzing boundary 611 in interval 501 , SCD component 403 may compare the audio signal corresponding to phone classes 601 and 602 with the audio signal corresponding to classes 603 - 605 .
  • FIG. 7 is a flow chart illustrating the operation of SCD component 403 in additional detail when detecting speaker change boundaries in intervals 501 - 504 (FIG. 5).
  • SCD component 403 may set the starting position at one of the phone class boundaries of an interval to be analyzed (Act 701 ).
  • the selected boundary may not be the first boundary in the interval as there may not be sufficient audio data between the start of the interval and the first boundary for a valid comparison of acoustic features. Accordingly, the first selected boundary may be, for example, half-way into overlapping portion 505 .
  • SCD component 403 may similarly process the previous interval up to the point half-way into the same overlapping portion. In this manner, the complete audio stream is processed.
  • SCD component 403 may next calculate cepstral vectors for the regions before and after the selected boundary of the active interval (Act 702 ).
  • the calculation of cepstral vectors for samples of audio data is well known in the art and will not be described in detail herein.
  • the two cepstral vectors are compared (Act 703 ).
  • the two cepstral vectors are compared using the generalized likelihood ratio test. Assume that the cepstral vector to the left of the selected boundary is vector x and the cepstral vector to the right of the selected boundary is vector y.
  • is above a predetermined threshold, the two vectors are considered to be similar to one another, and are assumed to be from the same speaker (Acts 704 and 705 ). Otherwise, the two vectors are dissimilar to one another, and the boundary point corresponding to the two vectors is defined as a speaker change boundary (Acts 704 and 706 ).
  • the predetermined threshold may be determined empirically.
  • SCD component 403 repeats Acts 702 - 706 for the next boundary (Acts 707 and 708 ). If there are no further boundaries, the interval has been completely processed.
  • SCD component 403 may apply different threshold levels for boundaries that surround the non-speech phones. Thus, in between speech phones, SCD component 403 may use a threshold that is less likely to give a speaker boundary change indication than a boundary between non-speech phones. Further, it is often the case that ⁇ tends to be larger when calculating for larger data sets. Accordingly, a bias factor based on the size of the data set may be added to ⁇ to compensate for this property.
  • Speaker segmentation logic detects changes in speakers at boundary points determined by the location of phones in speech.
  • the phones are decoded for the speech using a reduced set of possible phones.
  • the speaker segmentation logic processes the audio as discrete intervals of audio in which boundary portions of the audio are set to overlap to ensure accurate boundary detection over the whole audio signal.
  • the software may more generally be implemented as any type of logic.
  • This logic may include hardware, such as application specific integrated circuit or a field programmable gate array, software, or a combination of hardware and software.

Abstract

A speaker change detection system performs speaker change detection on an input audio stream. The speaker change detection system includes a segmentation component [401], a phone classification decode component [402], and a speaker change detection component [403]. The segmentation component [401] segments the audio stream into segments [501-504] of predetermined length intervals. The segments may overlap one another. The phone classification decode component decodes the intervals to produce a set of phone classes corresponding to each of the intervals. The speaker change detection component detects locations of speaker changes in the audio stream based on a similarity value calculated at phone class boundaries.

Description

    RELATED APPLICATIONS
  • This application claims priority under 35 U.S.C. § 119 based on U.S. Provisional Application No. 60/419,214 filed Oct. 17, 2002, the disclosure of which is incorporated herein by reference.[0001]
  • GOVERNMENT CONTRACT
  • [0002] The U.S. Government may have a paid-up license in this invention and the right in limited circumstances to require the patent owner to license others on reasonable terms as provided for by the terms of Contract No. DARPA F30602-97-C-0253.
  • BACKGROUND OF THE INVENTION
  • A. Field of the Invention [0003]
  • The present invention relates generally to speech processing and, more particularly, to speaker change detection in an audio signal. [0004]
  • B. Description of Related Art [0005]
  • Speech has not traditionally been valued as an archival information source. As effective as the spoken word is for communicating, archiving spoken segments in a useful and easily retrievable manner has long been a difficult proposition. Although the act of recording audio is not difficult, automatically transcribing and indexing speech in an intelligent and useful manner can be difficult. [0006]
  • Speech is typically received by a speech recognition system as a continuous stream of words without breaks. In order to effectively use the speech in information management systems (e.g., information retrieval, natural language processing and real-time alerting systems), the speech recognition system initially processes the speech to generate a formatted version of the speech. The speech may be transcribed and linguistic information, such as sentence structures, may be associated with the transcription. Additionally, information relating to the speakers may be associated with the transcription. Many speech recognition applications assume that all of the input speech originated from a single speaker. However, speech signals, such as news broadcasts, may include speech from a number of different speakers. It is often desirable to identify different speakers in the transcription of the broadcast. Before identifying an individual speaker, however, it is first necessary to determine when the different speakers begin and end speaking (called speaker change detection). [0007]
  • FIG. 1 is a diagram illustrating speaker change detection using a conventional speaker change detection technique. An input audio stream is divided into a continuous stream of [0008] audio intervals 101. The intervals each cover a fixed time duration such as 10 ms. Thus, in the example shown in FIG. 1, in which five intervals 101 are shown, a total of 50 ms of audio is illustrated. Boundaries between intervals 101, such as boundaries 110-113, are considered to be potential candidates for a speaker change determination. The conventional speaker change detection system would examine all, or a predetermined number of intervals 101, to the left of a particular boundary, such as boundary 112, and to the right of boundary 112. If the audio to the left and right of boundary 112 was determined to be dissimilar enough, boundary 112 was assumed to be a speaker change. Whether the audio to the left and right of boundary 112 is similar or dissimilar may be based on comparing cepstral vectors for the audio.
  • One drawback to the conventional speaker detection technique described above is that the interval boundaries occur at arbitrary locations. Accordingly, to ensure that a speaker change boundary is not missed, the intervals are assigned relatively short interval durations (e.g., 10 ms). Calculating whether audio segments are similar or dissimilar at each of the interval boundaries can, thus, be computationally burdensome. [0009]
  • There is, therefore, a need in the art for improved speaker change detection techniques. [0010]
  • SUMMARY OF THE INVENTION
  • Systems and methods consistent with the principles of this invention provide for fast speaker boundary change detection. [0011]
  • A method consistent with aspects of the invention detects speaker changes in an input audio stream. The method includes segmenting the input audio stream into predetermined length intervals and decoding the intervals to produce a set of phones corresponding to each of the intervals. The method further includes generating a similarity measurement based on a first portion of the audio stream within an interval prior to a boundary between adjacent phones and a second portion of the audio stream within the interval and after the boundary. Further, the method includes detecting speaker changes based on the similarity measurement. [0012]
  • A device consistent with another aspect of the invention detects speaker changes in an audio signal. The device comprises a processor and a memory containing instructions. The instructions, when executed by the processor, cause the processor to: segment the audio signal into predetermined length intervals and decode the intervals to produce a set of phones corresponding to the intervals. The instructions additionally cause the processor to generate a similarity measurement based on a first portion of the audio signal prior to a boundary between phones in one of the sets of phones and a second portion of the audio signal after the boundary, and detect speaker changes based on the similarity measurement. [0013]
  • Yet another aspect of the invention is directed to a device for detecting speaker changes in an audio signal. The device comprises a segmentation component that segments the audio signal into predetermined length intervals and a phone classification decode component that decodes the intervals to produce a set of phone classes corresponding to each of the intervals. Further, a speaker change detection component detects locations of speaker changes in the audio signal based on a similarity value calculated over a first portion of the audio signal prior to a boundary between phone classes in one of the sets of phone classes and a second portion of the audio signal after the boundary in the one set of phone classes. [0014]
  • Another aspect of the invention is directed to a system comprising an indexer, a memory system, and a server. The indexer receives input audio data and generates a rich transcription from the audio data. The rich transcription includes metadata that defines speaker changes in the audio data. The indexer further includes a segmentation component configured to divide the audio data into overlapping segments and a speaker change detection component. The speaker change detection component detects locations of speaker changes in the audio data based on a similarity value calculated at locations in the segments that correspond to phone class boundaries. The memory system stores the rich transcription and the server receives requests for documents and responds to the requests by transmitting ones of the rich transcriptions that match the requests.[0015]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate the invention and, together with the description, explain the invention. In the drawings, [0016]
  • FIG. 1 is a diagram illustrating speaker change detection using a conventional speaker change detection technique; [0017]
  • FIG. 2 is a diagram of a system in which systems and methods consistent with the present invention may be implemented; [0018]
  • FIG. 3 is a diagram illustrating an exemplary computing device; [0019]
  • FIG. 4 is diagram illustrating functional components of the speaker segmentation logic shown in FIG. 2; [0020]
  • FIG. 5 is a diagram illustrating an audio stream broken into intervals consistent with an aspect of the invention; [0021]
  • FIG. 6 is a diagram illustrating exemplary phone classes decoded for an audio interval; and [0022]
  • FIG. 7 is a flow chart illustrating the operation of the speaker change detection component shown in FIG. 4.[0023]
  • DETAILED DESCRIPTION
  • The following detailed description of the invention refers to the accompanying drawings. The same reference numbers in different drawings may identify the same or similar elements. Also, the following detailed description does not limit the invention. Instead, the scope of the invention is defined by the appended claims and equivalents. [0024]
  • A speaker change detection (SCD) system consistent with the present invention performs speaker change detection on an input audio stream. The speaker change detection can be performed at real-time or faster than real-time. Speaker changes are detected only at phone boundaries detected within audio segments of a predetermined length. The predetermined length of the audio segments may be set at a much longer length, such as 30 seconds, than conventional predetermined intervals (e.g., 10 ms) used in detecting speaker changes. [0025]
  • Exemplary System
  • FIG. 2 is a diagram of an [0026] exemplary system 200 in which systems and methods consistent with the present invention may be implemented. In general, system 200 provides indexing and retrieval of input audio for clients 250. For example, system 200 may index input speech data, create a structural summarization of the data, and provide tools for searching and browsing the stored data.
  • [0027] System 200 may include multimedia sources 210, an indexer 220, memory system 230, and server 240 connected to clients 250 via network 260. Network 260 may include any type of network, such as a local area network (LAN), a wide area network (WAN) (e.g., the Internet), a public telephone network (e.g., the Public Switched Telephone Network (PSTN)), a virtual private network (VPN), or a combination of networks. The various connections shown in FIG. 2 may be made via wired, wireless, and/or optical connections.
  • [0028] Multimedia sources 210 may include one or more audio sources, such as radio broadcasts, or video sources, such as television broadcasts. Indexer 220 may receive audio data from one of these sources as an audio stream or file.
  • Indexer [0029] 220 may receive the input audio data from multimedia sources 210 and generate a rich transcription therefrom. For example, indexer 220 may segment the input data by speaker, cluster audio segments from the same speaker, identify speakers by name or gender, and transcribe the spoken words. Indexer 220 may also segment the input data based on topic and locate the names of people, places, and organizations. Indexer 220 may further analyze the input data to identify when each word was spoken (possibly based on a time value). Indexer 220 may include any or all of this information as metadata associated with the transcription of the input audio data. To this end, indexer 220 may include speaker segmentation logic 221, speech recognition logic 222, speaker clustering logic 223, speaker identification logic 224, name spotting logic 225, topic classification logic 226, and story segmentation logic 227.
  • [0030] Speaker segmentation logic 221 detects changes in speakers in a manner consistent with aspects of the present invention. Speaker segmentation logic 221 is described in more detail below.
  • [0031] Speech recognition logic 222 may use statistical models, such as acoustic models and language models, to process input audio data. The language models may include n-gram language models, where the probability of each word is a function of the previous word (for a bi-gram language model) and the previous two words (for a tri-gram language model). Typically, the higher the order of the language model, the higher the recognition accuracy at the cost of slower recognition speeds. The language models may be trained on data that is manually and accurately transcribed by a human.
  • [0032] Speaker clustering logic 223 may identify all of the segments from the same speaker in a single document (i.e., a body of media that is contiguous in time (from beginning to end or from time A to time B)) and group them into speaker clusters. Speaker clustering logic 223 may then assign each of the speaker clusters a unique label. Speaker identification logic 224 may identify the speaker in each speaker cluster by name or gender.
  • Name spotting [0033] logic 225 may locate the names of people, places, and organizations in the transcription. Name spotting logic 225 may extract the names and store them in a database. Topic classification logic 226 may assign topics to the transcription. Each of the words in the transcription may contribute differently to each of the topics assigned to the transcription. Topic classification logic 226 may generate a rank-ordered list of all possible topics and corresponding scores for the transcription.
  • [0034] Story segmentation logic 227 may change the continuous stream of words in the transcription into document-like units with coherent sets of topic labels and other document features. This information may constitute metadata corresponding to the input audio data. Story segmentation logic 227 may output the metadata in the form of documents to memory system 230, where a document corresponds to a body of media that is contiguous in time (from beginning to end or from time A to time B).
  • In one implementation, logic [0035] 222-227 may be implemented in a manner similar to that described by John Makhoul et al., “Speech and Language Technologies for Audio Indexing and Retrieval,” Proceedings of the IEEE, Vol. 88, No. 8, August 2000, pp.1338-1353, which is incorporated herein by reference.
  • [0036] Memory system 230 may store documents from indexer 220. Memory system 230 may include one or more databases 231. Database 231 may include a conventional database, such as a relational database, that stores documents from indexer 220. Database 231 may also store documents received from clients 250 via server 240. Server 240 may include logic that interacts with memory system 230 to store documents in database 231, query or search database 231, and retrieve documents from database 231.
  • [0037] Server 240 may include a computer or another device that is capable of interacting with memory system 230 and clients 250 via network 260. Server 240 may receive queries from clients 250 and use the queries to retrieve relevant documents from memory system 230. Clients 250 may include personal computers, laptops, personal digital assistants, or other types of devices that are capable of interacting with server 240 to retrieve documents from memory system 230. Clients 250 may present information to users via a graphical user interface, such as a web browser window.
  • Typically, in the operation of [0038] system 200, audio streams are transcribed as rich transcriptions that include metadata that defines information, such as speaker identification and story segments, related to the audio streams. Indexer 220 generates the rich transcriptions. Clients 250, via server 240, may then search and browse the rich transcriptions.
  • FIG. 3 is a diagram illustrating an [0039] exemplary computing device 300 that may correspond to server 240 or clients 250. Computing device 300 may include bus 310, processor 320, main memory 330, read only memory (ROM) 340, storage device 350, input device 360, output device 370, and communication interface 380. Bus 310 permits communication among the components of computing device 300.
  • [0040] Processor 320 may include any type of conventional processor or microprocessor that interprets and executes instructions. Main memory 330 may include a random access memory (RAM) or another type of dynamic storage device that stores information and instructions for execution by processor 320. ROM 340 may include a conventional ROM device or another type of static storage device that stores static information and instructions for use by processor 320. Storage device 350 may include a magnetic and/or optical recording medium and its corresponding drive.
  • [0041] Input device 360 may include one or more conventional mechanisms that permit an operator to input information to computing device 300, such as a keyboard, a mouse, a pen, a number pad, a microphone and/or biometric mechanisms, etc. Output device 370 may include one or more conventional mechanisms that output information to the operator, including a display, a printer, speakers, etc. Communication interface 380 may include any transceiver-like mechanism that enables computing device 300 to communicate with other devices and/or systems. For example, communication interface 380 may include mechanisms for communicating with another device or system via a network, such as network 260.
  • Exemplary Processing
  • As previously mentioned, [0042] speaker segmentation logic 221 locates positions in the input audio stream at which there are speaker changes. FIG. 4 is diagram illustrating functional components of speaker change detection logic 221. As shown, speaker change detection logic 221 includes input stream segmentation component 401, phone classification decode component 402, and speaker change detection (SCD) component 403.
  • Input [0043] stream segmentation component 401 breaks the input stream into a continuous stream of audio intervals. The audio intervals may be of a predetermined length, such as 30 seconds. FIG. 5 illustrates an audio stream broken into 30 second intervals 501-504. Two successive intervals may include overlapping portions. Intervals 502 and 503, for example, may include overlapping portion 505. Overlapping portion 505 ensures that speakers changes at the boundaries of two intervals can still be detected. Portion 505 is the end of audio interval 502 and the beginning of audio interval 503. Input stream segmentation component 401 passes successive intervals, such as intervals 501-504, to phone classification decode component 402.
  • Returning to FIG. 4, phone [0044] classification decode component 402 locates phones in each of its input audio intervals, such as intervals 501-504. A phone is the smallest acoustic event that distinguishes one word from another. The number of phones used to represent a particular language may vary depending on the particular phone model that is employed. In the phone system described in Kubala et al., “The 1997 Byblos System Applied to Broadcast News Transcription,” Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne, Va., February 1998, 45 context independent phone models were trained for both genders. Another silence phone model was trained for silence, giving a total of 91 phone models. The 91 phone models were then used to decode speech, resulting in an output sequence of phones with gender and silence labels.
  • Phone [0045] classification decode component 402, instead of attempting to decode audio intervals 501-504 using a complete phone set, classifies audio intervals 501-504 using a simplified phone decode set. More particularly, phone classification decode component 402 may classify phones into three phone classes: vowels and nasals, fricatives or sibilants, and obstruents. The phones within each class have similar acoustic characteristics. Vowels and nasals are similar in that they both have pitch and high energy.
  • Phone [0046] classification decode component 402 may use four additional phones directed to non-speech events: music, laughter, breath and lip-smack, and silence. This gives a total of eight phone classes, which are illustrated in Table I, below.
    TABLE I
    CONVENTIONAL PHONES
    CLASS INCLUDED IN CLASS
    Vowels and Nasals AX, IX, AH, EH, IH, OH, UH, EY, IY, AY, OY,
    AW, OW, UW, AO, AA, AE, EI, ER, AXR, M,
    N, NX, L, R, W, Y
    Fricatives V, F, HH, TH, DH, Z, ZH, S, SH
    Obstruents B, D, G, P, T, K, DX, JH, CH
    Music
    Laughter
    Breath and lip-smack
    Silence
  • As shown in Table I, the phone class “vowels and nasals” includes a number of conventional phones. Similarly, the phone classes “fricatives” and “obstruents” may also include a number of conventional phones. Phone [0047] classification decode component 402 does not, however, need to distinguish between the individual phones shown in any particular phone class. Instead, phone classification decode component 402 simply classifies incoming audio as a sequence of the phone classes shown in Table I. This allows phone classification decode component 402 to be trained on a reduced number of events (seven), thus requiring a simpler, and a more computationally efficient, phone decode model.
  • Phone [0048] classification decode component 402 may use a 5-state Hidden Markov Model (HMM) to model each of the seven phone classes. One codebook with 64 diagonal Gaussian Mixture Models (GMM) is shared by the 5 states from the same phone class and for each state a Gaussian mixture weight is trained. One suitable implementation of phone class decode component 402 is discussed in Daben Liu et al., “Fast Speaker Change Detection for Broadcast News Transcription and Indexing,” Proceedings of Eurospeech 99, Budapest, Hungary, September 1999, pp.1031-1034, which is incorporated herein by reference.
  • FIG. 6 is a diagram illustrating exemplary phone classes [0049] 601-605 decoded for an audio interval, such as interval 501, by phone classification decode component 402. A typical 30 second speech interval may include approximately 300 phone classes. Between each phone class is a phone boundary 610-612.
  • Returning to FIG. 4, [0050] SCD component 403 receives the phone-decoded audio intervals from phone classification decode component 402. SCD component 403 may then analyze each boundary 610-612 to determine if the boundary corresponds to a speaker change. More specifically, SCD component 403 compares the audio signal before the boundary and after the boundary within the audio interval. For example, when analyzing boundary 611 in interval 501, SCD component 403 may compare the audio signal corresponding to phone classes 601 and 602 with the audio signal corresponding to classes 603-605.
  • FIG. 7 is a flow chart illustrating the operation of [0051] SCD component 403 in additional detail when detecting speaker change boundaries in intervals 501-504 (FIG. 5).
  • To begin, [0052] SCD component 403 may set the starting position at one of the phone class boundaries of an interval to be analyzed (Act 701). The selected boundary may not be the first boundary in the interval as there may not be sufficient audio data between the start of the interval and the first boundary for a valid comparison of acoustic features. Accordingly, the first selected boundary may be, for example, half-way into overlapping portion 505. SCD component 403 may similarly process the previous interval up to the point half-way into the same overlapping portion. In this manner, the complete audio stream is processed.
  • [0053] SCD component 403 may next calculate cepstral vectors for the regions before and after the selected boundary of the active interval (Act 702). The calculation of cepstral vectors for samples of audio data is well known in the art and will not be described in detail herein. The two cepstral vectors are compared (Act 703). In one implementation consistent with the present invention, the two cepstral vectors are compared using the generalized likelihood ratio test. Assume that the cepstral vector to the left of the selected boundary is vector x and the cepstral vector to the right of the selected boundary is vector y. The generalized likelihood ratio test may be written as: λ = L ( z ; u z , z ) L ( x ; u x , x ) L ( y ; u y , y ) ,
    Figure US20040204939A1-20041014-M00001
  • where L(v;u[0054] vv) is the maximum likelihood of v and z is the union of x and y.
  • If λ is above a predetermined threshold, the two vectors are considered to be similar to one another, and are assumed to be from the same speaker ([0055] Acts 704 and 705). Otherwise, the two vectors are dissimilar to one another, and the boundary point corresponding to the two vectors is defined as a speaker change boundary (Acts 704 and 706). The predetermined threshold may be determined empirically.
  • If there are any further boundaries in the interval, [0056] SCD component 403 repeats Acts 702-706 for the next boundary (Acts 707 and 708). If there are no further boundaries, the interval has been completely processed.
  • In alternate implementations, [0057] SCD component 403 may apply different threshold levels for boundaries that surround the non-speech phones. Thus, in between speech phones, SCD component 403 may use a threshold that is less likely to give a speaker boundary change indication than a boundary between non-speech phones. Further, it is often the case that λ tends to be larger when calculating for larger data sets. Accordingly, a bias factor based on the size of the data set may be added to λ to compensate for this property.
  • CONCLUSION
  • Speaker segmentation logic, as described herein, detects changes in speakers at boundary points determined by the location of phones in speech. The phones are decoded for the speech using a reduced set of possible phones. Further, the speaker segmentation logic processes the audio as discrete intervals of audio in which boundary portions of the audio are set to overlap to ensure accurate boundary detection over the whole audio signal. [0058]
  • The foregoing description of preferred embodiments of the invention provides illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. Moreover, while a series of acts have been presented with respect to FIG. 7, the order of the acts may be different in other implementations consistent with the present invention. [0059]
  • Certain portions of the invention have been described as software that performs one or more functions. The software may more generally be implemented as any type of logic. This logic may include hardware, such as application specific integrated circuit or a field programmable gate array, software, or a combination of hardware and software. [0060]
  • No element, act, or instruction used in the description of the present application should be construed as critical or essential to the invention unless explicitly described as such. Also, as used herein, the article “a” is intended to include one or more items. Where only one item is intended, the term “one” or similar language is used. [0061]
  • The scope of the invention is defined by the claims and their equivalents. [0062]

Claims (31)

What is claimed:
1. A method for detecting speaker changes in an input audio stream comprising:
segmenting the input audio stream into predetermined length intervals;
decoding the intervals to produce a set of phones corresponding to each of the intervals;
generating a similarity measurement based on a first portion of the audio stream within one of the intervals and prior to a boundary between adjacent phones and a second portion of the audio stream within the one of the intervals after the boundary; and
detecting speaker changes based on the similarity measurement.
2. The method of claim 1, wherein the predetermined length intervals are approximately thirty seconds in length.
3. The method of claim 1, wherein segmenting the input audio stream includes:
creating the predetermined length intervals such that portions of the intervals overlap one another.
4. The method of claim 1, wherein generating a similarity measurement includes:
calculating cepstral vectors for the audio stream prior to the boundary and the audio stream after the boundary, and
comparing the cepstral vectors.
5. The method of claim 4, wherein the cepstral vectors are compared using a generalized likelihood ratio test.
6. The method of claim 5, wherein a speaker change is detected when the generalized likelihood ratio test produces a value less than a preset threshold.
7. The method of claim 1, wherein the decoded set of phones is selected from a simplified corpus of phone classes.
8. The method of claim 7, wherein the simplified corpus of phone classes includes a phone class for vowels and nasals, a phone class for fricatives, and a phone class for obstruents.
9. The method of claim 8, wherein the simplified corpus of phone classes further includes a phone class for music, laughter, breath and lip-smack, and silence.
10. The method of claim 7, wherein the simplified corpus of phone classes includes approximately seven phone classes.
11. A device for detecting speaker changes in an audio signal, the device comprising:
a processor; and
a memory containing instructions that when executed by the processor cause the processor to:
segment the audio signal into predetermined length intervals,
decode the intervals to produce a set of phones corresponding to each of the intervals,
generate a similarity measurement based on a first portion of the audio signal prior to a boundary between phones in one of the sets of phones and a second portion of the audio signal after the boundary, and
detect speaker changes based on the similarity measurement.
12. The device of claim 11, wherein the predetermined length intervals are approximately thirty seconds in length.
13. The device of claim 11, wherein segmenting the audio signal includes:
creating the predetermined length intervals such that portions of the intervals overlap one another.
14. The device of claim 11, wherein the set of phones is selected from a simplified corpus of phone classes.
15. The device of claim 14, wherein the simplified corpus of phone classes includes a phone class for vowels and nasals, a phone class for fricatives, and a phone class for obstruents.
16. The device of claim 15, wherein the simplified corpus of phone classes further includes a phone class for music, laughter, breath and lip-smack, and silence.
17. The device of claim 11, wherein the simplified corpus of phone classes includes approximately seven phone classes.
18. A device for detecting speaker changes in an audio signal, the device comprising:
a segmentation component configured to segment the audio signal into predetermined length intervals;
a phone classification decode component configured to decode the intervals to produce a set of phone classes corresponding to each of the intervals, a number of possible phone classes being approximately seven; and
a speaker change detection component configured to detect locations of speaker changes in the audio signal based on a similarity value calculated over a first portion of the audio signal prior to a boundary between phone classes in one of the sets of phone classes and a second portion of the audio signal after the boundary in the one of the sets of phone classes.
19. The device of claim 18, wherein the predetermined length intervals are approximately thirty seconds in length.
20. The device of claim 18, wherein the segmentation component segments the predetermined length intervals such that portions of the intervals overlap one another.
21. The device of claim 18, wherein the phone classes include a phone class for vowels and nasals, a phone class for fricatives, and a phone class for obstruents.
22. The device of claim 21, wherein the phone classes further include a phone class for music, laughter, breath and lip-smack, and silence.
23. A system comprising:
an indexer configured to receive input audio data and generate a rich transcription from the audio data, the rich transcription including metadata that defines speaker changes in the audio data, the indexer including:
a segmentation component configured to divide the audio data into overlapping segments,
a speaker change detection component configured to detect locations of speaker changes in the audio data based on a similarity value calculated at locations in the segments that correspond to phone class boundaries;
a memory system for storing the rich transcription; and
a server configured to receive requests for documents and to respond to the requests by transmitting ones of the rich transcriptions that match the requests.
24. The system of claim 23, wherein the indexer further includes at least one of: a speaker clustering component, a speaker identification component, a name spotting component, and a topic classification component.
25. The system of claim 23, wherein the overlapping segments are segments of a predetermined length.
26. The system of claim 25, wherein the predetermined length is approximately thirty seconds.
27. The system of claim 23, wherein the phone classes include a phone class for vowels and nasals, a phone class for fricatives, and a phone class for obstruents.
28. The system of claim 27, wherein the phone classes additionally include a phone class for music, laughter, breath and lip-smack, and silence.
29. The system of claim 23, wherein the phone classes include approximately seven phone classes.
30. A device comprising:
means for segmenting the input audio stream into predetermined length intervals;
means for decoding the intervals to produce a set of phones corresponding to each of the intervals;
means for generating a similarity measurement based on audio within one of the intervals that is prior to a boundary between adjacent phones and based on audio within the one of the intervals that is after the boundary; and
means for detecting speaker changes based on the similarity measurement.
31. The device of claim 30, wherein the predetermined length intervals overlap one another.
US10/685,586 2002-10-17 2003-10-16 Systems and methods for speaker change detection Abandoned US20040204939A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/685,586 US20040204939A1 (en) 2002-10-17 2003-10-16 Systems and methods for speaker change detection

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US41921402P 2002-10-17 2002-10-17
US10/685,586 US20040204939A1 (en) 2002-10-17 2003-10-16 Systems and methods for speaker change detection

Publications (1)

Publication Number Publication Date
US20040204939A1 true US20040204939A1 (en) 2004-10-14

Family

ID=32110223

Family Applications (9)

Application Number Title Priority Date Filing Date
US10/685,585 Active 2026-01-10 US7424427B2 (en) 2002-10-17 2003-10-16 Systems and methods for classifying audio into broad phoneme classes
US10/685,478 Abandoned US20040083104A1 (en) 2002-10-17 2003-10-16 Systems and methods for providing interactive speaker identification training
US10/685,403 Abandoned US20040083090A1 (en) 2002-10-17 2003-10-16 Manager for integrating language technology components
US10/685,586 Abandoned US20040204939A1 (en) 2002-10-17 2003-10-16 Systems and methods for speaker change detection
US10/685,445 Abandoned US20040138894A1 (en) 2002-10-17 2003-10-16 Speech transcription tool for efficient speech transcription
US10/685,479 Abandoned US20040163034A1 (en) 2002-10-17 2003-10-16 Systems and methods for labeling clusters of documents
US10/685,566 Abandoned US20040176946A1 (en) 2002-10-17 2003-10-16 Pronunciation symbols based on the orthographic lexicon of a language
US10/685,565 Active - Reinstated 2026-04-05 US7292977B2 (en) 2002-10-17 2003-10-16 Systems and methods for providing online fast speaker adaptation in speech recognition
US10/685,410 Expired - Fee Related US7389229B2 (en) 2002-10-17 2003-10-16 Unified clustering tree

Family Applications Before (3)

Application Number Title Priority Date Filing Date
US10/685,585 Active 2026-01-10 US7424427B2 (en) 2002-10-17 2003-10-16 Systems and methods for classifying audio into broad phoneme classes
US10/685,478 Abandoned US20040083104A1 (en) 2002-10-17 2003-10-16 Systems and methods for providing interactive speaker identification training
US10/685,403 Abandoned US20040083090A1 (en) 2002-10-17 2003-10-16 Manager for integrating language technology components

Family Applications After (5)

Application Number Title Priority Date Filing Date
US10/685,445 Abandoned US20040138894A1 (en) 2002-10-17 2003-10-16 Speech transcription tool for efficient speech transcription
US10/685,479 Abandoned US20040163034A1 (en) 2002-10-17 2003-10-16 Systems and methods for labeling clusters of documents
US10/685,566 Abandoned US20040176946A1 (en) 2002-10-17 2003-10-16 Pronunciation symbols based on the orthographic lexicon of a language
US10/685,565 Active - Reinstated 2026-04-05 US7292977B2 (en) 2002-10-17 2003-10-16 Systems and methods for providing online fast speaker adaptation in speech recognition
US10/685,410 Expired - Fee Related US7389229B2 (en) 2002-10-17 2003-10-16 Unified clustering tree

Country Status (1)

Country Link
US (9) US7424427B2 (en)

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182627A1 (en) * 2004-01-14 2005-08-18 Izuru Tanaka Audio signal processing apparatus and audio signal processing method
US20060058998A1 (en) * 2004-09-16 2006-03-16 Kabushiki Kaisha Toshiba Indexing apparatus and indexing method
WO2007061947A2 (en) * 2005-11-18 2007-05-31 Blacklidge Emulsions, Inc. Method for bonding prepared substrates for roadways using a low-tracking asphalt emulsion coating
US20080046241A1 (en) * 2006-02-20 2008-02-21 Andrew Osburn Method and system for detecting speaker change in a voice transaction
US20080172227A1 (en) * 2004-01-13 2008-07-17 International Business Machines Corporation Differential Dynamic Content Delivery With Text Display In Dependence Upon Simultaneous Speech
US20120035927A1 (en) * 2010-08-09 2012-02-09 Keiichi Yamada Information Processing Apparatus, Information Processing Method, and Program
US20120197643A1 (en) * 2011-01-27 2012-08-02 General Motors Llc Mapping obstruent speech energy to lower frequencies
US20150127348A1 (en) * 2013-11-01 2015-05-07 Adobe Systems Incorporated Document distribution and interaction
US9313336B2 (en) 2011-07-21 2016-04-12 Nuance Communications, Inc. Systems and methods for processing audio signals captured using microphones of multiple devices
CN105765654A (en) * 2013-11-28 2016-07-13 弗劳恩霍夫应用研究促进协会 Hearing assistance device with fundamental frequency modification
US20160247520A1 (en) * 2015-02-25 2016-08-25 Kabushiki Kaisha Toshiba Electronic apparatus, method, and program
US9432368B1 (en) 2015-02-19 2016-08-30 Adobe Systems Incorporated Document distribution and interaction
US9531545B2 (en) 2014-11-24 2016-12-27 Adobe Systems Incorporated Tracking and notification of fulfillment events
US9544149B2 (en) 2013-12-16 2017-01-10 Adobe Systems Incorporated Automatic E-signatures in response to conditions and/or events
US20170061987A1 (en) * 2015-08-28 2017-03-02 Kabushiki Kaisha Toshiba Electronic device and method
US9626653B2 (en) 2015-09-21 2017-04-18 Adobe Systems Incorporated Document distribution and interaction with delegation of signature authority
US9703982B2 (en) 2014-11-06 2017-07-11 Adobe Systems Incorporated Document distribution and interaction
US20170358293A1 (en) * 2016-06-10 2017-12-14 Google Inc. Predicting pronunciations with word stress
US9935777B2 (en) 2015-08-31 2018-04-03 Adobe Systems Incorporated Electronic signature framework with enhanced security
US20180108343A1 (en) * 2016-10-14 2018-04-19 Soundhound, Inc. Virtual assistant configured by selection of wake-up phrase
US10134060B2 (en) 2007-02-06 2018-11-20 Vb Assets, Llc System and method for delivering targeted advertisements and/or providing natural language processing based on advertisements
WO2018212953A1 (en) * 2017-05-16 2018-11-22 Apple Inc. Detecting a trigger of a digital assistant
US10273637B2 (en) 2010-02-24 2019-04-30 Blacklidge Emulsions, Inc. Hot applied tack coat
US10297249B2 (en) * 2006-10-16 2019-05-21 Vb Assets, Llc System and method for a cooperative conversational voice user interface
US10347215B2 (en) 2016-05-27 2019-07-09 Adobe Inc. Multi-device electronic signature framework
US10430863B2 (en) 2014-09-16 2019-10-01 Vb Assets, Llc Voice commerce
US10503919B2 (en) 2017-04-10 2019-12-10 Adobe Inc. Electronic signature framework with keystroke biometric authentication
US10553216B2 (en) 2008-05-27 2020-02-04 Oracle International Corporation System and method for an integrated, multi-modal, multi-device natural language voice services environment
US10553213B2 (en) 2009-02-20 2020-02-04 Oracle International Corporation System and method for processing multi-modal device interactions in a natural language voice services environment
US10770077B2 (en) 2015-09-14 2020-09-08 Toshiba Client Solutions CO., LTD. Electronic device and method
US20210312944A1 (en) * 2018-08-15 2021-10-07 Nippon Telegraph And Telephone Corporation End-of-talk prediction device, end-of-talk prediction method, and non-transitory computer readable recording medium
US20210366479A1 (en) * 2020-05-21 2021-11-25 Orcam Technologies Ltd. Systems and methods for emphasizing a user's name

Families Citing this family (136)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002092533A1 (en) * 2001-05-16 2002-11-21 E.I. Du Pont De Nemours And Company Dielectric composition with reduced resistance
WO2004029773A2 (en) * 2002-09-27 2004-04-08 Callminer, Inc. Software for statistical analysis of speech
WO2004090870A1 (en) * 2003-04-04 2004-10-21 Kabushiki Kaisha Toshiba Method and apparatus for encoding or decoding wide-band audio
US8923838B1 (en) 2004-08-19 2014-12-30 Nuance Communications, Inc. System, method and computer program product for activating a cellular phone account
US7956905B2 (en) * 2005-02-28 2011-06-07 Fujifilm Corporation Titling apparatus, a titling method, and a machine readable medium storing thereon a computer program for titling
GB0511307D0 (en) * 2005-06-03 2005-07-13 South Manchester University Ho A method for generating output data
US7382933B2 (en) * 2005-08-24 2008-06-03 International Business Machines Corporation System and method for semantic video segmentation based on joint audiovisual and text analysis
EP1922720B1 (en) 2005-08-26 2017-06-21 Nuance Communications Austria GmbH System and method for synchronizing sound and manually transcribed text
US7801893B2 (en) * 2005-09-30 2010-09-21 Iac Search & Media, Inc. Similarity detection and clustering of images
US20070094270A1 (en) * 2005-10-21 2007-04-26 Callminer, Inc. Method and apparatus for the processing of heterogeneous units of work
US20070094023A1 (en) * 2005-10-21 2007-04-26 Callminer, Inc. Method and apparatus for processing heterogeneous units of work
KR100755677B1 (en) * 2005-11-02 2007-09-05 삼성전자주식회사 Apparatus and method for dialogue speech recognition using topic detection
US8756057B2 (en) * 2005-11-02 2014-06-17 Nuance Communications, Inc. System and method using feedback speech analysis for improving speaking ability
US20070129943A1 (en) * 2005-12-06 2007-06-07 Microsoft Corporation Speech recognition using adaptation and prior knowledge
US8996592B2 (en) * 2006-06-26 2015-03-31 Scenera Technologies, Llc Methods, systems, and computer program products for identifying a container associated with a plurality of files
US20080004876A1 (en) * 2006-06-30 2008-01-03 Chuang He Non-enrolled continuous dictation
US20080051916A1 (en) * 2006-08-28 2008-02-28 Arcadyan Technology Corporation Method and apparatus for recording streamed audio
KR100826875B1 (en) * 2006-09-08 2008-05-06 한국전자통신연구원 On-line speaker recognition method and apparatus for thereof
US20080104066A1 (en) * 2006-10-27 2008-05-01 Yahoo! Inc. Validating segmentation criteria
US7272558B1 (en) 2006-12-01 2007-09-18 Coveo Solutions Inc. Speech recognition training method for audio and video file indexing on a search engine
US20080154579A1 (en) * 2006-12-21 2008-06-26 Krishna Kummamuru Method of analyzing conversational transcripts
US8386254B2 (en) * 2007-05-04 2013-02-26 Nuance Communications, Inc. Multi-class constrained maximum likelihood linear regression
US20090043797A1 (en) * 2007-07-27 2009-02-12 Sparkip, Inc. System And Methods For Clustering Large Database of Documents
ATE457511T1 (en) * 2007-10-10 2010-02-15 Harman Becker Automotive Sys SPEAKER RECOGNITION
JP4405542B2 (en) * 2007-10-24 2010-01-27 株式会社東芝 Apparatus, method and program for clustering phoneme models
US9386154B2 (en) 2007-12-21 2016-07-05 Nuance Communications, Inc. System, method and software program for enabling communications between customer service agents and users of communication devices
JPWO2009122779A1 (en) * 2008-04-03 2011-07-28 日本電気株式会社 Text data processing apparatus, method and program
US9020816B2 (en) 2008-08-14 2015-04-28 21Ct, Inc. Hidden markov model for speech processing with training method
CA2680304C (en) * 2008-09-25 2017-08-22 Multimodal Technologies, Inc. Decoding-time prediction of non-verbalized tokens
US8458105B2 (en) 2009-02-12 2013-06-04 Decisive Analytics Corporation Method and apparatus for analyzing and interrelating data
US8301446B2 (en) * 2009-03-30 2012-10-30 Adacel Systems, Inc. System and method for training an acoustic model with reduced feature space variation
US8412525B2 (en) * 2009-04-30 2013-04-02 Microsoft Corporation Noise robust speech classifier ensemble
US8635223B2 (en) 2009-07-28 2014-01-21 Fti Consulting, Inc. System and method for providing a classification suggestion for electronically stored information
EP2471009A1 (en) 2009-08-24 2012-07-04 FTI Technology LLC Generating a reference set for use during document review
US8554562B2 (en) * 2009-11-15 2013-10-08 Nuance Communications, Inc. Method and system for speaker diarization
US8983958B2 (en) * 2009-12-21 2015-03-17 Business Objects Software Limited Document indexing based on categorization and prioritization
JP5477635B2 (en) * 2010-02-15 2014-04-23 ソニー株式会社 Information processing apparatus and method, and program
US9305553B2 (en) * 2010-04-28 2016-04-05 William S. Meisel Speech recognition accuracy improvement through speaker categories
US9009040B2 (en) * 2010-05-05 2015-04-14 Cisco Technology, Inc. Training a transcription system
US8391464B1 (en) 2010-06-24 2013-03-05 Nuance Communications, Inc. Customer service system, method, and software program product for responding to queries using natural language understanding
US8630854B2 (en) * 2010-08-31 2014-01-14 Fujitsu Limited System and method for generating videoconference transcriptions
US20120084149A1 (en) * 2010-09-10 2012-04-05 Paolo Gaudiano Methods and systems for online advertising with interactive text clouds
US8791977B2 (en) 2010-10-05 2014-07-29 Fujitsu Limited Method and system for presenting metadata during a videoconference
CN102455997A (en) * 2010-10-27 2012-05-16 鸿富锦精密工业(深圳)有限公司 Component name extraction system and method
KR101172663B1 (en) * 2010-12-31 2012-08-08 엘지전자 주식회사 Mobile terminal and method for grouping application thereof
GB2489489B (en) * 2011-03-30 2013-08-21 Toshiba Res Europ Ltd A speech processing system and method
US9774747B2 (en) * 2011-04-29 2017-09-26 Nexidia Inc. Transcription system
CA2832918C (en) * 2011-06-22 2016-05-10 Rogers Communications Inc. Systems and methods for ranking document clusters
JP5638479B2 (en) * 2011-07-26 2014-12-10 株式会社東芝 Transcription support system and transcription support method
JP2013025299A (en) * 2011-07-26 2013-02-04 Toshiba Corp Transcription support system and transcription support method
JP5404726B2 (en) * 2011-09-26 2014-02-05 株式会社東芝 Information processing apparatus, information processing method, and program
US8433577B2 (en) * 2011-09-27 2013-04-30 Google Inc. Detection of creative works on broadcast media
US20130144414A1 (en) * 2011-12-06 2013-06-06 Cisco Technology, Inc. Method and apparatus for discovering and labeling speakers in a large and growing collection of videos with minimal user effort
US9002848B1 (en) * 2011-12-27 2015-04-07 Google Inc. Automatic incremental labeling of document clusters
JP2013161205A (en) * 2012-02-03 2013-08-19 Sony Corp Information processing device, information processing method and program
US20130266127A1 (en) 2012-04-10 2013-10-10 Raytheon Bbn Technologies Corp System and method for removing sensitive data from a recording
US20140365221A1 (en) * 2012-07-31 2014-12-11 Novospeech Ltd. Method and apparatus for speech recognition
US8676590B1 (en) 2012-09-26 2014-03-18 Google Inc. Web-based audio transcription tool
US20140136204A1 (en) * 2012-11-13 2014-05-15 GM Global Technology Operations LLC Methods and systems for speech systems
US20140207786A1 (en) * 2013-01-22 2014-07-24 Equivio Ltd. System and methods for computerized information governance of electronic documents
US9865266B2 (en) * 2013-02-25 2018-01-09 Nuance Communications, Inc. Method and apparatus for automated speaker parameters adaptation in a deployed speaker verification system
US9330167B1 (en) * 2013-05-13 2016-05-03 Groupon, Inc. Method, apparatus, and computer program product for classification and tagging of textual data
US10463269B2 (en) 2013-09-25 2019-11-05 Bardy Diagnostics, Inc. System and method for machine-learning-based atrial fibrillation detection
US10888239B2 (en) 2013-09-25 2021-01-12 Bardy Diagnostics, Inc. Remote interfacing electrocardiography patch
US9433367B2 (en) 2013-09-25 2016-09-06 Bardy Diagnostics, Inc. Remote interfacing of extended wear electrocardiography and physiological sensor monitor
US9345414B1 (en) 2013-09-25 2016-05-24 Bardy Diagnostics, Inc. Method for providing dynamic gain over electrocardiographic data with the aid of a digital computer
US9408545B2 (en) 2013-09-25 2016-08-09 Bardy Diagnostics, Inc. Method for efficiently encoding and compressing ECG data optimized for use in an ambulatory ECG monitor
US9737224B2 (en) 2013-09-25 2017-08-22 Bardy Diagnostics, Inc. Event alerting through actigraphy embedded within electrocardiographic data
US10433751B2 (en) 2013-09-25 2019-10-08 Bardy Diagnostics, Inc. System and method for facilitating a cardiac rhythm disorder diagnosis based on subcutaneous cardiac monitoring data
US9717433B2 (en) 2013-09-25 2017-08-01 Bardy Diagnostics, Inc. Ambulatory electrocardiography monitoring patch optimized for capturing low amplitude cardiac action potential propagation
US9364155B2 (en) 2013-09-25 2016-06-14 Bardy Diagnostics, Inc. Self-contained personal air flow sensing monitor
US9775536B2 (en) 2013-09-25 2017-10-03 Bardy Diagnostics, Inc. Method for constructing a stress-pliant physiological electrode assembly
US9700227B2 (en) 2013-09-25 2017-07-11 Bardy Diagnostics, Inc. Ambulatory electrocardiography monitoring patch optimized for capturing low amplitude cardiac action potential propagation
US9619660B1 (en) 2013-09-25 2017-04-11 Bardy Diagnostics, Inc. Computer-implemented system for secure physiological data collection and processing
US10433748B2 (en) 2013-09-25 2019-10-08 Bardy Diagnostics, Inc. Extended wear electrocardiography and physiological sensor monitor
US10799137B2 (en) 2013-09-25 2020-10-13 Bardy Diagnostics, Inc. System and method for facilitating a cardiac rhythm disorder diagnosis with the aid of a digital computer
US10736529B2 (en) 2013-09-25 2020-08-11 Bardy Diagnostics, Inc. Subcutaneous insertable electrocardiography monitor
US10806360B2 (en) 2013-09-25 2020-10-20 Bardy Diagnostics, Inc. Extended wear ambulatory electrocardiography and physiological sensor monitor
US20190167139A1 (en) 2017-12-05 2019-06-06 Gust H. Bardy Subcutaneous P-Wave Centric Insertable Cardiac Monitor For Long Term Electrocardiographic Monitoring
WO2015048194A1 (en) 2013-09-25 2015-04-02 Bardy Diagnostics, Inc. Self-contained personal air flow sensing monitor
US9504423B1 (en) 2015-10-05 2016-11-29 Bardy Diagnostics, Inc. Method for addressing medical conditions through a wearable health monitor with the aid of a digital computer
US10820801B2 (en) 2013-09-25 2020-11-03 Bardy Diagnostics, Inc. Electrocardiography monitor configured for self-optimizing ECG data compression
US9730593B2 (en) 2013-09-25 2017-08-15 Bardy Diagnostics, Inc. Extended wear ambulatory electrocardiography and physiological sensor monitor
US10251576B2 (en) 2013-09-25 2019-04-09 Bardy Diagnostics, Inc. System and method for ECG data classification for use in facilitating diagnosis of cardiac rhythm disorders with the aid of a digital computer
US11723575B2 (en) 2013-09-25 2023-08-15 Bardy Diagnostics, Inc. Electrocardiography patch
US11213237B2 (en) 2013-09-25 2022-01-04 Bardy Diagnostics, Inc. System and method for secure cloud-based physiological data processing and delivery
US10667711B1 (en) 2013-09-25 2020-06-02 Bardy Diagnostics, Inc. Contact-activated extended wear electrocardiography and physiological sensor monitor recorder
US9717432B2 (en) 2013-09-25 2017-08-01 Bardy Diagnostics, Inc. Extended wear electrocardiography patch using interlaced wire electrodes
US9655537B2 (en) 2013-09-25 2017-05-23 Bardy Diagnostics, Inc. Wearable electrocardiography and physiology monitoring ensemble
US10736531B2 (en) 2013-09-25 2020-08-11 Bardy Diagnostics, Inc. Subcutaneous insertable cardiac monitor optimized for long term, low amplitude electrocardiographic data collection
US9655538B2 (en) 2013-09-25 2017-05-23 Bardy Diagnostics, Inc. Self-authenticating electrocardiography monitoring circuit
US10624551B2 (en) 2013-09-25 2020-04-21 Bardy Diagnostics, Inc. Insertable cardiac monitor for use in performing long term electrocardiographic monitoring
US9615763B2 (en) 2013-09-25 2017-04-11 Bardy Diagnostics, Inc. Ambulatory electrocardiography monitor recorder optimized for capturing low amplitude cardiac action potential propagation
US9408551B2 (en) 2013-11-14 2016-08-09 Bardy Diagnostics, Inc. System and method for facilitating diagnosis of cardiac rhythm disorders with the aid of a digital computer
US9495439B2 (en) * 2013-10-08 2016-11-15 Cisco Technology, Inc. Organizing multimedia content
US20150100582A1 (en) * 2013-10-08 2015-04-09 Cisco Technology, Inc. Association of topic labels with digital content
CN104143326B (en) * 2013-12-03 2016-11-02 腾讯科技(深圳)有限公司 A kind of voice command identification method and device
WO2015105994A1 (en) 2014-01-08 2015-07-16 Callminer, Inc. Real-time conversational analytics facility
JP6392012B2 (en) * 2014-07-14 2018-09-19 株式会社東芝 Speech synthesis dictionary creation device, speech synthesis device, speech synthesis dictionary creation method, and speech synthesis dictionary creation program
US9728190B2 (en) * 2014-07-25 2017-08-08 International Business Machines Corporation Summarization of audio data
US10447646B2 (en) * 2015-06-15 2019-10-15 International Business Machines Corporation Online communication modeling and analysis
US10068445B2 (en) 2015-06-24 2018-09-04 Google Llc Systems and methods of home-specific sound event detection
US9754593B2 (en) * 2015-11-04 2017-09-05 International Business Machines Corporation Sound envelope deconstruction to identify words and speakers in continuous speech
WO2017106454A1 (en) 2015-12-16 2017-06-22 Dolby Laboratories Licensing Corporation Suppression of breath in audio signals
AU2017274558B2 (en) 2016-06-02 2021-11-11 Nuix North America Inc. Analyzing clusters of coded documents
US20180232623A1 (en) * 2017-02-10 2018-08-16 International Business Machines Corporation Techniques for answering questions based on semantic distances between subjects
GB2578386B (en) 2017-06-27 2021-12-01 Cirrus Logic Int Semiconductor Ltd Detection of replay attack
GB2563953A (en) 2017-06-28 2019-01-02 Cirrus Logic Int Semiconductor Ltd Detection of replay attack
GB201713697D0 (en) 2017-06-28 2017-10-11 Cirrus Logic Int Semiconductor Ltd Magnetic detection of replay attack
GB201801526D0 (en) 2017-07-07 2018-03-14 Cirrus Logic Int Semiconductor Ltd Methods, apparatus and systems for authentication
GB201801528D0 (en) 2017-07-07 2018-03-14 Cirrus Logic Int Semiconductor Ltd Method, apparatus and systems for biometric processes
GB201801532D0 (en) 2017-07-07 2018-03-14 Cirrus Logic Int Semiconductor Ltd Methods, apparatus and systems for audio playback
GB201801527D0 (en) 2017-07-07 2018-03-14 Cirrus Logic Int Semiconductor Ltd Method, apparatus and systems for biometric processes
GB2567503A (en) * 2017-10-13 2019-04-17 Cirrus Logic Int Semiconductor Ltd Analysing speech signals
GB201801664D0 (en) 2017-10-13 2018-03-21 Cirrus Logic Int Semiconductor Ltd Detection of liveness
GB201804843D0 (en) 2017-11-14 2018-05-09 Cirrus Logic Int Semiconductor Ltd Detection of replay attack
GB201801661D0 (en) 2017-10-13 2018-03-21 Cirrus Logic International Uk Ltd Detection of liveness
GB201801659D0 (en) 2017-11-14 2018-03-21 Cirrus Logic Int Semiconductor Ltd Detection of loudspeaker playback
TWI625680B (en) 2017-12-15 2018-06-01 財團法人工業技術研究院 Method and device for recognizing facial expressions
US11735189B2 (en) 2018-01-23 2023-08-22 Cirrus Logic, Inc. Speaker identification
US11264037B2 (en) 2018-01-23 2022-03-01 Cirrus Logic, Inc. Speaker identification
US11475899B2 (en) 2018-01-23 2022-10-18 Cirrus Logic, Inc. Speaker identification
CN109300486B (en) * 2018-07-30 2021-06-25 四川大学 PICGTFs and SSMC enhanced cleft palate speech pharynx fricative automatic identification method
US10692490B2 (en) 2018-07-31 2020-06-23 Cirrus Logic, Inc. Detection of replay attack
US10915614B2 (en) 2018-08-31 2021-02-09 Cirrus Logic, Inc. Biometric authentication
US11037574B2 (en) 2018-09-05 2021-06-15 Cirrus Logic, Inc. Speaker recognition and speaker change detection
US11096579B2 (en) 2019-07-03 2021-08-24 Bardy Diagnostics, Inc. System and method for remote ECG data streaming in real-time
US11116451B2 (en) 2019-07-03 2021-09-14 Bardy Diagnostics, Inc. Subcutaneous P-wave centric insertable cardiac monitor with energy harvesting capabilities
US11696681B2 (en) 2019-07-03 2023-07-11 Bardy Diagnostics Inc. Configurable hardware platform for physiological monitoring of a living body
US11410642B2 (en) * 2019-08-16 2022-08-09 Soundhound, Inc. Method and system using phoneme embedding
US11354920B2 (en) 2019-10-12 2022-06-07 International Business Machines Corporation Updating and implementing a document from an audio proceeding
US11404049B2 (en) * 2019-12-09 2022-08-02 Microsoft Technology Licensing, Llc Interactive augmentation and integration of real-time speech-to-text
US11862168B1 (en) * 2020-03-30 2024-01-02 Amazon Technologies, Inc. Speaker disambiguation and transcription from multiple audio feeds
US11373657B2 (en) * 2020-05-01 2022-06-28 Raytheon Applied Signal Technology, Inc. System and method for speaker identification in audio data
US11315545B2 (en) 2020-07-09 2022-04-26 Raytheon Applied Signal Technology, Inc. System and method for language identification in audio data
CN113284508B (en) * 2021-07-21 2021-11-09 中国科学院自动化研究所 Hierarchical differentiation based generated audio detection system

Citations (84)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4879648A (en) * 1986-09-19 1989-11-07 Nancy P. Cochran Search system which continuously displays search terms during scrolling and selections of individually displayed data sets
US4908866A (en) * 1985-02-04 1990-03-13 Eric Goldwasser Speech transcribing system
US5317732A (en) * 1991-04-26 1994-05-31 Commodore Electronics Limited System for relocating a multimedia presentation on a different platform by extracting a resource map in order to remap and relocate resources
US5404295A (en) * 1990-08-16 1995-04-04 Katz; Boris Method and apparatus for utilizing annotations to facilitate computer retrieval of database material
US5418716A (en) * 1990-07-26 1995-05-23 Nec Corporation System for recognizing sentence patterns and a system for recognizing sentence patterns and grammatical cases
US5544257A (en) * 1992-01-08 1996-08-06 International Business Machines Corporation Continuous parameter hidden Markov model approach to automatic handwriting recognition
US5559875A (en) * 1995-07-31 1996-09-24 Latitude Communications Method and apparatus for recording and retrieval of audio conferences
US5572728A (en) * 1993-12-24 1996-11-05 Hitachi, Ltd. Conference multimedia summary support system and method
US5613032A (en) * 1994-09-02 1997-03-18 Bell Communications Research, Inc. System and method for recording, playing back and searching multimedia events wherein video, audio and text can be searched and retrieved
US5684924A (en) * 1995-05-19 1997-11-04 Kurzweil Applied Intelligence, Inc. User adaptable speech recognition system
US5715367A (en) * 1995-01-23 1998-02-03 Dragon Systems, Inc. Apparatuses and methods for developing and using models for speech recognition
US5752021A (en) * 1994-05-24 1998-05-12 Fuji Xerox Co., Ltd. Document database management apparatus capable of conversion between retrieval formulae for different schemata
US5757960A (en) * 1994-09-30 1998-05-26 Murdock; Michael Chase Method and system for extracting features from handwritten text
US5768607A (en) * 1994-09-30 1998-06-16 Intel Corporation Method and apparatus for freehand annotation and drawings incorporating sound and for compressing and synchronizing sound
US5777614A (en) * 1994-10-14 1998-07-07 Hitachi, Ltd. Editing support system including an interactive interface
US5787198A (en) * 1992-11-24 1998-07-28 Lucent Technologies Inc. Text recognition using two-dimensional stochastic models
US5835667A (en) * 1994-10-14 1998-11-10 Carnegie Mellon University Method and apparatus for creating a searchable digital video library and a system and method of using such a library
US5862259A (en) * 1996-03-27 1999-01-19 Caere Corporation Pattern recognition employing arbitrary segmentation and compound probabilistic evaluation
US5875108A (en) * 1991-12-23 1999-02-23 Hoffberg; Steven M. Ergonomic man-machine interface incorporating adaptive pattern recognition based control system
US5960447A (en) * 1995-11-13 1999-09-28 Holt; Douglas Word tagging and editing system for speech recognition
US5963940A (en) * 1995-08-16 1999-10-05 Syracuse University Natural language information retrieval system and method
US5970473A (en) * 1997-12-31 1999-10-19 At&T Corp. Video communication device providing in-home catalog services
US6006221A (en) * 1995-08-16 1999-12-21 Syracuse University Multilingual document retrieval system and method using semantic vector matching
US6024571A (en) * 1996-04-25 2000-02-15 Renegar; Janet Elaine Foreign language communication system/device and learning aid
US6029124A (en) * 1997-02-21 2000-02-22 Dragon Systems, Inc. Sequential, nonparametric speech recognition and speaker identification
US6029195A (en) * 1994-11-29 2000-02-22 Herz; Frederick S. M. System for customized electronic identification of desirable objects
US6052657A (en) * 1997-09-09 2000-04-18 Dragon Systems, Inc. Text segmentation and identification of topic using language models
US6064963A (en) * 1997-12-17 2000-05-16 Opus Telecom, L.L.C. Automatic key word or phrase speech recognition for the corrections industry
US6067514A (en) * 1998-06-23 2000-05-23 International Business Machines Corporation Method for automatically punctuating a speech utterance in a continuous speech recognition system
US6067517A (en) * 1996-02-02 2000-05-23 International Business Machines Corporation Transcription of speech data with segments from acoustically dissimilar environments
US6088669A (en) * 1997-01-28 2000-07-11 International Business Machines, Corporation Speech recognition with attempted speaker recognition for speaker model prefetching or alternative speech modeling
US6151598A (en) * 1995-08-14 2000-11-21 Shaw; Venson M. Digital dictionary with a communication system for the creating, updating, editing, storing, maintaining, referencing, and managing the digital dictionary
US6161087A (en) * 1998-10-05 2000-12-12 Lernout & Hauspie Speech Products N.V. Speech-recognition-assisted selective suppression of silent and filled speech pauses during playback of an audio recording
US6169789B1 (en) * 1996-12-16 2001-01-02 Sanjay K. Rao Intelligent keyboard system
US6185531B1 (en) * 1997-01-09 2001-02-06 Gte Internetworking Incorporated Topic indexing method
US6219640B1 (en) * 1999-08-06 2001-04-17 International Business Machines Corporation Methods and apparatus for audio-visual speaker recognition and utterance verification
US6253179B1 (en) * 1999-01-29 2001-06-26 International Business Machines Corporation Method and apparatus for multi-environment speaker verification
US20010026377A1 (en) * 2000-03-21 2001-10-04 Katsumi Ikegami Image display system, image registration terminal device and image reading terminal device used in the image display system
US6317716B1 (en) * 1997-09-19 2001-11-13 Massachusetts Institute Of Technology Automatic cueing of speech
US20010051984A1 (en) * 1996-01-30 2001-12-13 Toshihiko Fukasawa Coordinative work environment construction system, method and medium therefor
US6332139B1 (en) * 1998-11-09 2001-12-18 Mega Chips Corporation Information communication system
US6332147B1 (en) * 1995-11-03 2001-12-18 Xerox Corporation Computer controlled display system using a graphical replay device to control playback of temporal data representing collaborative activities
US6337818B1 (en) * 2000-06-20 2002-01-08 Mitsubishi Denki Kabushiki Kaisha Semiconductor memory device having a redundancy construction
US20020010575A1 (en) * 2000-04-08 2002-01-24 International Business Machines Corporation Method and system for the automatic segmentation of an audio stream into semantic or syntactic units
US20020010916A1 (en) * 2000-05-22 2002-01-24 Compaq Computer Corporation Apparatus and method for controlling rate of playback of audio data
US6345252B1 (en) * 1999-04-09 2002-02-05 International Business Machines Corporation Methods and apparatus for retrieving audio information using content and speaker information
US6360237B1 (en) * 1998-10-05 2002-03-19 Lernout & Hauspie Speech Products N.V. Method and system for performing text edits during audio recording playback
US6373985B1 (en) * 1998-08-12 2002-04-16 Lucent Technologies, Inc. E-mail signature block analysis
US20020049589A1 (en) * 2000-06-28 2002-04-25 Poirier Darrell A. Simultaneous multi-user real-time voice recognition system
US6381640B1 (en) * 1998-09-11 2002-04-30 Genesys Telecommunications Laboratories, Inc. Method and apparatus for automated personalization and presentation of workload assignments to agents within a multimedia communication center
US20020059204A1 (en) * 2000-07-28 2002-05-16 Harris Larry R. Distributed search system and method
US6434520B1 (en) * 1999-04-16 2002-08-13 International Business Machines Corporation System and method for indexing and querying audio archives
US20020133477A1 (en) * 2001-03-05 2002-09-19 Glenn Abel Method for profile-based notice and broadcast of multimedia content
US6463444B1 (en) * 1997-08-14 2002-10-08 Virage, Inc. Video cataloger system with extensibility
US6480826B2 (en) * 1999-08-31 2002-11-12 Accenture Llp System and method for a telephonic emotion detection that provides operator feedback
US20020184373A1 (en) * 2000-11-01 2002-12-05 International Business Machines Corporation Conversational networking via transport, coding and control conversational protocols
US20030051214A1 (en) * 1997-12-22 2003-03-13 Ricoh Company, Ltd. Techniques for annotating portions of a document relevant to concepts of interest
US20030088414A1 (en) * 2001-05-10 2003-05-08 Chao-Shih Huang Background learning of speaker voices
US20030093580A1 (en) * 2001-11-09 2003-05-15 Koninklijke Philips Electronics N.V. Method and system for information alerts
US6567980B1 (en) * 1997-08-14 2003-05-20 Virage, Inc. Video cataloger system with hyperlinked output
US6604110B1 (en) * 2000-08-31 2003-08-05 Ascential Software, Inc. Automated software code generation from a metadata-based repository
US6602300B2 (en) * 1998-02-03 2003-08-05 Fujitsu Limited Apparatus and method for retrieving data from a document database
US6611803B1 (en) * 1998-12-17 2003-08-26 Matsushita Electric Industrial Co., Ltd. Method and apparatus for retrieving a video and audio scene using an index generated by speech recognition
US20030167163A1 (en) * 2002-02-22 2003-09-04 Nec Research Institute, Inc. Inferring hierarchical descriptions of a set of documents
US6624826B1 (en) * 1999-09-28 2003-09-23 Ricoh Co., Ltd. Method and apparatus for generating visual representations for audio documents
US6647383B1 (en) * 2000-09-01 2003-11-11 Lucent Technologies Inc. System and method for providing interactive dialogue and iterative search functions to find information
US6654735B1 (en) * 1999-01-08 2003-11-25 International Business Machines Corporation Outbound information analysis for generating user interest profiles and improving user productivity
US6708148B2 (en) * 2001-10-12 2004-03-16 Koninklijke Philips Electronics N.V. Correction device to mark parts of a recognized text
US6714911B2 (en) * 2001-01-25 2004-03-30 Harcourt Assessment, Inc. Speech transcription and analysis system and method
US6718303B2 (en) * 1998-05-13 2004-04-06 International Business Machines Corporation Apparatus and method for automatically generating punctuation marks in continuous speech recognition
US20040073444A1 (en) * 2001-01-16 2004-04-15 Li Li Peh Method and apparatus for a financial database structure
US6748350B2 (en) * 2001-09-27 2004-06-08 Intel Corporation Method to compensate for stress between heat spreader and thermal interface material
US6778958B1 (en) * 1999-08-30 2004-08-17 International Business Machines Corporation Symbol insertion apparatus and method
US6792409B2 (en) * 1999-12-20 2004-09-14 Koninklijke Philips Electronics N.V. Synchronous reproduction in a speech recognition system
US6847961B2 (en) * 1999-06-30 2005-01-25 Silverbrook Research Pty Ltd Method and system for searching information using sensor with identifier
US20050060162A1 (en) * 2000-11-10 2005-03-17 Farhad Mohit Systems and methods for automatic identification and hyperlinking of words or other data items and for information retrieval using hyperlinked words or data items
US6877134B1 (en) * 1997-08-14 2005-04-05 Virage, Inc. Integrated data and real-time metadata capture system and method
US6922691B2 (en) * 2000-08-28 2005-07-26 Emotion, Inc. Method and apparatus for digital media management, retrieval, and collaboration
US6961954B1 (en) * 1997-10-27 2005-11-01 The Mitre Corporation Automated segmentation, information extraction, summarization, and presentation of broadcast news
US6973428B2 (en) * 2001-05-24 2005-12-06 International Business Machines Corporation System and method for searching, analyzing and displaying text transcripts of speech after imperfect speech recognition
US6978277B2 (en) * 1989-10-26 2005-12-20 Encyclopaedia Britannica, Inc. Multimedia search system
US6999918B2 (en) * 2002-09-20 2006-02-14 Motorola, Inc. Method and apparatus to facilitate correlating symbols to sounds
US7131117B2 (en) * 2002-09-04 2006-10-31 Sbc Properties, L.P. Method and system for automating the analysis of word frequencies
US7146317B2 (en) * 2000-02-25 2006-12-05 Koninklijke Philips Electronics N.V. Speech recognition device with reference transformation means

Family Cites Families (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0693221B2 (en) 1985-06-12 1994-11-16 株式会社日立製作所 Voice input device
US4908868A (en) * 1989-02-21 1990-03-13 Mctaggart James E Phase polarity test instrument and method
JP2524472B2 (en) * 1992-09-21 1996-08-14 インターナショナル・ビジネス・マシーンズ・コーポレイション How to train a telephone line based speech recognition system
US5689641A (en) * 1993-10-01 1997-11-18 Vicor, Inc. Multimedia collaboration system arrangement for routing compressed AV signal through a participant site without decompressing the AV signal
GB2285895A (en) 1994-01-19 1995-07-26 Ibm Audio conferencing system which generates a set of minutes
US5614940A (en) * 1994-10-21 1997-03-25 Intel Corporation Method and apparatus for providing broadcast information with indexing
US5729656A (en) 1994-11-30 1998-03-17 International Business Machines Corporation Reduction of search space in speech recognition using phone boundaries and phone ranking
US5638487A (en) * 1994-12-30 1997-06-10 Purespeech, Inc. Automatic speech recognition
US6026388A (en) * 1995-08-16 2000-02-15 Textwise, Llc User interface and other enhancements for natural language information retrieval system and method
EP0823112B1 (en) * 1996-02-27 2002-05-02 Koninklijke Philips Electronics N.V. Method and apparatus for automatic speech segmentation into phoneme-like units
US5778187A (en) 1996-05-09 1998-07-07 Netcast Communications Corp. Multicasting method and apparatus
US5996022A (en) 1996-06-03 1999-11-30 Webtv Networks, Inc. Transcoding data in a proxy computer prior to transmitting the audio data to a client
US5806032A (en) * 1996-06-14 1998-09-08 Lucent Technologies Inc. Compilation of weighted finite-state transducers from decision trees
US5897614A (en) * 1996-12-20 1999-04-27 International Business Machines Corporation Method and apparatus for sibilant classification in a speech recognition system
US6732183B1 (en) * 1996-12-31 2004-05-04 Broadware Technologies, Inc. Video and audio streaming for multiple users
JP2991287B2 (en) * 1997-01-28 1999-12-20 日本電気株式会社 Suppression standard pattern selection type speaker recognition device
CA2271745A1 (en) 1997-10-01 1999-04-08 Pierre David Wellner Method and apparatus for storing and retrieving labeled interval data for multimedia recordings
SE511584C2 (en) 1998-01-15 1999-10-25 Ericsson Telefon Ab L M information Routing
US6327343B1 (en) 1998-01-16 2001-12-04 International Business Machines Corporation System and methods for automatic call and data transfer processing
US6073096A (en) * 1998-02-04 2000-06-06 International Business Machines Corporation Speaker adaptation system and method based on class-specific pre-clustering training speakers
US7257528B1 (en) 1998-02-13 2007-08-14 Zi Corporation Of Canada, Inc. Method and apparatus for Chinese character text input
US6112172A (en) 1998-03-31 2000-08-29 Dragon Systems, Inc. Interactive searching
US6076053A (en) * 1998-05-21 2000-06-13 Lucent Technologies Inc. Methods and apparatus for discriminative training and adaptation of pronunciation networks
US6243680B1 (en) * 1998-06-15 2001-06-05 Nortel Networks Limited Method and apparatus for obtaining a transcription of phrases through text and spoken utterances
US6246983B1 (en) * 1998-08-05 2001-06-12 Matsushita Electric Corporation Of America Text-to-speech e-mail reader with multi-modal reply processor
US6347295B1 (en) 1998-10-26 2002-02-12 Compaq Computer Corporation Computer method and apparatus for grapheme-to-phoneme rule-set-generation
DE19912405A1 (en) * 1999-03-19 2000-09-21 Philips Corp Intellectual Pty Determination of a regression class tree structure for speech recognizers
DE60038674T2 (en) 1999-03-30 2009-06-10 TiVo, Inc., Alviso DATA STORAGE MANAGEMENT AND PROGRAM FLOW SYSTEM
US6711585B1 (en) * 1999-06-15 2004-03-23 Kanisa Inc. System and method for implementing a knowledge management system
IE990799A1 (en) 1999-08-20 2001-03-07 Digitake Software Systems Ltd "An audio processing system"
US6711541B1 (en) * 1999-09-07 2004-03-23 Matsushita Electric Industrial Co., Ltd. Technique for developing discriminative sound units for speech recognition and allophone modeling
US6571208B1 (en) * 1999-11-29 2003-05-27 Matsushita Electric Industrial Co., Ltd. Context-dependent acoustic models for medium and large vocabulary speech recognition with eigenvoice training
EP1148505A3 (en) * 2000-04-21 2002-03-27 Matsushita Electric Industrial Co., Ltd. Data playback apparatus
US6748356B1 (en) * 2000-06-07 2004-06-08 International Business Machines Corporation Methods and apparatus for identifying unknown speakers using a hierarchical tree structure
US6931376B2 (en) * 2000-07-20 2005-08-16 Microsoft Corporation Speech-related event notification system
WO2002010887A2 (en) 2000-07-28 2002-02-07 Jan Pathuel Method and system of securing data and systems
WO2002029614A1 (en) 2000-09-30 2002-04-11 Intel Corporation Method and system to scale down a decision tree-based hidden markov model (hmm) for speech recognition
WO2002029612A1 (en) 2000-09-30 2002-04-11 Intel Corporation Method and system for generating and searching an optimal maximum likelihood decision tree for hidden markov model (hmm) based speech recognition
US7221663B2 (en) 2001-12-31 2007-05-22 Polycom, Inc. Method and apparatus for wideband conferencing
US6778979B2 (en) 2001-08-13 2004-08-17 Xerox Corporation System for automatically generating queries
US7668816B2 (en) * 2002-06-11 2010-02-23 Microsoft Corporation Dynamically updated quick searches and strategies
EP1422692A3 (en) 2002-11-22 2004-07-14 ScanSoft, Inc. Automatic insertion of non-verbalized punctuation in speech recognition

Patent Citations (86)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4908866A (en) * 1985-02-04 1990-03-13 Eric Goldwasser Speech transcribing system
US4879648A (en) * 1986-09-19 1989-11-07 Nancy P. Cochran Search system which continuously displays search terms during scrolling and selections of individually displayed data sets
US6978277B2 (en) * 1989-10-26 2005-12-20 Encyclopaedia Britannica, Inc. Multimedia search system
US5418716A (en) * 1990-07-26 1995-05-23 Nec Corporation System for recognizing sentence patterns and a system for recognizing sentence patterns and grammatical cases
US5404295A (en) * 1990-08-16 1995-04-04 Katz; Boris Method and apparatus for utilizing annotations to facilitate computer retrieval of database material
US5317732A (en) * 1991-04-26 1994-05-31 Commodore Electronics Limited System for relocating a multimedia presentation on a different platform by extracting a resource map in order to remap and relocate resources
US5875108A (en) * 1991-12-23 1999-02-23 Hoffberg; Steven M. Ergonomic man-machine interface incorporating adaptive pattern recognition based control system
US5544257A (en) * 1992-01-08 1996-08-06 International Business Machines Corporation Continuous parameter hidden Markov model approach to automatic handwriting recognition
US5787198A (en) * 1992-11-24 1998-07-28 Lucent Technologies Inc. Text recognition using two-dimensional stochastic models
US5572728A (en) * 1993-12-24 1996-11-05 Hitachi, Ltd. Conference multimedia summary support system and method
US5752021A (en) * 1994-05-24 1998-05-12 Fuji Xerox Co., Ltd. Document database management apparatus capable of conversion between retrieval formulae for different schemata
US5613032A (en) * 1994-09-02 1997-03-18 Bell Communications Research, Inc. System and method for recording, playing back and searching multimedia events wherein video, audio and text can be searched and retrieved
US5757960A (en) * 1994-09-30 1998-05-26 Murdock; Michael Chase Method and system for extracting features from handwritten text
US5768607A (en) * 1994-09-30 1998-06-16 Intel Corporation Method and apparatus for freehand annotation and drawings incorporating sound and for compressing and synchronizing sound
US5777614A (en) * 1994-10-14 1998-07-07 Hitachi, Ltd. Editing support system including an interactive interface
US5835667A (en) * 1994-10-14 1998-11-10 Carnegie Mellon University Method and apparatus for creating a searchable digital video library and a system and method of using such a library
US6029195A (en) * 1994-11-29 2000-02-22 Herz; Frederick S. M. System for customized electronic identification of desirable objects
US5715367A (en) * 1995-01-23 1998-02-03 Dragon Systems, Inc. Apparatuses and methods for developing and using models for speech recognition
US5684924A (en) * 1995-05-19 1997-11-04 Kurzweil Applied Intelligence, Inc. User adaptable speech recognition system
US5559875A (en) * 1995-07-31 1996-09-24 Latitude Communications Method and apparatus for recording and retrieval of audio conferences
US6151598A (en) * 1995-08-14 2000-11-21 Shaw; Venson M. Digital dictionary with a communication system for the creating, updating, editing, storing, maintaining, referencing, and managing the digital dictionary
US6006221A (en) * 1995-08-16 1999-12-21 Syracuse University Multilingual document retrieval system and method using semantic vector matching
US5963940A (en) * 1995-08-16 1999-10-05 Syracuse University Natural language information retrieval system and method
US6332147B1 (en) * 1995-11-03 2001-12-18 Xerox Corporation Computer controlled display system using a graphical replay device to control playback of temporal data representing collaborative activities
US5960447A (en) * 1995-11-13 1999-09-28 Holt; Douglas Word tagging and editing system for speech recognition
US20010051984A1 (en) * 1996-01-30 2001-12-13 Toshihiko Fukasawa Coordinative work environment construction system, method and medium therefor
US6067517A (en) * 1996-02-02 2000-05-23 International Business Machines Corporation Transcription of speech data with segments from acoustically dissimilar environments
US5862259A (en) * 1996-03-27 1999-01-19 Caere Corporation Pattern recognition employing arbitrary segmentation and compound probabilistic evaluation
US6024571A (en) * 1996-04-25 2000-02-15 Renegar; Janet Elaine Foreign language communication system/device and learning aid
US6169789B1 (en) * 1996-12-16 2001-01-02 Sanjay K. Rao Intelligent keyboard system
US6185531B1 (en) * 1997-01-09 2001-02-06 Gte Internetworking Incorporated Topic indexing method
US6088669A (en) * 1997-01-28 2000-07-11 International Business Machines, Corporation Speech recognition with attempted speaker recognition for speaker model prefetching or alternative speech modeling
US6029124A (en) * 1997-02-21 2000-02-22 Dragon Systems, Inc. Sequential, nonparametric speech recognition and speaker identification
US6877134B1 (en) * 1997-08-14 2005-04-05 Virage, Inc. Integrated data and real-time metadata capture system and method
US6463444B1 (en) * 1997-08-14 2002-10-08 Virage, Inc. Video cataloger system with extensibility
US6567980B1 (en) * 1997-08-14 2003-05-20 Virage, Inc. Video cataloger system with hyperlinked output
US6052657A (en) * 1997-09-09 2000-04-18 Dragon Systems, Inc. Text segmentation and identification of topic using language models
US6317716B1 (en) * 1997-09-19 2001-11-13 Massachusetts Institute Of Technology Automatic cueing of speech
US6961954B1 (en) * 1997-10-27 2005-11-01 The Mitre Corporation Automated segmentation, information extraction, summarization, and presentation of broadcast news
US6064963A (en) * 1997-12-17 2000-05-16 Opus Telecom, L.L.C. Automatic key word or phrase speech recognition for the corrections industry
US20030051214A1 (en) * 1997-12-22 2003-03-13 Ricoh Company, Ltd. Techniques for annotating portions of a document relevant to concepts of interest
US5970473A (en) * 1997-12-31 1999-10-19 At&T Corp. Video communication device providing in-home catalog services
US6602300B2 (en) * 1998-02-03 2003-08-05 Fujitsu Limited Apparatus and method for retrieving data from a document database
US6718303B2 (en) * 1998-05-13 2004-04-06 International Business Machines Corporation Apparatus and method for automatically generating punctuation marks in continuous speech recognition
US6067514A (en) * 1998-06-23 2000-05-23 International Business Machines Corporation Method for automatically punctuating a speech utterance in a continuous speech recognition system
US6373985B1 (en) * 1998-08-12 2002-04-16 Lucent Technologies, Inc. E-mail signature block analysis
US6381640B1 (en) * 1998-09-11 2002-04-30 Genesys Telecommunications Laboratories, Inc. Method and apparatus for automated personalization and presentation of workload assignments to agents within a multimedia communication center
US6360237B1 (en) * 1998-10-05 2002-03-19 Lernout & Hauspie Speech Products N.V. Method and system for performing text edits during audio recording playback
US6161087A (en) * 1998-10-05 2000-12-12 Lernout & Hauspie Speech Products N.V. Speech-recognition-assisted selective suppression of silent and filled speech pauses during playback of an audio recording
US6332139B1 (en) * 1998-11-09 2001-12-18 Mega Chips Corporation Information communication system
US6728673B2 (en) * 1998-12-17 2004-04-27 Matsushita Electric Industrial Co., Ltd Method and apparatus for retrieving a video and audio scene using an index generated by speech recognition
US6611803B1 (en) * 1998-12-17 2003-08-26 Matsushita Electric Industrial Co., Ltd. Method and apparatus for retrieving a video and audio scene using an index generated by speech recognition
US6654735B1 (en) * 1999-01-08 2003-11-25 International Business Machines Corporation Outbound information analysis for generating user interest profiles and improving user productivity
US6253179B1 (en) * 1999-01-29 2001-06-26 International Business Machines Corporation Method and apparatus for multi-environment speaker verification
US6345252B1 (en) * 1999-04-09 2002-02-05 International Business Machines Corporation Methods and apparatus for retrieving audio information using content and speaker information
US6434520B1 (en) * 1999-04-16 2002-08-13 International Business Machines Corporation System and method for indexing and querying audio archives
US6847961B2 (en) * 1999-06-30 2005-01-25 Silverbrook Research Pty Ltd Method and system for searching information using sensor with identifier
US6219640B1 (en) * 1999-08-06 2001-04-17 International Business Machines Corporation Methods and apparatus for audio-visual speaker recognition and utterance verification
US6778958B1 (en) * 1999-08-30 2004-08-17 International Business Machines Corporation Symbol insertion apparatus and method
US6480826B2 (en) * 1999-08-31 2002-11-12 Accenture Llp System and method for a telephonic emotion detection that provides operator feedback
US6624826B1 (en) * 1999-09-28 2003-09-23 Ricoh Co., Ltd. Method and apparatus for generating visual representations for audio documents
US6792409B2 (en) * 1999-12-20 2004-09-14 Koninklijke Philips Electronics N.V. Synchronous reproduction in a speech recognition system
US7146317B2 (en) * 2000-02-25 2006-12-05 Koninklijke Philips Electronics N.V. Speech recognition device with reference transformation means
US20010026377A1 (en) * 2000-03-21 2001-10-04 Katsumi Ikegami Image display system, image registration terminal device and image reading terminal device used in the image display system
US20020010575A1 (en) * 2000-04-08 2002-01-24 International Business Machines Corporation Method and system for the automatic segmentation of an audio stream into semantic or syntactic units
US20020010916A1 (en) * 2000-05-22 2002-01-24 Compaq Computer Corporation Apparatus and method for controlling rate of playback of audio data
US6337818B1 (en) * 2000-06-20 2002-01-08 Mitsubishi Denki Kabushiki Kaisha Semiconductor memory device having a redundancy construction
US20020049589A1 (en) * 2000-06-28 2002-04-25 Poirier Darrell A. Simultaneous multi-user real-time voice recognition system
US20020059204A1 (en) * 2000-07-28 2002-05-16 Harris Larry R. Distributed search system and method
US6922691B2 (en) * 2000-08-28 2005-07-26 Emotion, Inc. Method and apparatus for digital media management, retrieval, and collaboration
US6604110B1 (en) * 2000-08-31 2003-08-05 Ascential Software, Inc. Automated software code generation from a metadata-based repository
US6647383B1 (en) * 2000-09-01 2003-11-11 Lucent Technologies Inc. System and method for providing interactive dialogue and iterative search functions to find information
US20020184373A1 (en) * 2000-11-01 2002-12-05 International Business Machines Corporation Conversational networking via transport, coding and control conversational protocols
US20050060162A1 (en) * 2000-11-10 2005-03-17 Farhad Mohit Systems and methods for automatic identification and hyperlinking of words or other data items and for information retrieval using hyperlinked words or data items
US20040073444A1 (en) * 2001-01-16 2004-04-15 Li Li Peh Method and apparatus for a financial database structure
US6714911B2 (en) * 2001-01-25 2004-03-30 Harcourt Assessment, Inc. Speech transcription and analysis system and method
US20020133477A1 (en) * 2001-03-05 2002-09-19 Glenn Abel Method for profile-based notice and broadcast of multimedia content
US20030088414A1 (en) * 2001-05-10 2003-05-08 Chao-Shih Huang Background learning of speaker voices
US7171360B2 (en) * 2001-05-10 2007-01-30 Koninklijke Philips Electronics N.V. Background learning of speaker voices
US6973428B2 (en) * 2001-05-24 2005-12-06 International Business Machines Corporation System and method for searching, analyzing and displaying text transcripts of speech after imperfect speech recognition
US6748350B2 (en) * 2001-09-27 2004-06-08 Intel Corporation Method to compensate for stress between heat spreader and thermal interface material
US6708148B2 (en) * 2001-10-12 2004-03-16 Koninklijke Philips Electronics N.V. Correction device to mark parts of a recognized text
US20030093580A1 (en) * 2001-11-09 2003-05-15 Koninklijke Philips Electronics N.V. Method and system for information alerts
US20030167163A1 (en) * 2002-02-22 2003-09-04 Nec Research Institute, Inc. Inferring hierarchical descriptions of a set of documents
US7131117B2 (en) * 2002-09-04 2006-10-31 Sbc Properties, L.P. Method and system for automating the analysis of word frequencies
US6999918B2 (en) * 2002-09-20 2006-02-14 Motorola, Inc. Method and apparatus to facilitate correlating symbols to sounds

Cited By (64)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8332220B2 (en) * 2004-01-13 2012-12-11 Nuance Communications, Inc. Differential dynamic content delivery with text display in dependence upon simultaneous speech
US8965761B2 (en) * 2004-01-13 2015-02-24 Nuance Communications, Inc. Differential dynamic content delivery with text display in dependence upon simultaneous speech
US8781830B2 (en) * 2004-01-13 2014-07-15 Nuance Communications, Inc. Differential dynamic content delivery with text display in dependence upon simultaneous speech
US20140188469A1 (en) * 2004-01-13 2014-07-03 Nuance Communications, Inc. Differential dynamic content delivery with text display in dependence upon simultaneous speech
US20080172227A1 (en) * 2004-01-13 2008-07-17 International Business Machines Corporation Differential Dynamic Content Delivery With Text Display In Dependence Upon Simultaneous Speech
US20140019129A1 (en) * 2004-01-13 2014-01-16 Nuance Communications, Inc. Differential dynamic content delivery with text display in dependence upon simultaneous speech
US8504364B2 (en) * 2004-01-13 2013-08-06 Nuance Communications, Inc. Differential dynamic content delivery with text display in dependence upon simultaneous speech
US20150206536A1 (en) * 2004-01-13 2015-07-23 Nuance Communications, Inc. Differential dynamic content delivery with text display
US20130013307A1 (en) * 2004-01-13 2013-01-10 Nuance Communications, Inc. Differential dynamic content delivery with text display in dependence upon simultaneous speech
US9691388B2 (en) * 2004-01-13 2017-06-27 Nuance Communications, Inc. Differential dynamic content delivery with text display
US20050182627A1 (en) * 2004-01-14 2005-08-18 Izuru Tanaka Audio signal processing apparatus and audio signal processing method
US20060058998A1 (en) * 2004-09-16 2006-03-16 Kabushiki Kaisha Toshiba Indexing apparatus and indexing method
US20090169901A1 (en) * 2005-11-18 2009-07-02 Blacklidge Emulsions, Inc. Method For Bonding Prepared Substrates For Roadways Using A Low-Tracking Asphalt Emulsion Coating
US7918624B2 (en) 2005-11-18 2011-04-05 Blacklidge Emulsions, Inc. Method for bonding prepared substrates for roadways using a low-tracking asphalt emulsion coating
US7503724B2 (en) * 2005-11-18 2009-03-17 Blacklidge Emulsions, Inc. Method for bonding prepared substrates for roadways using a low-tracking asphalt emulsion coating
WO2007061947A3 (en) * 2005-11-18 2008-09-18 Blacklidge Emulsions Inc Method for bonding prepared substrates for roadways using a low-tracking asphalt emulsion coating
US20070141241A1 (en) * 2005-11-18 2007-06-21 Blacklidge Roy B Method for bonding prepared substrates for roadways using a low-tracking asphalt emulsion coating
WO2007061947A2 (en) * 2005-11-18 2007-05-31 Blacklidge Emulsions, Inc. Method for bonding prepared substrates for roadways using a low-tracking asphalt emulsion coating
US20080046241A1 (en) * 2006-02-20 2008-02-21 Andrew Osburn Method and system for detecting speaker change in a voice transaction
US10515628B2 (en) 2006-10-16 2019-12-24 Vb Assets, Llc System and method for a cooperative conversational voice user interface
US10755699B2 (en) 2006-10-16 2020-08-25 Vb Assets, Llc System and method for a cooperative conversational voice user interface
US10297249B2 (en) * 2006-10-16 2019-05-21 Vb Assets, Llc System and method for a cooperative conversational voice user interface
US10510341B1 (en) 2006-10-16 2019-12-17 Vb Assets, Llc System and method for a cooperative conversational voice user interface
US11222626B2 (en) 2006-10-16 2022-01-11 Vb Assets, Llc System and method for a cooperative conversational voice user interface
US11080758B2 (en) 2007-02-06 2021-08-03 Vb Assets, Llc System and method for delivering targeted advertisements and/or providing natural language processing based on advertisements
US10134060B2 (en) 2007-02-06 2018-11-20 Vb Assets, Llc System and method for delivering targeted advertisements and/or providing natural language processing based on advertisements
US10553216B2 (en) 2008-05-27 2020-02-04 Oracle International Corporation System and method for an integrated, multi-modal, multi-device natural language voice services environment
US10553213B2 (en) 2009-02-20 2020-02-04 Oracle International Corporation System and method for processing multi-modal device interactions in a natural language voice services environment
US10273637B2 (en) 2010-02-24 2019-04-30 Blacklidge Emulsions, Inc. Hot applied tack coat
US20120035927A1 (en) * 2010-08-09 2012-02-09 Keiichi Yamada Information Processing Apparatus, Information Processing Method, and Program
US20120197643A1 (en) * 2011-01-27 2012-08-02 General Motors Llc Mapping obstruent speech energy to lower frequencies
US9313336B2 (en) 2011-07-21 2016-04-12 Nuance Communications, Inc. Systems and methods for processing audio signals captured using microphones of multiple devices
US20150127348A1 (en) * 2013-11-01 2015-05-07 Adobe Systems Incorporated Document distribution and interaction
US9942396B2 (en) * 2013-11-01 2018-04-10 Adobe Systems Incorporated Document distribution and interaction
CN105765654A (en) * 2013-11-28 2016-07-13 弗劳恩霍夫应用研究促进协会 Hearing assistance device with fundamental frequency modification
US9544149B2 (en) 2013-12-16 2017-01-10 Adobe Systems Incorporated Automatic E-signatures in response to conditions and/or events
US10250393B2 (en) 2013-12-16 2019-04-02 Adobe Inc. Automatic E-signatures in response to conditions and/or events
US11087385B2 (en) 2014-09-16 2021-08-10 Vb Assets, Llc Voice commerce
US10430863B2 (en) 2014-09-16 2019-10-01 Vb Assets, Llc Voice commerce
US9703982B2 (en) 2014-11-06 2017-07-11 Adobe Systems Incorporated Document distribution and interaction
US9531545B2 (en) 2014-11-24 2016-12-27 Adobe Systems Incorporated Tracking and notification of fulfillment events
US9432368B1 (en) 2015-02-19 2016-08-30 Adobe Systems Incorporated Document distribution and interaction
US20160247520A1 (en) * 2015-02-25 2016-08-25 Kabushiki Kaisha Toshiba Electronic apparatus, method, and program
US10089061B2 (en) * 2015-08-28 2018-10-02 Kabushiki Kaisha Toshiba Electronic device and method
US20170061987A1 (en) * 2015-08-28 2017-03-02 Kabushiki Kaisha Toshiba Electronic device and method
US9935777B2 (en) 2015-08-31 2018-04-03 Adobe Systems Incorporated Electronic signature framework with enhanced security
US10361871B2 (en) 2015-08-31 2019-07-23 Adobe Inc. Electronic signature framework with enhanced security
US10770077B2 (en) 2015-09-14 2020-09-08 Toshiba Client Solutions CO., LTD. Electronic device and method
US9626653B2 (en) 2015-09-21 2017-04-18 Adobe Systems Incorporated Document distribution and interaction with delegation of signature authority
US10347215B2 (en) 2016-05-27 2019-07-09 Adobe Inc. Multi-device electronic signature framework
US10255905B2 (en) * 2016-06-10 2019-04-09 Google Llc Predicting pronunciations with word stress
US20170358293A1 (en) * 2016-06-10 2017-12-14 Google Inc. Predicting pronunciations with word stress
US10217453B2 (en) * 2016-10-14 2019-02-26 Soundhound, Inc. Virtual assistant configured by selection of wake-up phrase
US10783872B2 (en) 2016-10-14 2020-09-22 Soundhound, Inc. Integration of third party virtual assistants
US20180108343A1 (en) * 2016-10-14 2018-04-19 Soundhound, Inc. Virtual assistant configured by selection of wake-up phrase
US10503919B2 (en) 2017-04-10 2019-12-10 Adobe Inc. Electronic signature framework with keystroke biometric authentication
EP3806091A1 (en) * 2017-05-16 2021-04-14 Apple Inc. Detecting a trigger of a digital assistant
US20210097998A1 (en) * 2017-05-16 2021-04-01 Apple Inc. Detecting a trigger of a digital assistant
WO2018212953A1 (en) * 2017-05-16 2018-11-22 Apple Inc. Detecting a trigger of a digital assistant
CN110473538A (en) * 2017-05-16 2019-11-19 苹果公司 Detect the triggering of digital assistants
US11532306B2 (en) * 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
US20210312944A1 (en) * 2018-08-15 2021-10-07 Nippon Telegraph And Telephone Corporation End-of-talk prediction device, end-of-talk prediction method, and non-transitory computer readable recording medium
US20210366479A1 (en) * 2020-05-21 2021-11-25 Orcam Technologies Ltd. Systems and methods for emphasizing a user's name
US11875791B2 (en) * 2020-05-21 2024-01-16 Orcam Technologies Ltd. Systems and methods for emphasizing a user's name

Also Published As

Publication number Publication date
US20050038649A1 (en) 2005-02-17
US20040230432A1 (en) 2004-11-18
US20040083090A1 (en) 2004-04-29
US20040083104A1 (en) 2004-04-29
US20040172250A1 (en) 2004-09-02
US20040138894A1 (en) 2004-07-15
US20040176946A1 (en) 2004-09-09
US7292977B2 (en) 2007-11-06
US7389229B2 (en) 2008-06-17
US20040163034A1 (en) 2004-08-19
US7424427B2 (en) 2008-09-09

Similar Documents

Publication Publication Date Title
US20040204939A1 (en) Systems and methods for speaker change detection
JP3488174B2 (en) Method and apparatus for retrieving speech information using content information and speaker information
Makhoul et al. Speech and language technologies for audio indexing and retrieval
US7617188B2 (en) System and method for audio hot spotting
US7801838B2 (en) Multimedia recognition system comprising a plurality of indexers configured to receive and analyze multimedia data based on training data and user augmentation relating to one or more of a plurality of generated documents
US7487094B1 (en) System and method of call classification with context modeling based on composite words
US9514126B2 (en) Method and system for automatically detecting morphemes in a task classification system using lattices
US6681206B1 (en) Method for generating morphemes
US6434520B1 (en) System and method for indexing and querying audio archives
Mandal et al. Recent developments in spoken term detection: a survey
US20080177544A1 (en) Method and system for automatic detecting morphemes in a task classification system using lattices
WO2007056344A2 (en) Techiques for model optimization for statistical pattern recognition
JP2004005600A (en) Method and system for indexing and retrieving document stored in database
Kubala et al. Integrated technologies for indexing spoken language
JP2004133880A (en) Method for constructing dynamic vocabulary for speech recognizer used in database for indexed document
Koumpis et al. Content-based access to spoken audio
US7085720B1 (en) Method for task classification using morphemes
US8639510B1 (en) Acoustic scoring unit implemented on a single FPGA or ASIC
Rose et al. Integration of utterance verification with statistical language modeling and spoken language understanding
Wechsler et al. Speech retrieval based on automatic indexing
US20050125224A1 (en) Method and apparatus for fusion of recognition results from multiple types of data sources
Wang Mandarin spoken document retrieval based on syllable lattice matching
Ariki et al. Live speech recognition in sports games by adaptation of acoustic model and language model.
Nouza et al. A system for information retrieval from large records of Czech spoken data
Furui Steps toward natural human-machine communication in the 21st century

Legal Events

Date Code Title Description
AS Assignment

Owner name: BBNT SOLUTIONS LLC, TEXAS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LIU, DABEN;KUBALA, FRANCIS;REEL/FRAME:014610/0015;SIGNING DATES FROM 20031001 TO 20031003

AS Assignment

Owner name: FLEET NATIONAL BANK, AS AGENT, MASSACHUSETTS

Free format text: PATENT & TRADEMARK SECURITY AGREEMENT;ASSIGNOR:BBNT SOLUTIONS LLC;REEL/FRAME:014624/0196

Effective date: 20040326

Owner name: FLEET NATIONAL BANK, AS AGENT,MASSACHUSETTS

Free format text: PATENT & TRADEMARK SECURITY AGREEMENT;ASSIGNOR:BBNT SOLUTIONS LLC;REEL/FRAME:014624/0196

Effective date: 20040326

AS Assignment

Owner name: BBN TECHNOLOGIES CORP.,MASSACHUSETTS

Free format text: MERGER;ASSIGNOR:BBNT SOLUTIONS LLC;REEL/FRAME:017274/0318

Effective date: 20060103

Owner name: BBN TECHNOLOGIES CORP., MASSACHUSETTS

Free format text: MERGER;ASSIGNOR:BBNT SOLUTIONS LLC;REEL/FRAME:017274/0318

Effective date: 20060103

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: BBN TECHNOLOGIES CORP. (AS SUCCESSOR BY MERGER TO

Free format text: RELEASE OF SECURITY INTEREST;ASSIGNOR:BANK OF AMERICA, N.A. (SUCCESSOR BY MERGER TO FLEET NATIONAL BANK);REEL/FRAME:023427/0436

Effective date: 20091026

AS Assignment

Owner name: APPLIED MEDICAL RESOURCES CORPORATION, CALIFORNIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CITIBANK N.A., AS ADMINISTRATIVE AGENT;REEL/FRAME:066796/0262

Effective date: 20240129