US20050108630A1 - Extraction of facts from text - Google Patents

Extraction of facts from text Download PDF

Info

Publication number
US20050108630A1
US20050108630A1 US10/716,202 US71620203A US2005108630A1 US 20050108630 A1 US20050108630 A1 US 20050108630A1 US 71620203 A US71620203 A US 71620203A US 2005108630 A1 US2005108630 A1 US 2005108630A1
Authority
US
United States
Prior art keywords
text
attributes
pattern
tokens
annotation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/716,202
Inventor
Mark Wasson
James Wiltshire
Donald Loritz
Steve Xu
Shian-Jung Chen
Valentina Templar
Eleni Koutsomitopoulou
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
LexisNexis Inc
Original Assignee
LexisNexis Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by LexisNexis Inc filed Critical LexisNexis Inc
Priority to US10/716,202 priority Critical patent/US20050108630A1/en
Assigned to LEXISNEXIS reassignment LEXISNEXIS ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: WASSON, MARK D., KOUTSOMITOPOULOU, ELENI, WILTSHIRE, JR., JAMES S., CHEN, SHIAN-JUNG, LORITZ, DONALD, TEMPLAR, VALENTINA, XU, STEVE
Priority to EP04796351A priority patent/EP1695170A4/en
Priority to AU2004294094A priority patent/AU2004294094B2/en
Priority to PCT/US2004/035359 priority patent/WO2005052727A2/en
Priority to CA2546896A priority patent/CA2546896C/en
Priority to NZ547871A priority patent/NZ547871A/en
Publication of US20050108630A1 publication Critical patent/US20050108630A1/en
Priority to US12/689,629 priority patent/US7912705B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the invention relates to the extraction of targeted pieces of information from text using linguistic pattern matching technologies, and more particularly, the extraction of targeted pieces of information using text annotation and fact extraction.
  • Action an instruction concerning what to do with some matched text.
  • Annotation Configuration a file that identifies and orders the set of annotators that should be applied to some text for a specific application.
  • Annotations attributes, or values, assigned to words or word groups that provide interesting information about the word or words.
  • Example annotations include part-of-speech, noun phrases, morphological root, named entities (such as Corporation, Person, Organization, Place, Citation), and embedded numerics (such as Time, Date, Monetary Amount).
  • Annotator a software process that assigns attributes to base tokens or to constituents or that creates constituents from patterns of one or more base tokens.
  • Attributes features, values, properties or links that are assigned to individual base tokens, sequences of base tokens or related but not necessarily adjacent base tokens (i.e., patterns of base tokens). Attributes may be assigned to the tokenized text through one or more processes that apply to the tokenized text or to the raw text.
  • RuBIE pattern recognition language a statement or shorthand notation used to name and define a sub-pattern for use elsewhere.
  • Base tokens minimum meaningful units, such as alphabetic strings (words), punctuation symbols, numbers, and so on, into which a text is divided by tokenization.
  • Base tokens are the minimum building blocks for a text processing system.
  • Constituent a base token or pattern of base tokens to which an attribute has been assigned. Although constituents often consist of a single base token or a pattern of base tokens, a constituent is not necessarily comprised of contiguous base tokens. An example of a non-contiguous constituent is the two-word verb looked up in the sentence He looked the address up.
  • Constituent attributes hose attributes that are assigned to a pattern of one or more base tokens that represent a single constituent.
  • Label an alphanumeric string that uniquely identifies a pattern recognition rule or auxiliary definition.
  • Machine learning-based pattern recognition in which a statistic-based process might be given a mix of example texts that do and do not represent the targeted extraction result, and the process will attempt to identify the valid patterns that correspond to the targeted results.
  • Pattern a description of a number of base tokens that should be recognized in some way, where the recognition of the tokens is primarily driven by targeted attributes that have been assigned to the text through annotation processes.
  • One or more annotation value tests, zero or more recognition shifts, zero or more regular expression operators, and zero or more XPath-based (tree-based) operators may all be included in a pattern.
  • Pattern recognition language a language used to guide a text processing system to find defined patterns of annotations.
  • a pattern recognition rule will test each constituent in some pattern for the presence or absence of one or more desired annotations (attributes). If the right combinations of annotations are found in the right order, the statement can then copy that text, add further annotations, or both, and return it to an application (that is, extract it) for further processing. Because linguistic relationships can involve constituents that are tree-structured or otherwise not necessarily sequentially ordered, a pattern recognition rule can also follow these types of relationships and not just sequentially arranged constituents.
  • Pattern recognition rule a statement used to describe what text should be located by its pattern, and what should be done when such a pattern is found.
  • RuBIE Rule-Based Information Extraction language. The language in which the pattern recognition rules of the present invention are expressed.
  • RuBIE application file a flat text file that contains one or more text pattern recognition rules and possibly other components of the RuBIE pattern recognition language. Typically it will contain all of the extraction rules associated with a single fact extraction application.
  • Pattern recognition in which the pattern recognition rules are developed by a computational linguist or other pattern recognition specialist, usually through an iterative trial-and-error develop-evaluate process.
  • Shift pattern recognition functionality that changes the location within a text where a pattern recognition rule is applying.
  • Many pattern recognition languages have rules that process a text in left-to-right order.
  • Shift functionality allows a rule to process a text in some other order, such as repositioning pattern recognition from mid-sentence to the start of a sentence, from a verb to its corresponding subject in mid-rule, or from any point to some other defined non-contiguous point.
  • Scope the portion or sub-pattern of a pattern recognition rule that corresponds to an action. An action may act upon the text matched by the sub-pattern only if the entire pattern successfully matches some text.
  • Sub-pattern any pattern fragment that is less than or equal to a full pattern. Sub-patterns are relevant from the perspective of auxiliary definition statements and from the perspective of scopes of actions.
  • Tests apply to constituents to verify either the value of a constituent or whether a particular attribute has been assigned to that constituent.
  • Text in the context of a document search and retrieval application such as LexisNexis®, any string of printable characters, although in general a text is usually expected to be a document or document fragment that can be searched, retrieved and presented to customers using the online system. Web pages, customer documents, and natural language queries are other examples of possible texts.
  • Token a minimal meaningful unit, such as an alphabetic string (word), space, punctuation symbol, number, and so on.
  • Token attributes hose attributes that are assigned to individual base tokens. Examples of token attributes may include the following: (1) part of speech tags, (2) literal values, (3) morphological roots, and (4) orthographic properties (e.g., capitalized, upper case, lower case strings).
  • Tokenize To divide a text into a sequence of tokens.
  • Prior art pattern recognition languages and tools include lex, SRA's NetOwl® technology, and PerlTM. These prior art pattern recognition languages and tools primarily exploit physical or orthographic characteristics of the text, such as alphabetic versus digit, capitalized vs. lower case, or specific literal values. Some of these also allow users to annotate pieces of text with attributes based on a lexical lookup process.
  • the leveled parser was an example of a regular expression-based pattern recognition language that used a lexical scanner to tokenize a text—that is, break the text up into its basic components (“base tokens”), such as words, spaces, punctuation symbols, numbers, document markup, etc.—and then use a combination of dictionary lookups and parser grammars to identify and annotate individual tokens and patterns of tokens of interest, based on attributes (“annotations” or “labels”) assigned to those tokens through the scanner, parser or dictionary lookup (a base token and patterns of base tokens that share some common attribute are called “constituents”).
  • base tokens such as words, spaces, punctuation symbols, numbers, document markup, etc.
  • the lexical scanner might break the character string
  • a dictionary lookup may include a rule to assign the annotation TITLE to any of the following words and phrases: Mr, Mrs, Ms, Miss, Dr, Rev, President, etc. For the above example, this would result in the following annotated token sequence: UC LCS TITLE PER CPS UC PER CPS LCS LCS PER I saw Mr . Mark D . Benson go away .
  • parser grammar was then used to find interesting tokens and token patterns and annotate them with an indication of their function in the text.
  • the parser grammar rules were based on regular expression notation, a widely used approach to create rules that generally work from left to right through some text or sequence of annotated tokens, testing for the specified attributes.
  • a regular expression rule to recognize people names in annotated text might look like the following: (TITLE (PER)?)? (CPS
  • This rule first looks for TITLE attribute optionally (“?”) followed by a period (PER), although the TITLE or TITLE-PERIOD is also optional. Then it looks for either a capitalized (CPS) OR upper case (UCS) string. It then looks for an upper case letter (UC) optionally followed by a period (PER), OR it looks for a capitalized string (CPS), OR it looks for an upper case string (UCS), although like the title, this portion of the rule is optional. Finally it looks for a capitalized (CPS) OR upper case (UCS) string.
  • a grammar whether a lexical scanner, leveled parser or any of the other conventional, expression-based pattern recognition languages and tools, may contain dozens, hundreds or even thousands of rules that are designed to work together for overall accuracy. Any one rule in the grammar may handle only a small fraction of the targeted patterns. Many rules typically are written to find what the user wants, although some rules in a grammar may primarily function to exclude some text patterns from other rules.
  • Regular expression-based pattern recognition works well for a number of pattern recognition problems in text. It is possible to achieve accuracy rates of 90%, 95% or higher for a number of interesting categories, such as company, people, organization and place names; addresses and address components; embedded numerics, such as times, dates, telephone numbers, weights, measures, and monetary amounts; and other tokens of interest such as case and statute citations, case names, social security numbers and other types of identification numbers, document markup, websites, e-mail addresses, and table components.
  • Regular expressions do have a problem recognizing some categories of tokens because there is little if any consistency in the structure of names in those categories, regardless of how many rules one might use. These include product names and names of books or other media, names that can be almost anything. There are also some language-specific issues that one runs into, for example: rules that recognize European language-based names in American English text often will stumble on names of Middle Eastern and Asian language origin; and rules developed to exploit capitalization patterns common in English language text may fail on languages with different capitalization patterns.
  • agent-action-patient has the ability to identify and exploit agent-action-patient relationships in sentences or clauses (the reader may think of these in terms of subject-verb-object relationships, but agent-action-patient is more descriptive and useful given the existence of both active and passive sentences).
  • Orthographic attributes that are assigned to texts or text fragments are attributes whose assignment is based on attributes of the characters in the text, such as capitalization characteristics, letters versus digits, or the literal value of those characters.
  • Regular expression-based pattern recognition rules applied to the characters in a text are quite useful for tokenizing a text into its base tokens and assigning orthographic annotations to those tokens, such as capitalized string, upper case letter, punctuation symbol or space.
  • Regular expression-based pattern recognition rules applied to base tokens are quite useful for combining base tokens together into special tokens such as named entities, citations, and embedded numerics. These types of rules also assign orthographic annotations.
  • a dictionary lookup may be used to assign orthographic, semantic, and other annotations to a token or pattern of tokens.
  • a dictionary was used to assign the attribute TITLE to Mr.
  • Semantic annotations can tell us that something is a person name or a potential title, but these types of annotations do not indicate the function of that person in a document. John may be a person name, but that does not tell us if John did the kissing or if he himself was kissed.
  • Linguists create parsers to help determine the natural language syntax of sentences, sentence fragments, and other texts. This syntax is both not only interesting in its own right for the linguistic annotations it provides, but also because it provides a basis for addressing ever more linguistically sophisticated problems. Identifying clauses, their syntactic subjects, verbs, and objects, and the various types of pronouns provides a basis for determining agents, actions, and patients in those clauses and for addressing some types of coreference resolution problems, particularly those involving linking pronouns to names and other nouns.
  • parser-based text annotations are usually represented by a tree or some other hierarchical representation.
  • a tree is useful for representing both simple and rather complex syntactic relationships between tokens.
  • FIG. 1 One such tree representation for John kissed Mary is shown in FIG. 1 .
  • Parse trees not only annotate a text with syntactic attributes like Noun Phrase or Verb, but through the relationships they represent, it is possible to derive additional grammatical roles as well as semantic functions. For example,
  • the token Mary may be annotated with several attributes, such as the following:
  • the tree representation of FIG. 1 can capture all of these attributes, as shown in FIG. 2 .
  • the hierarchical relationships represented by a tree can be represented through other means.
  • One common way is to represent the hierarchy through the use of nested parentheses.
  • a notation like X (Y) could be used to annotate whatever Y is with the structural attribute X.
  • ProperNoun (John) indicates that John is a constituent under Proper Noun in the tree.
  • the whole sentence would look like the following: Sentence ( NounPhrase( ProperNoun( John ) ), VerbPhrase( Verb( kissed ), NounPhrase( ProperNoun( Mary ) ) ) ) ) )
  • HTML HyperText Markup Language
  • XML the Extensible Markup Language
  • the structure of a news article may include the headline, byline, dateline, publisher, date, lead, and body, all of which fall under a document node.
  • a tree representation of this structure might look as shown in FIG. 3 .
  • XML can be used to define a news document markup
  • it can be used to define the type of linguistic markup shown in the John kissed Mary example above.
  • XML markup uses a label to mark the beginning and end of the annotated text.
  • X (Y) is used above to represent annotating the text Y with the attribute X
  • XML uses the following, where ⁇ X> and ⁇ /X> are XML tags that annotate text Y with X:
  • the elements of the XML representation correspond to the nodes in the tree representation here. And just as attributes can be added to the nodes in the tree, suchas +Object, +Patient and Literal “Mary” were added to the tree in FIG. 2 , attributes can be associated with XML elements. Attributes in XML provide additional information about the element or the contents of that element.
  • trees are routinely used to represent both syntactic structure and attributes assigned to nodes in the tree.
  • XML can be used to represent this same information.
  • Finding related entities/nodes in trees and identifying the relationships between them primarily rely on navigating the paths between these entities and using the information associated with the entities/nodes. For example, as discussed above, this information could be used to identify grammatical subjects, objects and the relationship (in that case the verb) between them.
  • XPath is a language created to similarly navigate XML representations of texts.
  • XSL is a language for expressing stylesheets.
  • An XML style sheet is a file that describes how to display an XML document of a given type.
  • XSL Transformations is a language for transforming XML documents, such as for generating an HTML web page from XML data.
  • XPath is a language used to identify particular parts of XML documents. XPath lets users write expressions that refer to elements and attributes. XPath indicates nodes in the tree by their position, relative position, type, content, and other criteria. XSLT uses XPath expressions to match and select specific elements in an XML document for output purposes or for further processing.
  • XPath and XPath-based functionality can serve as a basis for processing that representation much like linguists have historically used Lisp and Lisp-based functionality.
  • Base token identification feeds into named entity recognition.
  • Named entity recognition results feed into a part of speech tagger.
  • Part of speech tagging results feed into a parser. All of these processes can make mistakes, but because each tool feeds its results into the next one and each tool generally assumes correct input, errors are often built on errors.
  • a named entity recognizer that uses capitalization might incorrectly include the capitalized first word of a sentence as part of a name, whereas a part of speech tagger that relies heavily on term dictionaries may keep that first word separate.
  • Original text Did Bill go to the store?
  • Well-formed XML has a strict hierarchical syntax.
  • marked sub-pieces of text are permitted to be nested within one another, but their boundaries may not cross. That is, they may not have overlapping tags (HTML, the markup commonly used for web pages, does permit overlapping tags.)
  • HTML the markup commonly used for web pages
  • the Penn Tools an information extraction research prototype developed by the University of Pennsylvania, combine strong regular expression-based pattern recognition functionality with what on the surface appeared to be some tree navigation functionality. However, in that tool, only a few interesting types of tree-based relationships were retained. These were translated into a positional, linear, non-tree representation so that their regular expression-based extraction language, Mother of Perl (“MOP”), could also apply to those relationships in its rules.
  • MOP Mother of Perl
  • the Penn Tools information extraction research prototype does not have the ability to exploit all of the available, tree-based relationships in combination with full regular expression-based pattern recognition.
  • FEX fact extraction tool set
  • the tag uncrossing tool in accordance with the present invention resolves conflicting (crossed) annotation boundaries in an annotated text to produce well-formed XML from the results of the individual FEX Annotators.
  • the text annotation tool in accordance with the present invention includes assigning attributes to the parts of the text. These attributes may include tokenization, orthographic, text normalization, part of speech tags, sentence boundaries, parse trees, and syntactic, semantic, and pragmatic attribute tagging and other interesting attributes of the text.
  • the fact extraction tool set in accordance with the present invention takes a text passage such as a document, sentence, query, or any other text string, breaks it into its base tokens, and annotates those tokens and patterns of tokens with a number of orthographic, syntactic, semantic, pragmatic and dictionary-based attributes.
  • XML is used as a basis for representing the annotated text.
  • FEX annotations are of three basic types. Expressed in terms of regular expressions, these are as follows: (1) token attributes, which have a one-per-base-token alignment, where for the attribute type represented, there is an attempt to assign an attribute value to each base token; (2) constituent attributes assigned yes-no values to patterns of base tokens, where the entire pattern is considered to be a single constituent with respect to some annotation value; and (3) links, which connect coreferring constituents such as names, their variants, and pronouns.
  • token attributes tend to be represented as XML attributes on base tokens
  • constituent attributes and links tend to be represented as XML elements
  • Shifts tend to be represented as XPath expressions that utilize token attributes, constituent attributes, and links
  • FEX Annotators are identified as well as any necessary parameters, input/output, dictionaries, or other relevant information.
  • the annotation results of these FEX Annotators are stored individually.
  • the fact extraction tool set in accordance with the present invention focuses on identifying and extracting potentially interesting pieces of information in an annotated text by finding patterns in the attributes stored by the annotators. To find these patterns and extract the interesting facts, the user creates a RuBIE annotation file using a Rule-Based Information Extraction language (“the RuBIE pattern recognition language”) to write pattern recognition and extraction rules. This file queries for literal text, attributes, or relationships found in the annotations. It is these queries that actually define the facts to be extracted. The RuBIE annotation file is compiled and applied to the aligned annotations generated in the previous steps.
  • the RuBIE pattern recognition language a Rule-Based Information Extraction language
  • FIG. 1 is a tree representation for the phrase John kissed Mary.
  • FIG. 2 is the tree representation of FIG. 1 with further annotations of the token Mary.
  • FIG. 3 shows a tree representation of the basic structure of a news article.
  • FIG. 4 shows the assignment of tags with conflicting boundaries (or nesting) to the text string A B C D E.
  • FIG. 5 is a diagrammatic illustration of a first scenario in which a FEX product creates a database with the facts extracted by the FEX tool set and provides a customer interface to present these facts from the database.
  • FIG. 6 is a diagrammatic illustration of a second scenario in which an FEX product updates an original document with extracted facts metadata and leverages an existing customer interface to present the facts.
  • FIG. 7 is a diagrammatic illustration of the FEX tool set architecture.
  • FIG. 8 is a high level flow diagram for the processing flow of the FEX tool set using the architecture shown in FIG. 7 .
  • the fact extraction (“FEX”) tool set in accordance with the present invention extracts targeted pieces of information from text using linguistic and pattern matching technologies, and in particular, text annotation and fact extraction.
  • the text annotation process assigns attributes to a text passage such as a document, sentence, query, or any other text string, by parsing the text passage—breaking it into its base tokens and annotating those tokens and patterns of tokens with a number of orthographic, syntactic, semantic, pragmatic, and dictionary-based attributes.
  • attributes may include tokenization, text normalization, part of speech tags, sentence boundaries, parse trees, semantic attribute tagging and other interesting attributes of the text.
  • Text structure is usually defined or controlled by some type of markup language.
  • an annotated text is represented using XML, the Extensible Markup Language.
  • the FEX tool set includes a tag uncrossing process to resolve conflicting (crossed) annotation boundaries in an annotated text to produce well-formed XML from the results of the text annotation process prior to fact extraction.
  • the FEX Annotation Process includes the management of annotation configuration information, the actual running of the annotators, and the alignment of the resulting annotations.
  • a text is annotated by first segmenting, or tokenizing, it into a sequence of minimal, meaningful text units called base tokens, which include words, numbers, punctuation symbols, and other basic test units.
  • base tokens include words, numbers, punctuation symbols, and other basic test units.
  • token attributes with which the FEX tool set can annotate the base tokens of a text include, but are not limited to part of speech tags, literal values, morphological roots, and orthographic properties (e.g., capitalized, upper case, lower case strings). More specifically, examples of these attributes include, but are not limited to:
  • attributes are assigned to it through one or more processes that apply to the tokenized text or to the raw text. Every base token has at least one attribute—its literal value. Most tokens will have numerous additional attributes. They may also be part of pattern of tokens that have one or more attributes. Depending on the linguistic sophistication of a particular extraction application, a token may have a few or a few dozen attributes assigned to it, directly or through its parents in the tree structure representation (“representation” referring here to the fact that what is stored on the computer is a representation of a tree structure). A constituent is a base token or pattern of base tokens to which an attribute has been assigned.
  • tests are applied to the constituents to verify either the value of a constituent or whether a particular attribute has been assigned to that constituent. If a test is successful, the pattern recognition process consumes, or moves to a point just past, the corresponding underlying base-tokens. Because the pattern recognition process can shift, or move, to different locations in a text, it is possible for a single pattern recognition rule to consume the same base tokens more than once.
  • FEX Annotators used by the FEX tool set
  • the FEX Annotators can include proprietary annotators (including base tokenizers or end-of-sentence recognition) as well as commercially-available products (e.g., Inxight's ThingFinderTM and LinguistX® are commercially available tools that support named entity and event recognition and classification, part of speech tagging, and other language processing functions).
  • the user will determine which FEX Annotators should apply to the text.
  • the FEX tool set allows the user to control the execution of the FEX Annotators through annotation configuration files, which are created and maintained by the user in a graphical user interface development environment (GUI-DE) provided by the FEX tool set.
  • GUI-DE graphical user interface development environment
  • the user lists the FEX Annotators that the user wishes to run on his or her files, any relevant parameters for each (like dictionary names or other customizable switches, input/output, or other relevant information).
  • the user also determines the order in which the selected FEX Annotators run, since some FEX Annotators might depend on the output of others.
  • the annotation results of these FEX Annotators are stored individually.
  • the FEX tool set runs the chosen annotators against the input documents.
  • the first FEX annotator that runs against the document text is the Base Tokenizer, which generates “base tokens.”
  • Other FEX Annotators may operate on these base tokens or may use the original document text when generating annotations.
  • Annotation alignment is the process of associating all annotations assigned to a particular piece of text with the base token(s) for that text.
  • Example (a) the ⁇ ADJP> node “crosses” the ⁇ CLAUSE> and ⁇ NP> nodes, both of which begin inside of the ⁇ ADJP> node, but terminate outside of it (i.e., beyond ⁇ /ADJP>).
  • Such an improperly nested document cannot be processed by standard XML processors.
  • the method by which the FEX tool set uncrosses such documents to a properly-nested structure, as shown in the following Example (b), will now be described.
  • Step 1 Given a crossed XML document as in Example (a), convert contiguous character-sequences of the document to a Document Object Model (DOM) array of three object-types of contiguous document markup and content: START-TAGs, END-TAGs, and OTHER.
  • START-TAGs and END-TAGs are markup defined by the XML standard, for example, ⁇ doc> is a START-TAG and ⁇ /doc> is its corresponding END-TAG.
  • START-TAGs and their matching END-TAGs are also assigned a NESTING-LEVEL such that a parent-node's NESTING-LEVEL is less than (or, alternatively, greater than) its desired children's NESTING-LEVEL. All other blocks of contiguous text, whether markup, white space, textual content, or CDATA are designated OTHER.
  • Step 2 Set INDEX at the first element of the array and scan the array object-by-object by incrementing INDEX by 1 at each step.
  • Step 3 If the object at INDEX is a START-TAG, push a pointer to it onto an UNMATCHED-START-STACK (or, simply “the STACK”). Continue scanning.
  • Step 4 If the current object is an END-TAG, compare it to the START-TAG (referenced) at Top of the STACK (“TOS”).
  • Step 5 If the current END-TAG matches the START-TAG at TOS, pop the STACK. For example, the END-TAG “ ⁇ /doc>” matches the START-TAG “ ⁇ doc>.” Continue scanning with the DOM element that follows the current END-TAG.
  • Step 6 If the current END-TAG does not match the START-TAG at TOS, then
  • step 8 is repeated until the START-TAG at TOS match, at which point the method continues from Step 3.
  • Step 9 If the PRIORITY of the START-TAG at TOS is greater than the PRIORITY of the current END-TAG, set the variable INCREMENT to 1. Recursively descend the START-STACK until a START-TAG is found which matches to the current END-TAG. Create a SPLART-TAG from this START-TAG, as in Step 7, and replace the START-TAG in the DOM at the index of the START-TAG at TOS with this (current) SPLART-TAG.
  • Step 10 Unwind the STACK, and at each successive TOS, insert a copy of the current END-TAG into the array before the array index of the START-TAG at TOS. Add INCREMENT to the array index of the START-TAG at TOS. If INCREMENT is equal to 1, set it to 2. Insert a copy of SPLART-TAG into the DOM after the index of the START-TAG at TOS and continue unwinding the STACK at Step 10.
  • Step 11 Resume scanning after the current END-TAG at Step 2.
  • the DOM which in the above description is implemented as an array, may also be implemented as a string (with arrays or stacks of index pointers), as a linked-list, or other data structure without diminishing the generality of the method in accordance with the present invention.
  • the number, names, and types of elements represented in the DOM may also be changed without departing from the principles of the present invention.
  • the recursive techniques and SPID numbering conventions used in the preceding example were chosen for clarity of exposition. Those skilled in the art will understand that they can be replaced with non-recursive techniques and non-sequential reference identification without departing from the principles of the present invention.
  • Fact extraction focuses on identifying and extracting potentially interesting pieces of information in an annotated text by finding patterns in the attributes stored by the FEX annotators. To find these patterns and extract the interesting facts from the aligned annotations, the user creates a file in the GUI-DE using a Rule-Based Information Extraction language (“the FEX RuBIE pattern recognition language”).
  • This file (“the RuBIE application file”) comprises a set of instructions for extracting pieces of text from some text file.
  • the RuBIE application file can also comprise comments and blanks. The instructions are at the heart of a RuBIE-based extraction application, while the comments and blanks are useful for helping organize and present these instructions in a readable way.
  • RuBIE The instructions in the RuBIE application file are represented in RuBIE in two different types of rules or statements, (1) a pattern recognition rule or statement, and (2) an auxiliary definition statement.
  • a RuBIE pattern recognition rule is used to describe what text should be located by its pattern, and what should be done when such a pattern is found.
  • RuBIE application files are flat text files that can be created and edited using a text editor.
  • the RuBIE pattern recognition language is not limited to the basic 26-letter Roman alphabet, but at least minimally also supports characters found in major European languages, thus enabling it to be used in a multilingual context.
  • RuBIE application files can contain any number of rules and other components of the RuBIE pattern recognition language. They can support any number of comments and any amount of white space, within size limits of the text editor. Any limits on scale are due to text editor size restrictions or operational performance considerations.
  • a RuBIE pattern recognition rule comprises three components: (1) a pattern that describes the text of interest, perhaps in context, (2) a label that names the pattern for testing and debugging purposes; and (3) an action that indicates what should be done in response to a successful match.
  • a pattern is a regular expression-like description of a number of base tokens or other constituents that should be recognized in some way, where the recognition of the tokens is primarily driven by targeted attributes that have been assigned to the text through annotation processes.
  • One or more annotation value tests, zero or more recognition shifts, and zero or more regular expression operators may all be included in a pattern.
  • T(v) literal case-sensitive form of the base token string
  • T(v) word case-insensitive form of the base token string
  • T(v) token base level token orthographic attribute
  • T(v) POS part of speech tag
  • T(v) root inflectional morphological root
  • T(v) morphfeat inflectional morphology attributes
  • T(v) morphdfeat derivational morphology attributes
  • T(v) space representation of white space following the token in the original text
  • T(v) region segment name or region in the document
  • T(v) tokennum sequential number of token in text
  • T(v) startchar position of token's first character
  • T(v) endchar position of token's last character
  • T(v) length character length of token
  • T(v) DocID document identification number that includes the current base token
  • T(v) attribute attributes assigned using a system or user-defined word dictionary
  • One or more actions may be associated with a pattern or with one or more specified sub-patterns in the pattern.
  • a sub-pattern is any pattern fragment that is less than or equal to a full pattern.
  • An auxiliary definition statement is used to name and define a sub-pattern for use elsewhere. This named sub-pattern may then be used in any number of recognition statements located in the same RuBIE application file.
  • Auxiliary definitions provide convenient shorthand for sub-patterns that may be used in several different patterns.
  • a single pattern may match several base tokens, whether sequential or otherwise related, the user may only be interested in one or more subsets of the matched tokens.
  • the larger pattern provides context for the smaller pattern of interest. For example, in a text aaa bbb ccc the user may want to match bbb every time that it appears, only when it follows aaa, only when it precedes ccc, or only when it follows aaa and precedes ccc.
  • the full pattern is used to match a specific piece of text of interest.
  • the rest is only provided for context, then it must be possible to mark off the interesting sub-pattern.
  • Recognition shifts can significantly impact the text that actually corresponds to a bracketed sub-pattern. Because shifts do not actually recognize tokens, shifts at the start or end of a bracketed sub-pattern do not alter the tokens that are included in the bracket. In other words, aaa govtos [ bbb ] and aaa [ govtos bbb ] would perform the same way, identifying bbb.
  • a pattern must adhere to the following requirements:
  • a label is an alphanumeric string that uniquely identifies a pattern recognition rule or auxiliary definition.
  • a label supports debugging, because the name can be passed to the calling program when the corresponding pattern matches some piece of text.
  • auxiliary definition a label can be used in a pattern to represent a sub-pattern that has been defined. Auxiliary definitions are a convenience for when the same sub-pattern is used repeatedly in one or more patterns.
  • a label that is associated with some pattern may look something like this: ⁇ person>: ( word(“mr”, ”ms”, ”mrs”) literal(“.”)? )? token(capstring, capinitial) ⁇ 1,4 ⁇ #officer: ⁇ person> literal(“,”)? jobtitle
  • the auxiliary definition ⁇ person> may consist of a title word optionally followed by a period, although this sequence is optional. It is then followed by one to four capitalized words or strings.
  • the #officer pattern recognition rule uses the ⁇ person> label to represent the definition of a person, followed by an optional comma and then followed by a job title to identify and extract a reference to a corporate officer.
  • this sample auxiliary definition “ ⁇ person>” and the “Person” constituent attribute test as found in Table 5.
  • An action is an instruction to the RuBIE pattern recognition language concerning what to do with some matched text.
  • the user will want RuBIE to return the matched piece of text and some attributes of that text so that the calling application can process it further.
  • the user may want to return other information or context in some cases.
  • a selection of actions gives the user increased flexibility in what the user does when text is matched.
  • Each action has a scope, where the scope is the pattern or clearly delineated sub-pattern that when matched correctly to some piece of text, the action will apply to that piece of text.
  • Each pattern recognition rule must have at least one action (otherwise, there would be no reason for having the statement in the first place).
  • a statement may in fact have more than one action associated with it, each with a sub-pattern that defines its scope. More than one action may share the same scope, that is, the successful recognition of some piece of text may result in executing more than one action.
  • individual parts of the rule may successfully match attributes assigned to some text.
  • actions will only be triggered when the entire rule is successful, even if the scope of the action is limited only to a subset of the rule. For this reason, if the entire pattern recognition rule successfully matches some pattern of text attributes, all associated actions will be triggered, if any part of the rule fails, none of its associated actions will be triggered.
  • An auxiliary definition provides a shorthand notation for writing and maintaining a sub-pattern that will be used multiple times in the pattern recognition rules. It is somewhat analogous to macros in some programming languages.
  • auxiliary definitions are a convenience for when the same sub-pattern is used repeatedly in one or more pattern recognition rules.
  • #officer ⁇ person>: ( word(“mr”, ”ms”, ”mrs”) literal(“.”)? )? token(capstring, capinitial) ⁇ 1,4 ⁇ #officer: ⁇ person> literal(“,”)? jobtitle
  • the auxiliary definition label may be used repeatedly in one or more pattern recognition rules.
  • application-specific dictionaries in the RuBIE pattern recognition language can be separate annotators.
  • lexical entries can be provided in the same file in which pattern recognition rules are defined.
  • the RuBIE application file has syntax for defining lexical entries within the file.
  • RuBIE-based application files may vary from a few pattern recognition rules to hundreds or even thousands of rules. Individual rules may be rather simple, or they may be quite complex. Clear, well-organized and well-presented RAFs make applications easier to develop and maintain.
  • the RuBIE pattern recognition language provides users with the flexibility to organize their RAFs their own way in support of producing RAFs in a style that is most appropriate for the application and its maintenance.
  • the fact extraction application that applies a RuBIE application file against some annotated text routinely has access to some standard results. Also, it optionally has access to all the annotations that supported the extraction process.
  • the input and output requirements for the RuBIE pattern recognition language are as follows:
  • the FEX server (described in greater detail hereinafter) compiles the RuBIE application file and runs it against the aligned annotations to extract facts.
  • the RuBIE pattern recognition language is a pattern recognition, language that applies to text that has been tokenized into its base tokens—words, numbers, punctuation symbols, formatting information, etc.—and annotated with a number of attributes that indicate the form, function, and semantic role of individual tokens, patterns of tokens, and related tokens.
  • Text structure is usually defined or controlled by some type of markup language; that is the RuBIE pattern recognition language applies to one or more sets of annotations that have been aligned with a piece of tokenized text.
  • the RuBIE pattern recognition language itself places no restrictions on the markup language used in the source text because the RuBIE pattern recognition language actually applies to sets of annotations that have been aligned with the base tokens of the text rather than directly to the source text itself.
  • the RuBIE pattern recognition language is rule-based, as opposed to machine learning-based.
  • the RuBIE pattern recognition language can exploit any attributes with which a text representation has been annotated.
  • a dictionary lookup process a user can create new attributes specific to some application. For example, in an executive changes extraction application that targets corporate executive change information in business news stories, a dictionary may be used to assign the attribute ExecutivePosition to any of a number of job titles, such as President, CEO, Vice President of Marketing, Senior Director and Partner.
  • a RuBIE pattern recognition rule can then simply use the attribute name rather than list all of the possible job titles.
  • An application that targets corporate executive change information in business news stories may have rules that attempt to identify each of the following pieces of information in news stories that have been categorized as being relevant to the topic of executive changes:
  • the semantic agent of a “retired” action (the person performs the action of retiring) or the semantic patient of a “hired” or “fired” action (the person's executive status changes because someone else performs the action of hiring or firing them) is likely the person affected by the change. It may take multiple rules to capture all of the appropriate executives based on all the possible action-semantic role combinations possible. That is why a RuBIE application file may include many rules for a single application.
  • Information extraction applications can be developed for any topic area where information about the topic is explicitly stated in the text.
  • the attributes are limited to little more than orthographic attributes of the text, e.g., What is the literal value of a token? Is it an alphabetic string, a digit string or a punctuation symbol? Is the string capitalized, upper case or lower case? And so on.
  • pattern recognition languages rely on a regular expression-based description of the attribute patterns that should be matched.
  • a regular expression in annotated text processing is a rule that tests for the presence of a single attribute or the complement of that attribute assigned to some part of the text, such as a base token. More complex regular expressions look for some combination of tests, such as sequences of different tests, choices between multiple tests, or optional tests among required tests.
  • Regular expression-based pattern recognition processes often progress left-to-right through the text.
  • Some regular expression-based pattern recognition languages will have additional criteria for selecting between two pattern recognition rules that each could match the same text, such as the rule listed first in the rule set has priority, or the rule that matches the longest amount of text has priority.
  • Regular expression-based pattern recognition languages are often implemented using finite state machines, which are highly efficient for text processing.
  • the RuBEE pattern recognition language supports common, regular expression-based functionality.
  • the results of more sophisticated linguistics processes that annotate a text with syntactic attributes are best represented using a tree-based representation.
  • XML has emerged as a popular standard for creating a representation of a text that captures its structure.
  • the FEX tool set uses XML as a basis for annotating text with numerous attributes, including linguistic structure and other linguistic attributes.
  • the relationship between two elements in the tree-based representation can be determined by following the path through the tree between the two elements. Some important relationships can easily be anticipated—finding the subject and object (or agent and patient) of some verb, for example. Because sentences can come in an infinite variety, there can be an infinite number of possible ways to specify the relationships between all possible entity pairs.
  • the RuBIE pattern recognition language exploits some of the more popular syntactic relationships common to texts.
  • XPath provides a means for traversing the tree-like hierarchy represented by XML document markup. It is possible to create predefined functions and operators for popular relationships based on XPath as part of the RuBIE pattern recognition language, both as part of the RuBIE language and through application-specific auxiliary definitions, but it is also possible to give RuBIE pattern recognition rule writers direct access to XPath so that they can create information extraction rules based on any syntactic relationship that could be represented in XML. Thus a RuBIE pattern recognition rule can combine traditional regular expression pattern recognition functionality with the ability to exploit any syntactic relationship that can be expressed using XPath.
  • the RuBIE pattern recognition language is unique in its combination of traditional regular expression pattern recognition capabilities and XPath-based tree traversal capabilities, in addition to providing matching patterns in an annotated text to support information extraction.
  • the RuBIE pattern recognition language allows users to combine attribute tests together using traditional regular expression functionality and XPath's ability to traverse XML-based tree representations. Through the addition of macro-like auxiliary definitions, the RuBIE pattern recognition language also allows users to create application-specific matching functions based on regular expressions or XPath.
  • a single RuBIE pattern recognition rule can use traditional regular expression.functionality, XPath-based functionality, and auxiliary definitions in any combination.
  • the pattern recognition functionality that is deployed as part of the FEX tool set for tests, regular expression-based operators, and shift operators will now be described.
  • a test verifies that a token or constituent:
  • a RuBIE pattern recognition rule contains a single test or a combination of tests connected by RuBIE operators (a combination of regular expression and tree traversal functionality). If the test or combination of tests are all successful within the logic of the operators used, then the rule has matched the text that correspond to the tokens or constituents, and that text can be extracted or processed further in other ways.
  • Regular expression-based operators in the RuBIE pattern recognition language include the following:
  • Shift operators rely on syntactic and other hierarchical information such as that which can be gained from traversing the results of a parse tree.
  • XML is used to capture this hierarchical information, and XPath is used as a basis for the following tree traversal operators:
  • the RuBIE pattern recognition language allows users to create additional and new shift operations based on XPath in order to exploit any of a number of relationships between constituents as captured in the XML-based representation of the annotated text.
  • the RuBIE pattern recognition language also has shift operators based on relative position, including
  • the same attribute values may be used with different annotations (e.g., the word dog may have dog as its literal form, its capitalization normalized form and its morphological root form), and because the user may introduce new annotation types to an application, it is necessary to specifying both the annotation type and annotation value in RuBIE pattern recognition rules.
  • the RuBIE pattern recognition language allows a user to test a base token for the following attributes:
  • wildcard characters When specifying literal values, users are able to indicate wildcard characters (.), superuniversal truncation (!), and optional characters (?).
  • a wildcard character can match any character.
  • Superuniversal truncation means that the term must match exactly anything up to the superuniversal operator, and then anything after that operator is assumed to match by default.
  • An optional character is simply a character that is not relevant to a particular test, e.g., word-final -s for some nouns.
  • Constituent attributes are those attributes that are assigned to a pattern of one or more base tokens that represent a single constituent.
  • a proper name, a basal noun phrase, a direct object and other common linguistic attributes can consist of one or more base tokens, but RuBIE pattern recognition rules treat such a pattern as a single constituent. If for example the name
  • constituent attributes include, but are not limited to, the following: Company; Person; Organization; Place; Job Title; Citation; Monetary Amount; Basal Noun Phrase; Maximal Noun Phrase; Verb Group; Verb Phrase; Subject; Verb; Object; Employment Change Action Description Term; and Election Activity Descriptive Term (MDW—just making the fonts and notation we use for attributes more consistent).
  • the “pattern” may consist of a single base token.
  • the RuBIE pattern recognition language has the ability to recognize non-contiguous (i.e., tree-structured) constituents via XPath in addition to the true left-right sequences on which the regular expression component of the RuBIE pattern recognition language focuses.
  • the RuBIE pattern recognition language includes the following constituent attributes:
  • Regular expressions are powerful tools for identifying patterns in text when all of the necessary infonnation is located sequentially in the text. Natural language, however, does not always cooperate. A subject and its corresponding object may be separated by a verb. A pronoun and the person it refers to may be separated by paragraphs of text. And yet it is these relationships that are often the more interesting ones from a fact extraction perspective.
  • the RuBIE pattern recognition language supports the following relationship shifts:
  • the adopted shift command takes arguments, specifically references to constituent objects. Due to the nature of language, there can often be more than one possible constituent that may fit the prose description of the shift. For example, consider the sentence John kissed Mary and dated Sue. There are two verbs here, each with one subject (John in both cases) and one object (Mary and Sue respectively). This type of complexity adds some ambiguity, e.g. deciding which verb to shift to. The ability to use indirection and compound constituent objects addresses this class of problems.
  • the RuBIE pattern recognition language therefore also include the following capabilities:
  • the RuBIE shifts allow a RuBIE pattern recognition rule writer to shift the path of pattern recognition from one part of a text to another. For many of the shifts, however, there is a corresponding shift to return the path of pattern recognition back to where it was before the first shift occurred. A variation of the constituent attribute test could account for a number of cases where such shift-backs are likely to occur.
  • the RuBIE pattern recognition language therefore also includes the following capabilities:
  • the new hire is a person who is the patient of a hire verb.
  • the sentences targeted by the example rule are:
  • the rule first looks for a “hire” verb.
  • an auxiliary definition was created so that @hireverb will match a verb whose stem is “hire”, “name”, “appoint”, “promote”, or some similar word.
  • the rule goes up the tree to the nearest clause node and verifies that the clause contains a passive verb. If it is true that the clause contains a passive verb, the rule goes back down into the clause to find the patient of the clause verb.
  • the patient of a verb is the object affected by the action of the verb, in this case the person being hired. In a passive sentence, the patient is typically the grammatical subject of the clause.
  • the rule looks for a specific Person as opposed to some descriptive phrase. If an actual person name is found as the patient of a hire verb, the rule can then mark it up with the XML tags ⁇ NewHire> and ⁇ /NewHire>.
  • the rule tests other items in the clause until it finds a constituent that has the attribute position assigned to it.
  • a dictionary of candidate job positions of interest is used to assign this attribute to the text. If a valid position is found, it can be marked up with the XML tags ⁇ Position> and ⁇ /Position>.
  • FIG. 5 is a diagrammatic illustration of a first scenario in which a FEX product “A” creates a database with the facts extracted from a document “A” by the FEX too set and provides an entirely new customer interface (UI) to present these facts from the database.
  • UI customer interface
  • FIG. 6 is a diagrammatic illustration of a second scenario in which an FEX product “B” actually updates an original document “B” with the extracted facts metadata and leverages the existing customer interface—possibly updated—to present the facts.
  • the second scenario allows for the existing search technology to access the facts, requiring no new retrieval mechanism.
  • FIG. 7 is a diagrammatic illustration of the FEX tool set architecture.
  • the FEX tool set has a Windows NT client-server architecture, using Java® and ActiveState® PerlTM.
  • Windows NT was chosen because it is a standard operating environment and because the primary annotator (a linguistic parser called EngPars) currently runs exclusively on the Windows architecture.
  • Java® implemented with IBM Visual Age for Java®, is used primarily because of its graphical user interface development environment (GUI-DE), since it is a LexisNexis®-internal standard and provides strong portability and scalability.
  • GUI-DE graphical user interface development environment
  • ActiveState® PerlTM is used to implement some of the text-processing tasks, since PerlTM is also portable, and since it has strong regular expression handling and general text-processing capability. It will be appreciated by those of skill in the art that other architectures that provide equivalent functionality can be used.
  • the major hardware components in the FEX tool set are the FEX client and the FEX server.
  • the client for the FEX tool set is a “thin” Java®-based Windows NT® Workstation or Windows 98/2000® system.
  • the FEX server is a Windows NT Server system, a web server that provides the main functionality of the FEX tool set. Functionality is made available to the client via a standard HTTP interface, which can include SOAP (“Simple Object Access Protocol”, an HTTP protocol that uses XML for packaging).
  • FIG. 8 is a high level diagram of the processing flow of the FEX tool set using the architecture shown in FIG. 7 .
  • the user's interface to the FEX tool set is the FEX GUI-DE on the FEX Workstation.
  • GUI-DE the user opens or creates a FEX Workspace to store his or her product application work.
  • the user selects the appropriate annotators and may use available client Annotator Development Tools (not part of the FEX tool set) to troubleshoot and tune the FEX Annotators for the application.
  • the user saves the annotation settings to the Annotation Configuration in his or her workspace.
  • the user may then request Annotation Processing to run the relevant FEX Annotators (such as some lexical lookup tool or natural language parser) and Align Annotation results on the FEX Server. From these results, the user can further tune the Annotation Configuration, if necessary.
  • FEX Annotators such as some lexical lookup tool or natural language parser
  • the FEX GUI-DE provides the user interface to the FEX tool set.
  • the user uses editing tools in the FEX GUI-DE to create and maintain Notation Configuration information, RuBIE annotation files (scripts), and possibly other annotation files like dictionaries or annotator parameter information.
  • the FEX GUI-DE also allows the user to create and maintain Workspaces, in which the user stores annotation configurations, RuBIE application files, and other files for each logical work grouping.
  • the user also uses the FEX GUI-DE to start annotation and RuBIE processing on the FEX Server and to “move up” files into production space on the network.
  • the user writes a RuBIE application file in GUI-DE to define the patterns and relationships to extract from these annotations, and saves the file to the FEX Workspace.
  • the user can then compile the RuBIE application file on the FEX Server and apply it against the annotations to extract the targeted facts.
  • the user can then inspect the facts to troubleshoot and further tune the script or re-visit the annotations.
  • the primary FEX annotators preferably run on the FEX server, since annotators can be very processor- and memory-intensive. It is these annotators that are actually run by FEX when documents are processed for facts, based on parameters provided by the user. Some FEX annotators may also reside in some form independently on the FEX client.

Abstract

A fact extraction tool set (“FEX”) finds and extracts targeted pieces of information from text using linguistic and pattern matching technologies, and in particular, text annotation and fact extraction. Text annotation tools break a text, such as a document, into its base tokens and annotate those tokens or patterns of tokens with orthographic, syntactic, semantic, pragmatic and other attributes. A user-defined “Annotation Configuration” controls which annotation tools are used in a given application. XML is used as the basis for representing the annotated text. A tag uncrossing tool resolves conflicting (crossed) annotation boundaries in an annotated text to produce well-formed XML from the results of the individual annotators. The fact extraction tool is a pattern matching language which is used to write scripts that find and match patterns of attributes that correspond to targeted pieces of information in the text, and extract that information.

Description

  • A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
  • FIELD OF THE INVENTION
  • The invention relates to the extraction of targeted pieces of information from text using linguistic pattern matching technologies, and more particularly, the extraction of targeted pieces of information using text annotation and fact extraction.
  • BACKGROUND OF THE INVENTION
  • Definitions and abbreviations used herein are as follows:
  • Action—an instruction concerning what to do with some matched text.
  • Annotation Configuration—a file that identifies and orders the set of annotators that should be applied to some text for a specific application.
  • Annotations—attributes, or values, assigned to words or word groups that provide interesting information about the word or words. Example annotations include part-of-speech, noun phrases, morphological root, named entities (such as Corporation, Person, Organization, Place, Citation), and embedded numerics (such as Time, Date, Monetary Amount).
  • Annotator—a software process that assigns attributes to base tokens or to constituents or that creates constituents from patterns of one or more base tokens.
  • Attributes—features, values, properties or links that are assigned to individual base tokens, sequences of base tokens or related but not necessarily adjacent base tokens (i.e., patterns of base tokens). Attributes may be assigned to the tokenized text through one or more processes that apply to the tokenized text or to the raw text.
  • Auxiliary definition—in the RuBIE pattern recognition language, a statement or shorthand notation used to name and define a sub-pattern for use elsewhere.
  • Base tokens—minimal meaningful units, such as alphabetic strings (words), punctuation symbols, numbers, and so on, into which a text is divided by tokenization. Base tokens are the minimum building blocks for a text processing system.
  • Case-corrected—text in which everything is lower case except for named entities.
  • Constituent—a base token or pattern of base tokens to which an attribute has been assigned. Although constituents often consist of a single base token or a pattern of base tokens, a constituent is not necessarily comprised of contiguous base tokens. An example of a non-contiguous constituent is the two-word verb looked up in the sentence He looked the address up.
  • Constituent attributes—those attributes that are assigned to a pattern of one or more base tokens that represent a single constituent.
  • Label—an alphanumeric string that uniquely identifies a pattern recognition rule or auxiliary definition.
  • Machine learning-based pattern recognition—pattern recognition in which a statistic-based process might be given a mix of example texts that do and do not represent the targeted extraction result, and the process will attempt to identify the valid patterns that correspond to the targeted results.
  • Pattern—a description of a number of base tokens that should be recognized in some way, where the recognition of the tokens is primarily driven by targeted attributes that have been assigned to the text through annotation processes. One or more annotation value tests, zero or more recognition shifts, zero or more regular expression operators, and zero or more XPath-based (tree-based) operators may all be included in a pattern.
  • Pattern recognition language—a language used to guide a text processing system to find defined patterns of annotations. In its most common usage, a pattern recognition rule will test each constituent in some pattern for the presence or absence of one or more desired annotations (attributes). If the right combinations of annotations are found in the right order, the statement can then copy that text, add further annotations, or both, and return it to an application (that is, extract it) for further processing. Because linguistic relationships can involve constituents that are tree-structured or otherwise not necessarily sequentially ordered, a pattern recognition rule can also follow these types of relationships and not just sequentially arranged constituents.
  • Pattern recognition rule—a statement used to describe what text should be located by its pattern, and what should be done when such a pattern is found.
  • RAF—RuBIE application file.
  • RuBIE—Rule-Based Information Extraction language. The language in which the pattern recognition rules of the present invention are expressed.
  • RuBIE application file—a flat text file that contains one or more text pattern recognition rules and possibly other components of the RuBIE pattern recognition language. Typically it will contain all of the extraction rules associated with a single fact extraction application.
  • Rule-based pattern recognition—pattern recognition in which the pattern recognition rules are developed by a computational linguist or other pattern recognition specialist, usually through an iterative trial-and-error develop-evaluate process.
  • Shift—pattern recognition functionality that changes the location within a text where a pattern recognition rule is applying. Many pattern recognition languages have rules that process a text in left-to-right order. Shift functionality allows a rule to process a text in some other order, such as repositioning pattern recognition from mid-sentence to the start of a sentence, from a verb to its corresponding subject in mid-rule, or from any point to some other defined non-contiguous point.
  • Scope—the portion or sub-pattern of a pattern recognition rule that corresponds to an action. An action may act upon the text matched by the sub-pattern only if the entire pattern successfully matches some text.
  • Sub-pattern—any pattern fragment that is less than or equal to a full pattern. Sub-patterns are relevant from the perspective of auxiliary definition statements and from the perspective of scopes of actions.
  • Tests—tests apply to constituents to verify either the value of a constituent or whether a particular attribute has been assigned to that constituent.
  • Text—in the context of a document search and retrieval application such as LexisNexis®, any string of printable characters, although in general a text is usually expected to be a document or document fragment that can be searched, retrieved and presented to customers using the online system. Web pages, customer documents, and natural language queries are other examples of possible texts.
  • Token—a minimal meaningful unit, such as an alphabetic string (word), space, punctuation symbol, number, and so on.
  • Token attributes—those attributes that are assigned to individual base tokens. Examples of token attributes may include the following: (1) part of speech tags, (2) literal values, (3) morphological roots, and (4) orthographic properties (e.g., capitalized, upper case, lower case strings).
  • Tokenize—to divide a text into a sequence of tokens.
  • Prior art pattern recognition languages and tools include lex, SRA's NetOwl® technology, and Perl™. These prior art pattern recognition languages and tools primarily exploit physical or orthographic characteristics of the text, such as alphabetic versus digit, capitalized vs. lower case, or specific literal values. Some of these also allow users to annotate pieces of text with attributes based on a lexical lookup process.
  • In the mid-1980s, the Mead Data Central (now LexisNexis) Advanced Technology & Research Group created a tool called the leveled parser. The leveled parser was an example of a regular expression-based pattern recognition language that used a lexical scanner to tokenize a text—that is, break the text up into its basic components (“base tokens”), such as words, spaces, punctuation symbols, numbers, document markup, etc.—and then use a combination of dictionary lookups and parser grammars to identify and annotate individual tokens and patterns of tokens of interest, based on attributes (“annotations” or “labels”) assigned to those tokens through the scanner, parser or dictionary lookup (a base token and patterns of base tokens that share some common attribute are called “constituents”).
  • For example, the lexical scanner might break the character string
      • I saw Mr. Mark D. Benson go away.
  • into the annotated base token pattern:
    UC LCS CPS PER CPS UC PER CPS LCS LCS PER
    I saw Mr . Mark D . Benson go away .
  • (where UC=upper case letter, LCS=lower case string, CPS=capitalized string, PER=period).
  • A dictionary lookup may include a rule to assign the annotation TITLE to any of the following words and phrases: Mr, Mrs, Ms, Miss, Dr, Rev, President, etc. For the above example, this would result in the following annotated token sequence:
    UC LCS TITLE PER CPS UC PER CPS LCS LCS PER
    I saw Mr . Mark D . Benson go away .
  • A parser grammar was then used to find interesting tokens and token patterns and annotate them with an indication of their function in the text. The parser grammar rules were based on regular expression notation, a widely used approach to create rules that generally work from left to right through some text or sequence of annotated tokens, testing for the specified attributes.
  • For example, a regular expression rule to recognize people names in annotated text might look like the following:
    (TITLE (PER)?)? (CPS | UCS) (UC (PER)? | CPS | UCS)?
    (CPS | UCS)
  • This rule first looks for TITLE attribute optionally (“?”) followed by a period (PER), although the TITLE or TITLE-PERIOD is also optional. Then it looks for either a capitalized (CPS) OR upper case (UCS) string. It then looks for an upper case letter (UC) optionally followed by a period (PER), OR it looks for a capitalized string (CPS), OR it looks for an upper case string (UCS), although like the title, this portion of the rule is optional. Finally it looks for a capitalized (CPS) OR upper case (UCS) string.
  • This rule will find Mr. Mark D. Benson in the above example sentences. It will also find names like the following:
      • Mark Benson
      • Mark D Benson
      • Mark David Benson
      • Mr. Mark Benson
  • However, it will not find names like the following:
      • Mark
      • Benson
      • George H. W. Bush
      • e. e. cummings
      • Bill O'Reilly
  • Furthermore it will also incorrectly recognize a lot of other things as person names, such as Star Wars in the following sentence:
      • Mark saw Star Wars yesterday.
  • A grammar, whether a lexical scanner, leveled parser or any of the other conventional, expression-based pattern recognition languages and tools, may contain dozens, hundreds or even thousands of rules that are designed to work together for overall accuracy. Any one rule in the grammar may handle only a small fraction of the targeted patterns. Many rules typically are written to find what the user wants, although some rules in a grammar may primarily function to exclude some text patterns from other rules.
  • Regular expression-based pattern recognition works well for a number of pattern recognition problems in text. It is possible to achieve accuracy rates of 90%, 95% or higher for a number of interesting categories, such as company, people, organization and place names; addresses and address components; embedded numerics, such as times, dates, telephone numbers, weights, measures, and monetary amounts; and other tokens of interest such as case and statute citations, case names, social security numbers and other types of identification numbers, document markup, websites, e-mail addresses, and table components.
  • Regular expressions do have a problem recognizing some categories of tokens because there is little if any consistency in the structure of names in those categories, regardless of how many rules one might use. These include product names and names of books or other media, names that can be almost anything. There are also some language-specific issues that one runs into, for example: rules that recognize European language-based names in American English text often will stumble on names of Middle Eastern and Asian language origin; and rules developed to exploit capitalization patterns common in English language text may fail on languages with different capitalization patterns.
  • However, in spite of such problems, regular expression-based pattern recognition languages are widely used in a number of text processing applications across a number of languages.
  • What makes a text interesting is not that it contains just names, citations or other such special tokens, but that it also identifies the roles, functions, and attributes of those entities and their relationships with one another. These relationships are represented in text in any of a number of ways.
  • Consider the following sentences:
      • John kissed Mary.
      • Mary was kissed by John.
      • John only kissed Mary.
      • John kissed only Mary.
      • John, that devil, kissed Mary.
      • John kissed an unsuspecting Mary.
      • John snuck up behind Mary and kissed her.
      • Mary was minding her own business when John kissed her.
  • And yet for all of these sentences, the fundamental “who did what to whom” relationship is John (who) kissed (did what) Mary (to whom).
  • When trying to exploit sophisticated linguistic patterns, regular expression-based pattern recognition languages that progress from left to right through a sentence can enjoy some success even without any sophisticated linguistic annotations like agent or patient, but only for those cases where the attributes of interest are generally adjacent to one another, as in the first two example sentences above that use simple active voice or simple passive voice—and little else—to express the relationship between John and Mary.
  • But this approach to pattern recognition soon falls apart with the addition of any linguistic complexity to the sentence, such as adding a word like only or pronoun references like her.
  • A system that would attempt to find and annotate or extract who did what to whom in the above sentences would need at least two rather sophisticated linguistics processes:
  • (1) The ability to identify and exploit agent-action-patient relationships in sentences or clauses (the reader may think of these in terms of subject-verb-object relationships, but agent-action-patient is more descriptive and useful given the existence of both active and passive sentences).
  • (2) The ability to link coreferring expressions, such as her to Mary in the above sentences, and exploit those links.
  • This type of functionality is fundamentally beyond the scope of regular expression-based pattern recognition languages.
  • Orthographic attributes that are assigned to texts or text fragments are attributes whose assignment is based on attributes of the characters in the text, such as capitalization characteristics, letters versus digits, or the literal value of those characters.
  • Regular expression-based pattern recognition rules applied to the characters in a text are quite useful for tokenizing a text into its base tokens and assigning orthographic annotations to those tokens, such as capitalized string, upper case letter, punctuation symbol or space.
  • Regular expression-based pattern recognition rules applied to base tokens are quite useful for combining base tokens together into special tokens such as named entities, citations, and embedded numerics. These types of rules also assign orthographic annotations.
  • A dictionary lookup may be used to assign orthographic, semantic, and other annotations to a token or pattern of tokens. In an earlier example, a dictionary was used to assign the attribute TITLE to Mr. Some dictionary lookup processes at their heart rely on regular expression-based rules that apply to character strings, although there are other approaches to do this.
  • Semantic annotations can tell us that something is a person name or a potential title, but these types of annotations do not indicate the function of that person in a document. John may be a person name, but that does not tell us if John did the kissing or if he himself was kissed.
  • Linguists create parsers to help determine the natural language syntax of sentences, sentence fragments, and other texts. This syntax is both not only interesting in its own right for the linguistic annotations it provides, but also because it provides a basis for addressing ever more linguistically sophisticated problems. Identifying clauses, their syntactic subjects, verbs, and objects, and the various types of pronouns provides a basis for determining agents, actions, and patients in those clauses and for addressing some types of coreference resolution problems, particularly those involving linking pronouns to names and other nouns.
  • One typical characteristic of parser-based text annotations is that the annotations are usually represented by a tree or some other hierarchical representation. A tree is useful for representing both simple and rather complex syntactic relationships between tokens.
  • One such tree representation for John kissed Mary is shown in FIG. 1.
  • Parse trees not only annotate a text with syntactic attributes like Noun Phrase or Verb, but through the relationships they represent, it is possible to derive additional grammatical roles as well as semantic functions. For example,
      • A Noun Phrase found immediately under a Sentence node in such a tree may be annotated as the Grammatical Subject.
      • Depending on its content and location relative to the verb, a Noun Phrase found immediately under a Verb Phrase may be annotated as the Grammatical Object.
      • If the Verb in this Sentence is an active verb, then the Grammatical Object may be annotated with Patient as its semantic function. If the Verb is passive, then the Grammatical Subject may instead be annotated as the patient.
  • As sentences grow more complex, the process for annotating the text with these attributes also grows more complex—just as is seen with regular expression-based rule sets that target people names or other categories. But in general, many relationships between constituents of the tree can be defined by descriptions of their relative locations in the structure.
  • Through tokenization, dictionary lookups and parsing, it is possible for a part of the text to have many annotations assigned to it.
  • In the above sentence, the token Mary may be annotated with several attributes, such as the following:
      • Literal value “Mary”
      • Morphological root “Mary”
      • Quantity Singular
      • Capitalized String
      • Alphabetic String
      • Proper Noun
      • Person Name
      • Gender Female
      • Noun Phrase
      • Grammatical Object of Verb “Kiss”
      • Patient of Verb “Kiss”
      • Part of Verb Phrase “Kissed Mary”
      • Part of Sentence “John Kissed Mary”
  • The tree representation of FIG. 1 can capture all of these attributes, as shown in FIG. 2.
  • The hierarchical relationships represented by a tree can be represented through other means. One common way is to represent the hierarchy through the use of nested parentheses. A notation like X (Y) , for example, could be used to annotate whatever Y is with the structural attribute X. Using the above example, ProperNoun (John) indicates that John is a constituent under Proper Noun in the tree. Using this notation, the whole sentence would look like the following:
    Sentence ( NounPhrase( ProperNoun( John ) ),
    VerbPhrase( Verb( kissed ), NounPhrase( ProperNoun(
    Mary ) ) ) )
  • Often with this type of representation, the hierarchy can be made more apparent through the use of new lines and indentation, as the following shows:
    Sentence(
    NounPhrase(
    ProperNoun(
    John ) ),
    VerbPhrase(
    Verb(
    kissed ),
    NounPhrase(
    ProperNoun(
    Mary ) ) ) )
  • The difference is purely cosmetic; the use of labels and parentheses is identical.
  • In computing, there are now a number of widely used approaches for annotating a text with hierarchy-based attributes. SGML, the Standard Generalized Markup Language, gained widespread usage in the early 1990s. HTML, the HyperText Markup Language, is based on SGML and is used to publish hypertext documents on the World Wide Web.
  • In 1998, XML, the Extensible Markup Language, was created. Since it was introduced in 1998, it has gained growing acceptance in a number of text representation problems, many of which are geared towards representing the content of some text—a document—in a way that makes it easy to format, package, and present that text in any of a number of ways. XML is increasingly being used as a basis for representing text that has been annotated for linguistic processing. It has also emerged as a widely used standard for defining specific markup languages for capturing and representing document structure, although it can be used for any structured content.
  • The structure of a news article may include the headline, byline, dateline, publisher, date, lead, and body, all of which fall under a document node. A tree representation of this structure might look as shown in FIG. 3.
  • Just as XML can be used to define a news document markup, it can be used to define the type of linguistic markup shown in the John kissed Mary example above.
  • The notation for XML markup uses a label to mark the beginning and end of the annotated text. Where X (Y) is used above to represent annotating the text Y with the attribute X, XML uses the following, where <X> and </X> are XML tags that annotate text Y with X:
      • <X>Y</X>
  • The John kissed Mary example would look like:
    <Sentence><NounPhrase><ProperNoun>John</ProperNoun>
    </NounPhrase><VerbPhrase><Verb>kissed</Verb>
    <NounPhrase><ProperNoun>Mary</ProperNoun>
    </NounPhrase></VerbPhrase></Sentence>
  • or cosmetically printed as:
    <Sentence>
    <NounPhrase>
    <ProperNoun>
    John
    </ProperNoun>
    </NounPhrase>
    <VerbPhrase>
    <Verb>
    kissed
    </Verb>
    <NounPhrase>
    <ProperNoun>
    Mary
    </ProperNoun>
    </NounPhrase>
    </VerbPhrase>
    </Sentence>
  • The elements of the XML representation correspond to the nodes in the tree representation here. And just as attributes can be added to the nodes in the tree, suchas +Object, +Patient and Literal “Mary” were added to the tree in FIG. 2, attributes can be associated with XML elements. Attributes in XML provide additional information about the element or the contents of that element.
  • For example, it is possible to associate attributes with the Proper Noun element “Mary” found in the tree above in the following way in an XML element:
    <ProperNoun LITERAL=“Mary” NUM=“SINGULAR”
    ORTHO=“CPS”
    TYPE=“ALPHA” PERSON=“True”
    GENDER=“Female”>Mary</ProperNoun>
  • In computational linguistics, trees are routinely used to represent both syntactic structure and attributes assigned to nodes in the tree. XML can be used to represent this same information.
  • Finding related entities/nodes in trees and identifying the relationships between them primarily rely on navigating the paths between these entities and using the information associated with the entities/nodes. For example, as discussed above, this information could be used to identify grammatical subjects, objects and the relationship (in that case the verb) between them.
  • Linguists historically have used programming languages like Lisp to create, annotate, analyze, and navigate tree representations of text. XPath is a language created to similarly navigate XML representations of texts.
  • XSL is a language for expressing stylesheets. An XML style sheet is a file that describes how to display an XML document of a given type.
  • XSL Transformations (XSLT) is a language for transforming XML documents, such as for generating an HTML web page from XML data.
  • XPath is a language used to identify particular parts of XML documents. XPath lets users write expressions that refer to elements and attributes. XPath indicates nodes in the tree by their position, relative position, type, content, and other criteria. XSLT uses XPath expressions to match and select specific elements in an XML document for output purposes or for further processing.
  • When linguistic trees are represented using XML-based markup, XPath and XPath-based functionality can serve as a basis for processing that representation much like linguists have historically used Lisp and Lisp-based functionality.
  • Most work in information extraction research with which the inventors are familiar has focused on systems where all of the component technologies were created or adapted to work together. Base token identification feeds into named entity recognition. Named entity recognition results feed into a part of speech tagger. Part of speech tagging results feed into a parser. All of these processes can make mistakes, but because each tool feeds its results into the next one and each tool generally assumes correct input, errors are often built on errors.
  • In contrast, where annotation processes come from multiple sources and are not originally designed to work together, they do not necessarily build off each other's mistakes. Instead, their mistakes can be in conflict with one another.
  • For example, a named entity recognizer that uses capitalization might incorrectly include the capitalized first word of a sentence as part of a name, whereas a part of speech tagger that relies heavily on term dictionaries may keep that first word separate. E.g.,
    Original text: Did Bill go to the store?
    Named entity: [ Person]
    Part of Speech: [AUX][ProperNoun]
  • This can be an even bigger problem if two annotators conflict in their results at both the beginning and the end of the annotated text string. For example, for the text string A B C D E, assign tag X to A B C and Y to C D E as shown in FIG. 4. In an XML representation, one possible end result is:
      • <X> A B <Y> C </X>D E </Y>
  • For example, if a sentence mentions a college and its home state, “ . . . University of Chicago, Ill. . . . ”, then overlapping annotations for Organization and City may result:
    <Organization> University of <City> Chicago,
    </Organization> Illinois </City>
  • Well-formed XML has a strict hierarchical syntax. In XML, marked sub-pieces of text are permitted to be nested within one another, but their boundaries may not cross. That is, they may not have overlapping tags (HTML, the markup commonly used for web pages, does permit overlapping tags.) This typically is not a problem for most XML-based applications, because the text and their attributes are created through guidance from valid document type definitions (DTDs). Because it is possible to incorporate annotators that were not designed to some common DTD, annotators can produce conflicting attributes. For that reason the RuBIE annotation process needs a component that can combine independently-generated annotations into valid XML.
  • Further, our past experiences with prior pattern recognition tools showed a great deal of value for both the use of regular expressions and tree-traversal tools, depending on the application. Tools such as SRA NetOwl® Extractor, Inxight Thingfinder™, Perl™, and Mead Data Central's Leveled Parser all provide “linear” pattern recognition, and tools such as XSLT and XPath provide hierarchical tree-traversal. However, we did not find any pattern recognition tool that combined these, particularly in a way appropriate for XML-based document representations. The typical representation to which regular expressions usually apply do not have a tree structure, and thus is not generally conducive to tree traversal-based functionality. Whereas tree representations are natural candidates for tree traversal functionality, their structure is not generally supportive of regular expressions.
  • The Penn Tools, an information extraction research prototype developed by the University of Pennsylvania, combine strong regular expression-based pattern recognition functionality with what on the surface appeared to be some tree navigation functionality. However, in that tool, only a few interesting types of tree-based relationships were retained. These were translated into a positional, linear, non-tree representation so that their regular expression-based extraction language, Mother of Perl (“MOP”), could also apply to those relationships in its rules. The Penn Tools information extraction research prototype does not have the ability to exploit all of the available, tree-based relationships in combination with full regular expression-based pattern recognition.
  • It is to the solution of these and other objects to which the present invention is directed.
  • BRIEF SUMMARY OF THE INVENTION
  • It is therefore a primary object of the present invention to provide a fact extraction tool set that can extract targeted pieces of information from text using linguistic and pattern matching technologies, and in particular, text annotation and fact extraction.
  • It is another object of the present invention to provide a method for recognizing patterns in annotated text that exploits all tree-based relationships and provides full regular expression-based pattern recognition.
  • It is still another object of the present invention to provide a method that resolves conflicting, or crossed, annotation boundaries in annotations generated by independent, individual annotators to produce well-formed XML.
  • These and other objects are achieved by the provision of a fact extraction tool set (“FEX”) that extracts targeted pieces of information from text using linguistic and pattern matching technologies, and in particular, text annotation and fact extraction. The tag uncrossing tool in accordance with the present invention resolves conflicting (crossed) annotation boundaries in an annotated text to produce well-formed XML from the results of the individual FEX Annotators.
  • The text annotation tool in accordance with the present invention includes assigning attributes to the parts of the text. These attributes may include tokenization, orthographic, text normalization, part of speech tags, sentence boundaries, parse trees, and syntactic, semantic, and pragmatic attribute tagging and other interesting attributes of the text.
  • The fact extraction tool set in accordance with the present invention takes a text passage such as a document, sentence, query, or any other text string, breaks it into its base tokens, and annotates those tokens and patterns of tokens with a number of orthographic, syntactic, semantic, pragmatic and dictionary-based attributes. XML is used as a basis for representing the annotated text.
  • Text annotation is accomplished by individual processes called “Annotators” that are controlled by FEX according to a user-defined “Annotation Configuration.” FEX annotations are of three basic types. Expressed in terms of regular expressions, these are as follows: (1) token attributes, which have a one-per-base-token alignment, where for the attribute type represented, there is an attempt to assign an attribute value to each base token; (2) constituent attributes assigned yes-no values to patterns of base tokens, where the entire pattern is considered to be a single constituent with respect to some annotation value; and (3) links, which connect coreferring constituents such as names, their variants, and pronouns. In an XML representation, token attributes tend to be represented as XML attributes on base tokens, and constituent attributes and links tend to be represented as XML elements. Shifts tend to be represented as XPath expressions that utilize token attributes, constituent attributes, and links
  • Within the Annotation Configuration, appropriate FEX Annotators are identified as well as any necessary parameters, input/output, dictionaries, or other relevant information. The annotation results of these FEX Annotators are stored individually.
  • The fact extraction tool set in accordance with the present invention focuses on identifying and extracting potentially interesting pieces of information in an annotated text by finding patterns in the attributes stored by the annotators. To find these patterns and extract the interesting facts, the user creates a RuBIE annotation file using a Rule-Based Information Extraction language (“the RuBIE pattern recognition language”) to write pattern recognition and extraction rules. This file queries for literal text, attributes, or relationships found in the annotations. It is these queries that actually define the facts to be extracted. The RuBIE annotation file is compiled and applied to the aligned annotations generated in the previous steps.
  • Other objects, features, and advantages of the present invention will be apparent to those skilled in the art upon a reading of this specification including the accompanying drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The invention is better understood by reading the following Detailed Description of the Preferred Embodiments with reference to the accompanying drawing figures, in which like reference numerals refer to like elements throughout, and in which:
  • FIG. 1 is a tree representation for the phrase John kissed Mary.
  • FIG. 2 is the tree representation of FIG. 1 with further annotations of the token Mary.
  • FIG. 3 shows a tree representation of the basic structure of a news article.
  • FIG. 4 shows the assignment of tags with conflicting boundaries (or nesting) to the text string A B C D E.
  • FIG. 5 is a diagrammatic illustration of a first scenario in which a FEX product creates a database with the facts extracted by the FEX tool set and provides a customer interface to present these facts from the database.
  • FIG. 6 is a diagrammatic illustration of a second scenario in which an FEX product updates an original document with extracted facts metadata and leverages an existing customer interface to present the facts.
  • FIG. 7 is a diagrammatic illustration of the FEX tool set architecture.
  • FIG. 8 is a high level flow diagram for the processing flow of the FEX tool set using the architecture shown in FIG. 7.
  • DETAILED DESCRIPTION OF THE INVENTION
  • In describing preferred embodiments of the present invention illustrated in the drawings, specific terminology is employed for the sake of clarity. However, the invention is not intended to be limited to the specific terminology so selected, and it is to be understood that each specific element includes all technical equivalents that operate in a similar manner to accomplish a similar purpose.
  • The fact extraction (“FEX”) tool set in accordance with the present invention extracts targeted pieces of information from text using linguistic and pattern matching technologies, and in particular, text annotation and fact extraction.
  • The text annotation process assigns attributes to a text passage such as a document, sentence, query, or any other text string, by parsing the text passage—breaking it into its base tokens and annotating those tokens and patterns of tokens with a number of orthographic, syntactic, semantic, pragmatic, and dictionary-based attributes. These attributes may include tokenization, text normalization, part of speech tags, sentence boundaries, parse trees, semantic attribute tagging and other interesting attributes of the text.
  • Text structure is usually defined or controlled by some type of markup language. In the FEX tool set, an annotated text is represented using XML, the Extensible Markup Language. The FEX tool set includes a tag uncrossing process to resolve conflicting (crossed) annotation boundaries in an annotated text to produce well-formed XML from the results of the text annotation process prior to fact extraction.
  • XML was chosen to annotate text in the FEX tool set for two key properties:
      • XML's approach to marking up a text with tags (actually tag pairs, marking the beginning and the end of some part of the text—an element of the text) allows it to represent hierarchical information because elements can be nested inside other elements, including a document element (an element that contains the entire text). Some types of linguistic text processing—natural language parsing, in particular—create a tree-hierarchy-based representation of a text and its linguistic constituents, such as Noun Phrase, Verb Phrase, Prepositional Phrase or Subordinate Clause (somewhat akin to sentence diagramming of the type taught in high school English classes).
      • It is possible to annotate any element in an XML-based representation of a document with attributes. Many types of linguistic text processing assign attributes to words and tokens, patterns of tokens, linguistic constituents, sentences, and even entire documents, each of which can be represented as an element in an XML-based representation of a document.
  • The FEX Annotation Process includes the management of annotation configuration information, the actual running of the annotators, and the alignment of the resulting annotations.
  • In general, a text is annotated by first segmenting, or tokenizing, it into a sequence of minimal, meaningful text units called base tokens, which include words, numbers, punctuation symbols, and other basic test units. General examples of token attributes with which the FEX tool set can annotate the base tokens of a text include, but are not limited to part of speech tags, literal values, morphological roots, and orthographic properties (e.g., capitalized, upper case, lower case strings). More specifically, examples of these attributes include, but are not limited to:
      • (1) The literal value of the character string, e.g., string=“mark”;
      • (2) Capitalization-corrected literal value;
      • (3) Literal value regardless of capitalization;
      • (4) Specific part of speech, such as Noun, Verb, Article, Adjective, Adverb, Pronoun, etc.;
      • (5) Inflectional morphological root, e.g., match a noun regardless of whether it is singular or plural (root (dog) matches dog or dogs), or match a verb regardless of its form (e.g., root (go) matches go, goes, going, went, gone);
      • (6) Inflectional morphological attributes, e.g., test whether a noun is singular or plural, or test whether a verb agrees with the grammatical subject on first person, third person, etc.;
      • (7) Orthographic attributes, such as capitalized string, upper case letter, upper case string, lower case letter, lower case string, digit, digit string, punctuation symbol, white space, and other;
      • (8) Pragmatic features, such as those that can indicate whether a constituent is a direct quotation, conditional, subjunctive or hypothetical;
      • (9) Document region where the token is located, such as headline, byline, dateline, date, lead, body, table, and other;
      • (10) Positional information, including the token's relative position in a document by token count and by character count;
      • (11) The length of the token;
      • (12) Special token types such as company name, person name, organization name, city name, county name, state name, province name, country name, geographic region name, phone number, street address, zip code, monetary amount, and time;
      • (13) Syntactic constituent types such as sentence, clause, basal noun phrase, and maximal noun phrase;
      • (14) Syntactic roles such as subject, verb, verb group, direct object, negated constituent;
      • (15) Semantic roles such as agent, action, and patient;
      • (16) Noun phrase contains a number;
      • (17) Noun phrase is a person;
      • (18) Noun phrase is animate or inanimate; and
      • (19) The token simply is present.
  • After the text has been tokenized, attributes are assigned to it through one or more processes that apply to the tokenized text or to the raw text. Every base token has at least one attribute—its literal value. Most tokens will have numerous additional attributes. They may also be part of pattern of tokens that have one or more attributes. Depending on the linguistic sophistication of a particular extraction application, a token may have a few or a few dozen attributes assigned to it, directly or through its parents in the tree structure representation (“representation” referring here to the fact that what is stored on the computer is a representation of a tree structure). A constituent is a base token or pattern of base tokens to which an attribute has been assigned.
  • Once one or more attributes have been assigned to the base tokens or patterns of base tokens, tests are applied to the constituents to verify either the value of a constituent or whether a particular attribute has been assigned to that constituent. If a test is successful, the pattern recognition process consumes, or moves to a point just past, the corresponding underlying base-tokens. Because the pattern recognition process can shift, or move, to different locations in a text, it is possible for a single pattern recognition rule to consume the same base tokens more than once.
  • Text annotation in accordance with the present invention is accomplished by individual processes called “Annotators” Annotators used by the FEX tool set (“the FEX Annotators”) can include proprietary annotators (including base tokenizers or end-of-sentence recognition) as well as commercially-available products (e.g., Inxight's ThingFinder™ and LinguistX® are commercially available tools that support named entity and event recognition and classification, part of speech tagging, and other language processing functions). When developing an information extraction application, the user will determine which FEX Annotators should apply to the text. The FEX tool set allows the user to control the execution of the FEX Annotators through annotation configuration files, which are created and maintained by the user in a graphical user interface development environment (GUI-DE) provided by the FEX tool set. Within the annotation configuration files, the user lists the FEX Annotators that the user wishes to run on his or her files, any relevant parameters for each (like dictionary names or other customizable switches, input/output, or other relevant information). The user also determines the order in which the selected FEX Annotators run, since some FEX Annotators might depend on the output of others. The annotation results of these FEX Annotators are stored individually.
  • Based on the annotation configuration, the FEX tool set runs the chosen annotators against the input documents. Generally, the first FEX annotator that runs against the document text is the Base Tokenizer, which generates “base tokens.” Other FEX Annotators may operate on these base tokens or may use the original document text when generating annotations.
  • Because the resulting annotations can take many forms and represent many different types of attributes, they must be “aligned” to the base tokens. Annotation alignment is the process of associating all annotations assigned to a particular piece of text with the base token(s) for that text.
  • XML requires well-formed documents to be “properly nested.” When several annotation programs apply markup to a document independently they may cross each other's nodes, resulting in improperly nested markup. Consider the following Example (a):
  • EXAMPLE (a)
  • <!DOCTYPE doc ><doc>
    <SENTENCE> ssss
    <ADJP>
    jjjj
    <CLAUSE> cccc
    <NP> nnnn
    </ADJP>
    </NP>
    </CLAUSE>
    </SENTENCE>
    </doc>
  • In Example (a), the <ADJP> node “crosses” the <CLAUSE> and <NP> nodes, both of which begin inside of the <ADJP> node, but terminate outside of it (i.e., beyond </ADJP>). Such an improperly nested document cannot be processed by standard XML processors. The method by which the FEX tool set uncrosses such documents to a properly-nested structure, as shown in the following Example (b), will now be described.
  • EXAMPLE (b)
  • <!DOCTYPE doc ><doc>
    <SENTENCE> ssss
    <ADJP spid=“adjp1”>
    jjjj
    </ADJP>
    <CLAUSE> <ADJP spid=“adjp1”> cccc
     </ADJP>
     <NP> <ADJP spid=“adjp1”> nnnn
    </ADJP>
    </NP>
    </CLAUSE>
    </SENTENCE>
    </doc>
  • Step 1: Given a crossed XML document as in Example (a), convert contiguous character-sequences of the document to a Document Object Model (DOM) array of three object-types of contiguous document markup and content: START-TAGs, END-TAGs, and OTHER. Here START-TAGs and END-TAGs are markup defined by the XML standard, for example, <doc> is a START-TAG and </doc> is its corresponding END-TAG. START-TAGs and their matching END-TAGs are also assigned a NESTING-LEVEL such that a parent-node's NESTING-LEVEL is less than (or, alternatively, greater than) its desired children's NESTING-LEVEL. All other blocks of contiguous text, whether markup, white space, textual content, or CDATA are designated OTHER. For example, in one instantiation of this invention, Example (a) would be represented as follows:
    (OTHER <!DOCTYPE doc >)
    (START <doc> nesting-level=‘1’)
    (START <SENTENCE> nesting-level=‘2’)
    (OTHER ssss )
    (START <ADJP> nesting-level=‘5’)
    (OTHER jjjj)
    (START <CLAUSE> nesting-level=‘3’)
    (OTHER cccc)
    (START <NP> nesting-level=‘4’)
    (OTHER nnnn)
    (END </ADJP> nesting-level=‘5’)
    (END </NP>)
    (END </CLAUSE> nesting-level=‘3’)
    (END </SENTENCE> nesting-level=‘2’)
    (END </doc> nesting-level=‘1’)
  • Step 2: Set INDEX at the first element of the array and scan the array object-by-object by incrementing INDEX by 1 at each step.
  • Step 3: If the object at INDEX is a START-TAG, push a pointer to it onto an UNMATCHED-START-STACK (or, simply “the STACK”). Continue scanning.
  • Step 4: If the current object is an END-TAG, compare it to the START-TAG (referenced) at Top of the STACK (“TOS”).
  • Step 5: If the current END-TAG matches the START-TAG at TOS, pop the STACK. For example, the END-TAG “</doc>” matches the START-TAG “<doc>.” Continue scanning with the DOM element that follows the current END-TAG.
  • Step 6: If the current END-TAG does not match the START-TAG at TOS, then
  • Step 7: If the NESTING-LEVEL of the START-TAG at TOS is less than the NESTING-LEVEL of the END-TAG at INDEX, we are at a position like the following:
    (OTHER <!DOCTYPE doc >)
    (START <doc> nesting-level=‘1’)
    (START <SENTENCE> nesting-level=‘2’)
    (OTHER ssss )
    (START <ADJP> nesting-level=‘5’)
    (OTHER jjjj)
    (START <CLAUSE> nesting-level=‘3’)
    (OTHER cccc)
    TOS--> (START <NP> nesting-level=‘4’)
    (OTHER nnnn)
    INDEX --> (END </ADJP> nesting-level=‘5’)
    (END </NP>)
    (END </CLAUSE> nesting-level=‘3’)
    (END </SENTENCE> nesting-level=‘2’)
    (END </doc> nesting-level=‘1’)
  • This is the PRE-RECURSION position. If the END-TAG at INDEX does not have a SPID, assign it SPID=‘1’. Create a split-element start tag (SPLART-TAG) matching the END-TAG at INDEX. Insert the new SPLART-TAG above TOS. Copy the END-TAG at INDEX immediately below TOS, incrementing its SPID. Now recursively apply Step 6, with INDEX set below the old TOS and TOS popped, as in the continuing example:
    (OTHER <!DOCTYPE doc >)
    (START <doc> nesting-level=‘1’)
    (START <SENTENCE> nesting-level=‘2’)
    (OTHER ssss )
    (START <ADJP> nesting-level=‘5’)
    (OTHER jjjj)
    TOS--> (START <CLAUSE> nesting-level=‘3’)
    (OTHER cccc)
    INDEX--> (END </ADJP> nesting-level=‘5’ SPID=‘2’)
  • This process will recur until TOS and INDEX match, as in the continuing example:
    (OTHER <!DOCTYPE doc >)
    (START <doc> nesting-level=‘1’)
    (START <SENTENCE> nesting-level=‘2’)
    (OTHER ssss )
    TOS--> (START <ADJP> nesting-level=‘5’)
    (OTHER jjjj)
    INDEX--> (END </ADJP> nesting-level=‘5’ SPID=‘3’)
  • At this point the START-TAG at TOS is assigned the SPID of the (matching) END-TAG at INDEX, and the recursion unwinds to the PRE-RECURSION POSITION, as in the continuing example:
    (OTHER <!DOCTYPE doc>)
    (START <doc> nesting-level=‘1’)
    (START <SENTENCE> nesting-level=‘2’)
    (OTHER ssss )
    (START <ADJP> nesting-level=‘5’ SPID=‘3’)
    (OTHER jjjj)
    (END </ADJP> nesting-level=‘5’ SPID=‘3’)
    (START <CLAUSE> nesting-level=‘3’)
    (START <ADJP> nesting-level=‘5’ SPID=‘2’)
    (OTHER cccc)
    (END </ADJP> nesting-level=‘5’ SPID=‘2’)
    TOS--> (START <NP> nesting-level=‘4’)
    (START <ADJP> nesting-level=‘5’ SPID=‘1’)
    (OTHER nnnn)
    INDEX --> (END </ADJP> nesting-level=‘5’ SPID=‘1’)
    (END </NP>)
    (END </CLAUSE> nesting-level=‘3’)
    (END </SENTENCE> nesting-level=‘2’)
    (END </doc> nesting-level=‘1’)
  • Now INDEX is incremented and scanning of the array resumes at Step 3. Note than the SPLART-ELEMENTS added during recursion are not on STACK.
  • Step 8: If the nesting level of the START-TAG at TOS is greater than or equal to the NESTING-LEVEL of the END-TAG at INDEX, we are at a position like the following:
    (OTHER <!DOCTYPE doc >)
    (START <doc> nesting-level=‘5’)
    (START <SENTENCE> nesting-level=‘4’)
    (OTHER ssss )
    (START <ADJP> nesting-level=‘1’)
    (OTHER jjjj)
    (START <CLAUSE> nesting-level=‘3’)
    (OTHER cccc)
    TOS--> (START <NP> nesting-level=‘2’)
    (OTHER nnnn)
    INDEX--> (END </ADJP> nesting-level=‘1’)
    (END </NP>)
    (END </CLAUSE> nesting-level=‘3’)
    (END </SENTENCE> nesting-level=‘4’)
    (END </doc> nesting-level=‘5’)
  • In this case we create a SPLART-TAG at TOS, and insert a copy after INDEX with SPID incremented, and a matching END-TAG before INDEX. We then pop the STACK, arriving at the following exemplary position.
    (OTHER <!DOCTYPE doc >)
    (START <doc> nesting-level=‘5’)
    (START <SENTENCE> nesting-level=‘4’)
    (OTHER ssss )
    (START <ADJP> nesting-level=‘1’)
    (OTHER jjjj)
    TOS--> (START <CLAUSE> nesting-level=‘3’)
    (OTHER cccc)
    (START <NP> nesting-level=‘2’ SPID=‘1’)
    (OTHER nnnn)
    (END </NP>)
    INDEX--> (END </ADJP> nesting-level=‘1’)
    (START <NP> nesting-level=‘2’ SPID=‘2’)
    (END </NP>)
    (END </CLAUSE> nesting-level=‘3’)
    (END </SENTENCE> nesting-level=‘4’)
    (END </doc> nesting-level=‘5’)
  • Once again, the NESTING-LEVEL of the START-TAG at TOS is greater than the NESTING-LEVEL at INDEX, so step 8 is repeated until the START-TAG at TOS match, at which point the method continues from Step 3.
  • Step 9: If the PRIORITY of the START-TAG at TOS is greater than the PRIORITY of the current END-TAG, set the variable INCREMENT to 1. Recursively descend the START-STACK until a START-TAG is found which matches to the current END-TAG. Create a SPLART-TAG from this START-TAG, as in Step 7, and replace the START-TAG in the DOM at the index of the START-TAG at TOS with this (current) SPLART-TAG.
  • Step 10: Unwind the STACK, and at each successive TOS, insert a copy of the current END-TAG into the array before the array index of the START-TAG at TOS. Add INCREMENT to the array index of the START-TAG at TOS. If INCREMENT is equal to 1, set it to 2. Insert a copy of SPLART-TAG into the DOM after the index of the START-TAG at TOS and continue unwinding the STACK at Step 10.
  • Step 11: Resume scanning after the current END-TAG at Step 2.
  • Those skilled in the art will understand that the DOM, which in the above description is implemented as an array, may also be implemented as a string (with arrays or stacks of index pointers), as a linked-list, or other data structure without diminishing the generality of the method in accordance with the present invention. Likewise, the number, names, and types of elements represented in the DOM may also be changed without departing from the principles of the present invention. Similarly, the recursive techniques and SPID numbering conventions used in the preceding example were chosen for clarity of exposition. Those skilled in the art will understand that they can be replaced with non-recursive techniques and non-sequential reference identification without departing from the principles of the present invention. Finally it will be noted that this algorithm generates a number of “empty nodes”, for example, nodes of the general form <np spid=“xx”></np>, which contain no content. These may be left in the document, removed from the document by a post-process, or removed during the operation of the above method without departing from the principles of the method in accordance with the present invention. Those skilled in the art will understand further that the method described here in terms of XML can also be applied to any other markup language, data structure, or method in which marked segments of data must be properly-nested in order to be processed by any of the large class of processes which presume and require proper nesting.
  • The fact extraction process in accordance with the present invention will now be described. Fact extraction focuses on identifying and extracting potentially interesting pieces of information in an annotated text by finding patterns in the attributes stored by the FEX annotators. To find these patterns and extract the interesting facts from the aligned annotations, the user creates a file in the GUI-DE using a Rule-Based Information Extraction language (“the FEX RuBIE pattern recognition language”). This file (“the RuBIE application file”) comprises a set of instructions for extracting pieces of text from some text file. The RuBIE application file can also comprise comments and blanks. The instructions are at the heart of a RuBIE-based extraction application, while the comments and blanks are useful for helping organize and present these instructions in a readable way.
  • The instructions in the RuBIE application file are represented in RuBIE in two different types of rules or statements, (1) a pattern recognition rule or statement, and (2) an auxiliary definition statement. A RuBIE pattern recognition rule is used to describe what text should be located by its pattern, and what should be done when such a pattern is found.
  • RuBIE application files are flat text files that can be created and edited using a text editor. The RuBIE pattern recognition language is not limited to the basic 26-letter Roman alphabet, but at least minimally also supports characters found in major European languages, thus enabling it to be used in a multilingual context.
  • Ideally, RuBIE application files can contain any number of rules and other components of the RuBIE pattern recognition language. They can support any number of comments and any amount of white space, within size limits of the text editor. Any limits on scale are due to text editor size restrictions or operational performance considerations.
  • A RuBIE pattern recognition rule comprises three components: (1) a pattern that describes the text of interest, perhaps in context, (2) a label that names the pattern for testing and debugging purposes; and (3) an action that indicates what should be done in response to a successful match.
  • A pattern is a regular expression-like description of a number of base tokens or other constituents that should be recognized in some way, where the recognition of the tokens is primarily driven by targeted attributes that have been assigned to the text through annotation processes. One or more annotation value tests, zero or more recognition shifts, and zero or more regular expression operators may all be included in a pattern.
  • Only one label may be assigned to a pattern. Exemplary syntax used to capture the functionality of the RuBIE pattern recognition language is set forth in Tables 1 through 14. The notation in Tables 1 through 14 is exemplary only, it being understood that other notation could be designed by those of skill in the art. In the examples used herein, a RuBIE pattern recognition rule begins with a label and ends with a semicolon (;).
    TABLE 1
    Binary Operators
    Exemplary syntax Functionality
    BOP = & logical AND
    BOP = | logical OR
    BOP = <space> is followed by a.k.a. concatenation
  • TABLE 2
    Unary Pre-operators
    Exemplary syntax Functionality
    UOPre = - complement
  • TABLE 3
    Unary Post-operators
    Exemplary syntax Functionality
    UOPost = * zero closure
    UOPost = + positive closure
    UOPost = {m,n} repetition range specification
  • TABLE 4
    Shifts
    Exemplary syntax Functionality
    Shift = govtov verb group to start of verb group
    Shift = gonton noun phrase to start of noun phrase
    Shift = govtos verb group to start of subject
    Shift = gostov subject to start of verb group
    Shift = govtoo verb group to start of object
    Shift = gootov object to start of verb group
    Shift = gosntov sentence to start of verb group
    Shift = goctov clause to start of verb group
    Shift = goleftn left n tokens
    Shift = gorightn right n tokens
    Shift = retest leftmost base token tested
    Shift = redo leftmost base token in scope just
    matched
    Shift = gosns start of current sentence
    Shift = gosne end of current sentence
    Shift = gonpton noun phrase to start of head noun
    Shift = corefa start of next coreferring basal noun
    phrase
    Shift = corefc start of preceding coreferring basal
    noun phrase
  • TABLE 5
    Constituent Attribute Tests
    Exemplary syntax Functionality
    T = present placeholder for when any token will
    do
    T = Company company name
    T = Person person name
    T = Organization organization name
    T = City city name
    T = County county name
    T = State state name
    T = Region region name
    T = Country country name
    T = Phone phone number
    T = Address street address or address fragment
    T = Zipcode zip code
    T = Case case citation
    T = Statute statute citation
    T = Money monetary amounts
    T = BasalNP basal noun phrase
    T = MaxNP maximal noun phrase
    T = Time time amount
    T = NPNum basal noun phrase that contains a
    number
    T = Subject subject
    T = VerbGp verb group
    T = DirObj direct object
    T = Negated negated
    T = Sentence sentence
    T = Paragraph paragraph
    T = EOF end of file, a.k.a. end of text
    input
    T = attribute attributes assigned using a system
    or user-defined phrase dictionary,
    where attribute is some attribute
    defined by the system or dictionary
  • TABLE 6
    Constituent Attribute Tests for Syntactically-related Constituents
    % T(s), where T is any T from the above list, and s is any
    s from a list of possible syntactically-related
    constituents.
  • TABLE 7
    Token Attribute Tests
    Exemplary syntax Functionality
    T(v) = literal case-sensitive form of the base
    token string
    T(v) = word case-insensitive form of the base
    token string
    T(v) = token base level token orthographic
    attribute
    T(v) = POS part of speech tag
    T(v) = root inflectional morphological root
    T(v) = morphfeat inflectional morphology attributes
    T(v) = morphdfeat derivational morphology attributes
    T(v) = space representation of white space
    following the token in the original
    text
    T(v) = region segment name or region in the
    document
    T(v) = tokennum sequential number of token in text
    T(v) = startchar position of token's first character
    T(v) = endchar position of token's last character
    T(v) = length character length of token
    T(v) = DocID document identification number that
    includes the current base token
    T(v) = attribute attributes assigned using a system
    or user-defined word dictionary
  • TABLE 8
    Token attribute Test Values
    Exemplary syntax Functionality
    v = value specific character string or numeric
    value
    v = >value greater than the specified character
    string or numeric value
    v = >=value greater than or equal to the
    specified character string or
    numeric value
    v = <=value less than or equal to the specified
    character string or numeric value
    v = <value less than the specified character
    string or numeric value
    v = value..value greater than or equal to the first
    character string or numeric value,
    and less than or equal to the second
    character string or numeric value,
    where the second value is greater
    than or equal to the first value
    v = −value not the character string or numeric
    value
  • TABLE 9
    Auxiliary Definition Patterns
    Exemplary syntax Functionality
    <AUXPAT> = T constituent attribute test
    <AUXPAT> = T(v) token attribute test, single value
    <AUXPAT> = T(v,...,v) token attribute test, multiple
    values
    <AUXPAT> = ( grouping a sub-pattern for an
    <AUXPAT> ) operator)
    <AUXPAT> = <AUXPAT> using binary operators in patterns
    BOP <AUXPAT>
    <AUXPAT> = UOPre using unary pre-operators in
    <AUXPAT> patterns
    <AUXPAT> = <AUXPAT> using unary post-operators in
    UOPost patterns
  • TABLE 10
    Labels
    Exemplary syntax Functionality
    LABEL unique_alphanumeric_string
  • TABLE 11
    Auxiliary Definitions
    Exemplary syntax Functionality
    <AUXDEF> = <LABEL>: define an aux pattern and giving it
    <AUXPAT> a unique label
  • TABLE 12
    Basic Patterns for Pattern recognition rules
    Exemplary syntax Functionality
    P = T constituent attribute test
    P = % T(s) constituent attribute test on a
    specified, syntactically-related
    constituent
    P = T(v) token attribute test, single value
    P = T(v,...,v) token attribute test, multiple
    values
    P = <LABEL> using an auxiliary definition
    P = ( P ) grouping a sub-pattern for an
    operator
    P = P BOP P using binary operators in patterns
    P = UOPre P using unary pre-operators in
    patterns
    P = P UOPost using unary post-operators in
    patterns
    P = P shift+ P using shifts in patterns
  • TABLE 13
    Actions
    Exemplary syntax Functionality
    A = extract the text and return name and
    #match(operands) attributes
    A = extract sentence that contains the
    #matchS(operands) text and return name and attributes
    A = extract paragraph that contains the
    #matchP(operands) text and return name and attributes
    A = extract the text and surrounding n
    #matchn(operands) tokens on both sides
    A = return the message(s) when text is
    #return(message, ...) matched
    A = #return(format, return the message(s) when text is
    message, ...) matched, formatted as indicated
    A = #block hide the matched text from all other
    pattern recognition rules
    A = #blockS hide the sentences that include the
    matched text from all other pattern
    recognition rules
    A = #blockP hide the paragraphs that include the
    matched text from all other pattern
    recognition rules
    A = #blockD hide the documents that include the
    matched text from all other pattern
    recognition rules
    A = #musthaveS only process sentences matched by
    the corresponding pattern or scope
    A = #musthaveP only process paragraphs matched by
    the corresponding pattern or scope
    A = #musthaveD only process documents matched by
    the corresponding pattern or scope
  • TABLE 14
    Pattern recognition rules with Labels
    (L), Patterns (P) and Actions (A)
    Exemplary syntax Functionality
    #LABEL: P A+ the scope of the actions is the
    entire pattern
    #LABEL: [ P ] A+ the scope of the actions is the
    explicitly delineated entire pattern
    #LABEL: ( P* [ P ] actions are associated with one or
    A+ )+ P* more explicitly delineated sub-
    patterns
  • One or more actions may be associated with a pattern or with one or more specified sub-patterns in the pattern. Generally, a sub-pattern is any pattern fragment that is less than or equal to a full pattern. An auxiliary definition statement is used to name and define a sub-pattern for use elsewhere. This named sub-pattern may then be used in any number of recognition statements located in the same RuBIE application file. Auxiliary definitions provide convenient shorthand for sub-patterns that may be used in several different patterns.
  • Although a single pattern may match several base tokens, whether sequential or otherwise related, the user may only be interested in one or more subsets of the matched tokens. The larger pattern provides context for the smaller pattern of interest. For example, in a text
    aaa bbb ccc

    the user may want to match bbb every time that it appears, only when it follows aaa, only when it precedes ccc, or only when it follows aaa and precedes ccc.
  • When the entire pattern can be matched regardless of context, the full pattern—as specified—is used to match a specific piece of text of interest. However, when only part of the pattern is of interest for recognition purposes and the rest is only provided for context, then it must be possible to mark off the interesting sub-pattern.
  • In the following example, square brackets ([ ]) are used to do this. Thus, a pattern that tries to find bbb only when it follows aaa might look something like
    aaa [ bbb ]
  • Recognition shifts can significantly impact the text that actually corresponds to a bracketed sub-pattern. Because shifts do not actually recognize tokens, shifts at the start or end of a bracketed sub-pattern do not alter the tokens that are included in the bracket. In other words,
    aaa govtos [ bbb ]
    and
    aaa [ govtos bbb ]

    would perform the same way, identifying bbb.
  • However, because a piece of extracted text must be a sequence of adjacent base tokens, a shift can result in non-specified text being matched. For example, given a piece of text
    aaa bbb ccc ddd eee fff ggg
  • the pattern
    aaa [ bbb goright2 eee ] fff
  • will match
    bbb ccc ddd eee
  • whereas the pattern
    aaa [ bbb ] goright2 [ eee ] fff

    will match bbb and eee separately.
  • A pattern must adhere to the following requirements:
      • (1) It is possible for a pattern to have one or more annotation tests.
      • (2) It is possible for a pattern to have zero or more shifts.
      • (3) A shift is preceded and followed by annotation tests.
      • (4) It is possible for a pattern to have sequences of two or more annotation tests.
      • (5) It is possible to identify a sequence of annotation tests and shifts that are to be treated as one item with respect to some operator. In many regular expression languages, parentheses are used to group such sequences.
      • (6) It is possible to test a constituent for any one of a list of attribute values for an individual annotation test. A list of values can be used for some tests, e.g., test (value, . . . , value), but this format is not a requirement.
      • (7) It is possible to test a constituent for any of one or more different annotation tests. In the examples used herein, a vertical bar is used, e.g., test (value) test(value), but this format is not a requirement.
      • (8) It is possible to test a constituent for all of one or more different annotations. In the examples used herein, the ampersand is used, e.g., test (value) & test (value), but this format is not a requirement. (The annotation tests will typically represent different annotations since an annotation attribute generally has a single value for a given token.)
      • (9) It is possible to complement a test. In the examples used herein, the hyphen is used, e.g., −test (value) and test (−value), but this format is not a requirement.
      • (10) It is possible to complement a set of tests. In the examples used herein, the hyphen is used, e.g., −(test(value)|test(value)), but this format is not a requirement.
      • (11) It is possible to specify that a pattern or part of a pattern be repeated zero or more times (zero closure). In the examples herein, the asterisk is used, e.g., test (value)*, but this notation is not a requirement.
      • (12) It is possible to specify that a pattern or part of a pattern be repeated one or more times (positive closure). In the examples herein, the plus sign is used, e.g., test(value)+, but this notation is not a requirement.
      • (13) It is possible to specify that a pattern or part of a pattern be optional. In the examples herein, the question mark is used, e.g., test (value)?, but this notation is not a requirement.
      • (14) It is possible to specify that a pattern or part of a pattern be repeated at least m times and no more than n times, where m must be greater than or equal to zero, n must be greater than or equal to one, and n must be greater than or equal to m. In the examples herein, braces are used, e.g., test (value) {m,n}, but this notation is not a requirement.
      • (15) In general, patterns must process text from left to right, except when shifts direct the pattern recognition process to other parts of the text.
      • (16) It is possible to use an entire pattern to define a piece of text to extract. In this case, the entire pattern represents the scope of some action or actions.
      • (17) It is possible to define one or more sub-patterns that define pieces of text to extract. In this case, each sub-pattern represents the scope of some corresponding action or actions. Square brackets are used herein to mark sub-patterns, but this notation is not a requirement.
      • (18) If a sub-pattem is specified, no action can have the entire pattern in its scope.
      • (19) No token can be included in the scope of more than one action unless the actions share the same scope (no overlapping or nesting of sub-patterns are allowed).
      • (20) An action must follow the patterns relevant to that action.
  • A label is an alphanumeric string that uniquely identifies a pattern recognition rule or auxiliary definition. As the name of a RuBIE pattern recognition rule, a label supports debugging, because the name can be passed to the calling program when the corresponding pattern matches some piece of text. As the name of an auxiliary definition, a label can be used in a pattern to represent a sub-pattern that has been defined. Auxiliary definitions are a convenience for when the same sub-pattern is used repeatedly in one or more patterns.
  • A label that is associated with some pattern may look something like this:
    <person>:  ( word(“mr”, ”ms”, ”mrs”) literal(“.”)?
    )? token(capstring, capinitial){1,4}
    #officer: <person> literal(“,”)? jobtitle

    The auxiliary definition <person> may consist of a title word optionally followed by a period, although this sequence is optional. It is then followed by one to four capitalized words or strings. The #officer pattern recognition rule uses the <person> label to represent the definition of a person, followed by an optional comma and then followed by a job title to identify and extract a reference to a corporate officer. Thus, there is a distinction between this sample auxiliary definition “<person>” and the “Person” constituent attribute test as found in Table 5.
  • The requirements for labels are as follows:
      • (1) Each pattern recognition rule must have a unique label that distinguishes it from all other pattern recognition rules. In the examples herein, #alphanumeric: is used as a pattern label, but this format is not a requirement.
      • (2) Each auxiliary definition must have a unique label that distinguishes the auxiliary definition from all other auxiliary definitions. In the examples used herein, <alphanumeric>: is used as an auxiliary definition label, but this format is not a requirement.
      • (3) It is possible to use an auxiliary definition label, excluding the colon, in a pattern recognition rule to represent a sub-pattern as defined by the auxiliary definition pattern.
  • An action is an instruction to the RuBIE pattern recognition language concerning what to do with some matched text. Typically the user will want RuBIE to return the matched piece of text and some attributes of that text so that the calling application can process it further. However, the user may want to return other information or context in some cases. A selection of actions gives the user increased flexibility in what the user does when text is matched.
  • Each action has a scope, where the scope is the pattern or clearly delineated sub-pattern that when matched correctly to some piece of text, the action will apply to that piece of text. Each pattern recognition rule must have at least one action (otherwise, there would be no reason for having the statement in the first place). A statement may in fact have more than one action associated with it, each with a sub-pattern that defines its scope. More than one action may share the same scope, that is, the successful recognition of some piece of text may result in executing more than one action. For any given RuBIE pattern recognition rule, individual parts of the rule may successfully match attributes assigned to some text. However, actions will only be triggered when the entire rule is successful, even if the scope of the action is limited only to a subset of the rule. For this reason, if the entire pattern recognition rule successfully matches some pattern of text attributes, all associated actions will be triggered, if any part of the rule fails, none of its associated actions will be triggered.
  • The requirements for actions are as follows:
      • (1) The entire pattern must match before any actions associated with scopes in that pattern are triggered.
      • (2) There is a #match action that labels and returns a piece of text matched by some pattern or sub-pattern that falls within the scope of the #match action. The specific #match notation is used herein for example purposes, but it is not a requirement.
      • (3) A pair of markers around some sub-pattern that has a corresponding #match action defines the scope of that #match action. Square brackets are used herein for example purposes, but this notation is not a requirement.
      • (4) If no sub-pattern within a pattern is delimited by a pair of markers, the entire pattern is within the scope of the corresponding #match action.
      • (5) A #match action must take one or more operands.
      • (6) A #match action must specify a category name or some user-defined text to describe the matched text.
      • (7) A #match action must specify a list of zero or more annotation types. For each matched pattern, corresponding annotation values are returned for each annotation tag in the specified list.
        • Suppose for example that #match (operand, . . . , operand) is used to represent a #match action. Suppose also that POS is the part of speech annotation tag and root is the morphological root annotation tag. Then something like
          • #match(victim,POS,root)
        • could return the extracted text with the label victim, followed by the part of speech for each base token in the extracted text, followed by the morphological root for each base token in the extracted text.
      • (8) The #matchS action is the same as the #match action, except that #matchS returns text associated with the smallest sequence of one or more sentences that include the matched text (“context sentences”).
      • (9) The #matchP action is the same as the #match action, except that #matchP returns the text associated with the smallest sequence of one or more paragraphs that include the matched text. (“context paragraphs”).
      • (10) The #matchn action, where n is an unsigned positive integer, is the same as the #match action, except that #matchn returns the matched text and the text associated with up to n base tokens on either side of the matched text if those tokens are available. (“context tokens”).
      • (11) The #return action provides a means for the application to return one or more messages when a successful match occurs, where the operands are the messages to return.
        • Suppose for example that #return (message) is used to represent a #match action. It would then be possible to provide the calling program with a return code or “normalized” text, as in the pattern
          • [literal (“schnauzer”, “poodle”)] #return(“Found a dog.”)
        • returning the text B Found a dog. every time it finds literal values schnauzer or poodle.
      • (12) The scope of a #return action is defined the same way as the scope of a #match action.
      • (13) The #return action allows formatted strings to be returned. For example, if the first operand is a “format string”, i.e., contains placeholders for string substitution, then the following operands must be text strings for substitution into the format string at the respective placeholder locations.
      • (14) The #return action allows an explicit destination file.
        • For example, if the placeholders in the format string look like “% s” and $1 and $2 represent the patterns matched respectively, a #return (format, message, . . . ) action might look like:
          • [Person] root(“sue”) [Person] #return(“% s sued % s”, $1, $2)
        • returning some text like “John Smith sued Bob Johnson”.
      • (15) The #block action provides the application with a means to “hide” a piece of text from other pattern recognition rules. Text that is matched within the scope of a #block action may not be matched by any other pattern recognition rule. It is noted that this creates an exception to the general principle that all recognition statements can apply to all the text (“block text”).
      • (16) The scope of a #block action is defined the same way as the scope of a #match action.
      • (17) The #blocks action provides the application with a means to “hide” a piece of text from other pattern recognition rules. Sentences that include text that is matched within the scope of a #blocks action may not be matched by any other pattern recognition rule. It is noted that this creates an exception to the general principle that all recognition statements can apply to all the text (“block sentences”).
      • (18) The scope of a #blocks action is defined the same way as the scope of a #match action.
      • (19) The #blockP action provides the application with a means to “hide” a piece of text from other pattern recognition rules. Paragraphs that include text that is matched within the scope of a #blockP action may not be matched by any other pattern recognition rule. It is noted that this creates an exception to the general principle that all recognition statements can apply to all the text (“block paragraphs”).
      • (20) The scope of a #blockp action is defined the same way as the scope of a #match action.
      • (21) The #blockD action provides the application with a means to “hide” a piece of text from other pattern recognition rules. A document that contains text that is matched within the scope of a #blockD action may not be matched by any other pattern recognition rule. This effectively hides the entire document from all other pattern recognition rules in a RuBIE application file. It is noted that this creates an exception to the general principle that all recognition statements can apply to all the text (“block document”).
      • (22) The scope of a #blockD action is defined the same way as the scope of a #match action.
      • (23) The #musthaves action provides the application with a means to “hide” a piece of text from other pattern recognition rules. Only those sentences within a document that contain text that is matched within the scope of a #musthaveS action may be processed by other pattern recognition rules. It is noted that this creates an exception to the general principle that all recognition statements can apply to all the text (“sentence must have”).
      • (24) The scope of a #musthaves action is defined the same way as the scope of a #match action.
      • (25) The #musthavep action provides the application with a means to “hide” a piece of text from other pattern recognition rules. Only those paragraphs within a document that contain text that is matched within the scope of a #musthaveP action may be processed by other pattern recognition rules. It is noted that this creates an exception to the general principle that all recognition statements can apply to all the text (“paragraph must have”).
      • (26) The scope of a #musthaveP action is defined the same way as the scope of a #match action.
      • (27) The #musthaveD action provides the application with a means to “hide” a piece of text from other pattern recognition rules. Only those documents that contain text that is matched within the scope of a #musthaveD action may be processed by other pattern recognition rules. If an application file includes a #musthaveD action that has not been triggered for some document, no other patterns in the RuBIE application file may find anything in that document. It is noted that this creates an exception to the general principle that all recognition rules can apply to all the text (“document must have”).
      • (28) The scope of a #musthaveD action is defined the same way as the scope of a #match action.
  • An auxiliary definition provides a shorthand notation for writing and maintaining a sub-pattern that will be used multiple times in the pattern recognition rules. It is somewhat analogous to macros in some programming languages.
  • Auxiliary definitions are a convenience for when the same sub-pattern is used repeatedly in one or more pattern recognition rules. Repeating an example that was used earlier, note how the auxiliary definition label <person> is used in the pattern recognition rule labeled #officer:
    <person>:  ( word(“mr”, ”ms”, ”mrs”) literal(“.”)?
    )? token(capstring, capinitial){1,4}
    #officer:  <person> literal(“,”)? jobtitle
  • The auxiliary definition label may be used repeatedly in one or more pattern recognition rules.
  • The requirements for auxiliary definition statements are as follows:
      • (1) An auxiliary definition must consist of a uniquely labeled pattern.
      • (2) Shifts are not permitted in auxiliary definitions.
      • (3) Actions are not permitted in auxiliary definitions.
      • (4) Because actions are not permitted in auxiliary definitions, sub-pattern scopes cannot be delineated in auxiliary definitions.
      • (5) Except for the exclusions noted in the above requirements, an auxiliary pattern must have all of the same characteristics of a pattern in a pattern recognition rule.
  • In a preferred embodiment, application-specific dictionaries in the RuBIE pattern recognition language can be separate annotators. Alternatively, lexical entries can be provided in the same file in which pattern recognition rules are defined. In this alternative embodiment, the RuBIE application file has syntax for defining lexical entries within the file. One advantages of this alternative embodiment is that there is a clear relationship between the dictionaries and the applications that use them. Also, there is greater focus on application-specific development work on RuBIE Application Files. However, large word and phrase lists can make RuBIE application files difficult to read. Also, the alternative embodiment does not promote the idea of shared or common dictionaries.
  • In general, a free order among the patterns and auxiliary definitions may be assumed. All patterns generally apply simultaneously. However, there are two general recognition order requirements, as follows:
      • (1) For #block actions to have any meaning, recognition statements with #block actions must apply before other recognition statements.
      • (2) If two recognition statements match overlapping text, both recognition statements must apply, except when recognition is prohibited for other reasons, such as #block, #musthave, and so on.
  • As noted above, RuBIE-based application files may vary from a few pattern recognition rules to hundreds or even thousands of rules. Individual rules may be rather simple, or they may be quite complex. Clear, well-organized and well-presented RAFs make applications easier to develop and maintain. The RuBIE pattern recognition language provides users with the flexibility to organize their RAFs their own way in support of producing RAFs in a style that is most appropriate for the application and its maintenance.
  • The format requirements for RuBIE pattern recognition rules are as follows:
      • (1) There is not a one-to-one relationship between RuBIE pattern recognition rules and lines in a RuBIE application file. A single statement may span multiple lines.
      • (2) White space may be used for formatting purposes anywhere in between the components that make up a RuBIE pattern recognition rule and in between RuBIE pattern recognition rules themselves. Because literal values are considered single components, a white space in a literal value is considered literal space character and not white space used for formatting purposes.
      • (3) Any line may be blank or consist only of spaces.
      • (4) There are no column restrictions in a RuBIE application file (as required in some technical languages).
      • (5) It is possible to specify comments in a RuBIE application file.
  • Because the Fact Extraction Tool Set has API interfaces, ownership of input annotated text, output extraction results and output report files is the responsibility of the invoking program and not the RuBIE application file. When a statement successfully identifies and extracts a piece of text, the RAF needs to communicate those results.
  • The fact extraction application that applies a RuBIE application file against some annotated text routinely has access to some standard results. Also, it optionally has access to all the annotations that supported the extraction process.
  • The input and output requirements for the RuBIE pattern recognition language are as follows:
      • (1) When an extraction successfully matches a piece of text for extraction purposes, it provides the calling program with (1) the name of the rule, (2) the extracted text, (3) the start and end token numbers matched, and (4) any message generated by actions in the rule.
      • (2) One pattern recognition rule may extract more than one piece of text. This is why each piece of extracted information must contain some type of label, in addition to the name of the statement.
      • (3) Because one pattern recognition rule may extract more than one piece of text, it must also provide the calling program with a total extent. The total extent is a token range from the earliest (leftmost) token that a statement recognizes to the latest (rightmost) token that the statement recognizes for each piece of extracted text.
      • (4) The user may identify zero or more types of annotations to be reported for output purposes. For each annotation specified, the attributes for that annotation and the tokens or token sequences that they annotate are also returned or made accessible in some other way to the calling program.
      • (5) For debugging purposes, there is a switch that allows the user to write extraction results and related information to a browsable report file for viewing and analysis purposes.
  • Other functionalities implemented by the RuBIE pattern recognition language are as follows:
      • (1) Users may insert new annotation processes at will without the need for software changes to the RuBIE pattern recognition language to accommodate the additional annotation types and values that are introduced. For this reason, new annotation processes must provide access to their annotations in a way that is consistent with the various requirements of the RuBIE pattern recognition language. All annotation processes must provide their names and acceptable values (including digit ranges where appropriate) to the process that compiles or processes a RAF.
      • (2) Users do not have to use all of the annotations available to them in a given application.
  • At the user's request, the FEX server (described in greater detail hereinafter) compiles the RuBIE application file and runs it against the aligned annotations to extract facts.
  • The RuBIE pattern recognition language is a pattern recognition, language that applies to text that has been tokenized into its base tokens—words, numbers, punctuation symbols, formatting information, etc.—and annotated with a number of attributes that indicate the form, function, and semantic role of individual tokens, patterns of tokens, and related tokens. Text structure is usually defined or controlled by some type of markup language; that is the RuBIE pattern recognition language applies to one or more sets of annotations that have been aligned with a piece of tokenized text.
  • Although text annotation in accordance with the present invention uses XML as a basis for representing the annotated text, the RuBIE pattern recognition language itself places no restrictions on the markup language used in the source text because the RuBIE pattern recognition language actually applies to sets of annotations that have been aligned with the base tokens of the text rather than directly to the source text itself. The RuBIE pattern recognition language is rule-based, as opposed to machine learning-based.
  • The RuBIE pattern recognition language can exploit any attributes with which a text representation has been annotated. Through a dictionary lookup process, a user can create new attributes specific to some application. For example, in an executive changes extraction application that targets corporate executive change information in business news stories, a dictionary may be used to assign the attribute ExecutivePosition to any of a number of job titles, such as President, CEO, Vice President of Marketing, Senior Director and Partner. A RuBIE pattern recognition rule can then simply use the attribute name rather than list all of the possible job titles.
  • For the sentence
      • Mark Benson read a book.
        the tokens Mark and Benson may each be annotated with orthographic attributes indicating their form (e.g., alphabetic string and capitalized string). The sequence Mark Benson may further be annotated with attributes such as proper name, noun phrase, person, male, subject, and agent. The individual terms may also be annotated with positional information attributes (1 for Mark and 2 for Benson), indicating their relative position within a sentence, document or other text.
  • An application that targets corporate executive change information in business news stories may have rules that attempt to identify each of the following pieces of information in news stories that have been categorized as being relevant to the topic of executive changes:
      • (1) The executive position in question
      • (2) The type of change(s) that occurred in that position (e.g., hired, fired, resigned, retired, etc.)
      • (3) The person or persons affected by the executive change (the old or new person in the position)
      • (4) The name of the company, subsidiary or division where the change took place
      • (5) The date of the change
      • (6) Related comments from a company spokesperson
  • The semantic agent of a “retired” action (the person performs the action of retiring) or the semantic patient of a “hired” or “fired” action (the person's executive status changes because someone else performs the action of hiring or firing them) is likely the person affected by the change. It may take multiple rules to capture all of the appropriate executives based on all the possible action-semantic role combinations possible. That is why a RuBIE application file may include many rules for a single application.
  • Other possible extraction applications could include the following:
      • (1) Identify the buyer, the target, friendly vs. hostile, and the amount of money and stock involved in a corporate acquisition
      • (2) Identify a weather event, where it occurred, how many people were injured or killed, and the amount of damage done
      • (3) For a story about a terrorist attack, identify where the attack occurred, the type of attack, how many people were injured or killed, information on the damage done, and who claimed responsibility
      • (4) Identify the name of the company, revenues, earnings or losses, and the time periods for which the figures apply.
  • Information extraction applications can be developed for any topic area where information about the topic is explicitly stated in the text.
  • A pattern recognition language used as a basis for applications that apply to text fundamentally tests the tokens and constituents in that text for their values or attributes in some combination. In some applications, the attributes are limited to little more than orthographic attributes of the text, e.g., What is the literal value of a token? Is it an alphabetic string, a digit string or a punctuation symbol? Is the string capitalized, upper case or lower case? And so on.
  • Many pattern recognition languages rely on a regular expression-based description of the attribute patterns that should be matched. Typically, the simplest example of a regular expression in annotated text processing is a rule that tests for the presence of a single attribute or the complement of that attribute assigned to some part of the text, such as a base token. More complex regular expressions look for some combination of tests, such as sequences of different tests, choices between multiple tests, or optional tests among required tests. Regular expression-based pattern recognition processes often progress left-to-right through the text. Some regular expression-based pattern recognition languages will have additional criteria for selecting between two pattern recognition rules that each could match the same text, such as the rule listed first in the rule set has priority, or the rule that matches the longest amount of text has priority. Regular expression-based pattern recognition languages are often implemented using finite state machines, which are highly efficient for text processing.
  • A number of applications, especially in identifying many categories of named entities, can be highly successful even with such limited annotations. The LexisNexis® LEXCITE® case citation recognition process, SRA's NetOwl® technology and Inxight's ThingFinder™ all rely on this level of annotation in combination with the use of dictionaries that assign attributes based on literal values (e.g., LEXCITE® uses a dictionary of case reporter abbreviations; named entity recognition processes such as NetOwl® and ThingFinder™ commonly use dictionaries of company cues such as Inc, Corp, Co, and PLC, people titles such as Mr, Dr, and Jr, and place names).
  • Similar to prior art regular expression-based pattern recognition tools like SRA's NetOwl® technology, Perl™, and the Penn Tools, the RuBEE pattern recognition language supports common, regular expression-based functionality. However, the results of more sophisticated linguistics processes that annotate a text with syntactic attributes are best represented using a tree-based representation. XML has emerged as a popular standard for creating a representation of a text that captures its structure. As noted above, the FEX tool set uses XML as a basis for annotating text with numerous attributes, including linguistic structure and other linguistic attributes.
  • The relationship between two elements in the tree-based representation can be determined by following the path through the tree between the two elements. Some important relationships can easily be anticipated—finding the subject and object (or agent and patient) of some verb, for example. Because sentences can come in an infinite variety, there can be an infinite number of possible ways to specify the relationships between all possible entity pairs. The RuBIE pattern recognition language exploits some of the more popular syntactic relationships common to texts.
  • In the approach taken by the Penn Tools, a predefined set of specific shift operators based on those relationships was included in the language. However, that approach limited users to only those relationships that were predefined. The RuBIE pattern recognition language avoids similar restrictions. XPath provides a means for traversing the tree-like hierarchy represented by XML document markup. It is possible to create predefined functions and operators for popular relationships based on XPath as part of the RuBIE pattern recognition language, both as part of the RuBIE language and through application-specific auxiliary definitions, but it is also possible to give RuBIE pattern recognition rule writers direct access to XPath so that they can create information extraction rules based on any syntactic relationship that could be represented in XML. Thus a RuBIE pattern recognition rule can combine traditional regular expression pattern recognition functionality with the ability to exploit any syntactic relationship that can be expressed using XPath.
  • The RuBIE pattern recognition language is unique in its combination of traditional regular expression pattern recognition capabilities and XPath-based tree traversal capabilities, in addition to providing matching patterns in an annotated text to support information extraction.
  • The RuBIE pattern recognition language allows users to combine attribute tests together using traditional regular expression functionality and XPath's ability to traverse XML-based tree representations. Through the addition of macro-like auxiliary definitions, the RuBIE pattern recognition language also allows users to create application-specific matching functions based on regular expressions or XPath.
  • A single RuBIE pattern recognition rule can use traditional regular expression.functionality, XPath-based functionality, and auxiliary definitions in any combination. The pattern recognition functionality that is deployed as part of the FEX tool set for tests, regular expression-based operators, and shift operators will now be described.
  • A test verifies that a token or constituent:
      • (1) Has the presence of an attribute
      • (2) Has the presence of an attribute that has a particular value
      • (3) Has the presence of an attribute that has one of a set of possible values
        If the test is successful, the con-esponding text has been match.
  • A RuBIE pattern recognition rule contains a single test or a combination of tests connected by RuBIE operators (a combination of regular expression and tree traversal functionality). If the test or combination of tests are all successful within the logic of the operators used, then the rule has matched the text that correspond to the tokens or constituents, and that text can be extracted or processed further in other ways.
  • Regular expression-based operators in the RuBIE pattern recognition language include the following:
      • (1) Apply a single test to a single token or constituent;
      • (2) Apply a sequence of tests to a pattern of tokens or constituents;
      • (3) Create a test by putting a valid sequence of tests in parentheses;
      • (4) Either of two tests must be true (logical OR);
      • (5) Both of two tests must be true (logical AND);
      • (6) Use the complement of the result of the test (logical NOT);
      • (7) Apply a test to a sequence of zero or more tokens or constituents (zero closure);
      • (8) Apply a test to a sequence of one or more tokens or constituents (positive closure);
      • (9) Indicate that a test is optional; and
      • (10) Apply a test to a sequence of at least m and no more than n tokens or constituents.
  • Shift operators rely on syntactic and other hierarchical information such as that which can be gained from traversing the results of a parse tree. XML is used to capture this hierarchical information, and XPath is used as a basis for the following tree traversal operators:
      • (1) From within a noun phrase to the start of the noun phrase
      • (2) From within a verb group to the start of the verb group
      • (3) From within a subject to the start of its corresponding verb group
      • (4) From within a verb group to the start of its corresponding subject
      • (5) From within an object to the start of the verb group that governs it
      • (6) From within a sentence to the start of its verb group
      • (7) From within a clause to the start of its verb group
      • (8) From within a sentence to the start of that sentence
      • (9) From within a sentence to the end of that sentence
      • (10) From within a noun phrase to the start of its head noun
      • (11) From within a noun phrase to the start of the next co-referring noun phrase
      • (12) From within a noun phrase to the start of the previous co-referring noun phrase
  • There are many other similar relationships that can be captured in the RuBIE pattern recognition language's XML-based representation. Through direct use and programming macro-like auxiliary definitions, the RuBIE pattern recognition language allows users to create additional and new shift operations based on XPath in order to exploit any of a number of relationships between constituents as captured in the XML-based representation of the annotated text.
  • The RuBIE pattern recognition language also has shift operators based on relative position, including
      • (1) Go left some number of base tokens
      • (2) Go right some number of base tokens
      • (3) Go to the leftmost base token most recently matched by the application of the rule (allows a second test starting from the same position)
  • Because in the RuBIE pattern recognition language, the same attribute values may be used with different annotations (e.g., the word dog may have dog as its literal form, its capitalization normalized form and its morphological root form), and because the user may introduce new annotation types to an application, it is necessary to specifying both the annotation type and annotation value in RuBIE pattern recognition rules.
  • The RuBIE pattern recognition language allows a user to test a base token for the following attributes:
      • (1) An attribution, which includes a specific annotation type and corresponding annotation value.
      • (2) Any of a number of annotation values that correspond to a single specified annotation type. It is possible to do this at least two different ways. The preferred way is to allow the user to list multiple values in a single test. Another approach requires that the user OR together each individual annotation value test.
      • (3) One or more case sensitive literal values. Because this involves individual tokens, a literal phrase test involves testing the literal values of a sequence of individual tokens. It is possible to specify literals in at least two ways, such as literal (word) and “word.”
      • (4) One or more case-corrected values. Because this involves individual tokens, a literal phrase test involves testing the literal values of a sequence of individual tokens.
      • (5) One or more case insensitive literal values. Because this involves individual tokens, a literal phrase test involves testing the literal values of a sequence of individual tokens.
      • (6) One or more part of speech tag values.
      • (7) One or more inflectional morphological root values.
      • (8) One or more inflectional morphological attribute values.
      • (9) One or more derivational morphological attribute values.
      • (10) One or more orthographic attributes, such as capitalized string, upper case letter, upper case string, lower case letter, lower case string, digit, digit string, punctuation symbol, white space, and other.
      • (11) Information about the white space that follows the token in the original text (in the case where white space base tokens are eliminated from the base token stream, and information. about those tokens is attached as attributes of the tokens that precede them). These attributes may include followed by white space and not followed by white space (e.g., followed by punctuation or markup language tags). White space that precedes all base tokens in the source text may be ignored, or designers may consider introducing start-of-document and end-of-document base tokens.
      • (12) The region of the document in which it appears. As used herein, regions generally correspond to segment functions. For news data, regions may include headline, byline, dateline, date, lead (non-table), body (non-lead, non-table), table (in lead or body), company (index term segment), and other.
      • (13) Token number in the sequence of tokens that comprise the input text. Annotation values consist of integer strings, as tokens are numbered from 1 to n (or 0 to n-1). This permits a recognition rule to return the attribute value to the calling program.
      • (14) Starting character position in the input text. Annotation values consist of integer strings, as characters are numbered from 1 to n (or 0 to n-1). This permits a recognition rule to return the attribute value to the calling program.
      • (15) Ending character position in the input text. Annotation values consist of integer strings, as characters are numbered from 1 to n (or 0 to n-1). This permits a recognition rule to return the attribute value to the calling program.
      • (16) Length in characters. This permits a recognition rule to return the attribute value to the calling program.
      • (17) The current document identification. This permits a recognition rule to return the attribute value to the calling program so that the system can report which document a recognition occurred in when processing collections of multiple documents.
      • (18) Arithmetic ranges of values, such as >n, >=n, =n, <n, <=n, m. .n, and −=n, when testing an annotation type that has a numerical value.
  • When specifying literal values, users are able to indicate wildcard characters (.), superuniversal truncation (!), and optional characters (?). A wildcard character can match any character. Superuniversal truncation means that the term must match exactly anything up to the superuniversal operator, and then anything after that operator is assumed to match by default. An optional character is simply a character that is not relevant to a particular test, e.g., word-final -s for some nouns.
  • Constituent attributes are those attributes that are assigned to a pattern of one or more base tokens that represent a single constituent. A proper name, a basal noun phrase, a direct object and other common linguistic attributes can consist of one or more base tokens, but RuBIE pattern recognition rules treat such a pattern as a single constituent. If for example the name
      • Mark David Benson
        has been identified as a proper name AND a noun phrase AND a subject, simply specifying one of these attributes in some statement would result in the matching of all three base tokens that comprise the constituent.
  • The emphasis for constituent attributes is on recognizing valid constituents. Examples of constituent attributes include, but are not limited to, the following: Company; Person; Organization; Place; Job Title; Citation; Monetary Amount; Basal Noun Phrase; Maximal Noun Phrase; Verb Group; Verb Phrase; Subject; Verb; Object; Employment Change Action Description Term; and Election Activity Descriptive Term (MDW—just making the fonts and notation we use for attributes more consistent). In some instances, the “pattern” may consist of a single base token. The RuBIE pattern recognition language has the ability to recognize non-contiguous (i.e., tree-structured) constituents via XPath in addition to the true left-right sequences on which the regular expression component of the RuBIE pattern recognition language focuses.
  • Annotations are defined and assigned robustly by the RuBIE pattern recognition language. No sort of taxonomical inheritance is assumed; otherwise a pattern recognition rule would have to draw information from sources in addition to the XML-based annotation representation.
  • In most respects, a constituent attribute generally behaves like a token attribute in patterns. The RuBIE pattern recognition language includes the following constituent attributes:
      • (1) The RuBIE pattern recognition allows the user to identify some patterns of tokens as single constituents that consist of one or more sequential tokens.
      • (2) The RuBIE pattern recognition allows annotations to be assigned to constituents.
      • (3) When a pattern token is identified in a recognition pattern, it is treated as a single constituent regardless of how many base tokens comprise it.
      • (4) The RuBIE pattern recognition allows the user to test a constituent for a specific annotation type.
      • (5) Base tokens may be part of more than one constituent.
      • (6) Two constituents that have at least one base token in common do not necessarily have to have all base tokens in common.
      • (7) It is possible to test a constituent for a company name annotation type.
      • (8) It is possible to test a constituent for a person name annotation type.
      • (9) It is possible to test a constituent for an organization name annotation type.
      • (10) It is possible to test a constituent for a city name annotation type.
      • (11) It is possible to test a constituent for a county name annotation type.
      • (12) It is possible to test a constituent for a state or province name annotation type.
      • (13) It is possible to test a constituent for a geographic region name annotation type.
      • (14) It is possible to test a constituent for a country name annotation type.
      • (15) It is possible to test a constituent for a phone number annotation type.
      • (16) It is possible to test a constituent for a street address annotation type.
      • (17) It is possible to test a constituent for a zip code annotation type.
      • (18) It is possible to test a constituent for a case citation annotation type.
      • (19) It is possible to test a constituent for a statute citation annotation type.
      • (20) It is possible to test a constituent for a money amount annotation type.
      • (21) It is possible to test a constituent for a sentence annotation type.
      • (22) It is possible to test a constituent for a clause annotation type.
      • (23) It is possible to test a constituent for a paragraph annotation type.
      • (24) It is possible to test a constituent for a basal noun phrase annotation type.
      • (25) It is possible to test a constituent for a maximal noun phrase annotation type.
      • (26) It is possible to test a constituent for a time annotation type.
      • (27) It is possible to test a constituent for a noun phrase “Has Number” annotation type.
      • (28) It is possible to test a constituent for a subject annotation type.
      • (29) It is possible to test a constituent for a verb group annotation type.
      • (30) It is possible to test a constituent for a direct object annotation type.
      • (31) It is possible to test a constituent for a negated annotation type.
      • (32) It is possible to test a constituent for a person attribute.
      • (33) It is possible to test a constituent for an animate (living thing, perhaps at one time) attribute.
      • (34) It is possible to test a constituent for an inanimate attribute.
      • (35) It is possible to test for the presence of a token regardless of its other annotations. All tokens can be treated as if they have a “present” attribute.
      • (36) It is possible to introduce new constituent-based annotation type tests through general annotators.
      • (37) It is possible to introduce new constituent-based annotation type tests through one or more dictionaries.
  • Regular expressions are powerful tools for identifying patterns in text when all of the necessary infonnation is located sequentially in the text. Natural language, however, does not always cooperate. A subject and its corresponding object may be separated by a verb. A pronoun and the person it refers to may be separated by paragraphs of text. And yet it is these relationships that are often the more interesting ones from a fact extraction perspective.
  • There are a number of approaches for storing relationship information. One common approach uses a direct link between the related items. Adding a common identifier to both related items is another way of accomplishing this. The Penn Tools used this to support shifts from one location to the start of a related constituent; this was accomplished using positional triples that identified the beginning and end positions of the starting point constituent and the position immediately in front of the related token to which pattern recognition was to shift. From anywhere in a verb phrase, for example, one can shift the recognition process to a point just before the main verb in that phrase. From the subject, one can shift the recognition process to a point just before the start of the corresponding verb group.
  • In the RuBIE pattern recognition language, the pattern recognition process can be shifted directly between two related constituents. The RuBIE pattern recognition language supports the following relationship shifts:
      • (1) It is possible to shift recognition from anywhere within a noun phrase to the start of the noun phrase.
      • (2) It is possible to shift recognition from anywhere within a verb group to the start of the verb group.
      • (3) It is possible to shift recognition from anywhere within a subject to the start of the verb group that governs the subject.
      • (4) It is possible to shift recognition from anywhere within a verb group to the start of the subject governed by that verb group.
      • (5) It is possible to shift recognition from anywhere within a verb group to the start of the object governed by that verb group.
      • (6) It is possible to shift recognition from anywhere within an object to the start of the verb group that governs the object.
      • (7) It is possible to shift recognition from anywhere within a sentence to the start of the governing verb group.
      • (8) It is possible to shift recognition from anywhere within a linguistic clause to the start of the verb group that governs that clause.
      • (9) It is possible to shift recognition from anywhere within a text to the left by some specified number of base tokens, except when there are not enough tokens to the left.
      • (10) It is possible to shift recognition from anywhere within a text to the right by some specified number of base tokens, except when there are not enough tokens to the right.
      • (11) It is possible to shift recognition from anywhere within a text to the start of the leftmost base token matched by the most recent test in a pattern recognition rule. This allows the same text to be retested for some other annotation or pattern of annotations. For example, after verifying that the subject is represented by a noun phrase, one might then want to test its components to extract any adjectives as descriptors.
      • (12) It is possible to shift recognition from anywhere within a text to the start of the leftmost base token matched by the most recent scope as indicated by brackets in the examples in this report. This allows the same piece of text to be matched more than once, perhaps to return different attributes with it.
      • (13) It is possible to shift recognition from anywhere within a sentence to the start of that sentence.
      • (14) It is possible to shift recognition from anywhere within a sentence to the end of that sentence.
      • (15) It is possible to shift recognition from anywhere within a noun phrase to the start of the head noun in that noun phrase.
      • (16) It is possible to shift recognition from anywhere within a basal noun phrase to the start of the next (right) coreferring noun phrase.
      • (17) It is possible to shift recognition from anywhere within a basal noun phrase to the start of the previous (left) coreferring noun phrase.
      • (18) Users may add new annotations that define other relationships between two constituents. It is possible to define functionality that moves the recognition process from within one of these constituents to the start of the other constituent, allowing for the introduction of new shifts.
  • For those shifts dependent on parse tree-based syntactic relationships—such as shifts between subjects, verbs, and objects in a sentence or clause—the adopted shift command takes arguments, specifically references to constituent objects. Due to the nature of language, there can often be more than one possible constituent that may fit the prose description of the shift. For example, consider the sentence John kissed Mary and dated Sue. There are two verbs here, each with one subject (John in both cases) and one object (Mary and Sue respectively). This type of complexity adds some ambiguity, e.g. deciding which verb to shift to. The ability to use indirection and compound constituent objects addresses this class of problems.
  • The RuBIE pattern recognition language therefore also include the following capabilities:
      • (19) It is possible to pass references to constituents to RuBIE patterns.
      • (20) It is possible to pass references to compound constituent objects to RuBIE patterns (e.g., linking the subject “John” to both verbs “kissed” and “dated”—not just the first verb—in the example above).
      • (21) It is possible to access the elements and relationships found in a full parse tree.
  • The RuBIE shifts allow a RuBIE pattern recognition rule writer to shift the path of pattern recognition from one part of a text to another. For many of the shifts, however, there is a corresponding shift to return the path of pattern recognition back to where it was before the first shift occurred. A variation of the constituent attribute test could account for a number of cases where such shift-backs are likely to occur. The RuBIE pattern recognition language therefore also includes the following capabilities:
      • (22) It is possible to test a syntactically related constituent for the presence of an attribute without shifting the path of pattern recognition back and forth.
        • This might be realized with a test that, for constituent attributes, looks like,:
          • % sequence-attribute-test(parse-tree-locatable-entity)
        • as in
          • % company(antecedent-of-pronoun)
        • or for token attributes:
          • % sequence-attribute-test(v, parse-tree-locatable-entity)
        • as in
          • % root(“kill”, verb-of-subject)
        • where the percent sign is simply an arbitrarily selected character that indicates the type of test that follows.
      • (23) It is possible to test a syntactically related constituent recursively for the presence of an attribute without shifting the path of pattern recognition back and forth (“nested tests”).
  • Following is a walkthrough of an example RuBIE pattern recognition rule that identifies a new hire and his or her new job position as expressed in a passive sentence. In this case, the new hire is a person who is the patient of a hire verb.
  • The sentences targeted by the example rule are:
      • John Smith was hired as CEO of IBM Corl.
      • IBM announced that John Smith was appointed CEO.
  • The example production RuBIE pattern recognition rule using an alternate syntax, named PassiveFactl, is as follows:
    PassiveFact1:
    @hireverb
    ( @goUp(CLAUSE)atEl(“CLAUSE”,”passive=t”)
    goFromTo(“CLAUSE”,”pat”,”*”,”id”)
    <NewHire>@goDN(PERSON)</NewHire>)
    @goDN(T)
    <Position>atEl(“*”,”n-
    type=position”)</Position>
  • The rule first looks for a “hire” verb. In this case, an auxiliary definition was created so that @hireverb will match a verb whose stem is “hire”, “name”, “appoint”, “promote”, or some similar word.
  • Once such a verb is found, the rule goes up the tree to the nearest clause node and verifies that the clause contains a passive verb. If it is true that the clause contains a passive verb, the rule goes back down into the clause to find the patient of the clause verb. The patient of a verb is the object affected by the action of the verb, in this case the person being hired. In a passive sentence, the patient is typically the grammatical subject of the clause. Within the patient, the rule then looks for a specific Person as opposed to some descriptive phrase. If an actual person name is found as the patient of a hire verb, the rule can then mark it up with the XML tags <NewHire> and </NewHire>.
  • Finally, the rule tests other items in the clause until it finds a constituent that has the attribute position assigned to it. A dictionary of candidate job positions of interest is used to assign this attribute to the text. If a valid position is found, it can be marked up with the XML tags <Position> and </Position>.
  • The entire rule must succeed for the marking up of the text with both the <NewHire> and <Position> tags to take place.
  • The FEX tool set system architecture and design will now be described.
  • The FEX tool set is not itself a free-standing “application”, in the sense that it does not, for example, provide functionality to retrieve documents for extraction or to store the extracted facts in any persistent store. Rather, the FEX tool set typically exists as part of a larger application. Because document retrieval and preparation and presentation of extracted facts will vary depending on product requirements, these functions are left to the product applications that use the FEX tool set (the “FEX product”). FIG. 5 is a diagrammatic illustration of a first scenario in which a FEX product “A” creates a database with the facts extracted from a document “A” by the FEX too set and provides an entirely new customer interface (UI) to present these facts from the database. In the first scenario, the original document “A” remains untouched both by the FEX tool set and by the FEX product
  • FIG. 6 is a diagrammatic illustration of a second scenario in which an FEX product “B” actually updates an original document “B” with the extracted facts metadata and leverages the existing customer interface—possibly updated—to present the facts. The second scenario allows for the existing search technology to access the facts, requiring no new retrieval mechanism.
  • FIG. 7 is a diagrammatic illustration of the FEX tool set architecture. In the present embodiment, the FEX tool set has a Windows NT client-server architecture, using Java® and ActiveState® Perl™. Windows NT was chosen because it is a standard operating environment and because the primary annotator (a linguistic parser called EngPars) currently runs exclusively on the Windows architecture. Java®, implemented with IBM Visual Age for Java®, is used primarily because of its graphical user interface development environment (GUI-DE), since it is a LexisNexis®-internal standard and provides strong portability and scalability. ActiveState® Perl™ is used to implement some of the text-processing tasks, since Perl™ is also portable, and since it has strong regular expression handling and general text-processing capability. It will be appreciated by those of skill in the art that other architectures that provide equivalent functionality can be used.
  • The major hardware components in the FEX tool set are the FEX client and the FEX server. In the present embodiment, the client for the FEX tool set is a “thin” Java®-based Windows NT® Workstation or Windows 98/2000® system. The FEX server is a Windows NT Server system, a web server that provides the main functionality of the FEX tool set. Functionality is made available to the client via a standard HTTP interface, which can include SOAP (“Simple Object Access Protocol”, an HTTP protocol that uses XML for packaging).
  • While this architecture allows for true client-server interaction, it also allows for a reasonable migration to a single-machine solution, in which both the client and server parts are installed on the same workstation.
  • FIG. 8 is a high level diagram of the processing flow of the FEX tool set using the architecture shown in FIG. 7. The user's interface to the FEX tool set is the FEX GUI-DE on the FEX Workstation. Within GUI-DE, the user opens or creates a FEX Workspace to store his or her product application work. The user selects the appropriate annotators and may use available client Annotator Development Tools (not part of the FEX tool set) to troubleshoot and tune the FEX Annotators for the application. When satisfied with the results from the development tools, the user saves the annotation settings to the Annotation Configuration in his or her workspace. The user may then request Annotation Processing to run the relevant FEX Annotators (such as some lexical lookup tool or natural language parser) and Align Annotation results on the FEX Server. From these results, the user can further tune the Annotation Configuration, if necessary.
  • The FEX GUI-DE provides the user interface to the FEX tool set. The user uses editing tools in the FEX GUI-DE to create and maintain Notation Configuration information, RuBIE annotation files (scripts), and possibly other annotation files like dictionaries or annotator parameter information. The FEX GUI-DE also allows the user to create and maintain Workspaces, in which the user stores annotation configurations, RuBIE application files, and other files for each logical work grouping. The user also uses the FEX GUI-DE to start annotation and RuBIE processing on the FEX Server and to “move up” files into production space on the network.
  • Once satisfied with the annotation results, the user writes a RuBIE application file in GUI-DE to define the patterns and relationships to extract from these annotations, and saves the file to the FEX Workspace. The user can then compile the RuBIE application file on the FEX Server and apply it against the annotations to extract the targeted facts. The user can then inspect the facts to troubleshoot and further tune the script or re-visit the annotations.
  • When the user is satisfied with the performance of the Annotation Configuration and the RuBIE application file, the resulting extracted facts become available for use by the product application.
  • The primary FEX annotators preferably run on the FEX server, since annotators can be very processor- and memory-intensive. It is these annotators that are actually run by FEX when documents are processed for facts, based on parameters provided by the user. Some FEX annotators may also reside in some form independently on the FEX client.
  • Modifications and variations of the above-described embodiments of the present invention are possible, as appreciated by those skilled in the art in light of the above teachings. For example, additional attributes may be introduced that can be exploited by the RuBIE pattern recognition language, such as the results of a semantic disambiguation process. Additional discourse processing may be used to identify additional related non-contiguous tokens, such as robust coreference resolution. Information extraction application-specific annotators may also be introduced. A pharmaceutical information extraction application, for example, may require annotators that recognize and classify gene names, drug names and chemical compounds. It is therefore to be understood that, within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described.

Claims (55)

1. A fact extraction tool set for extracting information from a document, comprising:
means for annotating a text; and
means for extracting facts from the annotated text.
2. The fact extraction tool set of claim 1, wherein the means for annotating a text comprises means for assigning syntactic and semantic attributes to a text passage by at least one of parsing the text passage and applying text annotation processes other than parsing the text passage.
3. The fact extraction tool set of claim 2, wherein the means for assigning syntactic and semantic attributes to a text passage comprises means for breaking the text passage into its base tokens and annotating the base tokens and patterns of base tokens with a number of orthographic, syntactic, semantic, pragmatic and dictionary-based attributes.
4. The fact extraction tool set of claim 3, wherein the attributes include tokenization, text normalization, part of speech tags, sentence boundaries, parse trees, semantic attribute tagging and other interesting attributes of the text.
5. The fact extraction tool set of claim 2, wherein the means for assigning syntactic and semantic attributes to a text passage comprises independent annotators.
6. The fact extraction tool set of claim 5, wherein the independent annotators use XML as a basis for representing annotated text.
7. The fact extraction tool set of claim 6, further comprising means for resolving conflicting annotation boundaries in the annotated text to produce well-formed XML from the results of independent annotators.
8. The fact extraction tool set of claim 3, wherein the means for breaking the text passage into its base tokens and annotating the base tokens and patterns of base tokens comprises independent annotators, wherein the annotators are of three types comprising:
token attributes, which have a one-per-base-token alignment, where for the attribute type represented, there is an attempt to assign an attribute to each base token;
constituent attributes assigned yes-no values to patterns of base tokens, where the entire pattern is considered to be a single constituent with respect to some annotation value; and
links, which assign common identifiers to coreferring and other related patterns of base tokens.
9. The fact extraction tool set of claim 3, wherein the means for annotating a text further comprises means for associating all annotations assigned to a particular piece of text with the base tokens for that text to generate aligned annotations.
10. The fact extraction tool set of claim 9, wherein the means for extracting facts comprises means for identifying and extracting potentially interesting pieces of information in the aligned annotations by finding patterns in the attributes stored by the annotators.
11. The fact extraction tool set of claim 10, wherein the means for identifying and extracting potentially interesting pieces of information comprises means for recognizing both true left and right constituent attributes and non-contiguous constituent attributes.
12. The fact extraction tool set of claim 10, wherein the means for identifying and extracting potentially interesting pieces of information comprises at least one text pattern recognition rule written in a rule-based information extraction language, wherein the at least one text pattern recognition rule queries for at least one of literal text, attributes, and relationships found in the aligned annotations to define the facts to be extracted.
13. The fact extraction tool set of claim 12, wherein the at least one text pattern recognition rule can use regular expression functionality, XPath-based functionality, and auxiliary definitions in any combination.
14. The fact extraction tool set of claim 12, wherein the at least one text pattern recognition rule comprises a pattern that describes the text of interest, a label that names the pattern for testing and debugging purposes; and an action that indicates what should be done in response to a successful match.
15. The fact extraction tool set of claim 12, wherein the means for identifying and extracting potentially interesting pieces of information further comprises at least one auxiliary definition statement used to name and define a fragment of a pattern.
16. A rule-based information extraction language for use in identifying and extracting potentially interesting pieces of information in aligned annotations in a text, comprising at least one text pattern recognition rule that queries for at least one of literal text, attributes, and relationships found in the aligned annotations to define the facts to be extracted.
17. The language of claim 16, wherein the at least one text pattern recognition rule can use regular expression functionality, XPath-based functionality, and auxiliary definitions in any combination.
18. The language of claim 16, wherein the at least one text pattern recognition rule comprises a pattern that describes the text of interest, a label that names the pattern for testing and debugging purposes, and an action that indicates what should be done in response to a successful match.
19. The language of claim 16, further comprising at least one auxiliary definition statement used to name and define a fragment of a pattern.
20. A text annotation tool comprising:
means for assigning syntactic and semantic attributes to a text passage by at least one of parsing the text passage and applying text annotation processes other than parsing the text passage, including means for breaking the text passage into its base tokens and annotating the base tokens and patterns of base tokens with a number of orthographic, syntactic, semantic, pragmatic and dictionary-based attributes; and
means for associating all annotations assigned to a particular piece of text with the base tokens for that text to generate aligned annotations.
21. The text annotation tool of claim 20, wherein the attributes include tokenization, text normalization, part of speech tags, sentence boundaries, parse trees, semantic attribute tagging and other interesting attributes of the text.
22. The text annotation tool of claim 20, wherein the means for assigning syntactic and semantic attributes to a text passage comprises independent annotators.
23. The text annotation tool of claim 22, wherein the independent annotators use XML as a basis for representing annotated text.
24. The text annotation tool of claim 23, further comprising means for resolving conflicting annotation boundaries in the annotated text to produce well-formed XML from the results of independent annotators.
25. The text annotation tool of claim 20, wherein the means for breaking the text passage into its base tokens and annotating the base tokens and patterns of base tokens comprises independent annotators, wherein the annotators are of three types comprising:
token attributes, which have a one-per-base-token alignment, where for the attribute type represented, there is an attempt to assign an attribute to each base token;
constituent attributes assigned yes-no values to patterns of base tokens, where the entire pattern is considered to be a single constituent with respect to some annotation value; and
links, which assign common identifiers to coreferring and other related patterns of base tokens.
26. A computer program product for extracting information from a document, the computer program product comprising a computer usable storage medium having computer readable program code means embodied in the medium, the computer readable program code means comprising:
computer readable program code means for annotating a text; and
computer readable program code means for extracting facts from the annotated text.
27. The computer program product of claim 26, wherein the computer readable program code means for annotating a text comprises computer readable program code means for assigning syntactic and semantic attributes to a text passage by at least one of parsing the text passage and applying text annotation processes other than parsing the text passage.
28. The computer program product of claim 27, wherein the computer readable program code means for assigning syntactic and semantic attributes to a text passage comprises computer readable program code means for breaking the text passage into its base tokens and annotating the base tokens and patterns of base tokens with a number of orthographic, syntactic, semantic, pragmatic and dictionary-based attributes.
29. The computer program product of claim 28, wherein the attributes include tokenization, text normalization, part of speech tags, sentence boundaries, parse trees, semantic attribute tagging and other interesting attributes of the text.
30. The computer program product of claim 27, wherein the computer readable program code means for assigning syntactic and semantic attributes to a text passage comprises independent annotators.
31. The computer program product of claim 30, wherein the independent annotators use XML as a basis for representing annotated text.
32. The computer program product of claim 31, further comprising computer readable program code means for resolving conflicting annotation boundaries in the annotated text to produce well-formed XML from the results of independent annotators.
33. The computer program product of claim 28, wherein the computer readable program code means for breaking the text passage into its base tokens and annotating the base tokens and patterns of base tokens comprises individual annotators, wherein the annotators are of three types comprising:
token attributes, which have a one-per-base-token alignment, where for the attribute type represented, there is an attempt to assign an attribute to each base token;
constituent attributes assigned yes-no values to patterns of base tokens, where the entire pattern is considered to be a single constituent with respect to some annotation value; and
links, which assign common identifiers to coreferring and other related patterns of base tokens.
34. The computer program product of claim 28, wherein the computer readable program code means for annotating a text further comprises computer readable program code means for associating all annotations assigned to a particular piece of text with the base tokens for that text to generate aligned annotations.
35. The computer program product of claim 34, wherein the computer readable program code means for extracting facts comprises computer readable program code means for identifying and extracting potentially interesting pieces of information in the aligned annotations by finding patterns in the attributes stored by the annotators.
36. The computer program product of claim 35, wherein the computer readable program code means for identifying and extracting potentially interesting pieces of information further comprises computer readable program code means for recognizing both true left and right constituent attributes and non-contiguous constituent attributes.
37. The computer program product of claim 35, wherein the computer readable program code means for identifying and extracting potentially interesting pieces of information comprises at least one text pattern recognition rule written in a rule-based information extraction language, wherein the at least one text pattern recognition rule queries for at least one of literal text, attributes, and relationships found in the aligned annotations to define the facts to be extracted.
38. The computer program product of claim 37, wherein the at least one text pattern recognition rule can use regular expression functionality, XPath-based functionality, and auxiliary definitions in any combination.
39. The computer program product of claim 37, wherein the at least one text pattern recognition rule comprises a pattern that describes the text of interest, a label that names the pattern for testing and debugging purposes, and an action that indicates what should be done in response to a successful match.
40. The computer program product of claim 37, wherein the computer readable program code means for identifying and extracting potentially interesting pieces of information further comprises at least one auxiliary definition statement used to name and define a fragment of a pattern.
41. A method of extracting information from a document, comprising the steps of:
annotating a text; and
extracting facts from the annotated text.
42. The method of claim 41, wherein the step of annotating a text comprises assigning syntactic and semantic attributes to a text passage by at least one of parsing the text passage and applying text annotation processes other than parsing the text passage.
43. The method of claim 42, wherein the parsing of the text passage comprises breaking it into its base tokens and annotating the base tokens and patterns of base tokens with a number of orthographic, syntactic, semantic, pragmatic and dictionary-based attributes.
44. The method of claim 43, wherein the attributes include tokenization, text normalization, part of speech tags, sentence boundaries, parse trees, semantic attribute tagging and other interesting attributes of the text.
45. The method of claim 42, wherein the parsing of the text passage is carried out by independent annotators.
46. The method of claim 45, wherein the individual annotators use XML as a basis for representing annotated text.
47. The method of claim 46, further comprising the step of resolving conflicting annotation boundaries in the annotated text to produce well-formed XML from the results of independent annotators.
48. The method of claim 43, wherein the step of breaking the text passage into its base tokens and annotating the base tokens and patterns of base tokens is carried out using independent annotators, wherein the annotators are of three types comprising:
token attributes, which have a one-per-base-token alignment, where for the attribute type represented, there is an attempt to assign an attribute to each base token;
constituent attributes assigned yes-no values to patterns of base tokens, where the entire pattern is considered to be a single constituent with respect to some annotation value; and
links, which assign common identifiers to coreferring and other related patterns of base tokens.
49. The method of claim 43, wherein the step of annotating a text further comprises the step of associating all annotations assigned to a particular piece of text with the base tokens for that text to generate aligned annotations.
50. The method of claim 49, wherein the step of extracting facts comprises identifying and extracting potentially interesting pieces of information in the aligned annotations by finding patterns in the attributes stored by the annotators.
51. The method of claim 50, wherein the step of identifying and extracting potentially interesting pieces of information comprises recognizing both true left and right constituent attributes and non-contiguous constituent attributes.
52. The method of claim 50, wherein the patterns are found using at least one text pattern recognition rule written in a rule-based information extraction language, wherein the at least one text pattern recognition rule queries for at least one of literal text, attributes, and relationships found in the aligned annotations to define the facts to be extracted.
53. The method of claim 52, wherein the at least one text patternmrecognition rule can use regular expression functionality, XPath-based functionality, and auxiliary definitions in any combination.
54. The method of claim 52, wherein the at least one text pattern recognition rule describes the text of interest, names the pattern for testing and debugging purposes; and indicates what should be done in response to a successful match.
55. The method of claim 52, wherein the patterns are found further using at least one auxiliary definition statement used to name and define a fragment of a pattern.
US10/716,202 2003-11-19 2003-11-19 Extraction of facts from text Abandoned US20050108630A1 (en)

Priority Applications (7)

Application Number Priority Date Filing Date Title
US10/716,202 US20050108630A1 (en) 2003-11-19 2003-11-19 Extraction of facts from text
EP04796351A EP1695170A4 (en) 2003-11-19 2004-10-26 Extraction of facts from text
AU2004294094A AU2004294094B2 (en) 2003-11-19 2004-10-26 Extraction of facts from text
PCT/US2004/035359 WO2005052727A2 (en) 2003-11-19 2004-10-26 Extraction of facts from text
CA2546896A CA2546896C (en) 2003-11-19 2004-10-26 Extraction of facts from text
NZ547871A NZ547871A (en) 2003-11-19 2004-10-26 Extraction of facts from text
US12/689,629 US7912705B2 (en) 2003-11-19 2010-01-19 System and method for extracting information from text using text annotation and fact extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US10/716,202 US20050108630A1 (en) 2003-11-19 2003-11-19 Extraction of facts from text

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US12/689,629 Continuation US7912705B2 (en) 2003-11-19 2010-01-19 System and method for extracting information from text using text annotation and fact extraction

Publications (1)

Publication Number Publication Date
US20050108630A1 true US20050108630A1 (en) 2005-05-19

Family

ID=34574367

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/716,202 Abandoned US20050108630A1 (en) 2003-11-19 2003-11-19 Extraction of facts from text
US12/689,629 Expired - Lifetime US7912705B2 (en) 2003-11-19 2010-01-19 System and method for extracting information from text using text annotation and fact extraction

Family Applications After (1)

Application Number Title Priority Date Filing Date
US12/689,629 Expired - Lifetime US7912705B2 (en) 2003-11-19 2010-01-19 System and method for extracting information from text using text annotation and fact extraction

Country Status (6)

Country Link
US (2) US20050108630A1 (en)
EP (1) EP1695170A4 (en)
AU (1) AU2004294094B2 (en)
CA (1) CA2546896C (en)
NZ (1) NZ547871A (en)
WO (1) WO2005052727A2 (en)

Cited By (194)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040225646A1 (en) * 2002-11-28 2004-11-11 Miki Sasaki Numerical expression retrieving device
US20050055343A1 (en) * 2003-09-04 2005-03-10 Krishnamurthy Sanjay M. Storing XML documents efficiently in an RDBMS
US20050066271A1 (en) * 2002-06-28 2005-03-24 Nippon Telegraph And Telephone Corporation Extraction of information from structured documents
US20050228818A1 (en) * 2004-04-09 2005-10-13 Ravi Murthy Method and system for flexible sectioning of XML data in a database system
US20050228768A1 (en) * 2004-04-09 2005-10-13 Ashish Thusoo Mechanism for efficiently evaluating operator trees
US20050228791A1 (en) * 2004-04-09 2005-10-13 Ashish Thusoo Efficient queribility and manageability of an XML index with path subsetting
US20050237227A1 (en) * 2004-04-27 2005-10-27 International Business Machines Corporation Mention-synchronous entity tracking system and method for chaining mentions
US20050278613A1 (en) * 2004-06-09 2005-12-15 Nec Corporation Topic analyzing method and apparatus and program therefor
US20060129584A1 (en) * 2004-12-15 2006-06-15 Thuvan Hoang Performing an action in response to a file system event
US20060184551A1 (en) * 2004-07-02 2006-08-17 Asha Tarachandani Mechanism for improving performance on XML over XML data using path subsetting
US20060200556A1 (en) * 2004-12-29 2006-09-07 Scott Brave Method and apparatus for identifying, extracting, capturing, and leveraging expertise and knowledge
US20070016604A1 (en) * 2005-07-18 2007-01-18 Ravi Murthy Document level indexes for efficient processing in multiple tiers of a computer system
US20070016605A1 (en) * 2005-07-18 2007-01-18 Ravi Murthy Mechanism for computing structural summaries of XML document collections in a database system
US20070022131A1 (en) * 2003-03-24 2007-01-25 Duncan Gregory L Production of documents
US20070067320A1 (en) * 2005-09-20 2007-03-22 International Business Machines Corporation Detecting relationships in unstructured text
US20070088734A1 (en) * 2005-10-14 2007-04-19 International Business Machines Corporation System and method for exploiting semantic annotations in executing keyword queries over a collection of text documents
US20070112803A1 (en) * 2005-11-14 2007-05-17 Pettovello Primo M Peer-to-peer semantic indexing
US20070143317A1 (en) * 2004-12-30 2007-06-21 Andrew Hogue Mechanism for managing facts in a fact repository
US20070143282A1 (en) * 2005-03-31 2007-06-21 Betz Jonathan T Anchor text summarization for corroboration
US20070150464A1 (en) * 2005-12-27 2007-06-28 Scott Brave Method and apparatus for predicting destinations in a navigation context based upon observed usage patterns
US20070174309A1 (en) * 2006-01-18 2007-07-26 Pettovello Primo M Mtreeini: intermediate nodes and indexes
US20070185837A1 (en) * 2006-02-09 2007-08-09 Microsoft Corporation Detection of lists in vector graphics documents
US20070198480A1 (en) * 2006-02-17 2007-08-23 Hogue Andrew W Query language
US20070214134A1 (en) * 2006-03-09 2007-09-13 Microsoft Corporation Data parsing with annotated patterns
US20070276792A1 (en) * 2006-05-25 2007-11-29 Asha Tarachandani Isolation for applications working on shared XML data
US20080027888A1 (en) * 2006-07-31 2008-01-31 Microsoft Corporation Optimization of fact extraction using a multi-stage approach
US20080065979A1 (en) * 2004-11-12 2008-03-13 Justsystems Corporation Document Processing Device, and Document Processing Method
US20080065646A1 (en) * 2006-09-08 2008-03-13 Microsoft Corporation Enabling access to aggregated software security information
US20080071806A1 (en) * 2006-09-20 2008-03-20 Microsoft Corporation Difference analysis for electronic data interchange (edi) data dictionary
US20080071817A1 (en) * 2006-09-20 2008-03-20 Microsoft Corporation Electronic data interchange (edi) data dictionary management and versioning system
US20080072160A1 (en) * 2006-09-20 2008-03-20 Microsoft Corporation Electronic data interchange transaction set definition based instance editing
US20080126385A1 (en) * 2006-09-19 2008-05-29 Microsoft Corporation Intelligent batching of electronic data interchange messages
US20080126386A1 (en) * 2006-09-20 2008-05-29 Microsoft Corporation Translation of electronic data interchange messages to extensible markup language representation(s)
US20080140679A1 (en) * 2006-12-11 2008-06-12 Microsoft Corporation Relational linking among resoures
US20080162449A1 (en) * 2006-12-28 2008-07-03 Chen Chao-Yu Dynamic page similarity measurement
US20080168081A1 (en) * 2007-01-09 2008-07-10 Microsoft Corporation Extensible schemas and party configurations for edi document generation or validation
US20080168109A1 (en) * 2007-01-09 2008-07-10 Microsoft Corporation Automatic map updating based on schema changes
US20080262927A1 (en) * 2007-04-19 2008-10-23 Hiroshi Kanayama System, method, and program for selecting advertisements
US20080270887A1 (en) * 2004-11-12 2008-10-30 Justsystems Corporation Document Processing Device And Document Processing Method
US20080300862A1 (en) * 2007-06-01 2008-12-04 Xerox Corporation Authoring system
US20080320411A1 (en) * 2007-06-21 2008-12-25 Yen-Fu Chen Method of text type-ahead
US20090007271A1 (en) * 2007-06-28 2009-01-01 Microsoft Corporation Identifying attributes of aggregated data
US20090007272A1 (en) * 2007-06-28 2009-01-01 Microsoft Corporation Identifying data associated with security issue attributes
US20090037355A1 (en) * 2004-12-29 2009-02-05 Scott Brave Method and Apparatus for Context-Based Content Recommendation
US20090063426A1 (en) * 2007-08-31 2009-03-05 Powerset, Inc. Identification of semantic relationships within reported speech
US20090063550A1 (en) * 2007-08-31 2009-03-05 Powerset, Inc. Fact-based indexing for natural language search
US20090063473A1 (en) * 2007-08-31 2009-03-05 Powerset, Inc. Indexing role hierarchies for words in a search index
US20090070308A1 (en) * 2007-08-31 2009-03-12 Powerset, Inc. Checkpointing Iterators During Search
US20090070706A1 (en) * 2007-09-12 2009-03-12 Google Inc. Placement Attribute Targeting
US20090070322A1 (en) * 2007-08-31 2009-03-12 Powerset, Inc. Browsing knowledge on the basis of semantic relations
US20090076799A1 (en) * 2007-08-31 2009-03-19 Powerset, Inc. Coreference Resolution In An Ambiguity-Sensitive Natural Language Processing System
US20090077069A1 (en) * 2007-08-31 2009-03-19 Powerset, Inc. Calculating Valence Of Expressions Within Documents For Searching A Document Index
US20090089047A1 (en) * 2007-08-31 2009-04-02 Powerset, Inc. Natural Language Hypernym Weighting For Word Sense Disambiguation
US20090094267A1 (en) * 2007-10-04 2009-04-09 Muguda Naveenkumar V System and Method for Implementing Metadata Extraction of Artifacts from Associated Collaborative Discussions on a Data Processing System
US20090094019A1 (en) * 2007-08-31 2009-04-09 Powerset, Inc. Efficiently Representing Word Sense Probabilities
US20090125542A1 (en) * 2007-11-14 2009-05-14 Sap Ag Systems and Methods for Modular Information Extraction
US20090132521A1 (en) * 2007-08-31 2009-05-21 Powerset, Inc. Efficient Storage and Retrieval of Posting Lists
US20090138454A1 (en) * 2007-08-31 2009-05-28 Powerset, Inc. Semi-Automatic Example-Based Induction of Semantic Translation Rules to Support Natural Language Search
US20090157385A1 (en) * 2007-12-14 2009-06-18 Nokia Corporation Inverse Text Normalization
US20090172517A1 (en) * 2007-12-27 2009-07-02 Kalicharan Bhagavathi P Document parsing method and system using web-based GUI software
US20090182741A1 (en) * 2008-01-16 2009-07-16 International Business Machines Corporation Systems and Arrangements of Text Type-Ahead
US20090187567A1 (en) * 2008-01-18 2009-07-23 Citation Ware Llc System and method for determining valid citation patterns in electronic documents
US20090198646A1 (en) * 2008-01-31 2009-08-06 International Business Machines Corporation Systems, methods and computer program products for an algebraic approach to rule-based information extraction
US20090217243A1 (en) * 2008-02-26 2009-08-27 Hitachi, Ltd. Automatic software configuring system
US20090240487A1 (en) * 2008-03-20 2009-09-24 Libin Shen Machine translation
US20090271700A1 (en) * 2008-04-28 2009-10-29 Yen-Fu Chen Text type-ahead
US20090282401A1 (en) * 2008-05-09 2009-11-12 Mariela Todorova Deploying software modules in computer system
US20100083095A1 (en) * 2008-09-29 2010-04-01 Nikovski Daniel N Method for Extracting Data from Web Pages
US20100121631A1 (en) * 2008-11-10 2010-05-13 Olivier Bonnet Data detection
US20100131534A1 (en) * 2007-04-10 2010-05-27 Toshio Takeda Information providing system
US20100138211A1 (en) * 2008-12-02 2010-06-03 Microsoft Corporation Adaptive web mining of bilingual lexicon
US20100198756A1 (en) * 2009-01-30 2010-08-05 Zhang ling qin Methods and systems for matching records and normalizing names
US20100250235A1 (en) * 2009-03-24 2010-09-30 Microsoft Corporation Text analysis using phrase definitions and containers
US20100324885A1 (en) * 2009-06-22 2010-12-23 Computer Associates Think, Inc. INDEXING MECHANISM (Nth PHRASAL INDEX) FOR ADVANCED LEVERAGING FOR TRANSLATION
US20110035390A1 (en) * 2009-08-05 2011-02-10 Loglogic, Inc. Message Descriptions
US20110047153A1 (en) * 2005-05-31 2011-02-24 Betz Jonathan T Identifying the Unifying Subject of a Set of Facts
GB2475151A (en) * 2009-11-06 2011-05-11 Symantec Corp Indexing data for use by multiple applications by extracting tokens from data objects
US7958164B2 (en) 2006-02-16 2011-06-07 Microsoft Corporation Visual design of annotated regular expression
US20110145240A1 (en) * 2009-12-15 2011-06-16 International Business Machines Corporation Organizing Annotations
US7970766B1 (en) 2007-07-23 2011-06-28 Google Inc. Entity type assignment
US7991768B2 (en) 2007-11-08 2011-08-02 Oracle International Corporation Global query normalization to improve XML index based rewrites for path subsetted index
US20110221367A1 (en) * 2010-03-11 2011-09-15 Gm Global Technology Operations, Inc. Methods, systems and apparatus for overmodulation of a five-phase machine
US20110270856A1 (en) * 2010-04-30 2011-11-03 International Business Machines Corporation Managed document research domains
US20110295864A1 (en) * 2010-05-29 2011-12-01 Martin Betz Iterative fact-extraction
US20120016676A1 (en) * 2010-07-15 2012-01-19 King Abdulaziz City For Science And Technology System and method for writing digits in words and pronunciation of numbers, fractions, and units
US8122026B1 (en) * 2006-10-20 2012-02-21 Google Inc. Finding and disambiguating references to entities on web pages
US20120065959A1 (en) * 2010-09-13 2012-03-15 Richard Salisbury Word graph
US8150695B1 (en) 2009-06-18 2012-04-03 Amazon Technologies, Inc. Presentation of written works based on character identities and attributes
US20120102031A1 (en) * 2010-10-20 2012-04-26 Sap Ag Apparatus and method for entity expansion and grouping
US20120117077A1 (en) * 2006-02-17 2012-05-10 Tom Ritchford Annotation Framework
US8239349B2 (en) 2010-10-07 2012-08-07 Hewlett-Packard Development Company, L.P. Extracting data
CN102646128A (en) * 2012-03-06 2012-08-22 北京航空航天大学 Method for labeling word properties of emotional words based on extensible markup language (XML)
US8260785B2 (en) 2006-02-17 2012-09-04 Google Inc. Automatic object reference identification and linking in a browseable fact repository
AU2008292779B2 (en) * 2007-08-31 2012-09-06 Microsoft Technology Licensing, Llc Coreference resolution in an ambiguity-sensitive natural language processing system
US20120233534A1 (en) * 2011-03-11 2012-09-13 Microsoft Corporation Validation, rejection, and modification of automatically generated document annotations
US20120246176A1 (en) * 2011-03-24 2012-09-27 Sony Corporation Information processing apparatus, information processing method, and program
US20120253793A1 (en) * 2011-04-01 2012-10-04 Rima Ghannam System for natural language understanding
US8325974B1 (en) 2009-03-31 2012-12-04 Amazon Technologies Inc. Recognition of characters and their significance within written works
US8347202B1 (en) 2007-03-14 2013-01-01 Google Inc. Determining geographic locations for place names in a fact repository
US20130042200A1 (en) * 2011-08-08 2013-02-14 The Original Software Group Limited System and method for annotating graphical user interface
US20130091145A1 (en) * 2011-10-07 2013-04-11 Electronics And Telecommunications Research Institute Method and apparatus for analyzing web trends based on issue template extraction
US20130110818A1 (en) * 2011-10-28 2013-05-02 Eamonn O'Brien-Strain Profile driven extraction
US8463789B1 (en) 2010-03-23 2013-06-11 Firstrain, Inc. Event detection
WO2013159156A1 (en) * 2012-04-27 2013-10-31 Citadel Corporation Pty Ltd Method for storing and applying related sets of pattern/message rules
US20130297999A1 (en) * 2012-05-07 2013-11-07 Sap Ag Document Text Processing Using Edge Detection
US20130346856A1 (en) * 2010-05-13 2013-12-26 Expedia, Inc. Systems and methods for automated content generation
US8631028B1 (en) 2009-10-29 2014-01-14 Primo M. Pettovello XPath query processing improvements
US8650175B2 (en) 2005-03-31 2014-02-11 Google Inc. User interface for facts query engine with snippets from information sources that include query terms and answer terms
US8682913B1 (en) 2005-03-31 2014-03-25 Google Inc. Corroborating facts extracted from multiple sources
WO2014049186A1 (en) * 2012-09-26 2014-04-03 Universidad Carlos Iii De Madrid Method for generating semantic patterns
US8694510B2 (en) 2003-09-04 2014-04-08 Oracle International Corporation Indexing XML documents efficiently
US8738360B2 (en) 2008-06-06 2014-05-27 Apple Inc. Data detection of a character sequence having multiple possible data types
US20140164996A1 (en) * 2012-12-11 2014-06-12 Canon Kabushiki Kaisha Apparatus, method, and storage medium
US8782042B1 (en) 2011-10-14 2014-07-15 Firstrain, Inc. Method and system for identifying entities
US8781815B1 (en) * 2013-12-05 2014-07-15 Seal Software Ltd. Non-standard and standard clause detection
US8805834B2 (en) 2010-05-26 2014-08-12 International Business Machines Corporation Extensible system and method for information extraction in a data processing system
US8805840B1 (en) 2010-03-23 2014-08-12 Firstrain, Inc. Classification of documents
US8812435B1 (en) * 2007-11-16 2014-08-19 Google Inc. Learning objects and facts from documents
US20140236570A1 (en) * 2013-02-18 2014-08-21 Microsoft Corporation Exploiting the semantic web for unsupervised spoken language understanding
US8832092B2 (en) 2012-02-17 2014-09-09 Bottlenose, Inc. Natural language processing optimized for micro content
US20140278365A1 (en) * 2013-03-12 2014-09-18 Guangsheng Zhang System and methods for determining sentiment based on context
US20140358883A1 (en) * 2008-09-08 2014-12-04 Semanti Inc. Semantically associated text index and the population and use thereof
US8909569B2 (en) 2013-02-22 2014-12-09 Bottlenose, Inc. System and method for revealing correlations between data streams
US20150012262A1 (en) * 2007-06-27 2015-01-08 Abbyy Infopoisk Llc Method and system for generating new entries in natural language dictionary
US8954412B1 (en) 2006-09-28 2015-02-10 Google Inc. Corroborating facts in electronic documents
US8977613B1 (en) 2012-06-12 2015-03-10 Firstrain, Inc. Generation of recurring searches
US8990097B2 (en) 2012-07-31 2015-03-24 Bottlenose, Inc. Discovering and ranking trending links about topics
US8996470B1 (en) 2005-05-31 2015-03-31 Google Inc. System for ensuring the internal consistency of a fact repository
US20150142842A1 (en) * 2005-07-25 2015-05-21 Splunk Inc. Uniform storage and search of events derived from machine data from different sources
US9147271B2 (en) 2006-09-08 2015-09-29 Microsoft Technology Licensing, Llc Graphical representation of aggregated data
US9171100B2 (en) 2004-09-22 2015-10-27 Primo M. Pettovello MTree an XPath multi-axis structure threaded index
US9201868B1 (en) * 2011-12-09 2015-12-01 Guangsheng Zhang System, methods and user interface for identifying and presenting sentiment information
US9218568B2 (en) 2013-03-15 2015-12-22 Business Objects Software Ltd. Disambiguating data using contextual and historical information
US9262550B2 (en) 2013-03-15 2016-02-16 Business Objects Software Ltd. Processing semi-structured data
US9299041B2 (en) 2013-03-15 2016-03-29 Business Objects Software Ltd. Obtaining data from unstructured data for a structured data collection
US9348806B2 (en) * 2014-09-30 2016-05-24 International Business Machines Corporation High speed dictionary expansion
US20160203115A1 (en) * 2007-07-10 2016-07-14 International Business Machines Corporation Intelligent text annotation
US20160203233A1 (en) * 2015-01-12 2016-07-14 Microsoft Technology Licensing, Llc Storage and retrieval of structured content in unstructured user-editable content stores
US9495358B2 (en) 2006-10-10 2016-11-15 Abbyy Infopoisk Llc Cross-language text clustering
US9530229B2 (en) 2006-01-27 2016-12-27 Google Inc. Data object visualization using graphs
US20170011023A1 (en) * 2015-07-07 2017-01-12 Rima Ghannam System for Natural Language Understanding
US9614807B2 (en) 2011-02-23 2017-04-04 Bottlenose, Inc. System and method for analyzing messages in a network or across networks
US20170097988A1 (en) * 2015-10-05 2017-04-06 International Business Machines Corporation Hierarchical Target Centric Pattern Generation
US9626353B2 (en) 2014-01-15 2017-04-18 Abbyy Infopoisk Llc Arc filtering in a syntactic graph
US9626358B2 (en) 2014-11-26 2017-04-18 Abbyy Infopoisk Llc Creating ontologies by analyzing natural language texts
US9639818B2 (en) 2013-08-30 2017-05-02 Sap Se Creation of event types for news mining for enterprise resource planning
US9665454B2 (en) 2014-05-14 2017-05-30 International Business Machines Corporation Extracting test model from textual test suite
US9665617B1 (en) * 2014-04-16 2017-05-30 Google Inc. Methods and systems for generating a stable identifier for nodes likely including primary content within an information resource
US9747280B1 (en) * 2013-08-21 2017-08-29 Intelligent Language, LLC Date and time processing
US9805025B2 (en) 2015-07-13 2017-10-31 Seal Software Limited Standard exact clause detection
CN107342881A (en) * 2016-05-03 2017-11-10 中国移动通信集团四川有限公司 A kind of operation maintenance center's north direction interface data processing method and processing device
US20170344625A1 (en) * 2016-05-27 2017-11-30 International Business Machines Corporation Obtaining of candidates for a relationship type and its label
US9836765B2 (en) 2014-05-19 2017-12-05 Kibo Software, Inc. System and method for context-aware recommendation through user activity change detection
US9842161B2 (en) * 2016-01-12 2017-12-12 International Business Machines Corporation Discrepancy curator for documents in a corpus of a cognitive computing system
US9870356B2 (en) 2014-02-13 2018-01-16 Microsoft Technology Licensing, Llc Techniques for inferring the unknown intents of linguistic items
US20180018322A1 (en) * 2016-07-15 2018-01-18 Intuit Inc. System and method for automatically understanding lines of compliance forms through natural language patterns
US9898467B1 (en) * 2013-09-24 2018-02-20 Amazon Technologies, Inc. System for data normalization
US9898523B2 (en) 2013-04-22 2018-02-20 Abb Research Ltd. Tabular data parsing in document(s)
US10002117B1 (en) * 2013-10-24 2018-06-19 Google Llc Translating annotation tags into suggested markup
US10019437B2 (en) * 2015-02-23 2018-07-10 International Business Machines Corporation Facilitating information extraction via semantic abstraction
US10073840B2 (en) 2013-12-20 2018-09-11 Microsoft Technology Licensing, Llc Unsupervised relation detection model training
US10148547B2 (en) * 2014-10-24 2018-12-04 Tektronix, Inc. Hardware trigger generation from a declarative protocol description
RU2674331C2 (en) * 2014-09-03 2018-12-06 Дзе Дан Энд Брэдстрит Корпорейшн System and process for analysis, qualification and acquisition of sources of unstructured data by means of empirical attribution
US10191975B1 (en) * 2017-11-16 2019-01-29 The Florida International University Board Of Trustees Features for automatic classification of narrative point of view and diegesis
US20190065453A1 (en) * 2017-08-25 2019-02-28 Abbyy Development Llc Reconstructing textual annotations associated with information objects
US10235358B2 (en) 2013-02-21 2019-03-19 Microsoft Technology Licensing, Llc Exploiting structured content for unsupervised natural language semantic parsing
US10296573B2 (en) * 2012-11-16 2019-05-21 International Business Machines Corporation Building and maintaining information extraction rules
US10592480B1 (en) 2012-12-30 2020-03-17 Aurea Software, Inc. Affinity scoring
US10685189B2 (en) * 2016-11-17 2020-06-16 Goldman Sachs & Co. LLC System and method for coupled detection of syntax and semantics for natural language understanding and generation
US10725896B2 (en) 2016-07-15 2020-07-28 Intuit Inc. System and method for identifying a subset of total historical users of a document preparation system to represent a full set of test scenarios based on code coverage
US10778712B2 (en) 2015-08-01 2020-09-15 Splunk Inc. Displaying network security events and investigation activities across investigation timelines
US10848510B2 (en) 2015-08-01 2020-11-24 Splunk Inc. Selecting network security event investigation timelines in a workflow environment
CN112035408A (en) * 2020-09-01 2020-12-04 文思海辉智科科技有限公司 Text processing method and device, electronic equipment and storage medium
US10878020B2 (en) * 2017-01-27 2020-12-29 Hootsuite Media Inc. Automated extraction tools and their use in social content tagging systems
US20210042468A1 (en) * 2007-04-13 2021-02-11 Optum360, Llc Mere-parsing with boundary and semantic driven scoping
US10942958B2 (en) 2015-05-27 2021-03-09 International Business Machines Corporation User interface for a query answering system
US20210081602A1 (en) * 2019-09-16 2021-03-18 Docugami, Inc. Automatically Identifying Chunks in Sets of Documents
CN112819622A (en) * 2021-01-26 2021-05-18 深圳价值在线信息科技股份有限公司 Information entity relationship joint extraction method and device and terminal equipment
US11030402B2 (en) 2019-05-03 2021-06-08 International Business Machines Corporation Dictionary expansion using neural language models
US11030227B2 (en) 2015-12-11 2021-06-08 International Business Machines Corporation Discrepancy handler for document ingestion into a corpus for a cognitive computing system
US11048864B2 (en) * 2019-04-01 2021-06-29 Adobe Inc. Digital annotation and digital content linking techniques
US11049190B2 (en) 2016-07-15 2021-06-29 Intuit Inc. System and method for automatically generating calculations for fields in compliance forms
US11074286B2 (en) 2016-01-12 2021-07-27 International Business Machines Corporation Automated curation of documents in a corpus for a cognitive computing system
US11132111B2 (en) 2015-08-01 2021-09-28 Splunk Inc. Assigning workflow network security investigation actions to investigation timelines
US11140115B1 (en) * 2014-12-09 2021-10-05 Google Llc Systems and methods of applying semantic features for machine learning of message categories
US11163956B1 (en) 2019-05-23 2021-11-02 Intuit Inc. System and method for recognizing domain specific named entities using domain specific word embeddings
US20210403036A1 (en) * 2020-06-30 2021-12-30 Lyft, Inc. Systems and methods for encoding and searching scenario information
US11222266B2 (en) 2016-07-15 2022-01-11 Intuit Inc. System and method for automatic learning of functions
US11263408B2 (en) * 2018-03-13 2022-03-01 Fujitsu Limited Alignment generation device and alignment generation method
US20220101873A1 (en) * 2020-09-30 2022-03-31 Harman International Industries, Incorporated Techniques for providing feedback on the veracity of spoken statements
US20220245183A1 (en) * 2019-05-31 2022-08-04 Nec Corporation Parameter learning apparatus, parameter learning method, and computer readable recording medium
US11520975B2 (en) 2016-07-15 2022-12-06 Intuit Inc. Lean parsing: a natural language processing system and method for parsing domain-specific languages
US11783128B2 (en) 2020-02-19 2023-10-10 Intuit Inc. Financial document text conversion to computer readable operations
US11861294B2 (en) * 2013-09-10 2024-01-02 Embarcadero Technologies, Inc. Syndication of associations relating data and metadata

Families Citing this family (121)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8079086B1 (en) 1997-11-06 2011-12-13 Finjan, Inc. Malicious mobile code runtime monitoring system and methods
US7058822B2 (en) 2000-03-30 2006-06-06 Finjan Software, Ltd. Malicious mobile code runtime monitoring system and methods
US9219755B2 (en) 1996-11-08 2015-12-22 Finjan, Inc. Malicious mobile code runtime monitoring system and methods
US7975305B2 (en) * 1997-11-06 2011-07-05 Finjan, Inc. Method and system for adaptive rule-based content scanners for desktop computers
US8225408B2 (en) * 1997-11-06 2012-07-17 Finjan, Inc. Method and system for adaptive rule-based content scanners
US8656039B2 (en) 2003-12-10 2014-02-18 Mcafee, Inc. Rule parser
US8548170B2 (en) 2003-12-10 2013-10-01 Mcafee, Inc. Document de-registration
US7984175B2 (en) 2003-12-10 2011-07-19 Mcafee, Inc. Method and apparatus for data capture and analysis system
US8560534B2 (en) 2004-08-23 2013-10-15 Mcafee, Inc. Database for a capture system
US7949849B2 (en) 2004-08-24 2011-05-24 Mcafee, Inc. File system for a capture system
US9195766B2 (en) * 2004-12-14 2015-11-24 Google Inc. Providing useful information associated with an item in a document
US7907608B2 (en) 2005-08-12 2011-03-15 Mcafee, Inc. High speed packet capture
US7818326B2 (en) 2005-08-31 2010-10-19 Mcafee, Inc. System and method for word indexing in a capture system and querying thereof
US7730011B1 (en) 2005-10-19 2010-06-01 Mcafee, Inc. Attributes of captured objects in a capture system
US7487174B2 (en) * 2006-01-17 2009-02-03 International Business Machines Corporation Method for storing text annotations with associated type information in a structured data store
US8122019B2 (en) * 2006-02-17 2012-02-21 Google Inc. Sharing user distributed search results
US8862572B2 (en) * 2006-02-17 2014-10-14 Google Inc. Sharing user distributed search results
US7844603B2 (en) * 2006-02-17 2010-11-30 Google Inc. Sharing user distributed search results
US7949538B2 (en) * 2006-03-14 2011-05-24 A-Life Medical, Inc. Automated interpretation of clinical encounters with cultural cues
US8504537B2 (en) 2006-03-24 2013-08-06 Mcafee, Inc. Signature distribution in a document registration system
US8731954B2 (en) * 2006-03-27 2014-05-20 A-Life Medical, Llc Auditing the coding and abstracting of documents
US7958227B2 (en) 2006-05-22 2011-06-07 Mcafee, Inc. Attributes of captured objects in a capture system
US8996979B2 (en) 2006-06-08 2015-03-31 West Services, Inc. Document automation systems
US9098489B2 (en) 2006-10-10 2015-08-04 Abbyy Infopoisk Llc Method and system for semantic searching
US9069750B2 (en) 2006-10-10 2015-06-30 Abbyy Infopoisk Llc Method and system for semantic searching of natural language texts
US9075864B2 (en) 2006-10-10 2015-07-07 Abbyy Infopoisk Llc Method and system for semantic searching using syntactic and semantic analysis
US9892111B2 (en) 2006-10-10 2018-02-13 Abbyy Production Llc Method and device to estimate similarity between documents having multiple segments
US9189482B2 (en) 2012-10-10 2015-11-17 Abbyy Infopoisk Llc Similar document search
US8682823B2 (en) * 2007-04-13 2014-03-25 A-Life Medical, Llc Multi-magnitudinal vectors with resolution based on source vector features
US9946846B2 (en) 2007-08-03 2018-04-17 A-Life Medical, Llc Visualizing the documentation and coding of surgical procedures
KR101475339B1 (en) * 2008-04-14 2014-12-23 삼성전자주식회사 Communication terminal and method for unified natural language interface thereof
US8205242B2 (en) 2008-07-10 2012-06-19 Mcafee, Inc. System and method for data mining and security policy management
US9253154B2 (en) 2008-08-12 2016-02-02 Mcafee, Inc. Configuration management for a capture/registration system
TW201027375A (en) * 2008-10-20 2010-07-16 Ibm Search system, search method and program
US8850591B2 (en) 2009-01-13 2014-09-30 Mcafee, Inc. System and method for concept building
US8706709B2 (en) 2009-01-15 2014-04-22 Mcafee, Inc. System and method for intelligent term grouping
US8473442B1 (en) 2009-02-25 2013-06-25 Mcafee, Inc. System and method for intelligent state management
US8447722B1 (en) 2009-03-25 2013-05-21 Mcafee, Inc. System and method for data mining and security policy management
US8667121B2 (en) 2009-03-25 2014-03-04 Mcafee, Inc. System and method for managing data and policies
US8073718B2 (en) * 2009-05-29 2011-12-06 Hyperquest, Inc. Automation of auditing claims
US8346577B2 (en) * 2009-05-29 2013-01-01 Hyperquest, Inc. Automation of auditing claims
US8447632B2 (en) * 2009-05-29 2013-05-21 Hyperquest, Inc. Automation of auditing claims
US8255205B2 (en) * 2009-05-29 2012-08-28 Hyperquest, Inc. Automation of auditing claims
US20110035210A1 (en) * 2009-08-10 2011-02-10 Benjamin Rosenfeld Conditional random fields (crf)-based relation extraction system
US9047283B1 (en) * 2010-01-29 2015-06-02 Guangsheng Zhang Automated topic discovery in documents and content categorization
JP5656585B2 (en) * 2010-02-17 2015-01-21 キヤノン株式会社 Document creation support apparatus, document creation support method, and program
US9460232B2 (en) * 2010-04-07 2016-10-04 Oracle International Corporation Searching document object model elements by attribute order priority
US8538916B1 (en) 2010-04-09 2013-09-17 Google Inc. Extracting instance attributes from text
JP2011232871A (en) * 2010-04-26 2011-11-17 Sony Corp Information processor, text selection method and program
GB201010545D0 (en) * 2010-06-23 2010-08-11 Rolls Royce Plc Entity recognition
US8527488B1 (en) * 2010-07-08 2013-09-03 Netlogic Microsystems, Inc. Negative regular expression search operations
TWI403304B (en) * 2010-08-27 2013-08-01 Ind Tech Res Inst Method and mobile device for awareness of linguistic ability
US20120101980A1 (en) * 2010-10-26 2012-04-26 Microsoft Corporation Synchronizing online document edits
US9015033B2 (en) * 2010-10-26 2015-04-21 At&T Intellectual Property I, L.P. Method and apparatus for detecting a sentiment of short messages
US8806615B2 (en) * 2010-11-04 2014-08-12 Mcafee, Inc. System and method for protecting specified data combinations
JP5197774B2 (en) * 2011-01-18 2013-05-15 株式会社東芝 Learning device, determination device, learning method, determination method, learning program, and determination program
US10048992B2 (en) * 2011-04-13 2018-08-14 Microsoft Technology Licensing, Llc Extension of schematized XML protocols
US20120265784A1 (en) * 2011-04-15 2012-10-18 Microsoft Corporation Ordering semantic query formulation suggestions
US8838992B1 (en) * 2011-04-28 2014-09-16 Trend Micro Incorporated Identification of normal scripts in computer systems
US20120303570A1 (en) * 2011-05-27 2012-11-29 Verizon Patent And Licensing, Inc. System for and method of parsing an electronic mail
US8630989B2 (en) 2011-05-27 2014-01-14 International Business Machines Corporation Systems and methods for information extraction using contextual pattern discovery
US9164983B2 (en) * 2011-05-27 2015-10-20 Robert Bosch Gmbh Broad-coverage normalization system for social media language
WO2013054348A2 (en) * 2011-07-20 2013-04-18 Tata Consultancy Services Limited A method and system for differentiating textual information embedded in streaming news video
WO2013026146A1 (en) * 2011-08-24 2013-02-28 Alexei Kamychev Method and apparatus for emulating short text fast-reading processes
US9785638B1 (en) 2011-08-25 2017-10-10 Infotech International Llc Document display system and method
US9633012B1 (en) 2011-08-25 2017-04-25 Infotech International Llc Construction permit processing system and method
US9116895B1 (en) 2011-08-25 2015-08-25 Infotech International Llc Document processing system and method
US8812301B2 (en) * 2011-09-26 2014-08-19 Xerox Corporation Linguistically-adapted structural query annotation
US20130246431A1 (en) 2011-12-27 2013-09-19 Mcafee, Inc. System and method for providing data protection workflows in a network environment
US9158754B2 (en) * 2012-03-29 2015-10-13 The Echo Nest Corporation Named entity extraction from a block of text
US20130298003A1 (en) * 2012-05-04 2013-11-07 Rawllin International Inc. Automatic annotation of content
CA2865187C (en) * 2012-05-15 2015-09-22 Whyz Technologies Limited Method and system relating to salient content extraction for electronic content
US9684648B2 (en) * 2012-05-31 2017-06-20 International Business Machines Corporation Disambiguating words within a text segment
US9460199B2 (en) 2013-05-01 2016-10-04 International Business Machines Corporation Application of text analytics to determine provenance of an object
EP3022662A4 (en) 2013-05-31 2017-06-14 Joshi, Vikas Balwant Method and apparatus for browsing information
US10037317B1 (en) 2013-07-17 2018-07-31 Yseop Sa Techniques for automatic generation of natural language text
US9411804B1 (en) * 2013-07-17 2016-08-09 Yseop Sa Techniques for automatic generation of natural language text
US10541053B2 (en) 2013-09-05 2020-01-21 Optum360, LLCq Automated clinical indicator recognition with natural language processing
US10133727B2 (en) 2013-10-01 2018-11-20 A-Life Medical, Llc Ontologically driven procedure coding
US10075484B1 (en) 2014-03-13 2018-09-11 Issuu, Inc. Sharable clips for digital publications
US9659005B2 (en) 2014-05-16 2017-05-23 Semantix Technologies Corporation System for semantic interpretation
US9761222B1 (en) * 2014-06-11 2017-09-12 Albert Scarasso Intelligent conversational messaging
US9454695B2 (en) * 2014-10-22 2016-09-27 Xerox Corporation System and method for multi-view pattern matching
EP3029607A1 (en) * 2014-12-05 2016-06-08 PLANET AI GmbH Method for text recognition and computer program product
US10176163B2 (en) * 2014-12-19 2019-01-08 International Business Machines Corporation Diagnosing autism spectrum disorder using natural language processing
US20160299928A1 (en) * 2015-04-10 2016-10-13 Infotrax Systems Variable record size within a hierarchically organized data structure
US11010768B2 (en) * 2015-04-30 2021-05-18 Oracle International Corporation Character-based attribute value extraction system
WO2016203457A1 (en) * 2015-06-19 2016-12-22 Koninklijke Philips N.V. Efficient clinical trial matching
US9633048B1 (en) * 2015-11-16 2017-04-25 Adobe Systems Incorporated Converting a text sentence to a series of images
US10268750B2 (en) * 2016-01-29 2019-04-23 Cisco Technology, Inc. Log event summarization for distributed server system
US9836451B2 (en) * 2016-02-18 2017-12-05 Sap Se Dynamic tokens for an expression parser
WO2017147036A1 (en) 2016-02-23 2017-08-31 Carrier Corporation Extraction of policies from natural language documents for physical access control
JP2017167433A (en) * 2016-03-17 2017-09-21 株式会社東芝 Summary generation device, summary generation method, and summary generation program
US10169581B2 (en) 2016-08-29 2019-01-01 Trend Micro Incorporated Detecting malicious code in sections of computer files
US10769213B2 (en) * 2016-10-24 2020-09-08 International Business Machines Corporation Detection of document similarity
RU2636098C1 (en) * 2016-10-26 2017-11-20 Общество с ограниченной ответственностью "Аби Продакшн" Use of depth semantic analysis of texts on natural language for creation of training samples in methods of machine training
US10832000B2 (en) * 2016-11-14 2020-11-10 International Business Machines Corporation Identification of textual similarity with references
MX2019008257A (en) * 2017-01-11 2019-10-07 Koninklijke Philips Nv Method and system for automated inclusion or exclusion criteria detection.
US10565498B1 (en) 2017-02-28 2020-02-18 Amazon Technologies, Inc. Deep neural network-based relationship analysis with multi-feature token model
US10579719B2 (en) * 2017-06-15 2020-03-03 Turbopatent Inc. System and method for editor emulation
US10713519B2 (en) * 2017-06-22 2020-07-14 Adobe Inc. Automated workflows for identification of reading order from text segments using probabilistic language models
US10740560B2 (en) 2017-06-30 2020-08-11 Elsevier, Inc. Systems and methods for extracting funder information from text
US11475209B2 (en) * 2017-10-17 2022-10-18 Handycontract Llc Device, system, and method for extracting named entities from sectioned documents
US11055209B2 (en) * 2017-12-21 2021-07-06 Google Llc Application analysis with flexible post-processing
US10872122B2 (en) * 2018-01-30 2020-12-22 Government Of The United States Of America, As Represented By The Secretary Of Commerce Knowledge management system and process for managing knowledge
US11586955B2 (en) 2018-02-02 2023-02-21 Accenture Global Solutions Limited Ontology and rule based adjudication
US10733389B2 (en) * 2018-09-05 2020-08-04 International Business Machines Corporation Computer aided input segmentation for machine translation
US10936809B2 (en) * 2018-09-11 2021-03-02 Dell Products L.P. Method of optimized parsing unstructured and garbled texts lacking whitespaces
US11295083B1 (en) * 2018-09-26 2022-04-05 Amazon Technologies, Inc. Neural models for named-entity recognition
US11238215B2 (en) 2018-12-04 2022-02-01 Issuu, Inc. Systems and methods for generating social assets from electronic publications
US10977289B2 (en) * 2019-02-11 2021-04-13 Verizon Media Inc. Automatic electronic message content extraction method and apparatus
US11163954B2 (en) * 2019-09-18 2021-11-02 International Business Machines Corporation Propagation of annotation metadata to overlapping annotations of synonymous type
US11625555B1 (en) 2020-03-12 2023-04-11 Amazon Technologies, Inc. Artificial intelligence system with unsupervised model training for entity-pair relationship analysis
US11074402B1 (en) * 2020-04-07 2021-07-27 International Business Machines Corporation Linguistically consistent document annotation
US11514321B1 (en) 2020-06-12 2022-11-29 Amazon Technologies, Inc. Artificial intelligence system using unsupervised transfer learning for intra-cluster analysis
US11423072B1 (en) 2020-07-31 2022-08-23 Amazon Technologies, Inc. Artificial intelligence system employing multimodal learning for analyzing entity record relationships
US11620558B1 (en) 2020-08-25 2023-04-04 Amazon Technologies, Inc. Iterative machine learning based techniques for value-based defect analysis in large data sets
RU2751993C1 (en) * 2020-09-09 2021-07-21 Глеб Валерьевич Данилов Method for extracting information from unstructured texts written in natural language
CN112417161B (en) * 2020-11-12 2022-06-24 福建亿榕信息技术有限公司 Method and storage device for recognizing upper and lower relationships of knowledge graph based on mode expansion and BERT classification
EP4075320A1 (en) 2021-04-15 2022-10-19 Wonop Holding ApS A method and device for improving the efficiency of pattern recognition in natural language
CN113420149A (en) * 2021-06-30 2021-09-21 北京百度网讯科技有限公司 Data labeling method and device

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5890103A (en) * 1995-07-19 1999-03-30 Lernout & Hauspie Speech Products N.V. Method and apparatus for improved tokenization of natural language text
US6279017B1 (en) * 1996-08-07 2001-08-21 Randall C. Walker Method and apparatus for displaying text based upon attributes found within the text
US20010018697A1 (en) * 2000-01-25 2001-08-30 Fuji Xerox Co., Ltd. Structured document processing system and structured document processing method
US20020013694A1 (en) * 2000-07-26 2002-01-31 Toshiki Murata Apparatus and method for natural language processing
US6442545B1 (en) * 1999-06-01 2002-08-27 Clearforest Ltd. Term-level text with mining with taxonomies
US20020165717A1 (en) * 2001-04-06 2002-11-07 Solmer Robert P. Efficient method for information extraction
US20030007397A1 (en) * 2001-05-10 2003-01-09 Kenichiro Kobayashi Document processing apparatus, document processing method, document processing program and recording medium
US20030154070A1 (en) * 2002-02-12 2003-08-14 Naoyuki Tokuda System and method for accurate grammar analysis using a learners' model and part-of-speech tagged (POST) parser
US20030158723A1 (en) * 2002-02-20 2003-08-21 Fuji Xerox Co., Ltd. Syntactic information tagging support system and method
US20030167162A1 (en) * 2001-03-07 2003-09-04 International Business Machines Corporation System and method for building a semantic network capable of identifying word patterns in text
US20030229854A1 (en) * 2000-10-19 2003-12-11 Mlchel Lemay Text extraction method for HTML pages
US6714939B2 (en) * 2001-01-08 2004-03-30 Softface, Inc. Creation of structured data from plain text
US20040078190A1 (en) * 2000-09-29 2004-04-22 Fass Daniel C Method and system for describing and identifying concepts in natural language text for information retrieval and processing
US20040243645A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations
US20050066271A1 (en) * 2002-06-28 2005-03-24 Nippon Telegraph And Telephone Corporation Extraction of information from structured documents
US6910003B1 (en) * 1999-09-17 2005-06-21 Discern Communications, Inc. System, method and article of manufacture for concept based information searching
US20050154979A1 (en) * 2004-01-14 2005-07-14 Xerox Corporation Systems and methods for converting legacy and proprietary documents into extended mark-up language format

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6108698A (en) * 1998-07-29 2000-08-22 Xerox Corporation Node-link data defining a graph and a tree within the graph
SE524595C2 (en) * 2000-09-26 2004-08-31 Hapax Information Systems Ab Procedure and computer program for normalization of style throws
US6892189B2 (en) * 2001-01-26 2005-05-10 Inxight Software, Inc. Method for learning and combining global and local regularities for information extraction and classification
SE0101127D0 (en) * 2001-03-30 2001-03-30 Hapax Information Systems Ab Method of finding answers to questions
US20040243556A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and including a document common analysis system (CAS)

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5890103A (en) * 1995-07-19 1999-03-30 Lernout & Hauspie Speech Products N.V. Method and apparatus for improved tokenization of natural language text
US6279017B1 (en) * 1996-08-07 2001-08-21 Randall C. Walker Method and apparatus for displaying text based upon attributes found within the text
US6442545B1 (en) * 1999-06-01 2002-08-27 Clearforest Ltd. Term-level text with mining with taxonomies
US6910003B1 (en) * 1999-09-17 2005-06-21 Discern Communications, Inc. System, method and article of manufacture for concept based information searching
US20010018697A1 (en) * 2000-01-25 2001-08-30 Fuji Xerox Co., Ltd. Structured document processing system and structured document processing method
US20020013694A1 (en) * 2000-07-26 2002-01-31 Toshiki Murata Apparatus and method for natural language processing
US20040078190A1 (en) * 2000-09-29 2004-04-22 Fass Daniel C Method and system for describing and identifying concepts in natural language text for information retrieval and processing
US20030229854A1 (en) * 2000-10-19 2003-12-11 Mlchel Lemay Text extraction method for HTML pages
US6714939B2 (en) * 2001-01-08 2004-03-30 Softface, Inc. Creation of structured data from plain text
US20030167162A1 (en) * 2001-03-07 2003-09-04 International Business Machines Corporation System and method for building a semantic network capable of identifying word patterns in text
US20020165717A1 (en) * 2001-04-06 2002-11-07 Solmer Robert P. Efficient method for information extraction
US20030007397A1 (en) * 2001-05-10 2003-01-09 Kenichiro Kobayashi Document processing apparatus, document processing method, document processing program and recording medium
US20030154070A1 (en) * 2002-02-12 2003-08-14 Naoyuki Tokuda System and method for accurate grammar analysis using a learners' model and part-of-speech tagged (POST) parser
US20030158723A1 (en) * 2002-02-20 2003-08-21 Fuji Xerox Co., Ltd. Syntactic information tagging support system and method
US20050066271A1 (en) * 2002-06-28 2005-03-24 Nippon Telegraph And Telephone Corporation Extraction of information from structured documents
US20040243645A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations
US20050154979A1 (en) * 2004-01-14 2005-07-14 Xerox Corporation Systems and methods for converting legacy and proprietary documents into extended mark-up language format

Cited By (367)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7730104B2 (en) * 2002-06-28 2010-06-01 Nippon Telegraph And Telephone Corporation Extraction of information from structured documents
US20050066271A1 (en) * 2002-06-28 2005-03-24 Nippon Telegraph And Telephone Corporation Extraction of information from structured documents
US20040225646A1 (en) * 2002-11-28 2004-11-11 Miki Sasaki Numerical expression retrieving device
US20090132384A1 (en) * 2003-03-24 2009-05-21 Objective Systems Pty Limited Production of documents
US8719696B2 (en) * 2003-03-24 2014-05-06 Accessible Publishing Systems Pty Ltd Production of documents
US20070022131A1 (en) * 2003-03-24 2007-01-25 Duncan Gregory L Production of documents
US9430555B2 (en) 2003-03-24 2016-08-30 Accessible Publiahing Systems Pty Ltd Reformatting text in a document for the purpose of improving readability
US8229932B2 (en) 2003-09-04 2012-07-24 Oracle International Corporation Storing XML documents efficiently in an RDBMS
US8694510B2 (en) 2003-09-04 2014-04-08 Oracle International Corporation Indexing XML documents efficiently
US20050055343A1 (en) * 2003-09-04 2005-03-10 Krishnamurthy Sanjay M. Storing XML documents efficiently in an RDBMS
US20050228768A1 (en) * 2004-04-09 2005-10-13 Ashish Thusoo Mechanism for efficiently evaluating operator trees
US20050228791A1 (en) * 2004-04-09 2005-10-13 Ashish Thusoo Efficient queribility and manageability of an XML index with path subsetting
US20050228818A1 (en) * 2004-04-09 2005-10-13 Ravi Murthy Method and system for flexible sectioning of XML data in a database system
US7461074B2 (en) * 2004-04-09 2008-12-02 Oracle International Corporation Method and system for flexible sectioning of XML data in a database system
US7493305B2 (en) 2004-04-09 2009-02-17 Oracle International Corporation Efficient queribility and manageability of an XML index with path subsetting
US7603347B2 (en) 2004-04-09 2009-10-13 Oracle International Corporation Mechanism for efficiently evaluating operator trees
US20050237227A1 (en) * 2004-04-27 2005-10-27 International Business Machines Corporation Mention-synchronous entity tracking system and method for chaining mentions
US20080243888A1 (en) * 2004-04-27 2008-10-02 Abraham Ittycheriah Mention-Synchronous Entity Tracking: System and Method for Chaining Mentions
US8620961B2 (en) * 2004-04-27 2013-12-31 International Business Machines Corporation Mention-synchronous entity tracking: system and method for chaining mentions
US7398274B2 (en) * 2004-04-27 2008-07-08 International Business Machines Corporation Mention-synchronous entity tracking system and method for chaining mentions
US20050278613A1 (en) * 2004-06-09 2005-12-15 Nec Corporation Topic analyzing method and apparatus and program therefor
US20060184551A1 (en) * 2004-07-02 2006-08-17 Asha Tarachandani Mechanism for improving performance on XML over XML data using path subsetting
US7885980B2 (en) 2004-07-02 2011-02-08 Oracle International Corporation Mechanism for improving performance on XML over XML data using path subsetting
US9171100B2 (en) 2004-09-22 2015-10-27 Primo M. Pettovello MTree an XPath multi-axis structure threaded index
US20080065979A1 (en) * 2004-11-12 2008-03-13 Justsystems Corporation Document Processing Device, and Document Processing Method
US20080270887A1 (en) * 2004-11-12 2008-10-30 Justsystems Corporation Document Processing Device And Document Processing Method
US7921076B2 (en) 2004-12-15 2011-04-05 Oracle International Corporation Performing an action in response to a file system event
US20060129584A1 (en) * 2004-12-15 2006-06-15 Thuvan Hoang Performing an action in response to a file system event
US8176007B2 (en) 2004-12-15 2012-05-08 Oracle International Corporation Performing an action in response to a file system event
US7698270B2 (en) 2004-12-29 2010-04-13 Baynote, Inc. Method and apparatus for identifying, extracting, capturing, and leveraging expertise and knowledge
US8095523B2 (en) 2004-12-29 2012-01-10 Baynote, Inc. Method and apparatus for context-based content recommendation
US20080104004A1 (en) * 2004-12-29 2008-05-01 Scott Brave Method and Apparatus for Identifying, Extracting, Capturing, and Leveraging Expertise and Knowledge
US8601023B2 (en) 2004-12-29 2013-12-03 Baynote, Inc. Method and apparatus for identifying, extracting, capturing, and leveraging expertise and knowledge
US7702690B2 (en) 2004-12-29 2010-04-20 Baynote, Inc. Method and apparatus for suggesting/disambiguation query terms based upon usage patterns observed
US20090037355A1 (en) * 2004-12-29 2009-02-05 Scott Brave Method and Apparatus for Context-Based Content Recommendation
US20060200556A1 (en) * 2004-12-29 2006-09-07 Scott Brave Method and apparatus for identifying, extracting, capturing, and leveraging expertise and knowledge
US20070150466A1 (en) * 2004-12-29 2007-06-28 Scott Brave Method and apparatus for suggesting/disambiguation query terms based upon usage patterns observed
US20070143317A1 (en) * 2004-12-30 2007-06-21 Andrew Hogue Mechanism for managing facts in a fact repository
US20070143282A1 (en) * 2005-03-31 2007-06-21 Betz Jonathan T Anchor text summarization for corroboration
US9208229B2 (en) 2005-03-31 2015-12-08 Google Inc. Anchor text summarization for corroboration
US8650175B2 (en) 2005-03-31 2014-02-11 Google Inc. User interface for facts query engine with snippets from information sources that include query terms and answer terms
US8682913B1 (en) 2005-03-31 2014-03-25 Google Inc. Corroborating facts extracted from multiple sources
US8996470B1 (en) 2005-05-31 2015-03-31 Google Inc. System for ensuring the internal consistency of a fact repository
US20110047153A1 (en) * 2005-05-31 2011-02-24 Betz Jonathan T Identifying the Unifying Subject of a Set of Facts
US8719260B2 (en) 2005-05-31 2014-05-06 Google Inc. Identifying the unifying subject of a set of facts
US8825471B2 (en) 2005-05-31 2014-09-02 Google Inc. Unsupervised extraction of facts
US9558186B2 (en) 2005-05-31 2017-01-31 Google Inc. Unsupervised extraction of facts
US8078573B2 (en) 2005-05-31 2011-12-13 Google Inc. Identifying the unifying subject of a set of facts
US20070016604A1 (en) * 2005-07-18 2007-01-18 Ravi Murthy Document level indexes for efficient processing in multiple tiers of a computer system
US8762410B2 (en) 2005-07-18 2014-06-24 Oracle International Corporation Document level indexes for efficient processing in multiple tiers of a computer system
US20070016605A1 (en) * 2005-07-18 2007-01-18 Ravi Murthy Mechanism for computing structural summaries of XML document collections in a database system
US10318555B2 (en) 2005-07-25 2019-06-11 Splunk Inc. Identifying relationships between network traffic data and log data
US20150142842A1 (en) * 2005-07-25 2015-05-21 Splunk Inc. Uniform storage and search of events derived from machine data from different sources
US11126477B2 (en) 2005-07-25 2021-09-21 Splunk Inc. Identifying matching event data from disparate data sources
US11036567B2 (en) 2005-07-25 2021-06-15 Splunk Inc. Determining system behavior using event patterns in machine data
US11036566B2 (en) 2005-07-25 2021-06-15 Splunk Inc. Analyzing machine data based on relationships between log data and network traffic data
US11010214B2 (en) 2005-07-25 2021-05-18 Splunk Inc. Identifying pattern relationships in machine data
US9280594B2 (en) * 2005-07-25 2016-03-08 Splunk Inc. Uniform storage and search of events derived from machine data from different sources
US10339162B2 (en) 2005-07-25 2019-07-02 Splunk Inc. Identifying security-related events derived from machine data that match a particular portion of machine data
US10324957B2 (en) 2005-07-25 2019-06-18 Splunk Inc. Uniform storage and search of security-related events derived from machine data from different sources
US10318553B2 (en) 2005-07-25 2019-06-11 Splunk Inc. Identification of systems with anomalous behaviour using events derived from machine data produced by those systems
US11663244B2 (en) 2005-07-25 2023-05-30 Splunk Inc. Segmenting machine data into events to identify matching events
US10242086B2 (en) 2005-07-25 2019-03-26 Splunk Inc. Identifying system performance patterns in machine data
US9292590B2 (en) * 2005-07-25 2016-03-22 Splunk Inc. Identifying events derived from machine data based on an extracted portion from a first event
US9298805B2 (en) * 2005-07-25 2016-03-29 Splunk Inc. Using extractions to search events derived from machine data
US20150154250A1 (en) * 2005-07-25 2015-06-04 Splunk Inc. Pattern identification, pattern matching, and clustering for events derived from machine data
US9317582B2 (en) * 2005-07-25 2016-04-19 Splunk Inc. Identifying events derived from machine data that match a particular portion of machine data
US11204817B2 (en) 2005-07-25 2021-12-21 Splunk Inc. Deriving signature-based rules for creating events from machine data
US11599400B2 (en) 2005-07-25 2023-03-07 Splunk Inc. Segmenting machine data into events based on source signatures
US9361357B2 (en) * 2005-07-25 2016-06-07 Splunk Inc. Searching of events derived from machine data using field and keyword criteria
US9384261B2 (en) 2005-07-25 2016-07-05 Splunk Inc. Automatic creation of rules for identifying event boundaries in machine data
US11119833B2 (en) 2005-07-25 2021-09-14 Splunk Inc. Identifying behavioral patterns of events derived from machine data that reveal historical behavior of an information technology environment
US20150149460A1 (en) * 2005-07-25 2015-05-28 Splunk Inc. Searching of events derived from machine data using field and keyword criteria
US20080177740A1 (en) * 2005-09-20 2008-07-24 International Business Machines Corporation Detecting relationships in unstructured text
US20070067320A1 (en) * 2005-09-20 2007-03-22 International Business Machines Corporation Detecting relationships in unstructured text
US8001144B2 (en) 2005-09-20 2011-08-16 International Business Machines Corporation Detecting relationships in unstructured text
US7548933B2 (en) * 2005-10-14 2009-06-16 International Business Machines Corporation System and method for exploiting semantic annotations in executing keyword queries over a collection of text documents
US20070088734A1 (en) * 2005-10-14 2007-04-19 International Business Machines Corporation System and method for exploiting semantic annotations in executing keyword queries over a collection of text documents
US20100131564A1 (en) * 2005-11-14 2010-05-27 Pettovello Primo M Index data structure for a peer-to-peer network
US20070112803A1 (en) * 2005-11-14 2007-05-17 Pettovello Primo M Peer-to-peer semantic indexing
US7664742B2 (en) 2005-11-14 2010-02-16 Pettovello Primo M Index data structure for a peer-to-peer network
US8166074B2 (en) 2005-11-14 2012-04-24 Pettovello Primo M Index data structure for a peer-to-peer network
US20070150464A1 (en) * 2005-12-27 2007-06-28 Scott Brave Method and apparatus for predicting destinations in a navigation context based upon observed usage patterns
US7580930B2 (en) 2005-12-27 2009-08-25 Baynote, Inc. Method and apparatus for predicting destinations in a navigation context based upon observed usage patterns
US20070150465A1 (en) * 2005-12-27 2007-06-28 Scott Brave Method and apparatus for determining expertise based upon observed usage patterns
US7693836B2 (en) 2005-12-27 2010-04-06 Baynote, Inc. Method and apparatus for determining peer groups based upon observed usage patterns
US7546295B2 (en) 2005-12-27 2009-06-09 Baynote, Inc. Method and apparatus for determining expertise based upon observed usage patterns
US7856446B2 (en) 2005-12-27 2010-12-21 Baynote, Inc. Method and apparatus for determining usefulness of a digital asset
US20070150515A1 (en) * 2005-12-27 2007-06-28 Scott Brave Method and apparatus for determining usefulness of a digital asset
US20070174309A1 (en) * 2006-01-18 2007-07-26 Pettovello Primo M Mtreeini: intermediate nodes and indexes
US9530229B2 (en) 2006-01-27 2016-12-27 Google Inc. Data object visualization using graphs
US9092495B2 (en) 2006-01-27 2015-07-28 Google Inc. Automatic object reference identification and linking in a browseable fact repository
US7836399B2 (en) 2006-02-09 2010-11-16 Microsoft Corporation Detection of lists in vector graphics documents
US20070185837A1 (en) * 2006-02-09 2007-08-09 Microsoft Corporation Detection of lists in vector graphics documents
US7958164B2 (en) 2006-02-16 2011-06-07 Microsoft Corporation Visual design of annotated regular expression
US8260785B2 (en) 2006-02-17 2012-09-04 Google Inc. Automatic object reference identification and linking in a browseable fact repository
US20120117077A1 (en) * 2006-02-17 2012-05-10 Tom Ritchford Annotation Framework
US8682891B2 (en) 2006-02-17 2014-03-25 Google Inc. Automatic object reference identification and linking in a browseable fact repository
US20070198480A1 (en) * 2006-02-17 2007-08-23 Hogue Andrew W Query language
US8954426B2 (en) 2006-02-17 2015-02-10 Google Inc. Query language
US20070214134A1 (en) * 2006-03-09 2007-09-13 Microsoft Corporation Data parsing with annotated patterns
US7860881B2 (en) * 2006-03-09 2010-12-28 Microsoft Corporation Data parsing with annotated patterns
US8510292B2 (en) 2006-05-25 2013-08-13 Oracle International Coporation Isolation for applications working on shared XML data
US20070276792A1 (en) * 2006-05-25 2007-11-29 Asha Tarachandani Isolation for applications working on shared XML data
US8930348B2 (en) * 2006-05-25 2015-01-06 Oracle International Corporation Isolation for applications working on shared XML data
US20080027888A1 (en) * 2006-07-31 2008-01-31 Microsoft Corporation Optimization of fact extraction using a multi-stage approach
WO2008016491A1 (en) 2006-07-31 2008-02-07 Microsoft Corporation Optimization of fact extraction using a multi-stage approach
US7668791B2 (en) * 2006-07-31 2010-02-23 Microsoft Corporation Distinguishing facts from opinions using a multi-stage approach
JP2009545808A (en) * 2006-07-31 2009-12-24 マイクロソフト コーポレーション Optimizing fact extraction using a multi-stage approach
EP2050019A4 (en) * 2006-07-31 2012-03-21 Microsoft Corp Optimization of fact extraction using a multi-stage approach
AU2007281638B2 (en) * 2006-07-31 2011-10-06 Microsoft Technology Licensing, Llc Optimization of fact extraction using a multi-stage approach
EP2050019A1 (en) * 2006-07-31 2009-04-22 Microsoft Corporation Optimization of fact extraction using a multi-stage approach
US9147271B2 (en) 2006-09-08 2015-09-29 Microsoft Technology Licensing, Llc Graphical representation of aggregated data
US20080065646A1 (en) * 2006-09-08 2008-03-13 Microsoft Corporation Enabling access to aggregated software security information
US8234706B2 (en) 2006-09-08 2012-07-31 Microsoft Corporation Enabling access to aggregated software security information
US20080126385A1 (en) * 2006-09-19 2008-05-29 Microsoft Corporation Intelligent batching of electronic data interchange messages
US8161078B2 (en) 2006-09-20 2012-04-17 Microsoft Corporation Electronic data interchange (EDI) data dictionary management and versioning system
US20080071806A1 (en) * 2006-09-20 2008-03-20 Microsoft Corporation Difference analysis for electronic data interchange (edi) data dictionary
US20080072160A1 (en) * 2006-09-20 2008-03-20 Microsoft Corporation Electronic data interchange transaction set definition based instance editing
US20080126386A1 (en) * 2006-09-20 2008-05-29 Microsoft Corporation Translation of electronic data interchange messages to extensible markup language representation(s)
US8108767B2 (en) 2006-09-20 2012-01-31 Microsoft Corporation Electronic data interchange transaction set definition based instance editing
US20080071817A1 (en) * 2006-09-20 2008-03-20 Microsoft Corporation Electronic data interchange (edi) data dictionary management and versioning system
US9785686B2 (en) 2006-09-28 2017-10-10 Google Inc. Corroborating facts in electronic documents
US8954412B1 (en) 2006-09-28 2015-02-10 Google Inc. Corroborating facts in electronic documents
US9495358B2 (en) 2006-10-10 2016-11-15 Abbyy Infopoisk Llc Cross-language text clustering
US8122026B1 (en) * 2006-10-20 2012-02-21 Google Inc. Finding and disambiguating references to entities on web pages
US8751498B2 (en) 2006-10-20 2014-06-10 Google Inc. Finding and disambiguating references to entities on web pages
US9760570B2 (en) 2006-10-20 2017-09-12 Google Inc. Finding and disambiguating references to entities on web pages
US20080140679A1 (en) * 2006-12-11 2008-06-12 Microsoft Corporation Relational linking among resoures
US8099429B2 (en) 2006-12-11 2012-01-17 Microsoft Corporation Relational linking among resoures
US20080162449A1 (en) * 2006-12-28 2008-07-03 Chen Chao-Yu Dynamic page similarity measurement
US20080168081A1 (en) * 2007-01-09 2008-07-10 Microsoft Corporation Extensible schemas and party configurations for edi document generation or validation
US20080168109A1 (en) * 2007-01-09 2008-07-10 Microsoft Corporation Automatic map updating based on schema changes
US10459955B1 (en) 2007-03-14 2019-10-29 Google Llc Determining geographic locations for place names
US8347202B1 (en) 2007-03-14 2013-01-01 Google Inc. Determining geographic locations for place names in a fact repository
US9892132B2 (en) 2007-03-14 2018-02-13 Google Llc Determining geographic locations for place names in a fact repository
US20100131534A1 (en) * 2007-04-10 2010-05-27 Toshio Takeda Information providing system
US20210042468A1 (en) * 2007-04-13 2021-02-11 Optum360, Llc Mere-parsing with boundary and semantic driven scoping
US20080262927A1 (en) * 2007-04-19 2008-10-23 Hiroshi Kanayama System, method, and program for selecting advertisements
US9779079B2 (en) * 2007-06-01 2017-10-03 Xerox Corporation Authoring system
US20080300862A1 (en) * 2007-06-01 2008-12-04 Xerox Corporation Authoring system
US9251137B2 (en) 2007-06-21 2016-02-02 International Business Machines Corporation Method of text type-ahead
US20080320411A1 (en) * 2007-06-21 2008-12-25 Yen-Fu Chen Method of text type-ahead
US9239826B2 (en) * 2007-06-27 2016-01-19 Abbyy Infopoisk Llc Method and system for generating new entries in natural language dictionary
US20150012262A1 (en) * 2007-06-27 2015-01-08 Abbyy Infopoisk Llc Method and system for generating new entries in natural language dictionary
US20090007271A1 (en) * 2007-06-28 2009-01-01 Microsoft Corporation Identifying attributes of aggregated data
US20090007272A1 (en) * 2007-06-28 2009-01-01 Microsoft Corporation Identifying data associated with security issue attributes
US8250651B2 (en) * 2007-06-28 2012-08-21 Microsoft Corporation Identifying attributes of aggregated data
US8302197B2 (en) 2007-06-28 2012-10-30 Microsoft Corporation Identifying data associated with security issue attributes
US11556697B2 (en) * 2007-07-10 2023-01-17 International Business Machines Corporation Intelligent text annotation
US20160203115A1 (en) * 2007-07-10 2016-07-14 International Business Machines Corporation Intelligent text annotation
US7970766B1 (en) 2007-07-23 2011-06-28 Google Inc. Entity type assignment
US20120136859A1 (en) * 2007-07-23 2012-05-31 Farhan Shamsi Entity Type Assignment
US20090063550A1 (en) * 2007-08-31 2009-03-05 Powerset, Inc. Fact-based indexing for natural language search
KR101522049B1 (en) * 2007-08-31 2015-05-20 마이크로소프트 코포레이션 Coreference resolution in an ambiguity-sensitive natural language processing system
US8738598B2 (en) 2007-08-31 2014-05-27 Microsoft Corporation Checkpointing iterators during search
US8346756B2 (en) 2007-08-31 2013-01-01 Microsoft Corporation Calculating valence of expressions within documents for searching a document index
US8041697B2 (en) 2007-08-31 2011-10-18 Microsoft Corporation Semi-automatic example-based induction of semantic translation rules to support natural language search
US20090070308A1 (en) * 2007-08-31 2009-03-12 Powerset, Inc. Checkpointing Iterators During Search
US8712758B2 (en) 2007-08-31 2014-04-29 Microsoft Corporation Coreference resolution in an ambiguity-sensitive natural language processing system
US20090070322A1 (en) * 2007-08-31 2009-03-12 Powerset, Inc. Browsing knowledge on the basis of semantic relations
US20090076799A1 (en) * 2007-08-31 2009-03-19 Powerset, Inc. Coreference Resolution In An Ambiguity-Sensitive Natural Language Processing System
US20090077069A1 (en) * 2007-08-31 2009-03-19 Powerset, Inc. Calculating Valence Of Expressions Within Documents For Searching A Document Index
US20090089047A1 (en) * 2007-08-31 2009-04-02 Powerset, Inc. Natural Language Hypernym Weighting For Word Sense Disambiguation
US8868562B2 (en) * 2007-08-31 2014-10-21 Microsoft Corporation Identification of semantic relationships within reported speech
US20090094019A1 (en) * 2007-08-31 2009-04-09 Powerset, Inc. Efficiently Representing Word Sense Probabilities
US20090063473A1 (en) * 2007-08-31 2009-03-05 Powerset, Inc. Indexing role hierarchies for words in a search index
US8463593B2 (en) 2007-08-31 2013-06-11 Microsoft Corporation Natural language hypernym weighting for word sense disambiguation
US8229970B2 (en) 2007-08-31 2012-07-24 Microsoft Corporation Efficient storage and retrieval of posting lists
AU2008292779B2 (en) * 2007-08-31 2012-09-06 Microsoft Technology Licensing, Llc Coreference resolution in an ambiguity-sensitive natural language processing system
US20090132521A1 (en) * 2007-08-31 2009-05-21 Powerset, Inc. Efficient Storage and Retrieval of Posting Lists
US20090138454A1 (en) * 2007-08-31 2009-05-28 Powerset, Inc. Semi-Automatic Example-Based Induction of Semantic Translation Rules to Support Natural Language Search
US8316036B2 (en) 2007-08-31 2012-11-20 Microsoft Corporation Checkpointing iterators during search
US8639708B2 (en) * 2007-08-31 2014-01-28 Microsoft Corporation Fact-based indexing for natural language search
US8280721B2 (en) 2007-08-31 2012-10-02 Microsoft Corporation Efficiently representing word sense probabilities
US8229730B2 (en) 2007-08-31 2012-07-24 Microsoft Corporation Indexing role hierarchies for words in a search index
US20090063426A1 (en) * 2007-08-31 2009-03-05 Powerset, Inc. Identification of semantic relationships within reported speech
US9454776B2 (en) 2007-09-12 2016-09-27 Google Inc. Placement attribute targeting
US9058608B2 (en) * 2007-09-12 2015-06-16 Google Inc. Placement attribute targeting
US9679309B2 (en) 2007-09-12 2017-06-13 Google Inc. Placement attribute targeting
US20090070706A1 (en) * 2007-09-12 2009-03-12 Google Inc. Placement Attribute Targeting
US20090094267A1 (en) * 2007-10-04 2009-04-09 Muguda Naveenkumar V System and Method for Implementing Metadata Extraction of Artifacts from Associated Collaborative Discussions on a Data Processing System
US8326833B2 (en) * 2007-10-04 2012-12-04 International Business Machines Corporation Implementing metadata extraction of artifacts from associated collaborative discussions
US7991768B2 (en) 2007-11-08 2011-08-02 Oracle International Corporation Global query normalization to improve XML index based rewrites for path subsetted index
US20090125542A1 (en) * 2007-11-14 2009-05-14 Sap Ag Systems and Methods for Modular Information Extraction
US7987416B2 (en) * 2007-11-14 2011-07-26 Sap Ag Systems and methods for modular information extraction
US8812435B1 (en) * 2007-11-16 2014-08-19 Google Inc. Learning objects and facts from documents
US20090157385A1 (en) * 2007-12-14 2009-06-18 Nokia Corporation Inverse Text Normalization
US20090172517A1 (en) * 2007-12-27 2009-07-02 Kalicharan Bhagavathi P Document parsing method and system using web-based GUI software
US20090182741A1 (en) * 2008-01-16 2009-07-16 International Business Machines Corporation Systems and Arrangements of Text Type-Ahead
US8316035B2 (en) * 2008-01-16 2012-11-20 International Business Machines Corporation Systems and arrangements of text type-ahead
US8725753B2 (en) 2008-01-16 2014-05-13 International Business Machines Corporation Arrangements of text type-ahead
US8019769B2 (en) * 2008-01-18 2011-09-13 Litera Corp. System and method for determining valid citation patterns in electronic documents
US8219566B2 (en) 2008-01-18 2012-07-10 Litera Corp. System and method for determining valid citation patterns in electronic documents
US20090187567A1 (en) * 2008-01-18 2009-07-23 Citation Ware Llc System and method for determining valid citation patterns in electronic documents
US20090198646A1 (en) * 2008-01-31 2009-08-06 International Business Machines Corporation Systems, methods and computer program products for an algebraic approach to rule-based information extraction
US8387010B2 (en) * 2008-02-26 2013-02-26 Hitachi, Ltd. Automatic software configuring system
US20090217243A1 (en) * 2008-02-26 2009-08-27 Hitachi, Ltd. Automatic software configuring system
US8249856B2 (en) * 2008-03-20 2012-08-21 Raytheon Bbn Technologies Corp. Machine translation
US20090240487A1 (en) * 2008-03-20 2009-09-24 Libin Shen Machine translation
US20090271700A1 (en) * 2008-04-28 2009-10-29 Yen-Fu Chen Text type-ahead
US8359532B2 (en) 2008-04-28 2013-01-22 International Business Machines Corporation Text type-ahead
US20090282401A1 (en) * 2008-05-09 2009-11-12 Mariela Todorova Deploying software modules in computer system
US8869140B2 (en) * 2008-05-09 2014-10-21 Sap Se Deploying software modules in computer system
US8738360B2 (en) 2008-06-06 2014-05-27 Apple Inc. Data detection of a character sequence having multiple possible data types
US9454522B2 (en) 2008-06-06 2016-09-27 Apple Inc. Detection of data in a sequence of characters
US20140358883A1 (en) * 2008-09-08 2014-12-04 Semanti Inc. Semantically associated text index and the population and use thereof
US20100083095A1 (en) * 2008-09-29 2010-04-01 Nikovski Daniel N Method for Extracting Data from Web Pages
US20100121631A1 (en) * 2008-11-10 2010-05-13 Olivier Bonnet Data detection
US8489388B2 (en) * 2008-11-10 2013-07-16 Apple Inc. Data detection
US9489371B2 (en) 2008-11-10 2016-11-08 Apple Inc. Detection of data in a sequence of characters
US8306806B2 (en) 2008-12-02 2012-11-06 Microsoft Corporation Adaptive web mining of bilingual lexicon
US20100138211A1 (en) * 2008-12-02 2010-06-03 Microsoft Corporation Adaptive web mining of bilingual lexicon
US8190538B2 (en) 2009-01-30 2012-05-29 Lexisnexis Group Methods and systems for matching records and normalizing names
US20100198756A1 (en) * 2009-01-30 2010-08-05 Zhang ling qin Methods and systems for matching records and normalizing names
US8433559B2 (en) * 2009-03-24 2013-04-30 Microsoft Corporation Text analysis using phrase definitions and containers
US20100250235A1 (en) * 2009-03-24 2010-09-30 Microsoft Corporation Text analysis using phrase definitions and containers
US9495424B1 (en) 2009-03-31 2016-11-15 Amazon Technologies, Inc. Recognition of characters and their significance within written works
US8325974B1 (en) 2009-03-31 2012-12-04 Amazon Technologies Inc. Recognition of characters and their significance within written works
US8897486B1 (en) 2009-03-31 2014-11-25 Amazon Technologies, Inc. Recognition of characters and their significance within written works
US8150695B1 (en) 2009-06-18 2012-04-03 Amazon Technologies, Inc. Presentation of written works based on character identities and attributes
US9189475B2 (en) * 2009-06-22 2015-11-17 Ca, Inc. Indexing mechanism (nth phrasal index) for advanced leveraging for translation
US20100324885A1 (en) * 2009-06-22 2010-12-23 Computer Associates Think, Inc. INDEXING MECHANISM (Nth PHRASAL INDEX) FOR ADVANCED LEVERAGING FOR TRANSLATION
US20110035390A1 (en) * 2009-08-05 2011-02-10 Loglogic, Inc. Message Descriptions
US8386498B2 (en) * 2009-08-05 2013-02-26 Loglogic, Inc. Message descriptions
US8631028B1 (en) 2009-10-29 2014-01-14 Primo M. Pettovello XPath query processing improvements
GB2475151A (en) * 2009-11-06 2011-05-11 Symantec Corp Indexing data for use by multiple applications by extracting tokens from data objects
US8458186B2 (en) 2009-11-06 2013-06-04 Symantec Corporation Systems and methods for processing and managing object-related data for use by a plurality of applications
US20110145240A1 (en) * 2009-12-15 2011-06-16 International Business Machines Corporation Organizing Annotations
US20110221367A1 (en) * 2010-03-11 2011-09-15 Gm Global Technology Operations, Inc. Methods, systems and apparatus for overmodulation of a five-phase machine
US8463789B1 (en) 2010-03-23 2013-06-11 Firstrain, Inc. Event detection
US8463790B1 (en) 2010-03-23 2013-06-11 Firstrain, Inc. Event naming
US8805840B1 (en) 2010-03-23 2014-08-12 Firstrain, Inc. Classification of documents
US9760634B1 (en) 2010-03-23 2017-09-12 Firstrain, Inc. Models for classifying documents
US11367295B1 (en) 2010-03-23 2022-06-21 Aurea Software, Inc. Graphical user interface for presentation of events
US20180068018A1 (en) * 2010-04-30 2018-03-08 International Business Machines Corporation Managed document research domains
US9858338B2 (en) * 2010-04-30 2018-01-02 International Business Machines Corporation Managed document research domains
US20110270856A1 (en) * 2010-04-30 2011-11-03 International Business Machines Corporation Managed document research domains
US20130346856A1 (en) * 2010-05-13 2013-12-26 Expedia, Inc. Systems and methods for automated content generation
US10025770B2 (en) * 2010-05-13 2018-07-17 Expedia, Inc. Systems and methods for automated content generation
US8805834B2 (en) 2010-05-26 2014-08-12 International Business Machines Corporation Extensible system and method for information extraction in a data processing system
US9418069B2 (en) 2010-05-26 2016-08-16 International Business Machines Corporation Extensible system and method for information extraction in a data processing system
US20110295864A1 (en) * 2010-05-29 2011-12-01 Martin Betz Iterative fact-extraction
US20120016676A1 (en) * 2010-07-15 2012-01-19 King Abdulaziz City For Science And Technology System and method for writing digits in words and pronunciation of numbers, fractions, and units
US8468021B2 (en) * 2010-07-15 2013-06-18 King Abdulaziz City For Science And Technology System and method for writing digits in words and pronunciation of numbers, fractions, and units
US20120065959A1 (en) * 2010-09-13 2012-03-15 Richard Salisbury Word graph
US8977538B2 (en) * 2010-09-13 2015-03-10 Richard Salisbury Constructing and analyzing a word graph
US8239349B2 (en) 2010-10-07 2012-08-07 Hewlett-Packard Development Company, L.P. Extracting data
US20120102031A1 (en) * 2010-10-20 2012-04-26 Sap Ag Apparatus and method for entity expansion and grouping
US8312018B2 (en) * 2010-10-20 2012-11-13 Business Objects Software Limited Entity expansion and grouping
US9876751B2 (en) 2011-02-23 2018-01-23 Blazent, Inc. System and method for analyzing messages in a network or across networks
US9614807B2 (en) 2011-02-23 2017-04-04 Bottlenose, Inc. System and method for analyzing messages in a network or across networks
US8719692B2 (en) * 2011-03-11 2014-05-06 Microsoft Corporation Validation, rejection, and modification of automatically generated document annotations
US9880988B2 (en) * 2011-03-11 2018-01-30 Microsoft Technology Licensing, Llc Validation, rejection, and modification of automatically generated document annotations
US20120233534A1 (en) * 2011-03-11 2012-09-13 Microsoft Corporation Validation, rejection, and modification of automatically generated document annotations
US20140215305A1 (en) * 2011-03-11 2014-07-31 Microsoft Corporation Validation, rejection, and modification of automatically generated document annotations
US8543583B2 (en) * 2011-03-24 2013-09-24 Sony Corporation Information processing apparatus, information processing method, and program
US20120246176A1 (en) * 2011-03-24 2012-09-27 Sony Corporation Information processing apparatus, information processing method, and program
US20160041967A1 (en) * 2011-04-01 2016-02-11 Rima Ghannam System for Natural Language Understanding
US9710458B2 (en) * 2011-04-01 2017-07-18 Rima Ghannam System for natural language understanding
US20120253793A1 (en) * 2011-04-01 2012-10-04 Rima Ghannam System for natural language understanding
US9110883B2 (en) * 2011-04-01 2015-08-18 Rima Ghannam System for natural language understanding
US20130042200A1 (en) * 2011-08-08 2013-02-14 The Original Software Group Limited System and method for annotating graphical user interface
US8745521B2 (en) * 2011-08-08 2014-06-03 The Original Software Group Limited System and method for annotating graphical user interface
US20130091145A1 (en) * 2011-10-07 2013-04-11 Electronics And Telecommunications Research Institute Method and apparatus for analyzing web trends based on issue template extraction
US8782042B1 (en) 2011-10-14 2014-07-15 Firstrain, Inc. Method and system for identifying entities
US9965508B1 (en) 2011-10-14 2018-05-08 Ignite Firstrain Solutions, Inc. Method and system for identifying entities
US20130110818A1 (en) * 2011-10-28 2013-05-02 Eamonn O'Brien-Strain Profile driven extraction
US9201868B1 (en) * 2011-12-09 2015-12-01 Guangsheng Zhang System, methods and user interface for identifying and presenting sentiment information
US9304989B2 (en) * 2012-02-17 2016-04-05 Bottlenose, Inc. Machine-based content analysis and user perception tracking of microcontent messages
US8938450B2 (en) * 2012-02-17 2015-01-20 Bottlenose, Inc. Natural language processing optimized for micro content
US8832092B2 (en) 2012-02-17 2014-09-09 Bottlenose, Inc. Natural language processing optimized for micro content
US20150095021A1 (en) * 2012-02-17 2015-04-02 Bottlenose, Inc. Machine-based content analysis and user perception tracking of microcontent messages
CN102646128A (en) * 2012-03-06 2012-08-22 北京航空航天大学 Method for labeling word properties of emotional words based on extensible markup language (XML)
WO2013159156A1 (en) * 2012-04-27 2013-10-31 Citadel Corporation Pty Ltd Method for storing and applying related sets of pattern/message rules
US20130297999A1 (en) * 2012-05-07 2013-11-07 Sap Ag Document Text Processing Using Edge Detection
US9569413B2 (en) * 2012-05-07 2017-02-14 Sap Se Document text processing using edge detection
US8977613B1 (en) 2012-06-12 2015-03-10 Firstrain, Inc. Generation of recurring searches
US9292505B1 (en) 2012-06-12 2016-03-22 Firstrain, Inc. Graphical user interface for recurring searches
US9009126B2 (en) 2012-07-31 2015-04-14 Bottlenose, Inc. Discovering and ranking trending links about topics
US8990097B2 (en) 2012-07-31 2015-03-24 Bottlenose, Inc. Discovering and ranking trending links about topics
WO2014049186A1 (en) * 2012-09-26 2014-04-03 Universidad Carlos Iii De Madrid Method for generating semantic patterns
US10296573B2 (en) * 2012-11-16 2019-05-21 International Business Machines Corporation Building and maintaining information extraction rules
US20140164996A1 (en) * 2012-12-11 2014-06-12 Canon Kabushiki Kaisha Apparatus, method, and storage medium
US10592480B1 (en) 2012-12-30 2020-03-17 Aurea Software, Inc. Affinity scoring
US20140236570A1 (en) * 2013-02-18 2014-08-21 Microsoft Corporation Exploiting the semantic web for unsupervised spoken language understanding
US10235358B2 (en) 2013-02-21 2019-03-19 Microsoft Technology Licensing, Llc Exploiting structured content for unsupervised natural language semantic parsing
US8909569B2 (en) 2013-02-22 2014-12-09 Bottlenose, Inc. System and method for revealing correlations between data streams
US9697196B2 (en) * 2013-03-12 2017-07-04 Guangsheng Zhang System and methods for determining sentiment based on context
US10031910B1 (en) 2013-03-12 2018-07-24 Guangsheng Zhang System and methods for rule-based sentiment analysis
US20140278365A1 (en) * 2013-03-12 2014-09-18 Guangsheng Zhang System and methods for determining sentiment based on context
US9299041B2 (en) 2013-03-15 2016-03-29 Business Objects Software Ltd. Obtaining data from unstructured data for a structured data collection
US9262550B2 (en) 2013-03-15 2016-02-16 Business Objects Software Ltd. Processing semi-structured data
US9218568B2 (en) 2013-03-15 2015-12-22 Business Objects Software Ltd. Disambiguating data using contextual and historical information
US9898523B2 (en) 2013-04-22 2018-02-20 Abb Research Ltd. Tabular data parsing in document(s)
US9747280B1 (en) * 2013-08-21 2017-08-29 Intelligent Language, LLC Date and time processing
US9639818B2 (en) 2013-08-30 2017-05-02 Sap Se Creation of event types for news mining for enterprise resource planning
US11861294B2 (en) * 2013-09-10 2024-01-02 Embarcadero Technologies, Inc. Syndication of associations relating data and metadata
US9898467B1 (en) * 2013-09-24 2018-02-20 Amazon Technologies, Inc. System for data normalization
US10002117B1 (en) * 2013-10-24 2018-06-19 Google Llc Translating annotation tags into suggested markup
US8781815B1 (en) * 2013-12-05 2014-07-15 Seal Software Ltd. Non-standard and standard clause detection
US20150161102A1 (en) * 2013-12-05 2015-06-11 Seal Software Ltd. Non-Standard and Standard Clause Detection
US9268768B2 (en) * 2013-12-05 2016-02-23 Seal Software Ltd. Non-standard and standard clause detection
US10073840B2 (en) 2013-12-20 2018-09-11 Microsoft Technology Licensing, Llc Unsupervised relation detection model training
US9626353B2 (en) 2014-01-15 2017-04-18 Abbyy Infopoisk Llc Arc filtering in a syntactic graph
US9870356B2 (en) 2014-02-13 2018-01-16 Microsoft Technology Licensing, Llc Techniques for inferring the unknown intents of linguistic items
US9665617B1 (en) * 2014-04-16 2017-05-30 Google Inc. Methods and systems for generating a stable identifier for nodes likely including primary content within an information resource
US9665454B2 (en) 2014-05-14 2017-05-30 International Business Machines Corporation Extracting test model from textual test suite
US9836765B2 (en) 2014-05-19 2017-12-05 Kibo Software, Inc. System and method for context-aware recommendation through user activity change detection
RU2674331C2 (en) * 2014-09-03 2018-12-06 Дзе Дан Энд Брэдстрит Корпорейшн System and process for analysis, qualification and acquisition of sources of unstructured data by means of empirical attribution
US10621182B2 (en) 2014-09-03 2020-04-14 The Dun & Bradstreet Corporation System and process for analyzing, qualifying and ingesting sources of unstructured data via empirical attribution
US9348806B2 (en) * 2014-09-30 2016-05-24 International Business Machines Corporation High speed dictionary expansion
US10148547B2 (en) * 2014-10-24 2018-12-04 Tektronix, Inc. Hardware trigger generation from a declarative protocol description
US9626358B2 (en) 2014-11-26 2017-04-18 Abbyy Infopoisk Llc Creating ontologies by analyzing natural language texts
US11140115B1 (en) * 2014-12-09 2021-10-05 Google Llc Systems and methods of applying semantic features for machine learning of message categories
US20160203233A1 (en) * 2015-01-12 2016-07-14 Microsoft Technology Licensing, Llc Storage and retrieval of structured content in unstructured user-editable content stores
CN107209779A (en) * 2015-01-12 2017-09-26 微软技术许可有限责任公司 The storage of structured content and fetched in non-structured user's editable content store
US10706124B2 (en) * 2015-01-12 2020-07-07 Microsoft Technology Licensing, Llc Storage and retrieval of structured content in unstructured user-editable content stores
US10019437B2 (en) * 2015-02-23 2018-07-10 International Business Machines Corporation Facilitating information extraction via semantic abstraction
US10942958B2 (en) 2015-05-27 2021-03-09 International Business Machines Corporation User interface for a query answering system
US20170011023A1 (en) * 2015-07-07 2017-01-12 Rima Ghannam System for Natural Language Understanding
US9824083B2 (en) * 2015-07-07 2017-11-21 Rima Ghannam System for natural language understanding
USRE49576E1 (en) * 2015-07-13 2023-07-11 Docusign International (Emea) Limited Standard exact clause detection
US10185712B2 (en) * 2015-07-13 2019-01-22 Seal Software Ltd. Standard exact clause detection
US9805025B2 (en) 2015-07-13 2017-10-31 Seal Software Limited Standard exact clause detection
US11132111B2 (en) 2015-08-01 2021-09-28 Splunk Inc. Assigning workflow network security investigation actions to investigation timelines
US10848510B2 (en) 2015-08-01 2020-11-24 Splunk Inc. Selecting network security event investigation timelines in a workflow environment
US10778712B2 (en) 2015-08-01 2020-09-15 Splunk Inc. Displaying network security events and investigation activities across investigation timelines
US11363047B2 (en) 2015-08-01 2022-06-14 Splunk Inc. Generating investigation timeline displays including activity events and investigation workflow events
US11641372B1 (en) 2015-08-01 2023-05-02 Splunk Inc. Generating investigation timeline displays including user-selected screenshots
US20170097988A1 (en) * 2015-10-05 2017-04-06 International Business Machines Corporation Hierarchical Target Centric Pattern Generation
US20170097987A1 (en) * 2015-10-05 2017-04-06 International Business Machines Corporation Hierarchical Target Centric Pattern Generation
US11204951B2 (en) * 2015-10-05 2021-12-21 International Business Machines Corporation Hierarchical target centric pattern generation
US11157532B2 (en) * 2015-10-05 2021-10-26 International Business Machines Corporation Hierarchical target centric pattern generation
US11030227B2 (en) 2015-12-11 2021-06-08 International Business Machines Corporation Discrepancy handler for document ingestion into a corpus for a cognitive computing system
US9842161B2 (en) * 2016-01-12 2017-12-12 International Business Machines Corporation Discrepancy curator for documents in a corpus of a cognitive computing system
US11308143B2 (en) 2016-01-12 2022-04-19 International Business Machines Corporation Discrepancy curator for documents in a corpus of a cognitive computing system
US11074286B2 (en) 2016-01-12 2021-07-27 International Business Machines Corporation Automated curation of documents in a corpus for a cognitive computing system
CN107342881A (en) * 2016-05-03 2017-11-10 中国移动通信集团四川有限公司 A kind of operation maintenance center's north direction interface data processing method and processing device
US20170344625A1 (en) * 2016-05-27 2017-11-30 International Business Machines Corporation Obtaining of candidates for a relationship type and its label
US11163806B2 (en) * 2016-05-27 2021-11-02 International Business Machines Corporation Obtaining candidates for a relationship type and its label
US11663495B2 (en) 2016-07-15 2023-05-30 Intuit Inc. System and method for automatic learning of functions
US11663677B2 (en) 2016-07-15 2023-05-30 Intuit Inc. System and method for automatically generating calculations for fields in compliance forms
US11049190B2 (en) 2016-07-15 2021-06-29 Intuit Inc. System and method for automatically generating calculations for fields in compliance forms
US20180018322A1 (en) * 2016-07-15 2018-01-18 Intuit Inc. System and method for automatically understanding lines of compliance forms through natural language patterns
US10725896B2 (en) 2016-07-15 2020-07-28 Intuit Inc. System and method for identifying a subset of total historical users of a document preparation system to represent a full set of test scenarios based on code coverage
US11520975B2 (en) 2016-07-15 2022-12-06 Intuit Inc. Lean parsing: a natural language processing system and method for parsing domain-specific languages
US11222266B2 (en) 2016-07-15 2022-01-11 Intuit Inc. System and method for automatic learning of functions
US11138389B2 (en) 2016-11-17 2021-10-05 Goldman Sachs & Co. LLC System and method for coupled detection of syntax and semantics for natural language understanding and generation
US10685189B2 (en) * 2016-11-17 2020-06-16 Goldman Sachs & Co. LLC System and method for coupled detection of syntax and semantics for natural language understanding and generation
US10878020B2 (en) * 2017-01-27 2020-12-29 Hootsuite Media Inc. Automated extraction tools and their use in social content tagging systems
US20190065453A1 (en) * 2017-08-25 2019-02-28 Abbyy Development Llc Reconstructing textual annotations associated with information objects
US10191975B1 (en) * 2017-11-16 2019-01-29 The Florida International University Board Of Trustees Features for automatic classification of narrative point of view and diegesis
US11263408B2 (en) * 2018-03-13 2022-03-01 Fujitsu Limited Alignment generation device and alignment generation method
US11048864B2 (en) * 2019-04-01 2021-06-29 Adobe Inc. Digital annotation and digital content linking techniques
US11030402B2 (en) 2019-05-03 2021-06-08 International Business Machines Corporation Dictionary expansion using neural language models
US11163956B1 (en) 2019-05-23 2021-11-02 Intuit Inc. System and method for recognizing domain specific named entities using domain specific word embeddings
US20220245183A1 (en) * 2019-05-31 2022-08-04 Nec Corporation Parameter learning apparatus, parameter learning method, and computer readable recording medium
US11829722B2 (en) * 2019-05-31 2023-11-28 Nec Corporation Parameter learning apparatus, parameter learning method, and computer readable recording medium
US20210081602A1 (en) * 2019-09-16 2021-03-18 Docugami, Inc. Automatically Identifying Chunks in Sets of Documents
US11816428B2 (en) * 2019-09-16 2023-11-14 Docugami, Inc. Automatically identifying chunks in sets of documents
US11822880B2 (en) 2019-09-16 2023-11-21 Docugami, Inc. Enabling flexible processing of semantically-annotated documents
US11783128B2 (en) 2020-02-19 2023-10-10 Intuit Inc. Financial document text conversion to computer readable operations
US20210403036A1 (en) * 2020-06-30 2021-12-30 Lyft, Inc. Systems and methods for encoding and searching scenario information
CN112035408A (en) * 2020-09-01 2020-12-04 文思海辉智科科技有限公司 Text processing method and device, electronic equipment and storage medium
US20220101873A1 (en) * 2020-09-30 2022-03-31 Harman International Industries, Incorporated Techniques for providing feedback on the veracity of spoken statements
CN112819622A (en) * 2021-01-26 2021-05-18 深圳价值在线信息科技股份有限公司 Information entity relationship joint extraction method and device and terminal equipment

Also Published As

Publication number Publication date
AU2004294094A1 (en) 2005-06-09
US7912705B2 (en) 2011-03-22
NZ547871A (en) 2010-03-26
EP1695170A2 (en) 2006-08-30
CA2546896A1 (en) 2005-06-09
AU2004294094B2 (en) 2010-05-13
WO2005052727A2 (en) 2005-06-09
EP1695170A4 (en) 2010-06-02
US20100195909A1 (en) 2010-08-05
CA2546896C (en) 2012-08-07
WO2005052727A3 (en) 2007-12-21

Similar Documents

Publication Publication Date Title
US7912705B2 (en) System and method for extracting information from text using text annotation and fact extraction
Daud et al. Urdu language processing: a survey
Miłkowski Developing an open‐source, rule‐based proofreading tool
US6539348B1 (en) Systems and methods for parsing a natural language sentence
KR101139903B1 (en) Semantic processor for recognition of Whole-Part relations in natural language documents
US7937265B1 (en) Paraphrase acquisition
US9430742B2 (en) Method and apparatus for extracting entity names and their relations
Reese Natural language processing with Java
US7233891B2 (en) Natural language sentence parser
Jabbar et al. A survey on Urdu and Urdu like language stemmers and stemming techniques
Avner et al. Identifying translationese at the word and sub-word level
Mosavi Miangah FarsiSpell: A spell-checking system for Persian using a large monolingual corpus
Reese et al. Natural Language Processing with Java: Techniques for building machine learning and neural network models for NLP
Dione LFG parse disambiguation for Wolof
Yeshambel et al. Evaluation of corpora, resources and tools for Amharic information retrieval
Gakis et al. Design and implementation of an electronic lexicon for Modern Greek
Mohbey et al. Preprocessing and morphological analysis in text mining
Theijssen et al. Evaluating automatic annotation: automatically detecting and enriching instances of the dative alternation
Volk The automatic resolution of prepositional phrase attachment ambiguities in German
Weiss et al. From textual information to numerical vectors
Florea et al. Improving writing for Romanian language
Petrovčič et al. The New Chinese Corpus of Literary Texts Litchi
Vale et al. Building a large dictionary of abbreviations for named entity recognition in Portuguese historical corpora
Freihat et al. ALP: An Arabic Linguistic Pipeline
Shao et al. The Construction of a Chinese Semantic Dependency Graph Bank

Legal Events

Date Code Title Description
AS Assignment

Owner name: LEXISNEXIS, OHIO

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WASSON, MARK D.;WILTSHIRE, JR., JAMES S.;LORITZ, DONALD;AND OTHERS;REEL/FRAME:015013/0693;SIGNING DATES FROM 20040106 TO 20040130

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION