US20100138388A1 - Mapping instances of a dataset within a data management system - Google Patents

Mapping instances of a dataset within a data management system Download PDF

Info

Publication number
US20100138388A1
US20100138388A1 US12/628,521 US62852109A US2010138388A1 US 20100138388 A1 US20100138388 A1 US 20100138388A1 US 62852109 A US62852109 A US 62852109A US 2010138388 A1 US2010138388 A1 US 2010138388A1
Authority
US
United States
Prior art keywords
dataset
data
mapping
datasets
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/628,521
Inventor
Tim Wakeling
Adam Weiss
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ab Initio Technology LLC
Ab Initio Original Works LLC
Original Assignee
Ab Initio Software LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ab Initio Software LLC filed Critical Ab Initio Software LLC
Priority to US12/628,521 priority Critical patent/US20100138388A1/en
Assigned to AB INITIO SOFTWARE LLC reassignment AB INITIO SOFTWARE LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: WAKELING, TIM, WEISS, ADAM
Assigned to AB INITIO TECHNOLOGY LLC reassignment AB INITIO TECHNOLOGY LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: AB INITIO ORIGINAL WORKS LLC
Assigned to AB INITIO ORIGINAL WORKS LLC reassignment AB INITIO ORIGINAL WORKS LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: AB INITIO SOFTWARE LLC
Publication of US20100138388A1 publication Critical patent/US20100138388A1/en
Priority to US16/902,949 priority patent/US11341155B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/40Data acquisition and logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/197Version control

Definitions

  • This description relates to mapping instances of a dataset within a data management system.
  • a modern data management system may include a multitude of elements representing different aspects of the system. Systems of lesser complexity often allow data to be viewed directly without additional processing for the purpose of accurate visualization. Systems of greater complexity may require additional mechanisms for the data to be meaningfully viewed.
  • a complex data management system made up of many elements may store data in many different forms and process data in many different ways. These forms of storage and processing many relate to each other in ways that are not apparent without a way to analyze the relationships.
  • a system for mapping data stored in a data storage system includes a data storage system storing specifications of dataflow graphs that include nodes representing computations interconnected by links representing flows of data, with at least one of the dataflow graphs receiving a flow of data from at least one input dataset and at least one of the dataflow graphs providing a flow of data to at least one output dataset; a mapper that identifies one or more sets of datasets associated with the dataflow graphs, where each dataset in a given set matches one or more criteria for identifying different versions of a single dataset; a user interface that receives a mapping between at least two datasets in a given set, and stores the mapping in the data storage system in association with a dataflow graph that provides data to or receives data from the datasets of the mapping.
  • a system for mapping data stored in a data storage system includes: means for processing specifications of dataflow graphs that include nodes representing computations interconnected by links representing flows of data, with at least one of the dataflow graphs receiving a flow of data from at least one input dataset and at least one of the dataflow graphs providing a flow of data to at least one output dataset; means for identifying one or more sets of datasets, where each dataset in a given set matches one or more criteria for identifying different versions of a single dataset; means for providing a user interface to receive a mapping between at least two datasets in a given set; and means for storing the mapping received over the user interface in association with a dataflow graph that provides data to or receives data from the datasets of the mapping.
  • a computer program for mapping data stored in a data storage system is stored on a computer-readable medium, and includes instructions for causing a computer to process specifications of dataflow graphs that include nodes representing computations interconnected by links representing flows of data, with at least one of the dataflow graphs receiving a flow of data from at least one input dataset and at least one of the dataflow graphs providing a flow of data to at least one output dataset; identify one or more sets of datasets, where each dataset in a given set matches one or more criteria for identifying different versions of a single dataset; provide a user interface to receive a mapping between at least two datasets in a given set; and store the mapping received over the user interface in association with a dataflow graph that provides data to or receives data from the datasets of the mapping.
  • aspects can include one or more of the following features.
  • the set is presented over the user interface.
  • a list of possible mappings ordered according to a quantification of a match to the one or more criteria is presented over the user interface.
  • the list of possible mappings includes candidates that are more likely to be an instance of a given dataset ordered higher in the list.
  • One of the criteria is built into a mapper that identifes the one or more sets of datasets.
  • One of the criteria is received from the user interface.
  • At least one of the possible mappings indicates a component of a dataflow graph that represents a dataset, and at least one of the possible mappings indicates a component of a dataflow graph that does not represent a dataset.
  • a sub-graph of a dataflow graph including multiple components represents a dataset.
  • the sub-graph includes a data component.
  • the sub-graph includes an executable component.
  • Identifying one or more sets of datasets includes using heuristics for determining if a dataset in a given set has one or more characteristics in common with another dataset.
  • the characteristics include the quantity of bytes and records in a representation of the dataset.
  • the characteristics include the name of a representation of the dataset.
  • the characteristics include the date of creation of a representation of the dataset.
  • the characteristics include the data format of a representation of the dataset.
  • At least one of the datasets of the mapping belongs to a group of datasets known to a data management system.
  • a format mapping is provided between datasets in a given set.
  • the mapping includes an identifier that points to a record in the data management system that keeps track of the dataset.
  • the mapping is updated based on a change in a dataset.
  • aspects of the invention can include one or more of the following advantages.
  • a match between two instances of a dataset can be made more efficiently than purely manual operation. Further, by providing a user interface to receive a mapping between at least two datasets, the mapping will be more accurate than if the system was purely automatic.
  • FIG. 1 is a dataflow graph.
  • FIG. 2 is an overview of a dataset mapper and associated components.
  • FIGS. 3A-3E are diagrams of different scenarios handled by a dataset mapper.
  • FIG. 4 is a flowchart of dataset mapper operation.
  • FIG. 5 is a dataset linkage mapping.
  • FIG. 6 is a dataset format mapping.
  • a data processing element may be in the form of a graph.
  • a graph-based computation is implemented using a “dataflow graph” that is represented by a directed graph, with vertices in the graph representing components (either data storage components corresponding to stored data or computation components corresponding to executable processes), and the directed links or “edges” in the graph representing flows of data between components.
  • a dataflow graph (also called simply a “graph”) is a modular entity. Each graph can be made up of one or more other graphs, and a particular graph can be a component in a larger graph.
  • a graphic development environment provides a user interface for specifying executable graphs and defining parameters for the graph components.
  • an example of a dataflow graph 101 includes an input component 102 providing a collection of data to be processed by the executable components 104 a - 104 j of the dataflow graph 101 .
  • the dataset 102 can include data records associated with a database system or transactions associated with a transaction processing system.
  • Each executable component is associated with a portion of the computation defined by the overall dataflow graph 101 .
  • Work elements e.g., individual data records from the data collection
  • output work elements typically leave one or more output ports of the component.
  • output work elements from components 104 e, 104 g, and 104 j are stored in output data components 102 a - 102 c.
  • a dataset is an object (e.g., stored in an object oriented database) that represents a particular collection of data.
  • a component is capable of representing a dataset.
  • a graph may interact with the component representing a dataset (or simply “dataset component”) in one or more ways.
  • a dataset component includes instructions for accessing the physical data represented by a given dataset, so a graph can accept input from a dataset using a dataset component, provide output to a dataset using a dataset component, and process data of a dataset using a dataset component at an intermediate step.
  • a dataset component can include various kinds information associated with a given dataset object including an instance of the dataset object. Such a system could have many dozens, hundreds, or thousands of graphs and associated dataset components. As the complexity of such a system increases, the relationships between different graphs and dataset components become more difficult to manage. More than one dataset component in the system can represent the same data source and each such dataset component can be associated with a different graph, graph subset, or executable component.
  • a single dataset may be stored in more than one location associated with the data management system.
  • two or more data sources contain similar or identical versions of the same data.
  • Two graphs in the system might handle this single dataset, but each graph reads from and writes to a different data file, a different database table, or another type of dataset component.
  • the data (e.g., data files) represented by a given dataset may be not only stored in more than one location, but also interpreted using different data storage formats.
  • two graphs may operate on two separate data files containing the same data, differing only in format.
  • Each data file may have a different arrangement of data types, despite containing instances of the same data.
  • one graph may operate on a data file containing an instance of the dataset, and another graph may operate on a database table also containing an instance of the dataset.
  • a data file and a database table will generally have two different data formats.
  • the data management system may access different versions of the same dataset each in different ways.
  • One graph may access an instance of the dataset directly, such as by reading in a data file through a standard file input/output mechanism.
  • Another graph may retrieve a file by querying an external source, such as a data repository available via a network.
  • a graph may also access a database table retrieved through a similar external query, such as a query to a networked database.
  • the data management system may also make reference to different instances of the same dataset each in different ways.
  • a graph may be capable of accessing different data locations according to a parameter.
  • Such a parameter could point to any number of data locations over time.
  • a graph that operates multiple times may access different locations on different occasions if the parameter varies between executions of the graph.
  • the representation of a dataset within a graph may not be a single component, but rather a collection of components, such as a “sub-graph” component within a graph that is itself implemented as a graph with multiple components.
  • the collection may include one or more dataset components, and could also include one or more executable components.
  • One approach is an automatic mechanism that identifies multiple instances of the same dataset and creates linkages between them.
  • some automatic mechanisms have drawbacks, such as the following three drawbacks.
  • the mechanism may require that each instance of a dataset be stored in particular manner, such as under a unified naming scheme and directory structure. This provides the mechanism with a way to identify and locate each one in the storage system associated with the data management system.
  • this arrangement limits the flexibility of the data management system and may be too restrictive for some uses of the system.
  • the mechanism may not properly identify instances of the same dataset and form the correct linkages. For example, this is likely if a dataset is accessed using an externally-referenced entity, and the automatic mechanism does not have access to that entity. Similarly, this is likely if a component accesses a dataset according to an independent parameter in a parameter list, and the mechanism does not have a way to access or interpret the parameter list. Further, this is likely if a dataset is represented by a complex entity made up of one or more dataset components and executable components, such as a sub-graph. An automatic mechanism may be unable to discern what particular combination of components represents a particular dataset.
  • the mechanism may form redundant or unnecessary linkages between dataset instances.
  • some of the datasets handled by the data management system may represent extraneous data, such as the contents of error logs. Any linkages between instances of these datasets are unnecessary.
  • some of the instances of a dataset handled by the data management system may be redundant instances, such as cached data or other temporary copies of data. A linkage that connects to this type of data quickly becomes obsolete and would be confusing to a user examining the data management system.
  • An alternative approach is a system in which a user manually consolidates instances of the same dataset via a user interface.
  • a user is less likely to miss essential linkages between instances of a dataset, and is also less likely to create redundant or unnecessary linkages between instances of a dataset.
  • the data management system has hundreds or thousands of components, the amount of time needed for the user to manually create the necessary linkages is prohibitively large.
  • a dataset mapper is used to provide some automatic analysis, and to enable some interaction with a user in a way that is not prohibitive for a user of a large and/or complex system.
  • FIG. 2 is a block diagram of one embodiment of an exemplary dataset mapper 100 showing the interrelationship between associated principal elements.
  • a dataset mapper 100 is capable of analyzing a set of one or more graphs 180 , 180 a, 180 b, 180 c. Each graph is associated with one or more dataset components 182 , 182 a, 182 b, where each dataset component could correspond to a data file, a database table, a sub-graph, or another kind of component representing a dataset.
  • the mapper 100 analyzes the graphs for the purpose of forming linkages between dataset components that contain instances of the same dataset 186 .
  • the mapper 100 processes each dataset component according to a combination of built-in rules 110 , user-defined rules 120 , and heuristics 130 , to determine if a dataset component 182 may contain an instance of one of several datasets representing data sources 176 , 176 a, 176 b known to a data management system 170 .
  • the mapper 100 passes this information to a user interface 160 , which allows a user 162 to select the proper dataset, if any, that corresponds to the dataset component 182 .
  • the user interface 160 presents a list of possible candidate mappings based on a match to one or more criteria for identifying different versions or instances of a single dataset.
  • the list can be ordered according to quantification of the match to the one or more criteria (e.g., candidates that are more likely to be an instance of a given dataset are ordered higher in the list).
  • the mapper 100 then generates a dataset linkage mapping 140 that indicates that the dataset component 182 contains an instance of the dataset representing a data source 176 .
  • the dataset component 182 can have a data format 184 that differs from the format 174 of a corresponding linked data source 176 .
  • the user may choose to establish a single data format for all instances of the dataset.
  • the system stores a format 174 , 174 a, 174 b for each data source 176 , 176 a, 176 b.
  • the user can choose to create an optional mapping 142 between the format 184 of the dataset component 182 and the established format 174 of the corresponding data source 176 .
  • the optional data format mapping 142 allows the system 170 to retain information about the data types for each instance of the dataset.
  • the mapper 100 also enables a user to indicate a linkage between an executable component and a single dataset component, which may have no other linkages to it.
  • a dataset component may correspond to a source dataset with only one reader or a target dataset with only one writer. If the dataset object already exists in the system and has other relevant metadata, such as the correct record format, documentation, data profiles, etc., the linkage enables the dataset component to be mapped to the correct dataset.
  • the mapper 100 is capable of handling common scenarios that arise in complex data management system.
  • a first scenario shown in FIG. 3A , one graph 210 provides a dataset component 212 as output, and another graph 220 accepts a different dataset component 222 as input.
  • Each dataset component contains an instance of the same dataset 216 .
  • This dataset may be the same as a dataset representing a data source 176 known to the data management system.
  • the first dataset component 212 has a data format 214 that may be the same as the format belonging to the second dataset component 222 , or, alternatively, the second component may have a different format 224 .
  • the mapper 100 is capable of identifying the second dataset component 222 as being an instance of the dataset 216 represented by the first dataset component 212 and creating an appropriate linkage mapping 140 .
  • a graph 230 is associated with an external dataset component 232 using an external reference 238 to an external source 239 .
  • the external dataset component 232 has a data format 234 and is an instance of a dataset 236 .
  • the dataset 236 represented by the external dataset component may be a dataset representing a data source 176 known to the data management system 170 .
  • the mapper 100 is capable of identifying this external dataset component 232 as being an instance of another dataset and creating an appropriate linkage mapping 140 .
  • a graph 240 is associated with a dataset component 242 using a parameter 248 in a parameter list 247 .
  • the referenced dataset component 242 has a data format 244 and is an instance of a dataset 246 .
  • the dataset 246 represented by the referenced dataset component may be a dataset representing a data source 176 known to the data management system 170 .
  • the mapper 100 is capable of identifying this referenced dataset component 242 as being an instance of another dataset and creating an appropriate linkage mapping 140 .
  • a graph 250 is associated with an external component 251 using an external reference 258 to an external source 259 .
  • the external component 251 is not a dataset component, but rather another kind of component, such as an executable component.
  • the mapper 100 is capable of identifying this external component 251 as inapplicable to the dataset linkage mapping process.
  • a graph 260 is associated with a sub-graph component 263 , itself made up of several components. These components include at least one dataset component 262 , and, in this example, one or more executable components 261 a, 261 b, 261 c.
  • the sub-graph 263 as a single entity represents at least one dataset.
  • Other exemplary sub-graphs may include multiple dataset components, and any number of executable components, including zero.
  • this sub-graph 263 has multiple outputs 265 a, 265 b. Each output is capable of providing a different instance of a dataset to the component that receives the output.
  • Another exemplary sub-graph could also have any number of inputs.
  • a further exemplary sub-graph may have no inputs or outputs that correspond to a respective dataset.
  • the mapper 100 is capable of identifying the sub-graph 263 as being an instance of at least one dataset and creating at least one appropriate linkage mapping 140 .
  • step 302 the mapper first identifies, of the elements associated with a graph, which elements represent datasets.
  • a graph will have one or more inputs and outputs, and each input and each output could be an instance of a dataset.
  • Each graph may also handle an instance of a dataset at some intermediate step.
  • each graph can be connected to multiple components that are capable of being dataset candidates.
  • the data management system has information about the characteristics of some of the components, including information about whether or not the component represents a dataset. In those cases, the mapper adds the potential dataset components to a table of dataset candidates in step 304 .
  • a component could be a sub-graph made up of multiple components, including dataset components and executable components.
  • a sub-graph could represent at least one instance of a dataset. Accordingly, the mapper compiles a list of all such sub-graphs and adds them to the table of dataset candidates as part of step 304 .
  • the nature of the component may not be available to the data management system.
  • the component could be accessed through a reference to an external entity, where the reference may be a query to a database table, a Uniform Resource Locator pointing to an Internet server, a parameter in a parameter list, or another type of reference.
  • the mapper generally has no means by which it can independently access the entity pointed to by the reference. Accordingly, the mapper compiles a list of all such references and adds them to the table of dataset candidates as part of step 304 .
  • the mapper For a given dataset candidate, the mapper generates a list of known datasets that the dataset candidate could map to.
  • the mapper uses a combination of user-defined rules, built-in rules, and heuristics to evaluate which known datasets could map to a dataset candidate.
  • the user selects the known dataset that corresponds to the dataset candidate.
  • the user may also access a full list of all known datasets, if none of the suggested known datasets is the correct match.
  • the user can indicate that the dataset candidate is not a dataset.
  • a reference to a remote server could be a call to a remote executable procedure, which is not a data entity.
  • the dataset candidate may represent data, but it may be data of a kind not pertinent to the data management system, such as an error log. In this case, the user may indicate to the user interface that this data is to be ignored in the mapping process.
  • the user identifies the data format of the newly-mapped dataset.
  • the system may have a set of data format templates, one of which can be selected.
  • the user can create a new data format in the user interface.
  • step 312 the mapper uses this information to generate a linkage mapping for the dataset candidate, and, optionally, a format mapping.
  • the mapper offers the next dataset candidate to the user for linkage generation in another iteration of steps 308 , 310 , and 312 , unless the mapper has processed all dataset candidates.
  • step 314 the user views the components associated with the data management system, to ensure that a visualization of the associations between graphs and dataset components is accurate based on the new linkages between components.
  • step 316 the user has the option of making any adjustments to the linkage and format mapping.
  • the mapper delivers the linkage and format mapping to the data management system.
  • the mappings can be stored alongside one or more graphs, or in a separate storage entity associated with the data management system, or by another means.
  • the mapper 100 is capable of handling multiple scenarios that may arise that affect the integrity of the dataset linkages.
  • the first scenario includes identifying new dataset candidates when new components are added to the data management system 170 .
  • the mapper 100 analyzes each component and presents possible linkages to the user.
  • the mapper 100 is capable of operating on any new components to generate the appropriate linkages as needed.
  • the second scenario includes maintaining the existing linkages as the data management system 170 changes over time. For example, new instances of a dataset may have come into existence over the course of the normal operation of the graphs associated with the system. As another example, a dataset may have changed its identity, such as its name or location in the system. As a further example, a dataset may have been deleted entirely. As another further example, a dataset candidate may have been overlooked in a previous round of linkage creation, and so the collection of linkages is incomplete.
  • the user interface 160 of the mapping system allows a user 162 to modify the existing linkages to remedy any mappings that are incomplete or outdated.
  • the third scenario includes automatically updating linkages for dataset references that invariably follow a known pattern.
  • a graph may handle a dataset that is referenced in a parameter list 247 .
  • Such a parameter list may change over time. If the parameter list follows a standard format known to the data management system, the mapper can identify changes in the parameter list and update the existing linkages accordingly.
  • a dataset linkage mapping 140 contains a component name 402 , a dataset name 404 , a dataset type 406 , a format 408 , a master dataset location 410 , and a flag 412 .
  • the component name 402 is the dataset component or sub-graph that represents this instance of the dataset.
  • the dataset name 404 is an identifier that points to the dataset represented by this component.
  • the dataset type 406 indicates the category that this instance of the dataset falls under, for example, a data file, or a database table, or another type.
  • the format 408 is the format or arrangement that this instance of the dataset uses to represent its data.
  • the master dataset location 410 is an identifier that points to the record in the data management system that keeps track of this dataset.
  • the flag 412 indicates whether or not this instance of the dataset should be ignored, for example, if the user has identified this instance of the dataset as not applicable to the data management system and should be excluded from the set of linkages.
  • the mapper 100 has a set of built-in rules 110 that operate according to standard conventions of the data management system.
  • the mapper can identify datasets corresponding to a dataset component with the highest degree of accuracy if the dataset component follows the built-in rules 110 .
  • externally-referenced database tables containing dataset candidates must be placed in persistent storage under a standardized directory structure used by the data management system.
  • a graph that accesses an externally-referenced dataset component according to a parameter must use a parameter that the data management system is also capable of accessing and resolving.
  • the format of a dataset component must be available in persistent storage and accessible by the data management system.
  • Other built-in rules are also possible, depending on the data management system.
  • the mapper 100 also has a collection of optional user-defined rules 120 . These rules 120 may be enabled or disabled by a user, depending on which are applicable to the user's particular data management system.
  • the mapper has six user-defined optional rules. The mapper can ignore some of the information in the name of a database table, if some of the information in the name obscures the identity of the table, such as information about the a user who defined the table. Further, the mapper can eliminate this information from the name of a database table. Further, the mapper can ignore a particular category of data files that are known to contain data that is not pertinent to the datasets associated with the data management system.
  • Such a category could be a data file type or data file extension.
  • the mapper can resolve references to a particular parameter in a parameter list and replace the reference with the name of the parameter itself. Further, the mapper can eliminate references to a parameter entirely. The user can also create other rules for the mapper to follow.
  • the mapper 100 also uses a set of heuristics 130 .
  • the heuristics 130 allow the mapper to analyze the characteristics of a given dataset component and compare those characteristics to known datasets. A dataset component with similar characteristics to a known dataset is likely to be an instance of that dataset.
  • the mapper uses two heuristics. One heuristic is the characteristics of the data of a given dataset component. For example, if the data associated with a dataset component has the same quantity of bytes and records as does the data associated with a known dataset, then that dataset component is likely to be an instance of that dataset.
  • a second heuristic is the data format of a dataset component. If a dataset component shares a data format with a known dataset, then the dataset component is likely to be an instance of the dataset. This heuristic is less reliable in situations where multiple distinct datasets use the same data format.
  • Each dataset representing a data source has an associated data format that indicates, for each element in the dataset, what type of data the element represents.
  • the data format of a database table indicates the data types of each field within a given record.
  • the data management system 170 retains a single data format 174 , 174 a, 174 b for each dataset representing a data source 176 , 176 a, 176 b.
  • mapper 100 If the mapper 100 has encountered a dataset component 182 that represents a new dataset 186 , then the mapper 100 creates a corresponding data format to be stored by the data management system, based on the data format 184 of the dataset component 182 .
  • a dataset component 182 represents a known dataset representing a data source 176
  • the dataset component 182 has a different data format 184 than the data format 174 of the known dataset representing a data source 176 .
  • the data management system 170 handles the dataset representing a data source 176 as a single entity, independent of the number of instances of that dataset that may exist. Consequently, the data management system 170 relies on the mapper 100 to consolidate the different formats 174 , 184 when these situations arise.
  • the mapper is capable of addressing each situation in one of four different ways depending on the requirements of the user and the data management system. The user 162 can choose any one of the four methods of consolidation for each situation.
  • the mapper 100 uses the data format 184 of the dataset component 182 as the master data format of the dataset and updates the data management system 170 accordingly.
  • the mapper 100 uses the data format 174 of the existing dataset as the master data format of the dataset and updates the data management system 170 accordingly.
  • the mapper 100 retains both data formats, and generates a mapping 142 between the fields of each data format.
  • the dataset format mapping 142 indicates which fields 512 a, 512 b, 512 c of the dataset format 510 correspond to which fields 522 a, 522 b, 522 c of the format of the dataset instance, e.g. the dataset component.
  • the mapper Under the fourth method of consolidation, the mapper generates a new union data format capable of acting as either data format.
  • the dataset mapping approach described above can be implemented using software for execution on a computer.
  • the software forms procedures in one or more computer programs that execute on one or more programmed or programmable computer systems (which may be of various architectures such as distributed, client/server, or grid) each including at least one processor, at least one data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device or port, and at least one output device or port.
  • the software may form one or more modules of a larger program, for example, that provides other services related to the design and configuration of dataflow graphs.
  • the nodes and elements of the graph can be implemented as data structures stored in a computer readable medium or other organized data conforming to a data model stored in a data repository.
  • the software may be provided on a storage medium, such as a CD-ROM, readable by a general or special purpose programmable computer or delivered (encoded in a propagated signal) over a communication medium of a network to the computer where it is executed. All of the functions may be performed on a special purpose computer, or using special-purpose hardware, such as coprocessors.
  • the software may be implemented in a distributed manner in which different parts of the computation specified by the software are performed by different computers.
  • Each such computer program is preferably stored on or downloaded to a storage media or device (e.g., solid state memory or media, or magnetic or optical media) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer system to perform the procedures described herein.
  • a storage media or device e.g., solid state memory or media, or magnetic or optical media
  • the inventive system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer system to operate in a specific and predefined manner to perform the functions described herein.

Abstract

Mapping data stored in a data storage system for use by a computer system includes processing specifications of dataflow graphs that include nodes representing computations interconnected by links representing flows of data. At least one of the dataflow graphs receives a flow of data from at least one input dataset and at least one of the dataflow graphs provides a flow of data to at least one output dataset. A mapper identifies one or more sets of datasets. Each dataset in a given set matches one or more criteria for identifying different versions of a single dataset. A user interface is provided to receive a mapping between at least two datasets in a given set. The mapping received over the user interface is stored in association with a dataflow graph that provides data to or receives data from the datasets of the mapping.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application claims priority to U.S. Application Ser. No. 61/119,164, filed on Dec. 2, 2008, incorporated herein by reference.
  • BACKGROUND
  • This description relates to mapping instances of a dataset within a data management system.
  • A modern data management system may include a multitude of elements representing different aspects of the system. Systems of lesser complexity often allow data to be viewed directly without additional processing for the purpose of accurate visualization. Systems of greater complexity may require additional mechanisms for the data to be meaningfully viewed. A complex data management system made up of many elements may store data in many different forms and process data in many different ways. These forms of storage and processing many relate to each other in ways that are not apparent without a way to analyze the relationships.
  • SUMMARY
  • In a general aspect, a method for mapping data stored in a data storage system for use by a computer system includes processing specifications of dataflow graphs that include nodes representing computations interconnected by links representing flows of data, with at least one of the dataflow graphs receiving a flow of data from at least one input dataset and at least one of the dataflow graphs providing a flow of data to at least one output dataset; identifying one or more sets of datasets, where each dataset in a given set matches one or more criteria for identifying different versions of a single dataset; providing a user interface to receive a mapping between at least two datasets in a given set; and storing the mapping received over the user interface in association with a dataflow graph that provides data to or receives data from the datasets of the mapping.
  • In another general aspect, a system for mapping data stored in a data storage system includes a data storage system storing specifications of dataflow graphs that include nodes representing computations interconnected by links representing flows of data, with at least one of the dataflow graphs receiving a flow of data from at least one input dataset and at least one of the dataflow graphs providing a flow of data to at least one output dataset; a mapper that identifies one or more sets of datasets associated with the dataflow graphs, where each dataset in a given set matches one or more criteria for identifying different versions of a single dataset; a user interface that receives a mapping between at least two datasets in a given set, and stores the mapping in the data storage system in association with a dataflow graph that provides data to or receives data from the datasets of the mapping.
  • In another general aspect, a system for mapping data stored in a data storage system includes: means for processing specifications of dataflow graphs that include nodes representing computations interconnected by links representing flows of data, with at least one of the dataflow graphs receiving a flow of data from at least one input dataset and at least one of the dataflow graphs providing a flow of data to at least one output dataset; means for identifying one or more sets of datasets, where each dataset in a given set matches one or more criteria for identifying different versions of a single dataset; means for providing a user interface to receive a mapping between at least two datasets in a given set; and means for storing the mapping received over the user interface in association with a dataflow graph that provides data to or receives data from the datasets of the mapping.
  • In another general aspect, a computer program for mapping data stored in a data storage system is stored on a computer-readable medium, and includes instructions for causing a computer to process specifications of dataflow graphs that include nodes representing computations interconnected by links representing flows of data, with at least one of the dataflow graphs receiving a flow of data from at least one input dataset and at least one of the dataflow graphs providing a flow of data to at least one output dataset; identify one or more sets of datasets, where each dataset in a given set matches one or more criteria for identifying different versions of a single dataset; provide a user interface to receive a mapping between at least two datasets in a given set; and store the mapping received over the user interface in association with a dataflow graph that provides data to or receives data from the datasets of the mapping.
  • Aspects can include one or more of the following features.
  • The set is presented over the user interface.
  • A list of possible mappings ordered according to a quantification of a match to the one or more criteria is presented over the user interface.
  • The list of possible mappings includes candidates that are more likely to be an instance of a given dataset ordered higher in the list.
  • One of the criteria is built into a mapper that identifes the one or more sets of datasets.
  • One of the criteria is received from the user interface.
  • At least one of the possible mappings indicates a component of a dataflow graph that represents a dataset, and at least one of the possible mappings indicates a component of a dataflow graph that does not represent a dataset.
  • A sub-graph of a dataflow graph including multiple components represents a dataset.
  • The sub-graph includes a data component.
  • The sub-graph includes an executable component.
  • Identifying one or more sets of datasets includes using heuristics for determining if a dataset in a given set has one or more characteristics in common with another dataset.
  • The characteristics include the quantity of bytes and records in a representation of the dataset.
  • The characteristics include the name of a representation of the dataset.
  • The characteristics include the date of creation of a representation of the dataset.
  • The characteristics include the data format of a representation of the dataset.
  • At least one of the datasets of the mapping belongs to a group of datasets known to a data management system.
  • A format mapping is provided between datasets in a given set.
  • The mapping includes an identifier that points to a record in the data management system that keeps track of the dataset.
  • The mapping is updated based on a change in a dataset.
  • Aspects of the invention can include one or more of the following advantages.
  • By identifying sets of datasets according to version identification criteria, a match between two instances of a dataset can be made more efficiently than purely manual operation. Further, by providing a user interface to receive a mapping between at least two datasets, the mapping will be more accurate than if the system was purely automatic.
  • Other features and advantages of the invention will become apparent from the following description, and from the claims.
  • DESCRIPTION OF DRAWINGS
  • FIG. 1 is a dataflow graph.
  • FIG. 2 is an overview of a dataset mapper and associated components.
  • FIGS. 3A-3E are diagrams of different scenarios handled by a dataset mapper.
  • FIG. 4 is a flowchart of dataset mapper operation.
  • FIG. 5 is a dataset linkage mapping.
  • FIG. 6 is a dataset format mapping.
  • DESCRIPTION 1 Overview
  • A data processing element may be in the form of a graph. A graph-based computation is implemented using a “dataflow graph” that is represented by a directed graph, with vertices in the graph representing components (either data storage components corresponding to stored data or computation components corresponding to executable processes), and the directed links or “edges” in the graph representing flows of data between components. A dataflow graph (also called simply a “graph”) is a modular entity. Each graph can be made up of one or more other graphs, and a particular graph can be a component in a larger graph. A graphic development environment (GDE) provides a user interface for specifying executable graphs and defining parameters for the graph components.
  • Referring to FIG. 1, an example of a dataflow graph 101 includes an input component 102 providing a collection of data to be processed by the executable components 104 a-104 j of the dataflow graph 101. For example, the dataset 102 can include data records associated with a database system or transactions associated with a transaction processing system. Each executable component is associated with a portion of the computation defined by the overall dataflow graph 101. Work elements (e.g., individual data records from the data collection) enter one or more input ports of a component, and output work elements (which are in some cases the input work elements, or processed versions of the input work elements) typically leave one or more output ports of the component. In graph 101, output work elements from components 104 e, 104 g, and 104 j are stored in output data components 102 a-102 c.
  • A dataset is an object (e.g., stored in an object oriented database) that represents a particular collection of data. In the context of a system of dataflow graphs, a component is capable of representing a dataset. In these cases, a graph may interact with the component representing a dataset (or simply “dataset component”) in one or more ways. A dataset component includes instructions for accessing the physical data represented by a given dataset, so a graph can accept input from a dataset using a dataset component, provide output to a dataset using a dataset component, and process data of a dataset using a dataset component at an intermediate step. A dataset component can include various kinds information associated with a given dataset object including an instance of the dataset object. Such a system could have many dozens, hundreds, or thousands of graphs and associated dataset components. As the complexity of such a system increases, the relationships between different graphs and dataset components become more difficult to manage. More than one dataset component in the system can represent the same data source and each such dataset component can be associated with a different graph, graph subset, or executable component.
  • For example, in one possible scenario, a single dataset may be stored in more than one location associated with the data management system. In this scenario, two or more data sources contain similar or identical versions of the same data. Two graphs in the system might handle this single dataset, but each graph reads from and writes to a different data file, a different database table, or another type of dataset component.
  • In a similar scenario, the data (e.g., data files) represented by a given dataset may be not only stored in more than one location, but also interpreted using different data storage formats. As with the above example, two graphs may operate on two separate data files containing the same data, differing only in format. Each data file may have a different arrangement of data types, despite containing instances of the same data.
  • In an alternative scenario, one graph may operate on a data file containing an instance of the dataset, and another graph may operate on a database table also containing an instance of the dataset. In such a case, a data file and a database table will generally have two different data formats.
  • In another scenario, the data management system may access different versions of the same dataset each in different ways. One graph may access an instance of the dataset directly, such as by reading in a data file through a standard file input/output mechanism. Another graph may retrieve a file by querying an external source, such as a data repository available via a network. A graph may also access a database table retrieved through a similar external query, such as a query to a networked database.
  • The data management system may also make reference to different instances of the same dataset each in different ways. For example, a graph may be capable of accessing different data locations according to a parameter. Such a parameter could point to any number of data locations over time. A graph that operates multiple times may access different locations on different occasions if the parameter varies between executions of the graph.
  • In some scenarios, the representation of a dataset within a graph may not be a single component, but rather a collection of components, such as a “sub-graph” component within a graph that is itself implemented as a graph with multiple components. The collection may include one or more dataset components, and could also include one or more executable components.
  • All of these scenarios can potentially pose a problem for visualizing and analyzing the data handled by the data management system. If a user requires a consolidated view of the components that interact with a given dataset, various approaches can be used to reconcile the different instances of the dataset that may exist.
  • One approach is an automatic mechanism that identifies multiple instances of the same dataset and creates linkages between them. However, some automatic mechanisms have drawbacks, such as the following three drawbacks. First, the mechanism may require that each instance of a dataset be stored in particular manner, such as under a unified naming scheme and directory structure. This provides the mechanism with a way to identify and locate each one in the storage system associated with the data management system. However, this arrangement limits the flexibility of the data management system and may be too restrictive for some uses of the system.
  • Second, under several scenarios of operation, the mechanism may not properly identify instances of the same dataset and form the correct linkages. For example, this is likely if a dataset is accessed using an externally-referenced entity, and the automatic mechanism does not have access to that entity. Similarly, this is likely if a component accesses a dataset according to an independent parameter in a parameter list, and the mechanism does not have a way to access or interpret the parameter list. Further, this is likely if a dataset is represented by a complex entity made up of one or more dataset components and executable components, such as a sub-graph. An automatic mechanism may be unable to discern what particular combination of components represents a particular dataset.
  • Third, the mechanism may form redundant or unnecessary linkages between dataset instances. For example, some of the datasets handled by the data management system may represent extraneous data, such as the contents of error logs. Any linkages between instances of these datasets are unnecessary. Further, some of the instances of a dataset handled by the data management system may be redundant instances, such as cached data or other temporary copies of data. A linkage that connects to this type of data quickly becomes obsolete and would be confusing to a user examining the data management system.
  • An alternative approach is a system in which a user manually consolidates instances of the same dataset via a user interface. A user is less likely to miss essential linkages between instances of a dataset, and is also less likely to create redundant or unnecessary linkages between instances of a dataset. However, if the data management system has hundreds or thousands of components, the amount of time needed for the user to manually create the necessary linkages is prohibitively large.
  • In a partially-automated approach, a dataset mapper is used to provide some automatic analysis, and to enable some interaction with a user in a way that is not prohibitive for a user of a large and/or complex system.
  • FIG. 2 is a block diagram of one embodiment of an exemplary dataset mapper 100 showing the interrelationship between associated principal elements. A dataset mapper 100 is capable of analyzing a set of one or more graphs 180, 180 a, 180 b, 180 c. Each graph is associated with one or more dataset components 182, 182 a, 182 b, where each dataset component could correspond to a data file, a database table, a sub-graph, or another kind of component representing a dataset. The mapper 100 analyzes the graphs for the purpose of forming linkages between dataset components that contain instances of the same dataset 186. The mapper 100 processes each dataset component according to a combination of built-in rules 110, user-defined rules 120, and heuristics 130, to determine if a dataset component 182 may contain an instance of one of several datasets representing data sources 176, 176 a, 176 b known to a data management system 170. The mapper 100 passes this information to a user interface 160, which allows a user 162 to select the proper dataset, if any, that corresponds to the dataset component 182. For example, the user interface 160 presents a list of possible candidate mappings based on a match to one or more criteria for identifying different versions or instances of a single dataset. Examples of such criteria, including criteria based on built-in rules, user-defined rules, and heuristics, are described in more detail below. The list can be ordered according to quantification of the match to the one or more criteria (e.g., candidates that are more likely to be an instance of a given dataset are ordered higher in the list). The mapper 100 then generates a dataset linkage mapping 140 that indicates that the dataset component 182 contains an instance of the dataset representing a data source 176.
  • Further, the dataset component 182 can have a data format 184 that differs from the format 174 of a corresponding linked data source 176. Depending on the requirements of the data management system 170, the user may choose to establish a single data format for all instances of the dataset. The system stores a format 174, 174 a, 174 b for each data source 176, 176 a, 176 b. Alternatively, the user can choose to create an optional mapping 142 between the format 184 of the dataset component 182 and the established format 174 of the corresponding data source 176. The optional data format mapping 142 allows the system 170 to retain information about the data types for each instance of the dataset.
  • The mapper 100 also enables a user to indicate a linkage between an executable component and a single dataset component, which may have no other linkages to it. For example, a dataset component may correspond to a source dataset with only one reader or a target dataset with only one writer. If the dataset object already exists in the system and has other relevant metadata, such as the correct record format, documentation, data profiles, etc., the linkage enables the dataset component to be mapped to the correct dataset.
  • 2 Mapping Process
  • The mapper 100 is capable of handling common scenarios that arise in complex data management system. In a first scenario, shown in FIG. 3A, one graph 210 provides a dataset component 212 as output, and another graph 220 accepts a different dataset component 222 as input. Each dataset component contains an instance of the same dataset 216. This dataset may be the same as a dataset representing a data source 176 known to the data management system. Further, the first dataset component 212 has a data format 214 that may be the same as the format belonging to the second dataset component 222, or, alternatively, the second component may have a different format 224. The mapper 100 is capable of identifying the second dataset component 222 as being an instance of the dataset 216 represented by the first dataset component 212 and creating an appropriate linkage mapping 140.
  • In a second scenario, shown in FIG. 3B, a graph 230 is associated with an external dataset component 232 using an external reference 238 to an external source 239. The external dataset component 232 has a data format 234 and is an instance of a dataset 236. As in the first scenario, the dataset 236 represented by the external dataset component may be a dataset representing a data source 176 known to the data management system 170. The mapper 100 is capable of identifying this external dataset component 232 as being an instance of another dataset and creating an appropriate linkage mapping 140.
  • In a third scenario, shown in FIG. 3C, a graph 240 is associated with a dataset component 242 using a parameter 248 in a parameter list 247. The referenced dataset component 242 has a data format 244 and is an instance of a dataset 246. As in the first and second scenarios, the dataset 246 represented by the referenced dataset component may be a dataset representing a data source 176 known to the data management system 170. The mapper 100 is capable of identifying this referenced dataset component 242 as being an instance of another dataset and creating an appropriate linkage mapping 140.
  • In a fourth scenario, shown in FIG. 3D, a graph 250 is associated with an external component 251 using an external reference 258 to an external source 259. The external component 251 is not a dataset component, but rather another kind of component, such as an executable component. The mapper 100 is capable of identifying this external component 251 as inapplicable to the dataset linkage mapping process.
  • In a fifth scenario, shown in FIG. 3E, a graph 260 is associated with a sub-graph component 263, itself made up of several components. These components include at least one dataset component 262, and, in this example, one or more executable components 261 a, 261 b, 261 c. Under this scenario, the sub-graph 263 as a single entity represents at least one dataset. Other exemplary sub-graphs may include multiple dataset components, and any number of executable components, including zero. Further, this sub-graph 263 has multiple outputs 265 a, 265 b. Each output is capable of providing a different instance of a dataset to the component that receives the output. Another exemplary sub-graph could also have any number of inputs. A further exemplary sub-graph may have no inputs or outputs that correspond to a respective dataset. For cases where the sub-graph does represent at least one dataset, the mapper 100 is capable of identifying the sub-graph 263 as being an instance of at least one dataset and creating at least one appropriate linkage mapping 140.
  • An example of a sequence of operation of the mapper is shown in FIG. 4. In step 302, the mapper first identifies, of the elements associated with a graph, which elements represent datasets. Generally, a graph will have one or more inputs and outputs, and each input and each output could be an instance of a dataset. Each graph may also handle an instance of a dataset at some intermediate step. As a result, each graph can be connected to multiple components that are capable of being dataset candidates. In some cases, the data management system has information about the characteristics of some of the components, including information about whether or not the component represents a dataset. In those cases, the mapper adds the potential dataset components to a table of dataset candidates in step 304. In some cases, a component could be a sub-graph made up of multiple components, including dataset components and executable components. A sub-graph could represent at least one instance of a dataset. Accordingly, the mapper compiles a list of all such sub-graphs and adds them to the table of dataset candidates as part of step 304. In other cases, the nature of the component may not be available to the data management system. The component could be accessed through a reference to an external entity, where the reference may be a query to a database table, a Uniform Resource Locator pointing to an Internet server, a parameter in a parameter list, or another type of reference. In these cases, the mapper generally has no means by which it can independently access the entity pointed to by the reference. Accordingly, the mapper compiles a list of all such references and adds them to the table of dataset candidates as part of step 304.
  • Next, in step 306, for a given dataset candidate, the mapper generates a list of known datasets that the dataset candidate could map to. The mapper uses a combination of user-defined rules, built-in rules, and heuristics to evaluate which known datasets could map to a dataset candidate.
  • Next, in step 308, the user then selects the known dataset that corresponds to the dataset candidate. The user may also access a full list of all known datasets, if none of the suggested known datasets is the correct match. In addition, the user can indicate that the dataset candidate is not a dataset. For example, a reference to a remote server could be a call to a remote executable procedure, which is not a data entity. As another example, the dataset candidate may represent data, but it may be data of a kind not pertinent to the data management system, such as an error log. In this case, the user may indicate to the user interface that this data is to be ignored in the mapping process.
  • Next, in step 310, the user identifies the data format of the newly-mapped dataset. The system may have a set of data format templates, one of which can be selected. Alternatively, the user can create a new data format in the user interface.
  • Next, in step 312 the mapper uses this information to generate a linkage mapping for the dataset candidate, and, optionally, a format mapping.
  • Next, the mapper offers the next dataset candidate to the user for linkage generation in another iteration of steps 308, 310, and 312, unless the mapper has processed all dataset candidates.
  • Next, in step 314, the user views the components associated with the data management system, to ensure that a visualization of the associations between graphs and dataset components is accurate based on the new linkages between components. In step 316, the user has the option of making any adjustments to the linkage and format mapping.
  • Finally, in step 318, the mapper delivers the linkage and format mapping to the data management system. The mappings can be stored alongside one or more graphs, or in a separate storage entity associated with the data management system, or by another means.
  • 3 Dataset Mapping Maintenance
  • The mapper 100 is capable of handling multiple scenarios that may arise that affect the integrity of the dataset linkages.
  • The first scenario includes identifying new dataset candidates when new components are added to the data management system 170. Under this scenario, the mapper 100 analyzes each component and presents possible linkages to the user. The mapper 100 is capable of operating on any new components to generate the appropriate linkages as needed.
  • The second scenario includes maintaining the existing linkages as the data management system 170 changes over time. For example, new instances of a dataset may have come into existence over the course of the normal operation of the graphs associated with the system. As another example, a dataset may have changed its identity, such as its name or location in the system. As a further example, a dataset may have been deleted entirely. As another further example, a dataset candidate may have been overlooked in a previous round of linkage creation, and so the collection of linkages is incomplete. The user interface 160 of the mapping system allows a user 162 to modify the existing linkages to remedy any mappings that are incomplete or outdated.
  • The third scenario includes automatically updating linkages for dataset references that invariably follow a known pattern. For example, a graph may handle a dataset that is referenced in a parameter list 247. Such a parameter list may change over time. If the parameter list follows a standard format known to the data management system, the mapper can identify changes in the parameter list and update the existing linkages accordingly.
  • 4 Dataset Linkage Mapping
  • As shown in FIG. 5, a dataset linkage mapping 140 contains a component name 402, a dataset name 404, a dataset type 406, a format 408, a master dataset location 410, and a flag 412. The component name 402 is the dataset component or sub-graph that represents this instance of the dataset. The dataset name 404 is an identifier that points to the dataset represented by this component. The dataset type 406 indicates the category that this instance of the dataset falls under, for example, a data file, or a database table, or another type. The format 408 is the format or arrangement that this instance of the dataset uses to represent its data. The master dataset location 410 is an identifier that points to the record in the data management system that keeps track of this dataset. Finally, the flag 412 indicates whether or not this instance of the dataset should be ignored, for example, if the user has identified this instance of the dataset as not applicable to the data management system and should be excluded from the set of linkages.
  • 5 Built-In Rules
  • The mapper 100 has a set of built-in rules 110 that operate according to standard conventions of the data management system. The mapper can identify datasets corresponding to a dataset component with the highest degree of accuracy if the dataset component follows the built-in rules 110. In one exemplary implementation of a rule, externally-referenced database tables containing dataset candidates must be placed in persistent storage under a standardized directory structure used by the data management system. Further, a graph that accesses an externally-referenced dataset component according to a parameter must use a parameter that the data management system is also capable of accessing and resolving. Further, the format of a dataset component must be available in persistent storage and accessible by the data management system. Other built-in rules are also possible, depending on the data management system.
  • 6 User-Defined Rules
  • In addition to the built-in rules that the mapper uses to identify dataset candidates, the mapper 100 also has a collection of optional user-defined rules 120. These rules 120 may be enabled or disabled by a user, depending on which are applicable to the user's particular data management system. In one exemplary implementation, the mapper has six user-defined optional rules. The mapper can ignore some of the information in the name of a database table, if some of the information in the name obscures the identity of the table, such as information about the a user who defined the table. Further, the mapper can eliminate this information from the name of a database table. Further, the mapper can ignore a particular category of data files that are known to contain data that is not pertinent to the datasets associated with the data management system. Such a category could be a data file type or data file extension. Further, the mapper can resolve references to a particular parameter in a parameter list and replace the reference with the name of the parameter itself. Further, the mapper can eliminate references to a parameter entirely. The user can also create other rules for the mapper to follow.
  • 7 Heuristics
  • In addition to following the built-in and user-defined rules to evaluate dataset candidates, the mapper 100 also uses a set of heuristics 130. The heuristics 130 allow the mapper to analyze the characteristics of a given dataset component and compare those characteristics to known datasets. A dataset component with similar characteristics to a known dataset is likely to be an instance of that dataset. In one exemplary implementation, the mapper uses two heuristics. One heuristic is the characteristics of the data of a given dataset component. For example, if the data associated with a dataset component has the same quantity of bytes and records as does the data associated with a known dataset, then that dataset component is likely to be an instance of that dataset. Further, if the dataset component has a name or date of creation similar to that of a known dataset, then the dataset component is likely to be an instance of that dataset. A second heuristic is the data format of a dataset component. If a dataset component shares a data format with a known dataset, then the dataset component is likely to be an instance of the dataset. This heuristic is less reliable in situations where multiple distinct datasets use the same data format.
  • 8 Dataset Formats and Mapping
  • Each dataset representing a data source has an associated data format that indicates, for each element in the dataset, what type of data the element represents. For example, the data format of a database table indicates the data types of each field within a given record. The data management system 170 retains a single data format 174, 174 a, 174 b for each dataset representing a data source 176, 176 a, 176 b.
  • If the mapper 100 has encountered a dataset component 182 that represents a new dataset 186, then the mapper 100 creates a corresponding data format to be stored by the data management system, based on the data format 184 of the dataset component 182.
  • In some cases where a dataset component 182 represents a known dataset representing a data source 176, the dataset component 182 has a different data format 184 than the data format 174 of the known dataset representing a data source 176. The data management system 170 handles the dataset representing a data source 176 as a single entity, independent of the number of instances of that dataset that may exist. Consequently, the data management system 170 relies on the mapper 100 to consolidate the different formats 174, 184 when these situations arise. In one implementation, the mapper is capable of addressing each situation in one of four different ways depending on the requirements of the user and the data management system. The user 162 can choose any one of the four methods of consolidation for each situation.
  • Under the first method of consolidation, the mapper 100 uses the data format 184 of the dataset component 182 as the master data format of the dataset and updates the data management system 170 accordingly.
  • Under the second method of consolidation, the mapper 100 uses the data format 174 of the existing dataset as the master data format of the dataset and updates the data management system 170 accordingly.
  • Under the third method of consolidation, the mapper 100 retains both data formats, and generates a mapping 142 between the fields of each data format. As shown in FIG. 6, the dataset format mapping 142 indicates which fields 512 a, 512 b, 512 c of the dataset format 510 correspond to which fields 522 a, 522 b, 522 c of the format of the dataset instance, e.g. the dataset component.
  • Under the fourth method of consolidation, the mapper generates a new union data format capable of acting as either data format.
  • 9 General Computer Implementation
  • The dataset mapping approach described above can be implemented using software for execution on a computer. For instance, the software forms procedures in one or more computer programs that execute on one or more programmed or programmable computer systems (which may be of various architectures such as distributed, client/server, or grid) each including at least one processor, at least one data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device or port, and at least one output device or port. The software may form one or more modules of a larger program, for example, that provides other services related to the design and configuration of dataflow graphs. The nodes and elements of the graph can be implemented as data structures stored in a computer readable medium or other organized data conforming to a data model stored in a data repository.
  • The software may be provided on a storage medium, such as a CD-ROM, readable by a general or special purpose programmable computer or delivered (encoded in a propagated signal) over a communication medium of a network to the computer where it is executed. All of the functions may be performed on a special purpose computer, or using special-purpose hardware, such as coprocessors. The software may be implemented in a distributed manner in which different parts of the computation specified by the software are performed by different computers. Each such computer program is preferably stored on or downloaded to a storage media or device (e.g., solid state memory or media, or magnetic or optical media) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer system to perform the procedures described herein. The inventive system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer system to operate in a specific and predefined manner to perform the functions described herein.
  • A number of embodiments of the invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. For example, some of the steps described above may be order independent, and thus can be performed in an order different from that described.
  • It is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the appended claims. For example, a number of the function steps described above may be performed in a different order without substantially affecting overall processing. Other embodiments are within the scope of the following claims.

Claims (44)

1. A method for mapping data stored in a data storage system for use by a computer system, the method including:
processing specifications of dataflow graphs that include nodes representing computations interconnected by links representing flows of data, with at least one of the dataflow graphs receiving a flow of data from at least one input dataset and at least one of the dataflow graphs providing a flow of data to at least one output dataset;
identifying one or more sets of datasets, where each dataset in a given set matches one or more criteria for identifying different versions of a single dataset, each version of the single dataset representing data received or provided by a different one of the dataflow graphs;
providing a user interface to receive a mapping between at least two datasets in a given set; and
storing the mapping received over the user interface in association with a dataflow graph that provides data to or receives data from the datasets of the mapping.
2. The method of claim 1, including presenting the set over the user interface.
3. The method of claim 1, including presenting over the user interface a list of possible mappings ordered according to a quantification of a match to the one or more criteria.
4. The method of claim 3, wherein the list of possible mappings includes candidates that are more likely to be an instance of a given dataset ordered higher in the list.
5. The method of claim 3, wherein one of the criteria is built into a mapper that identifies the one or more sets of datasets.
6. The method of claim 3, wherein one of the criteria is received from the user interface.
7. The method of claim 3, wherein at least one of the possible mappings indicates a component of a dataflow graph that represents a dataset, and at least one of the possible mappings indicates a component of a dataflow graph that does not represent a dataset.
8. The method of claim 1, wherein a sub-graph of a dataflow graph including multiple components represents a dataset.
9. The method of claim 8, wherein the sub-graph includes a data component.
10. The method of claim 8, wherein the sub-graph includes an executable component.
11. The method of claim 1, wherein identifying one or more sets of datasets includes using heuristics for determining if a dataset in a given set has one or more characteristics in common with another dataset.
12. The method of claim 11, wherein the characteristics include the quantity of bytes and records in a representation of the dataset.
13. The method of claim 11, wherein the characteristics include the name of a representation of the dataset.
14. The method of claim 11, wherein the characteristics include the date of creation of a representation of the dataset.
15. The method of claim 11, wherein the characteristics include the data format of a representation of the dataset.
16. The method of claim 1, wherein at least one of the datasets of the mapping belongs to a group of datasets known to a data management system.
17. The method of claim 1, further including providing a format mapping between datasets in a given set.
18. The method of claim 1, wherein the mapping includes an identifier that points to a record in the data management system that keeps track of the dataset.
19. The method of claim 1, further including updating the mapping based on a change in a dataset.
20. A system for mapping data stored in a data storage system, the system including
a data storage system storing specifications of dataflow graphs that include nodes representing computations interconnected by links representing flows of data, with at least one of the dataflow graphs receiving a flow of data from at least one input dataset and at least one of the dataflow graphs providing a flow of data to at least one output dataset;
a mapper that identifies one or more sets of datasets associated with the dataflow graphs, where each dataset in a given set matches one or more criteria for identifying different versions of a single dataset, each version of the single dataset representing data received or provided by a different one of the dataflow graphs;
a user interface that receives a mapping between at least two datasets in a given set, and stores the mapping in the data storage system in association with a dataflow graph that provides data to or receives data from the datasets of the mapping.
21. The system of claim 20, wherein the user interface presents the set.
22. The system of claim 20, wherein the user interface presents a list of possible mappings ordered according to a quantification of a match to the one or more criteria.
23. The system of claim 22, wherein the list of possible mappings includes candidates that are more likely to be an instance of a given dataset ordered higher in the list.
24. The system of claim 22, wherein one of the criteria is built into the mapper.
25. The system of claim 22, wherein one of the criteria is received by the user interface.
26. The system of claim 22, wherein at least one of the possible mappings indicates a component of a dataflow graph that represents a dataset, and at least one of the possible mappings indicates a component of a dataflow graph that does not represent a dataset.
27. The system of claim 20, wherein a sub-graph of a dataflow graph including multiple components represents a dataset.
28. The system of claim 27, wherein the sub-graph includes a data component.
29. The system of claim 27, wherein the sub-graph includes an executable component.
30. The system of claim 20, wherein the mapper uses heuristics for determining if a dataset in a given set has one or more characteristics in common with another dataset.
31. The system of claim 30, wherein the characteristics include the quantity of bytes and records in a representation of the dataset.
32. The system of claim 30, wherein the characteristics include the name of a representation of the dataset.
33. The system of claim 30, wherein the characteristics include the date of creation of a representation of the dataset.
34. The system of claim 30, wherein the characteristics include the data format of a representation of the dataset.
35. The system of claim 20, wherein at least one of the datasets of the mapping belongs to a group of datasets known to a data management system.
36. The system of claim 20, wherein the mapper generates a format mapping between datasets in a given set.
37. The system of claim 20, wherein the mapping includes an identifier that points to a record in the data management system that keeps track of the dataset.
38. The system of claim 20, wherein the mapper updates the mapping based on a change in a dataset.
39. A system for mapping data stored in a data storage system, the system including:
means for processing specifications of dataflow graphs that include nodes representing computations interconnected by links representing flows of data, with at least one of the dataflow graphs receiving a flow of data from at least one input dataset and at least one of the dataflow graphs providing a flow of data to at least one output dataset;
means for identifying one or more sets of datasets, where each dataset in a given set matches one or more criteria for identifying different versions of a single dataset each version of the single dataset representing data received or provided by a different one of the dataflow graphs;
means for providing a user interface to receive a mapping between at least two datasets in a given set; and
means for storing the mapping received over the user interface in association with a dataflow graph that provides data to or receives data from the datasets of the mapping.
40. A computer-readable medium storing a computer program for mapping data stored in a data storage system, the computer program including instructions for causing a computer to:
process specifications of dataflow graphs that include nodes representing computations interconnected by links representing flows of data, with at least one of the dataflow graphs receiving a flow of data from at least one input dataset and at least one of the dataflow graphs providing a flow of data to at least one output dataset;
identify one or more sets of datasets, where each dataset in a given set matches one or more criteria for identifying different versions of a single dataset, each version of the single dataset representing data received or provided by a different one of the dataflow graphs;
provide a user interface to receive a mapping between at least two datasets in a given set; and
store the mapping received over the user interface in association with a dataflow graph that provides data to or receives data from the datasets of the mapping.
41. The method of claim 1, wherein each version of a single dataset is associated with a different graph, graph subset, or executable component.
42. The method of claim 1, wherein each version of a single dataset is stored in a different location associated with the data storage system.
43. The method of claim 1, wherein each version of a single dataset is interpreted using a different data storage format.
44. The method of claim 1, wherein each version of a single dataset is accessed using a parameter that varies between executions of the dataflow graph.
US12/628,521 2008-12-02 2009-12-01 Mapping instances of a dataset within a data management system Abandoned US20100138388A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US12/628,521 US20100138388A1 (en) 2008-12-02 2009-12-01 Mapping instances of a dataset within a data management system
US16/902,949 US11341155B2 (en) 2008-12-02 2020-06-16 Mapping instances of a dataset within a data management system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11916408P 2008-12-02 2008-12-02
US12/628,521 US20100138388A1 (en) 2008-12-02 2009-12-01 Mapping instances of a dataset within a data management system

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/902,949 Continuation US11341155B2 (en) 2008-12-02 2020-06-16 Mapping instances of a dataset within a data management system

Publications (1)

Publication Number Publication Date
US20100138388A1 true US20100138388A1 (en) 2010-06-03

Family

ID=42223717

Family Applications (2)

Application Number Title Priority Date Filing Date
US12/628,521 Abandoned US20100138388A1 (en) 2008-12-02 2009-12-01 Mapping instances of a dataset within a data management system
US16/902,949 Active US11341155B2 (en) 2008-12-02 2020-06-16 Mapping instances of a dataset within a data management system

Family Applications After (1)

Application Number Title Priority Date Filing Date
US16/902,949 Active US11341155B2 (en) 2008-12-02 2020-06-16 Mapping instances of a dataset within a data management system

Country Status (8)

Country Link
US (2) US20100138388A1 (en)
EP (1) EP2370892B1 (en)
JP (1) JP5525541B2 (en)
KR (2) KR20150042866A (en)
CN (1) CN102232212B (en)
AU (1) AU2009322602B2 (en)
CA (1) CA2744881C (en)
WO (1) WO2010065511A1 (en)

Cited By (63)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110066602A1 (en) * 2009-09-16 2011-03-17 Ab Initio Software Llc Mapping dataset elements
US20120054255A1 (en) * 2010-08-25 2012-03-01 Ab Initio Technology Llc Evaluating dataflow graph characteristics
WO2012061109A1 (en) * 2010-10-25 2012-05-10 Ab Initio Technology Llc Managing data set objects in a dataflow graph that represents a computer program
US8217945B1 (en) 2011-09-02 2012-07-10 Metric Insights, Inc. Social annotation of a single evolving visual representation of a changing dataset
US8538934B2 (en) * 2011-10-28 2013-09-17 Microsoft Corporation Contextual gravitation of datasets and data services
KR20140006862A (en) * 2011-01-07 2014-01-16 아브 이니티오 테크놀로지 엘엘시 Flow analysis instrumentation
WO2014209260A1 (en) * 2013-06-24 2014-12-31 Hewlett-Packard Development Company, L.P. Processing a data flow graph of a hybrid flow
US20150261694A1 (en) * 2014-03-14 2015-09-17 Ab Initio Technology Llc Mapping attributes of keyed entities
US20150310055A1 (en) * 2014-04-29 2015-10-29 Microsoft Corporation Using lineage to infer data quality issues
WO2016011442A1 (en) * 2014-07-18 2016-01-21 Ab Initio Technology Llc Managing lineage information
US9251225B2 (en) 2012-07-24 2016-02-02 Ab Initio Technology Llc Mapping entities in data models
CN105302843A (en) * 2014-08-01 2016-02-03 友劲科技股份有限公司 Management system and management method
US20160036621A1 (en) * 2014-08-01 2016-02-04 Cameo Communications, Inc. Management system and management method
US9418095B2 (en) 2011-01-14 2016-08-16 Ab Initio Technology Llc Managing changes to collections of data
US9444674B2 (en) 2012-10-02 2016-09-13 Microsoft Technology Licensing, Llc Heuristic analysis of responses to user requests
US20160292444A1 (en) * 2013-11-08 2016-10-06 Norman Shaw Data accessibility control
WO2016177405A1 (en) * 2015-05-05 2016-11-10 Huawei Technologies Co., Ltd. Systems and methods for transformation of a dataflow graph for execution on a processing system
US9626393B2 (en) 2014-09-10 2017-04-18 Ab Initio Technology Llc Conditional validation rules
US10089409B2 (en) 2014-04-29 2018-10-02 Microsoft Technology Licensing, Llc Event-triggered data quality verification
US20190012369A1 (en) * 2017-07-07 2019-01-10 Palantir Technologies Inc. Systems and methods for providing an object platform for a relational database
US10489360B2 (en) 2012-10-17 2019-11-26 Ab Initio Technology Llc Specifying and applying rules to data
US10540659B2 (en) 2002-03-05 2020-01-21 Visa U.S.A. Inc. System for personal authorization control for card transactions
US10592147B2 (en) 2017-07-26 2020-03-17 International Business Machines Corporation Dataset relevance estimation in storage systems
US10671303B2 (en) 2017-09-13 2020-06-02 International Business Machines Corporation Controlling a storage system
US11016931B2 (en) * 2016-06-19 2021-05-25 Data.World, Inc. Data ingestion to generate layered dataset interrelations to form a system of networked collaborative datasets
EP3770774A4 (en) * 2018-03-23 2021-05-26 Huawei Technologies Co., Ltd. Control method for household appliance, and household appliance
US11023104B2 (en) 2016-06-19 2021-06-01 data.world,Inc. Interactive interfaces as computerized tools to present summarization data of dataset attributes for collaborative datasets
US11036716B2 (en) 2016-06-19 2021-06-15 Data World, Inc. Layered data generation and data remediation to facilitate formation of interrelated data in a system of networked collaborative datasets
US11036697B2 (en) * 2016-06-19 2021-06-15 Data.World, Inc. Transmuting data associations among data arrangements to facilitate data operations in a system of networked collaborative datasets
US11042560B2 (en) 2016-06-19 2021-06-22 data. world, Inc. Extended computerized query language syntax for analyzing multiple tabular data arrangements in data-driven collaborative projects
US11042556B2 (en) 2016-06-19 2021-06-22 Data.World, Inc. Localized link formation to perform implicitly federated queries using extended computerized query language syntax
US11042537B2 (en) * 2016-06-19 2021-06-22 Data.World, Inc. Link-formative auxiliary queries applied at data ingestion to facilitate data operations in a system of networked collaborative datasets
US11042548B2 (en) 2016-06-19 2021-06-22 Data World, Inc. Aggregation of ancillary data associated with source data in a system of networked collaborative datasets
US11068847B2 (en) 2016-06-19 2021-07-20 Data.World, Inc. Computerized tools to facilitate data project development via data access layering logic in a networked computing platform including collaborative datasets
US11068453B2 (en) * 2017-03-09 2021-07-20 data.world, Inc Determining a degree of similarity of a subset of tabular data arrangements to subsets of graph data arrangements at ingestion into a data-driven collaborative dataset platform
US11086896B2 (en) * 2016-06-19 2021-08-10 Data.World, Inc. Dynamic composite data dictionary to facilitate data operations via computerized tools configured to access collaborative datasets in a networked computing platform
US11093633B2 (en) 2016-06-19 2021-08-17 Data.World, Inc. Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization
US11163755B2 (en) 2016-06-19 2021-11-02 Data.World, Inc. Query generation for collaborative datasets
US11210313B2 (en) 2016-06-19 2021-12-28 Data.World, Inc. Computerized tools to discover, form, and analyze dataset interrelations among a system of networked collaborative datasets
USD940169S1 (en) 2018-05-22 2022-01-04 Data.World, Inc. Display screen or portion thereof with a graphical user interface
USD940732S1 (en) 2018-05-22 2022-01-11 Data.World, Inc. Display screen or portion thereof with a graphical user interface
US11238109B2 (en) * 2017-03-09 2022-02-01 Data.World, Inc. Computerized tools configured to determine subsets of graph data arrangements for linking relevant data to enrich datasets associated with a data-driven collaborative dataset platform
US11243960B2 (en) 2018-03-20 2022-02-08 Data.World, Inc. Content addressable caching and federation in linked data projects in a data-driven collaborative dataset platform using disparate database architectures
US11246018B2 (en) 2016-06-19 2022-02-08 Data.World, Inc. Computerized tool implementation of layered data files to discover, form, or analyze dataset interrelations of networked collaborative datasets
US11327996B2 (en) 2016-06-19 2022-05-10 Data.World, Inc. Interactive interfaces to present data arrangement overviews and summarized dataset attributes for collaborative datasets
US11327991B2 (en) * 2018-05-22 2022-05-10 Data.World, Inc. Auxiliary query commands to deploy predictive data models for queries in a networked computing platform
US11334625B2 (en) 2016-06-19 2022-05-17 Data.World, Inc. Loading collaborative datasets into data stores for queries via distributed computer networks
US11341155B2 (en) 2008-12-02 2022-05-24 Ab Initio Technology Llc Mapping instances of a dataset within a data management system
US11366824B2 (en) 2016-06-19 2022-06-21 Data.World, Inc. Dataset analysis and dataset attribute inferencing to form collaborative datasets
US11373094B2 (en) 2016-06-19 2022-06-28 Data.World, Inc. Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization
US11409802B2 (en) 2010-10-22 2022-08-09 Data.World, Inc. System for accessing a relational database using semantic queries
US11423039B2 (en) 2016-06-19 2022-08-23 data. world, Inc. Collaborative dataset consolidation via distributed computer networks
US20220277004A1 (en) * 2016-06-19 2022-09-01 Data.World, Inc. Data ingestion to generate layered dataset interrelations to form a system of networked collaborative datasets
US11442988B2 (en) 2018-06-07 2022-09-13 Data.World, Inc. Method and system for editing and maintaining a graph schema
US11573948B2 (en) 2018-03-20 2023-02-07 Data.World, Inc. Predictive determination of constraint data for application with linked data in graph-based datasets associated with a data-driven collaborative dataset platform
US20230169124A1 (en) * 2021-11-30 2023-06-01 Data.World, Inc. Computerized tools configured to determine subsets of graph data arrangements for linking relevant data to enrich datasets associated with a data-driven collaborative dataset platform
US11669540B2 (en) 2017-03-09 2023-06-06 Data.World, Inc. Matching subsets of tabular data arrangements to subsets of graphical data arrangements at ingestion into data-driven collaborative datasets
US11675808B2 (en) 2016-06-19 2023-06-13 Data.World, Inc. Dataset analysis and dataset attribute inferencing to form collaborative datasets
US11755602B2 (en) 2016-06-19 2023-09-12 Data.World, Inc. Correlating parallelized data from disparate data sources to aggregate graph data portions to predictively identify entity data
US11941140B2 (en) 2016-06-19 2024-03-26 Data.World, Inc. Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization
US11947600B2 (en) 2021-11-30 2024-04-02 Data.World, Inc. Content addressable caching and federation in linked data projects in a data-driven collaborative dataset platform using disparate database architectures
US11947529B2 (en) 2018-05-22 2024-04-02 Data.World, Inc. Generating and analyzing a data model to identify relevant data catalog data derived from graph-based data arrangements to perform an action
US11947554B2 (en) 2016-06-19 2024-04-02 Data.World, Inc. Loading collaborative datasets into data stores for queries via distributed computer networks

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9262490B2 (en) * 2004-08-12 2016-02-16 Oracle International Corporation Adaptively routing transactions to servers
KR102148984B1 (en) * 2014-05-29 2020-08-27 삼성에스디에스 주식회사 System and method for processing data
JP6598973B2 (en) * 2015-03-23 2019-10-30 モルガン スタンレー サービシーズ グループ,インコーポレイテッド Tracking data flow in distributed computing systems
US11093703B2 (en) * 2016-09-29 2021-08-17 Google Llc Generating charts from data in a data table
KR20210046487A (en) * 2019-10-18 2021-04-28 삼성전자주식회사 Apparatus and method for analyzing data contained in the database
AU2022213419A1 (en) * 2021-01-31 2023-08-03 Ab Initio Technology Llc Data processing system with manipulation of logical dataset groups
CN115017251B (en) * 2022-08-05 2022-10-25 山东省计算中心(国家超级计算济南中心) Standard mapping map establishing method and system for smart city

Citations (74)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5758351A (en) * 1995-03-01 1998-05-26 Sterling Software, Inc. System and method for the creation and use of surrogate information system objects
US5966072A (en) * 1996-07-02 1999-10-12 Ab Initio Software Corporation Executing computations expressed as graphs
US20010014890A1 (en) * 1998-02-06 2001-08-16 Gwoho Liu Methods for mapping data fields from one data set to another in a data processing environment
US20020161799A1 (en) * 2001-02-27 2002-10-31 Microsoft Corporation Spreadsheet error checker
US6494159B2 (en) * 2001-05-11 2002-12-17 The United States Of America As Represented By The Secretary Of The Navy Submarine launched unmanned combat vehicle replenishment
US20030016246A1 (en) * 2001-07-18 2003-01-23 Sanjai Singh Graphical subclassing
US20030163597A1 (en) * 2001-05-25 2003-08-28 Hellman Ziv Zalman Method and system for collaborative ontology modeling
US20040015783A1 (en) * 2002-06-20 2004-01-22 Canon Kabushiki Kaisha Methods for interactively defining transforms and for generating queries by manipulating existing query data
US6708186B1 (en) * 2000-08-14 2004-03-16 Oracle International Corporation Aggregating and manipulating dictionary metadata in a database system
US20040056908A1 (en) * 2001-03-22 2004-03-25 Turbo Worx, Inc. Method and system for dataflow creation and execution
US20040239681A1 (en) * 2000-08-07 2004-12-02 Zframe, Inc. Visual content browsing using rasterized representations
US20050060317A1 (en) * 2003-09-12 2005-03-17 Lott Christopher Martin Method and system for the specification of interface definitions and business rules and automatic generation of message validation and transformation software
US20050060313A1 (en) * 2003-09-15 2005-03-17 Oracle International Corporation A California Corporation Data quality analyzer
US20050114369A1 (en) * 2003-09-15 2005-05-26 Joel Gould Data profiling
US20050178833A1 (en) * 2001-12-20 2005-08-18 Canon Information Systems Research Australia Pty Microprocessor card defining a custom user interface
US20050187984A1 (en) * 2004-02-20 2005-08-25 Tianlong Chen Data driven database management system and method
US6948154B1 (en) * 1999-03-22 2005-09-20 Oregon State University Methodology for testing spreadsheets
US20050234762A1 (en) * 2004-04-16 2005-10-20 Pinto Stephen K Dimension reduction in predictive model development
US20050262121A1 (en) * 1999-09-21 2005-11-24 International Business Machines Corporation Method, system, program, and data structure for cleaning a database table
US20060020570A1 (en) * 2004-07-23 2006-01-26 Yuh-Cherng Wu Conflict resolution engine
US20060095466A1 (en) * 2004-11-02 2006-05-04 Daniell Stevens Managing related data objects
US7080088B1 (en) * 2002-01-30 2006-07-18 Oracle International Corporation Automatic reconciliation of bindable objects
US20060200739A1 (en) * 2005-03-07 2006-09-07 Rishi Bhatia System and method for data manipulation
US7110924B2 (en) * 2002-05-15 2006-09-19 Caterpillar Inc. Method for controlling the performance of a target system
US20070011208A1 (en) * 2005-07-06 2007-01-11 Smith Alan R Apparatus, system, and method for performing semi-automatic dataset maintenance
US7164422B1 (en) * 2000-07-28 2007-01-16 Ab Initio Software Corporation Parameterized graphs with conditional components
US7167850B2 (en) * 2002-10-10 2007-01-23 Ab Initio Software Corporation Startup and control of graph-based computation
US20070027858A1 (en) * 2005-07-29 2007-02-01 Paul Weinberg Method for generating properly formed expressions
US20070050705A1 (en) * 2005-08-30 2007-03-01 Erxiang Liu Method of xml element level comparison and assertion utilizing an application-specific parser
US20070094060A1 (en) * 2005-10-25 2007-04-26 Angoss Software Corporation Strategy trees for data mining
US20070136692A1 (en) * 2005-12-09 2007-06-14 Eric Seymour Enhanced visual feedback of interactions with user interface
US20070198457A1 (en) * 2006-02-06 2007-08-23 Microsoft Corporation Accessing and manipulating data in a data flow graph
US20070226203A1 (en) * 2006-03-23 2007-09-27 Microsoft Corporation Generation of query and update views for object relational mapping
US20070239751A1 (en) * 2006-03-31 2007-10-11 Sap Ag Generic database manipulator
US20070271381A1 (en) * 2006-05-16 2007-11-22 Joseph Skeffington Wholey Managing computing resources in graph-based computations
US20070276787A1 (en) * 2006-05-15 2007-11-29 Piedmonte Christopher M Systems and Methods for Data Model Mapping
US20070294119A1 (en) * 2006-03-30 2007-12-20 Adaptive Alpha, Llc System, method and computer program product for evaluating and rating an asset management business and associate investment funds using experiential business process and performance data, and applications thereof
US20080049022A1 (en) * 2006-08-10 2008-02-28 Ab Initio Software Corporation Distributing Services in Graph-Based Computations
US20080162384A1 (en) * 2006-12-28 2008-07-03 Privacy Networks, Inc. Statistical Heuristic Classification
US20080228697A1 (en) * 2007-03-16 2008-09-18 Microsoft Corporation View maintenance rules for an update pipeline of an object-relational mapping (ORM) platform
US20080243772A1 (en) * 2007-03-29 2008-10-02 Ariel Fuxman Method and sytsem for generating nested mapping specifications in a schema mapping formalism and for generating transformation queries based thereon
US20080243891A1 (en) * 2007-03-30 2008-10-02 Fmr Corp. Mapping Data on a Network
US20080256014A1 (en) * 2007-04-10 2008-10-16 Joel Gould Editing and Compiling Business Rules
US20080313204A1 (en) * 2007-06-14 2008-12-18 Colorquick, L.L.C. Method and apparatus for database mapping
US20080312979A1 (en) * 2007-06-13 2008-12-18 International Business Machines Corporation Method and system for estimating financial benefits of packaged application service projects
US20090036749A1 (en) * 2007-08-03 2009-02-05 Paul Donald Freiburger Multi-volume rendering of single mode data in medical diagnostic imaging
US20090083313A1 (en) * 2007-09-20 2009-03-26 Stanfill Craig W Managing Data Flows in Graph-Based Computations
US20090089630A1 (en) * 2007-09-28 2009-04-02 Initiate Systems, Inc. Method and system for analysis of a system for matching data records
US20090094291A1 (en) * 2007-09-14 2009-04-09 Oracle International Corporation Support for compensation aware data types in relational database systems
US20090193046A1 (en) * 2008-01-24 2009-07-30 Oracle International Corporation Match rules to identify duplicate records in inbound data
US20090319494A1 (en) * 2008-06-20 2009-12-24 Microsoft Corporation Field mapping for data stream output
US20090327196A1 (en) * 2008-06-30 2009-12-31 Ab Initio Software Llc Data Logging in Graph-Based Computations
US7661067B2 (en) * 2006-02-21 2010-02-09 International Business Machines Corporation Method for providing quick responses in instant messaging conversations
US20100083237A1 (en) * 2008-09-26 2010-04-01 Arm Limited Reducing trace overheads by modifying trace operations
US20100100220A1 (en) * 2005-06-09 2010-04-22 Belanger David G Arrangement for guiding user design of comprehensive product solution using on-the-fly data validation
US20100114833A1 (en) * 2008-10-31 2010-05-06 Netapp, Inc. Remote office duplication
US7716630B2 (en) * 2005-06-27 2010-05-11 Ab Initio Technology Llc Managing parameters for graph-based computations
US20100121890A1 (en) * 2008-11-12 2010-05-13 Ab Initio Software Llc Managing and automatically linking data objects
US20100145914A1 (en) * 2008-06-09 2010-06-10 Panasonic Corporation Database management server apparatus, database management system, database management method and database management program
US7765529B1 (en) * 2003-10-31 2010-07-27 The Mathworks, Inc. Transforming graphical objects in a graphical modeling environment
US20100198769A1 (en) * 2009-01-30 2010-08-05 Ab Initio Technology Llc Processing data using vector fields
US20100223218A1 (en) * 2007-01-10 2010-09-02 Radiation Watch Limited Data processing apparatus and method for automatically generating a classification component
US7840949B2 (en) * 2003-11-03 2010-11-23 Ramal Acquisition Corp. System and method for data transformation using dataflow graphs
US7853553B2 (en) * 2001-03-26 2010-12-14 Siebel Systems, Inc. Engine for converting data from a source format to a destination format using user defined mappings
US7890509B1 (en) * 2006-12-05 2011-02-15 First American Real Estate Solutions Llc Parcel data acquisition and processing
US7895586B2 (en) * 2004-06-21 2011-02-22 Sanyo Electric Co., Ltd. Data flow graph processing method, reconfigurable circuit and processing apparatus
US20110061057A1 (en) * 2009-09-04 2011-03-10 International Business Machines Corporation Resource Optimization for Parallel Data Integration
US20110066602A1 (en) * 2009-09-16 2011-03-17 Ab Initio Software Llc Mapping dataset elements
US20110295863A1 (en) * 2010-05-26 2011-12-01 Microsoft Corporation Exposing metadata relationships through filter interplay
US20120054164A1 (en) * 2010-08-27 2012-03-01 Microsoft Corporation Reducing locking during database transactions
US20120102029A1 (en) * 2010-10-25 2012-04-26 Ab Initio Technology Llc Managing data set objects
US20120158625A1 (en) * 2010-12-16 2012-06-21 International Business Machines Corporation Creating and Processing a Data Rule
US20120185449A1 (en) * 2011-01-14 2012-07-19 Ab Initio Technology Llc Managing changes to collections of data
US8484159B2 (en) * 2005-06-27 2013-07-09 Ab Initio Technology Llc Managing metadata for graph-based computations

Family Cites Families (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5168441A (en) 1990-05-30 1992-12-01 Allen-Bradley Company, Inc. Methods for set up and programming of machine and process controllers
US5446885A (en) 1992-05-15 1995-08-29 International Business Machines Corporation Event driven management information system with rule-based applications structure stored in a relational database
JPH0744368A (en) 1993-07-29 1995-02-14 Hitachi Ltd Editing system for combination model
US6216140B1 (en) * 1997-09-17 2001-04-10 Hewlett-Packard Company Methodology for the efficient management of hierarchically organized information
US6088702A (en) 1998-02-25 2000-07-11 Plantz; Scott H. Group publishing system
US6633875B2 (en) 1999-12-30 2003-10-14 Shaun Michael Brady Computer database system and method for collecting and reporting real estate property and loan performance information over a computer driven network
GB2358072B (en) 2000-01-07 2004-01-28 Mitel Corp Tabular range editing mechanism
US7143076B2 (en) 2000-12-12 2006-11-28 Sap Aktiengesellschaft Method and apparatus for transforming data
US6629098B2 (en) 2001-01-16 2003-09-30 Hewlett-Packard Development Company, L.P. Method and system for validating data submitted to a database application
JP2002279147A (en) 2001-03-22 2002-09-27 Sharp Corp In-house production determination support device, in- house determination support method, machine-readable recording medium with in-house production determination support program recorded thereon and in-house determination support program
US6732095B1 (en) 2001-04-13 2004-05-04 Siebel Systems, Inc. Method and apparatus for mapping between XML and relational representations
US6832366B2 (en) 2001-05-17 2004-12-14 Simdesk Technologies, Inc. Application generator
US7185317B2 (en) 2002-02-14 2007-02-27 Hubbard & Wells Logical data modeling and integrated application framework
US6820077B2 (en) 2002-02-22 2004-11-16 Informatica Corporation Method and system for navigating a large amount of data
US20050144189A1 (en) 2002-07-19 2005-06-30 Keay Edwards Electronic item management and archival system and method of operating the same
US7225301B2 (en) 2002-11-22 2007-05-29 Quicksilver Technologies External memory controller node
US20040225632A1 (en) 2003-05-08 2004-11-11 Microsoft Corporation Automated information management and related methods
US7257603B2 (en) 2003-05-08 2007-08-14 Microsoft Corporation Preview mode
US20050010896A1 (en) 2003-07-07 2005-01-13 International Business Machines Corporation Universal format transformation between relational database management systems and extensible markup language using XML relational transformation
US7536406B2 (en) 2004-06-23 2009-05-19 Microsoft Corporation Impact analysis in an object model
US20060007464A1 (en) 2004-06-30 2006-01-12 Percey Michael F Structured data update and transformation system
JP4550641B2 (en) 2005-03-30 2010-09-22 大陽日酸エンジニアリング株式会社 Data collation apparatus and method
US8255363B2 (en) 2005-06-08 2012-08-28 rPath Methods, systems, and computer program products for provisioning software using dynamic tags to identify and process files
US20070050750A1 (en) 2005-08-31 2007-03-01 Microsoft Corporation Extensible data-driven setup application for operating system
US20070179956A1 (en) 2006-01-18 2007-08-02 Whitmyer Wesley W Jr Record protection system for networked databases
US7970746B2 (en) 2006-06-13 2011-06-28 Microsoft Corporation Declarative management framework
US7689565B1 (en) 2006-06-28 2010-03-30 Emc Corporation Methods and apparatus for synchronizing network management data
US20080083237A1 (en) * 2006-10-06 2008-04-10 Hussmann Corporation Electronic head pressure control
US8423564B1 (en) 2006-10-31 2013-04-16 Ncr Corporation Methods and apparatus for managing and updating stored information
US20080126988A1 (en) 2006-11-24 2008-05-29 Jayprakash Mudaliar Application management tool
US8103704B2 (en) 2007-07-31 2012-01-24 ePrentise, LLC Method for database consolidation and database separation
US7860863B2 (en) 2007-09-05 2010-12-28 International Business Machines Corporation Optimization model for processing hierarchical data in stream systems
US20090234623A1 (en) 2008-03-12 2009-09-17 Schlumberger Technology Corporation Validating field data
EP2370901A4 (en) 2008-12-02 2014-04-09 Ab Initio Technology Llc Data maintenance system
KR20150042866A (en) 2008-12-02 2015-04-21 아브 이니티오 테크놀로지 엘엘시 Mapping instances of a dataset within a data management system
EP2221733A1 (en) 2009-02-17 2010-08-25 AMADEUS sas Method allowing validation in a production database of new entered data prior to their release
JP5401279B2 (en) 2009-11-26 2014-01-29 株式会社日立製作所 Check rule design support method, check rule design support system, and check rule design support program
US9805015B2 (en) 2009-12-16 2017-10-31 Teradata Us, Inc. System and method for enhanced user interactions with a grid
US8555265B2 (en) 2010-05-04 2013-10-08 Google Inc. Parallel processing of data
US20120310904A1 (en) 2011-06-01 2012-12-06 International Business Machine Corporation Data validation and service
US20130166515A1 (en) 2011-12-22 2013-06-27 David Kung Generating validation rules for a data report based on profiling the data report in a data processing tool
US8516008B1 (en) 2012-05-18 2013-08-20 Splunk Inc. Flexible schema column store
US10489360B2 (en) 2012-10-17 2019-11-26 Ab Initio Technology Llc Specifying and applying rules to data

Patent Citations (75)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5758351A (en) * 1995-03-01 1998-05-26 Sterling Software, Inc. System and method for the creation and use of surrogate information system objects
US5966072A (en) * 1996-07-02 1999-10-12 Ab Initio Software Corporation Executing computations expressed as graphs
US20010014890A1 (en) * 1998-02-06 2001-08-16 Gwoho Liu Methods for mapping data fields from one data set to another in a data processing environment
US6948154B1 (en) * 1999-03-22 2005-09-20 Oregon State University Methodology for testing spreadsheets
US20050262121A1 (en) * 1999-09-21 2005-11-24 International Business Machines Corporation Method, system, program, and data structure for cleaning a database table
US7164422B1 (en) * 2000-07-28 2007-01-16 Ab Initio Software Corporation Parameterized graphs with conditional components
US20040239681A1 (en) * 2000-08-07 2004-12-02 Zframe, Inc. Visual content browsing using rasterized representations
US6708186B1 (en) * 2000-08-14 2004-03-16 Oracle International Corporation Aggregating and manipulating dictionary metadata in a database system
US20020161799A1 (en) * 2001-02-27 2002-10-31 Microsoft Corporation Spreadsheet error checker
US20040056908A1 (en) * 2001-03-22 2004-03-25 Turbo Worx, Inc. Method and system for dataflow creation and execution
US7853553B2 (en) * 2001-03-26 2010-12-14 Siebel Systems, Inc. Engine for converting data from a source format to a destination format using user defined mappings
US6494159B2 (en) * 2001-05-11 2002-12-17 The United States Of America As Represented By The Secretary Of The Navy Submarine launched unmanned combat vehicle replenishment
US20030163597A1 (en) * 2001-05-25 2003-08-28 Hellman Ziv Zalman Method and system for collaborative ontology modeling
US20030016246A1 (en) * 2001-07-18 2003-01-23 Sanjai Singh Graphical subclassing
US20050178833A1 (en) * 2001-12-20 2005-08-18 Canon Information Systems Research Australia Pty Microprocessor card defining a custom user interface
US7080088B1 (en) * 2002-01-30 2006-07-18 Oracle International Corporation Automatic reconciliation of bindable objects
US7110924B2 (en) * 2002-05-15 2006-09-19 Caterpillar Inc. Method for controlling the performance of a target system
US20040015783A1 (en) * 2002-06-20 2004-01-22 Canon Kabushiki Kaisha Methods for interactively defining transforms and for generating queries by manipulating existing query data
US7167850B2 (en) * 2002-10-10 2007-01-23 Ab Initio Software Corporation Startup and control of graph-based computation
US20050060317A1 (en) * 2003-09-12 2005-03-17 Lott Christopher Martin Method and system for the specification of interface definitions and business rules and automatic generation of message validation and transformation software
US20050114369A1 (en) * 2003-09-15 2005-05-26 Joel Gould Data profiling
US20050060313A1 (en) * 2003-09-15 2005-03-17 Oracle International Corporation A California Corporation Data quality analyzer
US7765529B1 (en) * 2003-10-31 2010-07-27 The Mathworks, Inc. Transforming graphical objects in a graphical modeling environment
US7840949B2 (en) * 2003-11-03 2010-11-23 Ramal Acquisition Corp. System and method for data transformation using dataflow graphs
US20050187984A1 (en) * 2004-02-20 2005-08-25 Tianlong Chen Data driven database management system and method
US20050234762A1 (en) * 2004-04-16 2005-10-20 Pinto Stephen K Dimension reduction in predictive model development
US7895586B2 (en) * 2004-06-21 2011-02-22 Sanyo Electric Co., Ltd. Data flow graph processing method, reconfigurable circuit and processing apparatus
US20060020570A1 (en) * 2004-07-23 2006-01-26 Yuh-Cherng Wu Conflict resolution engine
US20060095466A1 (en) * 2004-11-02 2006-05-04 Daniell Stevens Managing related data objects
US20060200739A1 (en) * 2005-03-07 2006-09-07 Rishi Bhatia System and method for data manipulation
US20100100220A1 (en) * 2005-06-09 2010-04-22 Belanger David G Arrangement for guiding user design of comprehensive product solution using on-the-fly data validation
US7716630B2 (en) * 2005-06-27 2010-05-11 Ab Initio Technology Llc Managing parameters for graph-based computations
US8484159B2 (en) * 2005-06-27 2013-07-09 Ab Initio Technology Llc Managing metadata for graph-based computations
US20070011208A1 (en) * 2005-07-06 2007-01-11 Smith Alan R Apparatus, system, and method for performing semi-automatic dataset maintenance
US20070027858A1 (en) * 2005-07-29 2007-02-01 Paul Weinberg Method for generating properly formed expressions
US20070050705A1 (en) * 2005-08-30 2007-03-01 Erxiang Liu Method of xml element level comparison and assertion utilizing an application-specific parser
US20070094060A1 (en) * 2005-10-25 2007-04-26 Angoss Software Corporation Strategy trees for data mining
US20070136692A1 (en) * 2005-12-09 2007-06-14 Eric Seymour Enhanced visual feedback of interactions with user interface
US20070198457A1 (en) * 2006-02-06 2007-08-23 Microsoft Corporation Accessing and manipulating data in a data flow graph
US7661067B2 (en) * 2006-02-21 2010-02-09 International Business Machines Corporation Method for providing quick responses in instant messaging conversations
US20070226203A1 (en) * 2006-03-23 2007-09-27 Microsoft Corporation Generation of query and update views for object relational mapping
US20070294119A1 (en) * 2006-03-30 2007-12-20 Adaptive Alpha, Llc System, method and computer program product for evaluating and rating an asset management business and associate investment funds using experiential business process and performance data, and applications thereof
US20070239751A1 (en) * 2006-03-31 2007-10-11 Sap Ag Generic database manipulator
US20070276787A1 (en) * 2006-05-15 2007-11-29 Piedmonte Christopher M Systems and Methods for Data Model Mapping
US20070271381A1 (en) * 2006-05-16 2007-11-22 Joseph Skeffington Wholey Managing computing resources in graph-based computations
US20080049022A1 (en) * 2006-08-10 2008-02-28 Ab Initio Software Corporation Distributing Services in Graph-Based Computations
US7890509B1 (en) * 2006-12-05 2011-02-15 First American Real Estate Solutions Llc Parcel data acquisition and processing
US20080162384A1 (en) * 2006-12-28 2008-07-03 Privacy Networks, Inc. Statistical Heuristic Classification
US20100223218A1 (en) * 2007-01-10 2010-09-02 Radiation Watch Limited Data processing apparatus and method for automatically generating a classification component
US20080228697A1 (en) * 2007-03-16 2008-09-18 Microsoft Corporation View maintenance rules for an update pipeline of an object-relational mapping (ORM) platform
US20080243772A1 (en) * 2007-03-29 2008-10-02 Ariel Fuxman Method and sytsem for generating nested mapping specifications in a schema mapping formalism and for generating transformation queries based thereon
US20080243891A1 (en) * 2007-03-30 2008-10-02 Fmr Corp. Mapping Data on a Network
US20080256014A1 (en) * 2007-04-10 2008-10-16 Joel Gould Editing and Compiling Business Rules
US20080312979A1 (en) * 2007-06-13 2008-12-18 International Business Machines Corporation Method and system for estimating financial benefits of packaged application service projects
US20080313204A1 (en) * 2007-06-14 2008-12-18 Colorquick, L.L.C. Method and apparatus for database mapping
US20090036749A1 (en) * 2007-08-03 2009-02-05 Paul Donald Freiburger Multi-volume rendering of single mode data in medical diagnostic imaging
US20090094291A1 (en) * 2007-09-14 2009-04-09 Oracle International Corporation Support for compensation aware data types in relational database systems
US20090083313A1 (en) * 2007-09-20 2009-03-26 Stanfill Craig W Managing Data Flows in Graph-Based Computations
US20090089630A1 (en) * 2007-09-28 2009-04-02 Initiate Systems, Inc. Method and system for analysis of a system for matching data records
US20090193046A1 (en) * 2008-01-24 2009-07-30 Oracle International Corporation Match rules to identify duplicate records in inbound data
US20100145914A1 (en) * 2008-06-09 2010-06-10 Panasonic Corporation Database management server apparatus, database management system, database management method and database management program
US20090319494A1 (en) * 2008-06-20 2009-12-24 Microsoft Corporation Field mapping for data stream output
US20090327196A1 (en) * 2008-06-30 2009-12-31 Ab Initio Software Llc Data Logging in Graph-Based Computations
US20100083237A1 (en) * 2008-09-26 2010-04-01 Arm Limited Reducing trace overheads by modifying trace operations
US20100114833A1 (en) * 2008-10-31 2010-05-06 Netapp, Inc. Remote office duplication
US20100121890A1 (en) * 2008-11-12 2010-05-13 Ab Initio Software Llc Managing and automatically linking data objects
US20100198769A1 (en) * 2009-01-30 2010-08-05 Ab Initio Technology Llc Processing data using vector fields
US20120167112A1 (en) * 2009-09-04 2012-06-28 International Business Machines Corporation Method for Resource Optimization for Parallel Data Integration
US20110061057A1 (en) * 2009-09-04 2011-03-10 International Business Machines Corporation Resource Optimization for Parallel Data Integration
US20110066602A1 (en) * 2009-09-16 2011-03-17 Ab Initio Software Llc Mapping dataset elements
US20110295863A1 (en) * 2010-05-26 2011-12-01 Microsoft Corporation Exposing metadata relationships through filter interplay
US20120054164A1 (en) * 2010-08-27 2012-03-01 Microsoft Corporation Reducing locking during database transactions
US20120102029A1 (en) * 2010-10-25 2012-04-26 Ab Initio Technology Llc Managing data set objects
US20120158625A1 (en) * 2010-12-16 2012-06-21 International Business Machines Corporation Creating and Processing a Data Rule
US20120185449A1 (en) * 2011-01-14 2012-07-19 Ab Initio Technology Llc Managing changes to collections of data

Cited By (95)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10540659B2 (en) 2002-03-05 2020-01-21 Visa U.S.A. Inc. System for personal authorization control for card transactions
US11341155B2 (en) 2008-12-02 2022-05-24 Ab Initio Technology Llc Mapping instances of a dataset within a data management system
US8825695B2 (en) * 2009-09-16 2014-09-02 Ab Initio Technology Llc Mapping dataset elements
US20110066602A1 (en) * 2009-09-16 2011-03-17 Ab Initio Software Llc Mapping dataset elements
US8930337B2 (en) 2009-09-16 2015-01-06 Ab Initio Technology Llc Mapping dataset elements
US9727438B2 (en) * 2010-08-25 2017-08-08 Ab Initio Technology Llc Evaluating dataflow graph characteristics
US20120054255A1 (en) * 2010-08-25 2012-03-01 Ab Initio Technology Llc Evaluating dataflow graph characteristics
US11409802B2 (en) 2010-10-22 2022-08-09 Data.World, Inc. System for accessing a relational database using semantic queries
WO2012061109A1 (en) * 2010-10-25 2012-05-10 Ab Initio Technology Llc Managing data set objects in a dataflow graph that represents a computer program
US9977659B2 (en) 2010-10-25 2018-05-22 Ab Initio Technology Llc Managing data set objects
CN103180826A (en) * 2010-10-25 2013-06-26 起元技术有限责任公司 Managing data set objects in a dataflow graph that represents a computer program
KR101894925B1 (en) * 2011-01-07 2018-09-04 아브 이니티오 테크놀로지 엘엘시 Flow analysis instrumentation
KR20140006862A (en) * 2011-01-07 2014-01-16 아브 이니티오 테크놀로지 엘엘시 Flow analysis instrumentation
US9418095B2 (en) 2011-01-14 2016-08-16 Ab Initio Technology Llc Managing changes to collections of data
US8217945B1 (en) 2011-09-02 2012-07-10 Metric Insights, Inc. Social annotation of a single evolving visual representation of a changing dataset
US8538934B2 (en) * 2011-10-28 2013-09-17 Microsoft Corporation Contextual gravitation of datasets and data services
US9251225B2 (en) 2012-07-24 2016-02-02 Ab Initio Technology Llc Mapping entities in data models
US9444674B2 (en) 2012-10-02 2016-09-13 Microsoft Technology Licensing, Llc Heuristic analysis of responses to user requests
US10489360B2 (en) 2012-10-17 2019-11-26 Ab Initio Technology Llc Specifying and applying rules to data
WO2014209260A1 (en) * 2013-06-24 2014-12-31 Hewlett-Packard Development Company, L.P. Processing a data flow graph of a hybrid flow
US10515118B2 (en) 2013-06-24 2019-12-24 Micro Focus Llc Processing a data flow graph of a hybrid flow
US20160292444A1 (en) * 2013-11-08 2016-10-06 Norman Shaw Data accessibility control
US10592680B2 (en) * 2013-11-08 2020-03-17 Exacttrak Limited Data accessibility control
US11281596B2 (en) 2014-03-14 2022-03-22 Ab Initio Technology Llc Mapping attributes of keyed entities
US20150261694A1 (en) * 2014-03-14 2015-09-17 Ab Initio Technology Llc Mapping attributes of keyed entities
US10191863B2 (en) * 2014-03-14 2019-01-29 Ab Initio Technology Llc Mapping attributes of keyed entities
US10191862B2 (en) 2014-03-14 2019-01-29 Ab Initio Technology Llc Mapping attributes of keyed entities
US10877955B2 (en) * 2014-04-29 2020-12-29 Microsoft Technology Licensing, Llc Using lineage to infer data quality issues
US10089409B2 (en) 2014-04-29 2018-10-02 Microsoft Technology Licensing, Llc Event-triggered data quality verification
US20150310055A1 (en) * 2014-04-29 2015-10-29 Microsoft Corporation Using lineage to infer data quality issues
US10318283B2 (en) 2014-07-18 2019-06-11 Ab Initio Technology Llc Managing parameter sets
US10175974B2 (en) 2014-07-18 2019-01-08 Ab Initio Technology Llc Managing lineage information
JP2017525039A (en) * 2014-07-18 2017-08-31 アビニシオ テクノロジー エルエルシー System information management
US11210086B2 (en) 2014-07-18 2021-12-28 Ab Initio Technology Llc Managing parameter sets
EP3742284A1 (en) * 2014-07-18 2020-11-25 AB Initio Technology LLC Managing lineage information
WO2016011442A1 (en) * 2014-07-18 2016-01-21 Ab Initio Technology Llc Managing lineage information
US20160036621A1 (en) * 2014-08-01 2016-02-04 Cameo Communications, Inc. Management system and management method
CN105302843A (en) * 2014-08-01 2016-02-03 友劲科技股份有限公司 Management system and management method
US9626393B2 (en) 2014-09-10 2017-04-18 Ab Initio Technology Llc Conditional validation rules
WO2016177405A1 (en) * 2015-05-05 2016-11-10 Huawei Technologies Co., Ltd. Systems and methods for transformation of a dataflow graph for execution on a processing system
US11042556B2 (en) 2016-06-19 2021-06-22 Data.World, Inc. Localized link formation to perform implicitly federated queries using extended computerized query language syntax
US11334625B2 (en) 2016-06-19 2022-05-17 Data.World, Inc. Loading collaborative datasets into data stores for queries via distributed computer networks
US11023104B2 (en) 2016-06-19 2021-06-01 data.world,Inc. Interactive interfaces as computerized tools to present summarization data of dataset attributes for collaborative datasets
US11036716B2 (en) 2016-06-19 2021-06-15 Data World, Inc. Layered data generation and data remediation to facilitate formation of interrelated data in a system of networked collaborative datasets
US11036697B2 (en) * 2016-06-19 2021-06-15 Data.World, Inc. Transmuting data associations among data arrangements to facilitate data operations in a system of networked collaborative datasets
US11042560B2 (en) 2016-06-19 2021-06-22 data. world, Inc. Extended computerized query language syntax for analyzing multiple tabular data arrangements in data-driven collaborative projects
US11016931B2 (en) * 2016-06-19 2021-05-25 Data.World, Inc. Data ingestion to generate layered dataset interrelations to form a system of networked collaborative datasets
US11042537B2 (en) * 2016-06-19 2021-06-22 Data.World, Inc. Link-formative auxiliary queries applied at data ingestion to facilitate data operations in a system of networked collaborative datasets
US11042548B2 (en) 2016-06-19 2021-06-22 Data World, Inc. Aggregation of ancillary data associated with source data in a system of networked collaborative datasets
US11068847B2 (en) 2016-06-19 2021-07-20 Data.World, Inc. Computerized tools to facilitate data project development via data access layering logic in a networked computing platform including collaborative datasets
US11947554B2 (en) 2016-06-19 2024-04-02 Data.World, Inc. Loading collaborative datasets into data stores for queries via distributed computer networks
US11086896B2 (en) * 2016-06-19 2021-08-10 Data.World, Inc. Dynamic composite data dictionary to facilitate data operations via computerized tools configured to access collaborative datasets in a networked computing platform
US11093633B2 (en) 2016-06-19 2021-08-17 Data.World, Inc. Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization
US11163755B2 (en) 2016-06-19 2021-11-02 Data.World, Inc. Query generation for collaborative datasets
US11941140B2 (en) 2016-06-19 2024-03-26 Data.World, Inc. Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization
US11210313B2 (en) 2016-06-19 2021-12-28 Data.World, Inc. Computerized tools to discover, form, and analyze dataset interrelations among a system of networked collaborative datasets
US11928596B2 (en) 2016-06-19 2024-03-12 Data.World, Inc. Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization
US11816118B2 (en) 2016-06-19 2023-11-14 Data.World, Inc. Collaborative dataset consolidation via distributed computer networks
US11755602B2 (en) 2016-06-19 2023-09-12 Data.World, Inc. Correlating parallelized data from disparate data sources to aggregate graph data portions to predictively identify entity data
US11734564B2 (en) 2016-06-19 2023-08-22 Data.World, Inc. Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization
US11726992B2 (en) 2016-06-19 2023-08-15 Data.World, Inc. Query generation for collaborative datasets
US11246018B2 (en) 2016-06-19 2022-02-08 Data.World, Inc. Computerized tool implementation of layered data files to discover, form, or analyze dataset interrelations of networked collaborative datasets
US11277720B2 (en) 2016-06-19 2022-03-15 Data.World, Inc. Computerized tool implementation of layered data files to discover, form, or analyze dataset interrelations of networked collaborative datasets
US11675808B2 (en) 2016-06-19 2023-06-13 Data.World, Inc. Dataset analysis and dataset attribute inferencing to form collaborative datasets
US20230109821A1 (en) * 2016-06-19 2023-04-13 Data.World, Inc. Data ingestion to generate layered dataset interrelations to form a system of networked collaborative datasets
US11314734B2 (en) 2016-06-19 2022-04-26 Data.World, Inc. Query generation for collaborative datasets
US11327996B2 (en) 2016-06-19 2022-05-10 Data.World, Inc. Interactive interfaces to present data arrangement overviews and summarized dataset attributes for collaborative datasets
US20230105459A1 (en) * 2016-06-19 2023-04-06 Data.World, Inc. Data ingestion to generate layered dataset interrelations to form a system of networked collaborative datasets
US11609680B2 (en) 2016-06-19 2023-03-21 Data.World, Inc. Interactive interfaces as computerized tools to present summarization data of dataset attributes for collaborative datasets
US11468049B2 (en) * 2016-06-19 2022-10-11 Data.World, Inc. Data ingestion to generate layered dataset interrelations to form a system of networked collaborative datasets
US11366824B2 (en) 2016-06-19 2022-06-21 Data.World, Inc. Dataset analysis and dataset attribute inferencing to form collaborative datasets
US11373094B2 (en) 2016-06-19 2022-06-28 Data.World, Inc. Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization
US11386218B2 (en) 2016-06-19 2022-07-12 Data.World, Inc. Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization
US20220277004A1 (en) * 2016-06-19 2022-09-01 Data.World, Inc. Data ingestion to generate layered dataset interrelations to form a system of networked collaborative datasets
US11423039B2 (en) 2016-06-19 2022-08-23 data. world, Inc. Collaborative dataset consolidation via distributed computer networks
US11669540B2 (en) 2017-03-09 2023-06-06 Data.World, Inc. Matching subsets of tabular data arrangements to subsets of graphical data arrangements at ingestion into data-driven collaborative datasets
US11068453B2 (en) * 2017-03-09 2021-07-20 data.world, Inc Determining a degree of similarity of a subset of tabular data arrangements to subsets of graph data arrangements at ingestion into a data-driven collaborative dataset platform
US11238109B2 (en) * 2017-03-09 2022-02-01 Data.World, Inc. Computerized tools configured to determine subsets of graph data arrangements for linking relevant data to enrich datasets associated with a data-driven collaborative dataset platform
US10691729B2 (en) * 2017-07-07 2020-06-23 Palantir Technologies Inc. Systems and methods for providing an object platform for a relational database
US20190012369A1 (en) * 2017-07-07 2019-01-10 Palantir Technologies Inc. Systems and methods for providing an object platform for a relational database
US11301499B2 (en) * 2017-07-07 2022-04-12 Palantir Technologies Inc. Systems and methods for providing an object platform for datasets
US10592147B2 (en) 2017-07-26 2020-03-17 International Business Machines Corporation Dataset relevance estimation in storage systems
US10671303B2 (en) 2017-09-13 2020-06-02 International Business Machines Corporation Controlling a storage system
US11243960B2 (en) 2018-03-20 2022-02-08 Data.World, Inc. Content addressable caching and federation in linked data projects in a data-driven collaborative dataset platform using disparate database architectures
US11573948B2 (en) 2018-03-20 2023-02-07 Data.World, Inc. Predictive determination of constraint data for application with linked data in graph-based datasets associated with a data-driven collaborative dataset platform
EP3770774A4 (en) * 2018-03-23 2021-05-26 Huawei Technologies Co., Ltd. Control method for household appliance, and household appliance
US11190618B2 (en) 2018-03-23 2021-11-30 Huawei Technologies Co., Ltd. Scheduling method, scheduler, storage medium, and system
US11327991B2 (en) * 2018-05-22 2022-05-10 Data.World, Inc. Auxiliary query commands to deploy predictive data models for queries in a networked computing platform
USD940732S1 (en) 2018-05-22 2022-01-11 Data.World, Inc. Display screen or portion thereof with a graphical user interface
USD940169S1 (en) 2018-05-22 2022-01-04 Data.World, Inc. Display screen or portion thereof with a graphical user interface
US11947529B2 (en) 2018-05-22 2024-04-02 Data.World, Inc. Generating and analyzing a data model to identify relevant data catalog data derived from graph-based data arrangements to perform an action
US11657089B2 (en) 2018-06-07 2023-05-23 Data.World, Inc. Method and system for editing and maintaining a graph schema
US11442988B2 (en) 2018-06-07 2022-09-13 Data.World, Inc. Method and system for editing and maintaining a graph schema
US20230169124A1 (en) * 2021-11-30 2023-06-01 Data.World, Inc. Computerized tools configured to determine subsets of graph data arrangements for linking relevant data to enrich datasets associated with a data-driven collaborative dataset platform
US11947600B2 (en) 2021-11-30 2024-04-02 Data.World, Inc. Content addressable caching and federation in linked data projects in a data-driven collaborative dataset platform using disparate database architectures

Also Published As

Publication number Publication date
CA2744881C (en) 2020-03-10
JP5525541B2 (en) 2014-06-18
US20200311098A1 (en) 2020-10-01
EP2370892A1 (en) 2011-10-05
KR101661532B1 (en) 2016-09-30
WO2010065511A1 (en) 2010-06-10
EP2370892B1 (en) 2020-11-04
AU2009322602B2 (en) 2015-06-25
US11341155B2 (en) 2022-05-24
KR20110097921A (en) 2011-08-31
EP2370892A4 (en) 2016-03-09
CN102232212A (en) 2011-11-02
AU2009322602A1 (en) 2010-06-10
JP2012510687A (en) 2012-05-10
KR20150042866A (en) 2015-04-21
CA2744881A1 (en) 2010-06-10
CN102232212B (en) 2015-11-25

Similar Documents

Publication Publication Date Title
US11341155B2 (en) Mapping instances of a dataset within a data management system
US11461294B2 (en) System for importing data into a data repository
US10678810B2 (en) System for data management in a large scale data repository
CN105144080B (en) System for metadata management
US8433673B2 (en) System and method for supporting data warehouse metadata extension using an extender
JP2021099819A (en) Specifying and applying logical adequacy inspection rule to data
US11726969B2 (en) Matching metastructure for data modeling
US5659723A (en) Entity/relationship to object oriented logical model conversion method
US8954375B2 (en) Method and system for developing data integration applications with reusable semantic types to represent and process application data
CA2723933C (en) Methods and systems for developing, debugging, and executing data integration applications
US7401085B2 (en) System and method for controlling the release of updates to a database configuration
US20060235899A1 (en) Method of migrating legacy database systems
KR20130130706A (en) Managing data set objects in a dataflow graph that represents a computer program
WO2014019093A1 (en) System and method for managing versions of program assets
US10417234B2 (en) Data flow modeling and execution
US20230004477A1 (en) Providing a pseudo language for manipulating complex variables of an orchestration flow
Buchgeher et al. A platform for the automated provisioning of architecture information for large-scale service-oriented software systems
JP6588988B2 (en) Business program generation support system and business program generation support method
EP4109287A1 (en) A collaborative system and method for multi-user data management
CN113918208A (en) Middleware management method and device, computing equipment and storage medium

Legal Events

Date Code Title Description
AS Assignment

Owner name: AB INITIO SOFTWARE LLC,MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WAKELING, TIM;WEISS, ADAM;REEL/FRAME:023597/0439

Effective date: 20091124

AS Assignment

Owner name: AB INITIO TECHNOLOGY LLC,MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AB INITIO ORIGINAL WORKS LLC;REEL/FRAME:024377/0007

Effective date: 20100511

Owner name: AB INITIO ORIGINAL WORKS LLC,MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AB INITIO SOFTWARE LLC;REEL/FRAME:024377/0009

Effective date: 20100511

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION