WO1992015066A1 - Method of evaluating a recursive query of a database - Google Patents

Method of evaluating a recursive query of a database Download PDF

Info

Publication number
WO1992015066A1
WO1992015066A1 PCT/US1992/001458 US9201458W WO9215066A1 WO 1992015066 A1 WO1992015066 A1 WO 1992015066A1 US 9201458 W US9201458 W US 9201458W WO 9215066 A1 WO9215066 A1 WO 9215066A1
Authority
WO
WIPO (PCT)
Prior art keywords
fixpoint
recursive
query
relation
operator
Prior art date
Application number
PCT/US1992/001458
Other languages
French (fr)
Inventor
Ming-Chien Shan
Marie-Anne Neimat
Original Assignee
Hewlett-Packard Company
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Company filed Critical Hewlett-Packard Company
Publication of WO1992015066A1 publication Critical patent/WO1992015066A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24537Query rewriting; Transformation of operators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • G06F16/24566Recursive queries

Definitions

  • the present invention relates generally to database systems and more particularly to a method of evaluating a recursive query of a database.
  • Database systems are being used to store and manage more and more different kinds of data. As the use of database systems has expanded and the quantities of data stored in a database have increased, much effort has been devoted to improving existing database systems and developing new systems with new and better capabilities.
  • Each table may be thought of as specifying a "relation" among the data in that table; therefore, each table is referred to as a "relation”.
  • Each row in a relation may be thought of as one data record. The rows are referred to as "tuples”.
  • a database means to request information from it.
  • a query means to obtain the requested information. Sometimes the requested information can be obtained directly by looking it up in one of the relations. If the requested information does not appear in any of the relations, it must be derived, for example by comparing two or more tuples in a relation or by combining tuples from two different relations. The following four examples will help to illustrate these concepts.
  • PNAME is the name of a parent of a person named NAME.
  • PARENT Andrew, William
  • tuple which says that William is a parent of Andrew.
  • FNAME is the name of a friend of a person named NAME.
  • NAME is the name of a person and SEX is the sex of that person.
  • SEX is the sex of that person.
  • Representative PARENT, FRIEND and PERSON relations are depicted in Tables I through III, respectively.
  • a query of the form "FIND PARENTS OF [X]" is an example of a request for information that is directly obtainable by looking up data in this particular database.
  • the response to such a query would be the names of the parents of X.
  • a query of the form "FIND MOTHER OF [X]" is an example of a request for information that cannot be obtained by looking it up.
  • the PARENT relation does not include the sex of the parents and, unlike a human, a computer does not know, for example, that "William” would ordinarily be a father and "Mary" a mother. Therefore, the requested information must be derived from information in the database, for example by (1) finding the parents of X in the PARENT relation, (2) finding the sex of each of those parents in the PERSON relation, and (3) selecting the female parent.
  • the response to such a query would be the name of the mother of X.
  • SUPPLIER contains data arranged in the form
  • SUPPLIER (PART, SUPPLIER, CITY)
  • SUPPLIER represents a name of a source of that part
  • CITY represents the location of that supplier.
  • SUPPLIER Naeedle Valve, Paul's Plumbing Mfg. Co., Pittsburgh
  • SPART represents the name of a part that requires a subpart
  • SSUBPART represents the name of that subpart
  • SQTY represents how many of that subpart are used in that part.
  • a query of the form "FIND SUPPLIERS OF [X]" is an example of a request for information that is directly obtainable by retrieving data from this particular database.
  • the response to such a query would be the names of all suppliers of part X.
  • a query of the form "HOW MANY SUPPLIERS SUPPLY [X]" is an example of a request for information that is not directly obtainable but that can be derived from information in the database, for example by finding all suppliers of X and then counting how many supplier names are found. The response to such a query would be the number of suppliers that supply part X.
  • FLIGHT FLIGHT (FROM, TO, DISTANCE, DTIME, ATIME, COST)
  • FROM is the name of a departure city
  • TO is the name of an arrival city
  • DISTANCE is the mileage between those two cities
  • DTIME is the time of departure
  • ATIME is the time of arrival
  • COST is the cost of a ticket on that flight.
  • a query of the form "FIND CHEAPEST FLIGHT BETWEEN [X] and [Y]" is an example of a request for information that cannot be obtained merely by retrieving data from this database but that can be derived from information in the database, for example by finding the fares of all flights between city X and city Y and then comparing the various fares to find which is lowest.
  • the response to such a query would be the flight number of the cheapest flight between X and Y.
  • VACCINATED is a yes-or-no entry indicating whether the person has been vaccinated against the virus.
  • Representative ENCOUNTER and PERSON relations are depicted in Tables V and VI, respectively.
  • a query that requests the names of all women who had encounters with a certain man is an example of a request for information that can be retrieved directly.
  • a query that seeks the names of all vaccinated women who had encounters with a certain man is an example of a request for information that can be derived by (1) retrieving the names of all women who had encounters with that man according to the ENCOUNTER relation and (2) checking the names of each of those women according to the PERSON relation to find out which of them have been vaccinated.
  • relational algebra A set of relational operators collectively referred to as "relational algebra" has been developed for use in optimizing the evaluation of a complicated query in a large database.
  • a query is translated into a relational algebra expression; the expression is simplified according to certain procedures; query plans for evaluating the simplified expression are generated; and the most efficient of these plans is selected and carried out to provide the desired response. See generally C. J. Date, An Introduction to Database Systems (4th Ed.) Vol. I, Addison-Wesley 1986, chapters 13 and 16, and references cited therein.
  • the relational algebra includes a set of operators. These operators can be compared with arithmetic operators such as " + " and " ⁇ ". Just as an arithmetic operator operates on one or two "input” numbers and provides a new "output” number, so each relational algebra operator takes one or two relations as "inputs” and provides a new relation as "output”.
  • a few examples of these operators are the SELECT, PROJECT, UNION, INTERSECTION, and JOIN operators.
  • the SELECT operator obtains specified rows from a relation. For example, a query seeking the names of Andrew's parents could be phrased as "SELECT entries respecting Andrew from the PARENT relation" (see Example 1 above). This SELECT operation would be represented in the relational algebra as
  • the PROJECT operator obtains specified columns from a relation. For example, a query seeking the names of all persons who have children could be phrased as
  • PROJECT PNAME entries from the PARENT relation This PROJECT operation would be represented in the relational algebra as
  • the UNION operator collects all the rows of each of two relations.
  • the UNION operation is represented in relational algebra as
  • a and B are relations.
  • PARENT U FRIEND in the historical database would provide a new relation containing parent-names and friend-names of everyone who has either a parent or a friend or both.
  • the NAME entries in the PARENT relation would be correlated with the NAME entries in the FRIEND relation; for example, the name Andrew in the PARENT relation would be considered to refer to the same person as the name Andrew in the FRIEND relation.
  • the INTERSECTION operator collects common rows of each of two relations. The INTERSECTION operation is represented in relational algebra as
  • PARENT D FRIEND in the historical database would provide a new relation containing parent-names and friend-names of those persons who have both a parent and a friend.
  • the JOIN operator combines rows from each of two relations according to a specified condition.
  • relational algebra the JOIN operation is represented as
  • a kind of database query which has grown more important in recent years is a recursive query.
  • Such a query can be described as a query which queries itself.
  • a recursive query can be evaluated only by deriving information recursively.
  • a general discussion of the mathematical concept of recursion can be found in J. Bradley,
  • information can be recursively derived from the information in the database, for example by a query of the form "FIND PARENTS OF [FIND PARENTS OF [X]]". If the number of iterations required to evaluate a recursive query is known in advance, then the evaluation process is relatively straight-forward. For example, the request to find the grandparents of X requires exactly two iterations ⁇ one to find the parents of X and one to find the parents of the parents of X. However, if the number of iterations is not known, then the evaluation becomes far more difficult; an example of a request in which the number of iterations is not known is a request to find all ancestors of X.
  • the relational algebra does not have recursion operators and hence cannot support recursive queries.
  • Some relatively simple recursive queries can be expressed in transitive closure form, and transitive closure operators have been proposed for use in translating such queries into relational algebra expressions (R. Agrawal, "Alpha: An Extension of Relational Algebra to Express a Class of Recursive Queries", Proceedings of the Third International Conference on Data Engineering, Los Angeles, California, February 3-5, 1987; S. Ceri et al., "Translation and Optimization of Logic Queries: the Algebraic Approach", Proceedings of the Eleventh International Conference on Very Large Data Bases, Kyoto, Japan, August 1986).
  • not all recursive queries can be expressed in transitive closure form. From the foregoing, it will be apparent that there is a need for a way to optimize recursive queries, especially those which cannot be expressed in transitive closure form, for efficient evaluation in large and complex database systems.
  • the present invention provides a method of evaluating a recursive query in a database system by translating the query into an expression that includes a novel fixpoint operator and using a novel set of transformation procedures to simplify the translated query.
  • a method of evaluating a recursive query includes the steps of translating the query into a relational algebra expression that includes a fixpoint operator, optimizing the expression according to a set of transformation procedures, and evaluating the optimized expression by reference to data in the database.
  • transformation procedures include commuting a projection operation with a fixpoint operation, commuting a selection operation with a fixpoint operation, distributing a join operation over a fixpoint operation, and regrouping a join operation and a fixpoint operation.
  • Regrouping means applying the commutation and association rules, typically to an expression having a fixpoint and several join operators.
  • the selection operation may be a selection predicate on a direct mapping column, a global selection predicate, or a selection predicate that includes a join operation.
  • FIGURE 1 is a data flow diagram depicting a preferred embodiment of a method of optimizing recursive queries according to the invention
  • FIG. 2 is a flow diagram depicting initial and recursive inputs of a fixpoint operator as referenced in the "translate using ⁇ " process of FIG. 1;
  • FIG. 3 is a flow diagram depicting a generalized version of the fixpoint operator shown in FIG. 2.
  • the invention is embodied in a novel method of evaluating recursive queries in a database system.
  • Relational algebra provides a powerful technique for optimizing the evaluation of database queries, but recursive queries have not been amenable to such optimization techniques.
  • the query can be evaluated much more efficiently than would otherwise be possible.
  • a method of evaluating a recursive query of a database 11 comprises translating a recursive query into an expression that includes a fixpoint operator, as indicated by a "translate using ⁇ ” process circle 13; optimizing the expression according to a set of transformation procedures as indicated by an “optimize” process circle 15; and evaluating the optimized expression, as indicated by an "evaluate” process circle 17, by reference to data in the database 11.
  • the recursive query is received from a user as indicated by an input box 19 and an arrow extending from the box 19 to the "translate” circle 13.
  • the "user” may be a person at a computer terminal, but the user could also be, for example, an electronic device, an application program, or the like.
  • the result of evaluating the optimized query is provided to the user as indicated by an output box 21 and an arrow extending from the "evaluate” circle 17 to the box 21.
  • the user who receives the output is usually the same user as the one who generated the query, but this need not be the case; the user that generates a query could specify that the result be sent somewhere else.
  • These new transformation procedures include commuting a projection operation with a fixpoint operation as indicated by a "commute projection” process circle 25, commuting a selection operation with a fixpoint operation as indicated by a “commute selection” process circle 27, distributing a join operation over a fixpoint operation as indicated by a “distribute join” process circle 29, and regrouping join and fixpoint operations as indicated by a "regroup” process circle 31.
  • Commuting a selection operation with a fixpoint operation comprises commuting a selection predicate on a direct mapping column with a fixpoint operation as indicated by a "direct map” process circle 33, commuting a global selection predicate with a fixpoint operation as indicated by a "global” process circle 35, and commuting a selection predicate that includes a join with a fixpoint operation as indicated by a "join” process circle 37.
  • Arrows extend in both directions between the "commute selection” circle 27 and each of the circles 33, 35 and 37 to indicate that one or more than one of the procedures respecting commuting a selection and a fixpoint operator may be used as needed; of course, none of the "commute selection" procedures may be required in some cases.
  • Regrouping a join operation and a fixpoint operation comprises commutation and association as indicated by a "commutation” process circle 39 and an “association” process circle 41, respectively. Arrows extending in both directions between the
  • transformation procedures may also be used to simplify a recursive query as indicated by a "transform procedures" process circle 43.
  • Arrows extending in both directions between the circle 43 and the circle 15 indicate that these previously-known transformation procedures may be used once, several times or not at all in any given optimization.
  • fixpoint operator enhances the declarative power of relational algebra by supporting recursive queries. It is expected that the introduction of the fixpoint operator will benefit many database and computer applications such as computer aided design and manufacture (CAD/CAM), software engineering (CASE), and artificial intelligence (AI) applications.
  • CAD/CAM computer aided design and manufacture
  • CASE software engineering
  • AI artificial intelligence
  • the inputs (operands) and output of the fixpoint operator are relations.
  • the fixpoint operator supports least fixed point semantics.
  • the fixpoint operator can compute both linear and mutually recursive relations. In a simple form the fixpoint operator is defined symbolically by an expression of the form
  • I j represents the j-th initial input
  • R j represents they j-th recursive input.
  • Each initial input I is a relation. There may be one or more such inputs; M initial inputs are shown in FIG. 2.
  • Each recursive input R is also a relation. There may be none, one or more of these recursive inputs; N recursive inputs are shown in FIG. 2.
  • a recursive input may, but need not, be the same as an initial input.
  • the output of the fixpoint operator is fed back as a recursive input R c ; this recursive input R c differs from the N other recursive inputs in that the input R c is derived by the fixpoint operator whereas the other recursive inputs are not.
  • the initial inputs are utilized to provide a first output. This output is fed back as the recursive input R c .
  • the recursive input R c and the N other recursive inputs are utilized to provide a second output, and so on for as many iterations as are required.
  • fixpoint operator simplifies to conventional join and union operations.
  • Each initial input I is a relation. There may be one or more such inputs; M initial inputs are shown in FIG. 3.
  • Each recursive input R is also a relation. There may be none, one or more of these recursive inputs; N recursive inputs are shown in FIG. 2.
  • a recursive input may, but need not, be the same as an initial input.
  • the fixpoint operator provides K recursive relations as outputs and these are fed back as recursive inputs R C1 through R CK ; these K recursive inputs R c differ from the N other recursive inputs in that the inputs R c are derived by the fixpoint operator whereas the other recursive inputs are not.
  • J may but need not be equal to K.
  • the sets of initial or recursive inputs for each recursive relation need not be disjoint.
  • the recursive inputs cannot be disjoint because if they were the relations would not be mutually recursive.
  • Cartesion product of the K mutually recursive output relations can be used to extract individual relations from the output.
  • the historical "people” database as described in Example 1 above includes the base relations PARENT, FRIEND and PERSON as given in Tables I through III.
  • DNAME is the name of a person and ANAME is the name of an ancestor of that person, is defined as
  • PARENT ⁇ name aname: dname, pname (PARENT, ANCESTOR).
  • fixpoint operator may then be used to express various recursive queries of the "people” database. Specifically, a query of the form “Find all ancestors of John” is expressed:
  • the factory database of Example 2 above includes the base relations SUPPLIER and SUBPART.
  • a derived relation COMP of the form
  • fixpoint operator may then be used to express various recursive queries such as a query of the form "Find the location and quantities of any parts that go into making a Locomotive" as follows:
  • FLIGHT ⁇ destination from, catime ⁇ dtime, distance ⁇ 1000: start, to, cdtime, atime, ccost, +cost ( FLIGHT, SHORT_CONNECTION )
  • the fixpoint operator is used, for example, to express a query of the form "Find the minimum cost of all flights between London and San Francisco under the condition that the distance between each pair of connecting points is less than 1,000 miles" as follows:
  • a main task to be performed by a query optimizer is to rearrange the sequence of operations in an expression of a query for more efficient evaluation.
  • the query expression Starting with an initial form generated by a parser, the query expression usually undergoes a sequence of transformations based upon certain heuristic rules or execution cost comparisons.
  • the transformations usually include:
  • Example 4 above (the study of the sexually transmitted virus) will be used to illustrate these rules.
  • EXPSDFML is defined as:
  • EXPSDML is defined as:
  • Tables V and VI above illustrate the ENCOUNTER(E) and PERSON(P) relations.
  • the E XPSDFML(F) AND EXPSDML(M) relations are illustrated in the following tables VII and VIII.
  • a selection predicate is said to be global if it is applied at each iteration during the generation of the recursive relation defined by the fixpoint operation. In other words, once a tuple of an input relation fails to satisfy a global selection predicate, it will be excluded from consideration in any of the subsequent recursive computations. (The detection of global predicates will be discussed later.)
  • a detailed semantic analysis is required to determine the legal transformations that can be applied to an algebraic expression. The present discussion is limited to transformation rules which do not require any semantic query analysis. Formal proofs of the rules are omitted because of their length and instead each rule is motivated either with a sketch of a proof or with detailed examples.
  • the purpose of this translation is to reduce the sizes of the operands for each operation.
  • the fixpoint operator ⁇ is viewed as a generator of a directed graph consisting of all paths leading to all possible answers, an early selection on the initial input and recursive input relations has the effect of eliminating the unqualified paths in the graph before they are generated. See, generally, Ioannidis, Y., and Wong, W., "On the Computation of the Transitive Closure of Relational Operators", Proc. of 12th Int. Conf.
  • direct mapping column is very similar to that of invariant column introduced in Devanbu, P. and Agrawal, R., "Moving Selections Into Fixpoint Queries", Proc. of 4th Int. Conf on Data Engineering, Los Angeles, February, 1988.
  • Direct mapping columns are actually a subset of invariant columns.
  • the detection of the more general invariant columns requires a detailed analysis of the selection predicates.
  • the advantage of concentrating on direct mapping columns is that their detection is trivial and they cover the majority of the cases for which the payoff for performing early selections is substantial.
  • this rule was used to move the selection predicate
  • this rule was used to move the global selection predicates
  • EXPSDFML and EXPSDML This will be detected by the query optimizer and query Q3 will be translated to:
  • join operator need not be converted to a right outer join operator.
  • the distribution of joins is performed over the initial inputs and non-recursive relations in the recursive inputs and only when the join columns are restricted to a single non-recursive relation in the inputs.
  • the invention provides an effective and efficient method of evaluating linear and recursive queries in large databases.

Abstract

A method of evaluating a recursive query in a database (11). A recursive query is translated (13) into a relational algebra expression which includes a new fixpoint operator. The fixpoint operator supports mutually recursive and linearly recursive queries. The resulting expression is optimized (15) according to a set of fixpoint procedures.

Description

METHOD OF EVALUATING A RECURSIVE QUERY OF A DATABASE
CROSS-REFERENCE TO RELATED APPLICATION
This patent application is a continuation of patent application serial number 07/487,346, filed March 1, 1990, now abandoned, which in turn is a continuation-in-part of copending patent application serial number 07/286,425, filed December 19, 1988, and assigned to the same assignee as the present application.
BACKGROUND OF THE INVENTION
The present invention relates generally to database systems and more particularly to a method of evaluating a recursive query of a database.
Database systems are being used to store and manage more and more different kinds of data. As the use of database systems has expanded and the quantities of data stored in a database have increased, much effort has been devoted to improving existing database systems and developing new systems with new and better capabilities.
Relational Database Concepts
Data in a relational database are perceived by the user as being arranged in tables. Each table may be thought of as specifying a "relation" among the data in that table; therefore, each table is referred to as a "relation". Each row in a relation may be thought of as one data record. The rows are referred to as "tuples".
To "query" a database means to request information from it. To "evaluate" a query means to obtain the requested information. Sometimes the requested information can be obtained directly by looking it up in one of the relations. If the requested information does not appear in any of the relations, it must be derived, for example by comparing two or more tuples in a relation or by combining tuples from two different relations. The following four examples will help to illustrate these concepts.
EXAMPLE 1
Consider an historical database that contains data about people. Some of the data are arranged in a table or "relation" called PARENT. The data in the PARENT relation are arranged in the form
PARENT (NAME, PNAME)
where PNAME is the name of a parent of a person named NAME. For example, "PARENT (Andrew, William)" represents a tuple which says that William is a parent of Andrew.
Others of the data indicate who is a friend of whom; these data are arranged in a FRIEND relation in the form
FRIEND (NAME, FNAME)
where FNAME is the name of a friend of a person named NAME.
Still others of the data are in a PERSON relation and are arranged in the form
PERSON (NAME, SEX)
where NAME is the name of a person and SEX is the sex of that person. Representative PARENT, FRIEND and PERSON relations are depicted in Tables I through III, respectively.
Figure imgf000004_0001
Figure imgf000005_0001
Figure imgf000005_0002
A query of the form "FIND PARENTS OF [X]" is an example of a request for information that is directly obtainable by looking up data in this particular database. The response to such a query would be the names of the parents of X.
A query of the form "FIND MOTHER OF [X]" is an example of a request for information that cannot be obtained by looking it up. This is because the PARENT relation does not include the sex of the parents and, unlike a human, a computer does not know, for example, that "William" would ordinarily be a father and "Mary" a mother. Therefore, the requested information must be derived from information in the database, for example by (1) finding the parents of X in the PARENT relation, (2) finding the sex of each of those parents in the PERSON relation, and (3) selecting the female parent. The response to such a query would be the name of the mother of X.
EXAMPLE 2
Consider a factory database for keeping track of parts that are used to make engines. The data are arranged in two relations. The first relation, SUPPLIER, contains data arranged in the form
SUPPLIER (PART, SUPPLIER, CITY)
where PART represents the name (or part number or other convenient identifier) of a particular part, SUPPLIER represents a name of a source of that part, and CITY represents the location of that supplier. A few typical entries might be
SUPPLIER (Needle Valve, Bill's Brass Works, Buffalo)
SUPPLIER (Needle Valve, Paul's Plumbing Mfg. Co., Pittsburgh)
SUPPLIER (1/4" Screw, Mac's Machine Shop, Milwaukee)
SUPPLIER (Short Spring, Sam's Spring Specialties, Springfield)
and so on. The second relation, SUBPART, contains data in the form
SUBPART (SPART, SSUBPART, SQTY)
where SPART represents the name of a part that requires a subpart, SSUBPART represents the name of that subpart, and SQTY represents how many of that subpart are used in that part. Some examples are
SUBPART (Carburetor, Valve Assembly, 2)
SUBPART (Carburetor, 1/4" Screw, 16)
SUBPART (Carburetor, Short Spring, 3)
SUBPART (Valve Assembly, Needle Valve, l)
SUBPART (Valve Assembly, 1/4" Screw, 4)
indicating that each carburetor requires two valve assemblies, sixteen 1/4" screws, three short springs, and so on.
A query of the form "FIND SUPPLIERS OF [X]" is an example of a request for information that is directly obtainable by retrieving data from this particular database. The response to such a query would be the names of all suppliers of part X.
A query of the form "HOW MANY SUPPLIERS SUPPLY [X]" is an example of a request for information that is not directly obtainable but that can be derived from information in the database, for example by finding all suppliers of X and then counting how many supplier names are found. The response to such a query would be the number of suppliers that supply part X. EXAMPLE 3
Consider an airline reservation system. Data in this system are arranged in a FLIGHT relation in the form
FLIGHT (FROM, TO, DISTANCE, DTIME, ATIME, COST)
where FROM is the name of a departure city, TO is the name of an arrival city, DISTANCE is the mileage between those two cities, DTIME is the time of departure, ATIME is the time of arrival, and COST is the cost of a ticket on that flight. TABLE IV presents a typical FLIGHT relation.
Figure imgf000007_0001
A query of the form "FIND CHEAPEST FLIGHT BETWEEN [X] and [Y]" is an example of a request for information that cannot be obtained merely by retrieving data from this database but that can be derived from information in the database, for example by finding the fares of all flights between city X and city Y and then comparing the various fares to find which is lowest. The response to such a query would be the flight number of the cheapest flight between X and Y.
EXAMPLE 4
Consider a study of the spread of a sexually transmitted virus. One of the questions under investigation is the spread of the virus by heterosexual transmission. Records have been compiled of all heterosexual encounters in a defined population. These records are arranged in an epidemic-study database in an ENCOUNTER relation in the form
ENCOUNTER (MALE, FEMALE, DATE, CITY, COUNTY).
Additional data respecting individuals in the population under study are arranged in a PERSON relation:
PERSON (PERSON, AGE, VACCINATED)
where VACCINATED is a yes-or-no entry indicating whether the person has been vaccinated against the virus. Representative ENCOUNTER and PERSON relations are depicted in Tables V and VI, respectively.
Figure imgf000008_0001
Figure imgf000008_0002
A query that requests the names of all women who had encounters with a certain man is an example of a request for information that can be retrieved directly. A query that seeks the names of all vaccinated women who had encounters with a certain man is an example of a request for information that can be derived by (1) retrieving the names of all women who had encounters with that man according to the ENCOUNTER relation and (2) checking the names of each of those women according to the PERSON relation to find out which of them have been vaccinated.
Relational Algebra
As the volume of data in a database grows larger and the nature of the relations grows more complex, evaluating complicated queries becomes more difficult and time-consuming. Simplifying the task of evaluating a given query is known as "optimizing" the query. There may be only a few, or many hundreds, of ways to evaluate a complicated query, all of which provide the same answer but some of which are much more efficient than others. "Optimizing" such a query may mean choosing the best of all possible ways of evaluating the query, but usually "optimizing" means finding a reasonable number of ways according to techniques that are known to increase
computational efficiency and choosing the best of those.
A set of relational operators collectively referred to as "relational algebra" has been developed for use in optimizing the evaluation of a complicated query in a large database. A query is translated into a relational algebra expression; the expression is simplified according to certain procedures; query plans for evaluating the simplified expression are generated; and the most efficient of these plans is selected and carried out to provide the desired response. See generally C. J. Date, An Introduction to Database Systems (4th Ed.) Vol. I, Addison-Wesley 1986, chapters 13 and 16, and references cited therein.
The relational algebra includes a set of operators. These operators can be compared with arithmetic operators such as " + " and " ÷ ". Just as an arithmetic operator operates on one or two "input" numbers and provides a new "output" number, so each relational algebra operator takes one or two relations as "inputs" and provides a new relation as "output". A few examples of these operators are the SELECT, PROJECT, UNION, INTERSECTION, and JOIN operators. The SELECT operator obtains specified rows from a relation. For example, a query seeking the names of Andrew's parents could be phrased as "SELECT entries respecting Andrew from the PARENT relation" (see Example 1 above). This SELECT operation would be represented in the relational algebra as
σname = "Andrew"(PARENT).
In response, the SELECT operator would provide the following new relation:
Andrew William
Andrew Mary
from the PARENT relation.
The PROJECT operator obtains specified columns from a relation. For example, a query seeking the names of all persons who have children could be phrased as
"PROJECT PNAME entries from the PARENT relation". This PROJECT operation would be represented in the relational algebra as
πpname(PARENT)
and would provide the following new relation:
William
Mary
John
Anne
Richard
Wilma
from the PARENT relation.
The UNION operator collects all the rows of each of two relations. The UNION operation is represented in relational algebra as
A∪ B
where A and B are relations. For example, PARENT U FRIEND in the historical database would provide a new relation containing parent-names and friend-names of everyone who has either a parent or a friend or both. In providing the new relation, the NAME entries in the PARENT relation would be correlated with the NAME entries in the FRIEND relation; for example, the name Andrew in the PARENT relation would be considered to refer to the same person as the name Andrew in the FRIEND relation. The INTERSECTION operator collects common rows of each of two relations. The INTERSECTION operation is represented in relational algebra as
A∩ B.
For example, PARENT D FRIEND in the historical database would provide a new relation containing parent-names and friend-names of those persons who have both a parent and a friend.
The JOIN operator combines rows from each of two relations according to a specified condition. In relational algebra the JOIN operation is represented as
A⋈condition B
For example, the join operation E⋈female =person P in the epidemic-study database of Example 4 above would provide a new relation by combining the information in each tuple of the ENCOUNTER relation with the information in that tuple of the PERSON relation having an entry under PERSON that matches the entry under FEMALE in the tuple of the ENCOUNTER relation.
Recursive Queries
A kind of database query which has grown more important in recent years is a recursive query. Such a query can be described as a query which queries itself. A recursive query can be evaluated only by deriving information recursively. A general discussion of the mathematical concept of recursion can be found in J. Bradley,
Introduction to Discrete Mathematics, ch. 6, Addison-Wesley 1988; see also E. Roberts, Thinking Recursively, John Wiley 1986.
As a simple example of a recursive query, consider a query of the form "FIND GRANDPARENTS OF [X]" directed to the historical database of example 1 above. This database contains no information about grandparents. However, the requested
information can be recursively derived from the information in the database, for example by a query of the form "FIND PARENTS OF [FIND PARENTS OF [X]]". If the number of iterations required to evaluate a recursive query is known in advance, then the evaluation process is relatively straight-forward. For example, the request to find the grandparents of X requires exactly two iterations╌one to find the parents of X and one to find the parents of the parents of X. However, if the number of iterations is not known, then the evaluation becomes far more difficult; an example of a request in which the number of iterations is not known is a request to find all ancestors of X.
As the volume of data in a database grows larger and the nature of the relations expressed by the data grows more complex, the time required for even a very powerful computer to respond to a complicated recursive query can become unacceptably long, especially when the number of iterations required to derive the response is not known in advance. Accordingly, the efficient evaluation of recursive queries has become a matter of critical importance in the design of modern database systems. A comprehensive survey of this problem is presented by F. Bancilhon and R. Ramakrishnan in "An Amateur's Introduction to Recursive Query Processing Strategies" in the Proceedings of the ACM-SIGMOD Conference, Washington, D.C., May 1986.
The relational algebra does not have recursion operators and hence cannot support recursive queries. Some relatively simple recursive queries can be expressed in transitive closure form, and transitive closure operators have been proposed for use in translating such queries into relational algebra expressions (R. Agrawal, "Alpha: An Extension of Relational Algebra to Express a Class of Recursive Queries", Proceedings of the Third International Conference on Data Engineering, Los Angeles, California, February 3-5, 1987; S. Ceri et al., "Translation and Optimization of Logic Queries: the Algebraic Approach", Proceedings of the Eleventh International Conference on Very Large Data Bases, Kyoto, Japan, August 1986). However, not all recursive queries can be expressed in transitive closure form. From the foregoing, it will be apparent that there is a need for a way to optimize recursive queries, especially those which cannot be expressed in transitive closure form, for efficient evaluation in large and complex database systems.
SUMMARY OF THE INVENTION
The present invention provides a method of evaluating a recursive query in a database system by translating the query into an expression that includes a novel fixpoint operator and using a novel set of transformation procedures to simplify the translated query.
Briefly and in general terms, a method of evaluating a recursive query includes the steps of translating the query into a relational algebra expression that includes a fixpoint operator, optimizing the expression according to a set of transformation procedures, and evaluating the optimized expression by reference to data in the database. The
transformation procedures include commuting a projection operation with a fixpoint operation, commuting a selection operation with a fixpoint operation, distributing a join operation over a fixpoint operation, and regrouping a join operation and a fixpoint operation.
Regrouping means applying the commutation and association rules, typically to an expression having a fixpoint and several join operators. The selection operation may be a selection predicate on a direct mapping column, a global selection predicate, or a selection predicate that includes a join operation.
Other aspects and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the invention. BRIEF DESCRIPTION OF THE DRAWINGS
FIGURE 1 is a data flow diagram depicting a preferred embodiment of a method of optimizing recursive queries according to the invention;
FIG. 2 is a flow diagram depicting initial and recursive inputs of a fixpoint operator as referenced in the "translate using⊗" process of FIG. 1; and
FIG. 3 is a flow diagram depicting a generalized version of the fixpoint operator shown in FIG. 2.
DESCRIPTION OF THE PREFERRED EMBODIMENT
As shown in the drawings for purposes of illustration, the invention is embodied in a novel method of evaluating recursive queries in a database system. Relational algebra provides a powerful technique for optimizing the evaluation of database queries, but recursive queries have not been amenable to such optimization techniques.
In accordance with the invention, a novel fixpoint operator and new transformation procedures are provided for translating a recursive query into a relational algebra expression and then simplifying that expression. In the form of this simplified
expression, the query can be evaluated much more efficiently than would otherwise be possible.
As shown in state diagram form in FIGURE 1, a method of evaluating a recursive query of a database 11 comprises translating a recursive query into an expression that includes a fixpoint operator, as indicated by a "translate using⊗" process circle 13; optimizing the expression according to a set of transformation procedures as indicated by an "optimize" process circle 15; and evaluating the optimized expression, as indicated by an "evaluate" process circle 17, by reference to data in the database 11.
The recursive query is received from a user as indicated by an input box 19 and an arrow extending from the box 19 to the "translate" circle 13. The "user" may be a person at a computer terminal, but the user could also be, for example, an electronic device, an application program, or the like. Similarly, the result of evaluating the optimized query is provided to the user as indicated by an output box 21 and an arrow extending from the "evaluate" circle 17 to the box 21. The user who receives the output is usually the same user as the one who generated the query, but this need not be the case; the user that generates a query could specify that the result be sent somewhere else.
Several novel transformation procedures are provided according to the invention. These new transformation procedures include commuting a projection operation with a fixpoint operation as indicated by a "commute projection" process circle 25, commuting a selection operation with a fixpoint operation as indicated by a "commute selection" process circle 27, distributing a join operation over a fixpoint operation as indicated by a "distribute join" process circle 29, and regrouping join and fixpoint operations as indicated by a "regroup" process circle 31.
Arrows extend in both directions between the "optimize" process circle 15 and each of the process circles 25, 27, 29 and 31. These arrows indicate that application of one transformation procedure may result in an expression that requires application of another transformation procedure and that the optimization process may require one or more than one application of any given transformation procedure during the course of simplifying a given expression. Thus, depending on the characteristics of the expression being optimized, various ones of the transformation procedures may be used at various times during the optimization. Of course, some of the procedures may not be used at all in a given case.
Commuting a selection operation with a fixpoint operation comprises commuting a selection predicate on a direct mapping column with a fixpoint operation as indicated by a "direct map" process circle 33, commuting a global selection predicate with a fixpoint operation as indicated by a "global" process circle 35, and commuting a selection predicate that includes a join with a fixpoint operation as indicated by a "join" process circle 37. Arrows extend in both directions between the "commute selection" circle 27 and each of the circles 33, 35 and 37 to indicate that one or more than one of the procedures respecting commuting a selection and a fixpoint operator may be used as needed; of course, none of the "commute selection" procedures may be required in some cases.
Regrouping a join operation and a fixpoint operation comprises commutation and association as indicated by a "commutation" process circle 39 and an "association" process circle 41, respectively. Arrows extending in both directions between the
"regroup" process circle 31 and each of the circles 39 and 41 indicate that one or more of the commutation and association procedures may be used. Again, some cases might not require either procedure.
In addition to the novel transformation procedures provided by the invention, a number of transformation procedures are already known in the relational algebra.
Various ones of these may also be used to simplify a recursive query as indicated by a "transform procedures" process circle 43. Arrows extending in both directions between the circle 43 and the circle 15 indicate that these previously-known transformation procedures may be used once, several times or not at all in any given optimization.
The fixpoint operator according to the invention enhances the declarative power of relational algebra by supporting recursive queries. It is expected that the introduction of the fixpoint operator will benefit many database and computer applications such as computer aided design and manufacture (CAD/CAM), software engineering (CASE), and artificial intelligence (AI) applications.
As with other relational operators, the inputs (operands) and output of the fixpoint operator are relations. The fixpoint operator supports least fixed point semantics. The fixpoint operator can compute both linear and mutually recursive relations. In a simple form the fixpoint operator is defined symbolically by an expression of the form
( I1 - IM ) ⊗ CE:OE ( R1 - RN , Rc )
where
Ij represents the j-th initial input,
CE represents a Condition Expression,
OE represents an Output Expression, and
Rj represents they j-th recursive input.
This form of the fixpoint operator is depicted diagrammatically in FIG. 2. Each initial input I is a relation. There may be one or more such inputs; M initial inputs are shown in FIG. 2. Each recursive input R is also a relation. There may be none, one or more of these recursive inputs; N recursive inputs are shown in FIG. 2. A recursive input may, but need not, be the same as an initial input. The output of the fixpoint operator is fed back as a recursive input R c ; this recursive input R c differs from the N other recursive inputs in that the input Rc is derived by the fixpoint operator whereas the other recursive inputs are not.
More particularly, during a first iteration the initial inputs are utilized to provide a first output. This output is fed back as the recursive input R c . The recursive input R c and the N other recursive inputs are utilized to provide a second output, and so on for as many iterations as are required.
If an output is not fed back as one of the recursive inputs, the fixpoint operator simplifies to conventional join and union operations.
In general, equality predicates in the condition expression are treated as join columns during processing of the⊗ operator while any other predicates in the condition expression represent additional conditions which the tuples of the recursively derived relation generated by the ⊗ operator must satisfy. A more general form of the fixpoint operator, corresponding to a plurality of K mutually recursive relations, is defined symbolically by an expression of the form
Figure imgf000018_0001
This form of the fixpoint operator is depicted diagrammatically in FIG. 3. Each initial input I is a relation. There may be one or more such inputs; M initial inputs are shown in FIG. 3. Each recursive input R is also a relation. There may be none, one or more of these recursive inputs; N recursive inputs are shown in FIG. 2. A recursive input may, but need not, be the same as an initial input. The fixpoint operator provides K recursive relations as outputs and these are fed back as recursive inputs RC1 through RCK ; these K recursive inputs Rc differ from the N other recursive inputs in that the inputs Rc are derived by the fixpoint operator whereas the other recursive inputs are not. In general, J may but need not be equal to K.
The sets of initial or recursive inputs for each recursive relation need not be disjoint. For mutually recursive relations the recursive inputs cannot be disjoint because if they were the relations would not be mutually recursive.
The ultimate output of the ⊗ operator is a single relation representing the
Cartesion product of the K mutually recursive output relations. Additional relational operators can be used to extract individual relations from the output.
The invention will now be more formally described in mathematical terms. The following examples will show how the fixpoint operator can be freely intermixed with other relational algebra operators to pose powerful queries.
The historical "people" database as described in Example 1 above includes the base relations PARENT, FRIEND and PERSON as given in Tables I through III. A derived relation ANCESTOR, of the form
ANCESTOR (DNAME, ANAME)
where DNAME is the name of a person and ANAME is the name of an ancestor of that person, is defined as
PARENT⊗name=aname: dname, pname (PARENT, ANCESTOR).
The fixpoint operator may then be used to express various recursive queries of the "people" database. Specifically, a query of the form "Find all ancestors of John" is expressed:
πanamedname= "John " ( PARENT ⊗name=aname: dname, pname
(PARENT,ANCESTOR )))
A query of the form "Find all people who are either friends of ancestors of John or friends of John" is expressed: πanamedname= "John " ( PARENT ⊗name=aname: dname, pname
(PARENT,ANCESTOR ))⋈aname=nameFRIEND )) U πframename= "John " ( FRIEND ))
A query of the form "Find the friends of John's ancestors" is expressed πfhamedname= "John " (( PARENT⊗name=a name: dname, pname
(PARENTANCESTOR ))⋈aname=nameFRIEND ))
The factory database of Example 2 above includes the base relations SUPPLIER and SUBPART. A derived relation COMP, of the form
COMP (CPART, CSUBPART, CQTY),
is defined as
SUBPART ⊗ spart=csubpart: ssubpart, cqty*sqty ( SUBPART,COMP ).
The fixpoint operator may then be used to express various recursive queries such as a query of the form "Find the location and quantities of any parts that go into making a Locomotive" as follows:
πcsubpan,city,sum(cqty) (SUPPLY⋈part=csubpart (GROUP_BY
csubpart:csubpart,sum(cqty)cpart= "Locomotive "(SU BPA RT⊗spart=csubpart: cpart, ssubpart, cqty*sqty (SUBPART.COMP ))))) wherein the subscripts in the GROUP_BY operator indicate the group_by column and output columns of the GROUP_BY operator.
Similarly, in the airline reservation database of Example 3, a derived relation SHORT_CONNECTTON of the form
SHORT_CONNECTION(START,DESTINAΉON,CDTIME,CATIME,CCOST),
is defined as
FLIGHT⊗ destination=from, catime < dtime, distance < 1000: start, to, cdtime, atime, ccost, +cost( FLIGHT, SHORT_CONNECTION )
The fixpoint operator is used, for example, to express a query of the form "Find the minimum cost of all flights between London and San Francisco under the condition that the distance between each pair of connecting points is less than 1,000 miles" as follows:
Mincoststart= "SF", destination= "London"distance < 1000 ( FLIGHT) ⊗destination=fr om, catime < dtime, distance < 1000: start, to, cdtime, atime, ccost + cost( FLIGHT,SHORT_CONNECTION )
A main task to be performed by a query optimizer is to rearrange the sequence of operations in an expression of a query for more efficient evaluation. Starting with an initial form generated by a parser, the query expression usually undergoes a sequence of transformations based upon certain heuristic rules or execution cost comparisons. The transformations usually include:
Performing selections and projections as early as possible,
Combining sequences of the same operation (e.g. selection or projection) into one operation,
Commuting selection or projection operations with join or Cartesian product operations,
Commuting join (or Cartesian product) operations, and
Re-associating join (or Cartesian product) operations. (See, generally, Ullman, J., Principles of Database Systems (2nd ed.), Computer Science Press, Maryland, 1982 for more discussion of transformations).
These transformations are made possible by properties (such as commutativity and associativity) which are inherent in these operators. In order to avoid the development of a new paradigm for dealing with recursive queries, the present invention stays within the realm of relational algebra so that existing efficient implementation algorithms and other useful techniques (e.g. optimal plan search mechanisms) can be employed. However, not all of the common algebraic properties are held between fixpoint operators and other relational algebra operators. To integrate the fixpoint operator into an existing query optimization process, the valid transformation rules in the extended relational algebra have to be identified.
Example 4 above (the study of the sexually transmitted virus) will be used to illustrate these rules.
Assume that one of the questions under investigation is the spread of the virus by heterosexual transmission. To this end, a record has been kept of all heterosexual encounters in a population under study. The ENCNTR relation is used to store these records. The relation contains additional information about each person in the population under study. Now consider the two derived relations EXPSDFML (MALE, FEMALE, DATE, CITY, COUNTY) and EXPSDML (FEMALE, MALE, DATE, CΓΓY, COUNTY). The relation EXPSDFML is defined as:
πE (([ πe1, e2, e3, e4, e5 E ], [ πe2, e3, e4, e5 E])
[f2 =m1,β<m3, ml~-el, m3<e3 :fl, e2, e3, e4, e5], [m2=f1, m3<f3,f2=e2, f3<e3 m1, e1, e3, e4, e5] ( ( E,F,M ), ( E,F,M )))
The projection is needed since the output of the fixpoint operator is, by definition, the cross product of relations F and M. There is no need to compute the cross product of F and M in such a case, and the query processor will detect that. Similarly, the relation
EXPSDML is defined as:
πE (([ πe1, e2, e3, e4, e5 E ], [ πe2, e1, e3, e4, e5 E ]) [f2=m1,f3<m3, m2=e1, m3<e3:f1,e2,e3,e4,e5], [m2=f1, m3<f3,f2 =e2,f3<e3 m1, e1, e3, e4, e5](( E,F,M ), ( E,F,M )))
Tables V and VI above illustrate the ENCOUNTER(E) and PERSON(P) relations. The E
Figure imgf000022_0003
XPSDFML(F) AND EXPSDML(M) relations are illustrated in the following tables VII and VIII.
Figure imgf000022_0001
Figure imgf000022_0002
Now, consider query Ql which finds all vaccinated females who might have been exposed to the virus either directly or indirectly through make carrier X and such that all encounters leading to each female have taken place in New York. The query is expresses in the extended relational algebra as follows:
πf2 ( σf1="X", G:f4= "NY", G: m4= "NY", p3=true( P⋈p1=f2(([πe1, e2, e3, e4, e5 E], [πe2, el, e3, e4, e5E ])⊗[f2=m1,f3<m3, m2=e1, m3<e3:f1, e2, e3, e4, e5], [m2=f1, m3<f3,f2=e2,f3<e3 : m1, e1, e3, e4, e5] ((E,F,M),(E,F,M)))))
Note that a "G" tag attached to a selection condition indicates that it is a global one. A selection predicate is said to be global if it is applied at each iteration during the generation of the recursive relation defined by the fixpoint operation. In other words, once a tuple of an input relation fails to satisfy a global selection predicate, it will be excluded from consideration in any of the subsequent recursive computations. (The detection of global predicates will be discussed later.) In general, for queries on recursive relations, a detailed semantic analysis is required to determine the legal transformations that can be applied to an algebraic expression. The present discussion is limited to transformation rules which do not require any semantic query analysis. Formal proofs of the rules are omitted because of their length and instead each rule is motivated either with a sketch of a proof or with detailed examples.
1. Commuting projection operation with fixpoint operation
Rule 1 : Let IINPi and RINPi be the initial and recursive input expressions for mutually recursive relation Ri of the fixpoint operation. PEi denotes the set of attributes of Ri in the projection operation. SE- denotes the set of attributes of Ri in the selection operation. CEi and OEi denote the sets of attributes in the condition expression and output expression of Ri in the fixpoint operation. A(IINPi) denotes the set of attributes in the initial input expression and A(RINPi) the set of attributes in the recursive input expression for relation Ri. Then
πPE1 , ... , PEn (σSE1 , ... , SEn ((IINP1 , ... , IINPn)⊗ CE1 : OE1 , ..., CEn : OEn (RINP1 , ... , RiNPn)))≡ πPE1 , ... , PEn (σSE1 , ..., SEn ((πPE'1 IINP1 , ..., πPE'n
IINPn ) ⊗ CE1 : OE'1 , ..., CEn : OE'n (πPE"1 RINP1, ..., πPE"n RINPn ))) where
PEi' = (PEi∪ SEi∪ (∪n j=1CEj))∩ A ( IINPi), PEi" = (PEi∪ SEi∪ (∪n j=1CEj )) n A ( RINPi ), and OEi' = (PEi∪ SEi∪ (∪^jCEj )) n OE:
Sketch of proof: In principle, values for columns not required in any subsequent operations can be discarded. Therefore, during the process of inputs for the fixpoint operation, only the values for columns being referenced in the condition or output expressions of the fixpoint operator or in other subsequent operations need to be retrieved. For example, using query Q1, this rule can be applied to transform it as follows: πf2f1="X", G:f4="NY", G: m4="NY", p3=true (P⋈p1=f2 (([πe1, el, e3, e4, e5E], [πe2, e1, e3,e4, e5E])⊗[f2=m1,f3<m3, m1=e1, m3<e3:f1, e2, e3, e4, e5], [m2=f1, m3<f3,f2=e3,f3<e3:m1, e1, e3, e4, e5]((E,FM,),(E,F,M)))))≡ πf2f1="X", G:f4="NY", G:m4="NY",p3=true(P⋈p1=f2(([πe1,e2, e3, e4E], [πe2, e1, e3, e4E])⊗[f2=m1,f3<m3,m2=e1, m3<e3:f1, e2,e3, e4], [m2=f1, m3<f3,f2=e2,f3<e3:m1, e1, e3, e4]((πe1, e2, e3, e4E,πf1,f2,f3,f4F,πm1, m2, m3, m4M), (πe1, e2,e3, e4E,πf1,f2,f3,f4F,πm1, m2, m3, m4M)))))
2. Commuting selection operations with fixpoint operations.
This discussion will explore the heuristic of "performing selection as early as possible". In general, it means to move the selections inside other operators as far as possible. For query Q1, all of its selections can be applied to base relations E and P directly rather than the final result composed by the join operation and fixpoint operation. The original expression is then translated into:
πf2((σp3=trueP)⋈p1=f2 (([σe1="X", e4="NY"e1, e2, e3, e4, e5E)]
e4="NY"e2, e1,e3,e4, e5E)])⊗[f2=m1,f3<m3, m2=e1, m3<e3:f 1 e2, e3, e4, e5],[m2=f1,m3<f3,f2=e2,f3<e3:m1, e1,e3, e4, e5]
(((σe4="NY"E),F,M),((σe4="NY"7E),F,M))))
In the above expression, the global selections (G:f4="NY",G:m4="NY") are converted to regular selections and applied to both initial and recursive inputs of the fixpoint operation. The purpose of this translation is to reduce the sizes of the operands for each operation. If the fixpoint operator ⊗ is viewed as a generator of a directed graph consisting of all paths leading to all possible answers, an early selection on the initial input and recursive input relations has the effect of eliminating the unqualified paths in the graph before they are generated. See, generally, Ioannidis, Y., and Wong, W., "On the Computation of the Transitive Closure of Relational Operators", Proc. of 12th Int. Conf. on VLDB, Tokyo, Japan, August, 1986; Jagadish, H., Agrawal, R., and Ness, L., "A Study of Transitive Closure as a Recursive Mechanism", Proc. of ACM-SIGMOD 1987 Int. Conf. on Management of Data, San Francisco, California, May 1987; and Lu, H. , "New Strategies for Computing the Transitive Closure of a Database
Relation", Proc. of 13th. Int. Conf. on VLDB, Brighton, England, September, 1987.
While the legality of moving global selections into the fixpoint operator will be apparent, the movement of the non-global selection (σf1 = "χ") into the initial input needs some explanation. After a careful examination, it can be found that the two forms of the query are semantically equivalent. The is because the selection (σf1 = "χ" ) is applied to a direct mapping column of the recursive relation EXPSDFML. A column of a recursive relation is considered a direct mapping column if the output expression for that column consists of that column itself only. The crucial characteristic of a direct mapping column is that is acquires its entire set of values from the initial input relation(s). The values it assumes during each subsequent recursive iteration are always taken from the value set of its own initial input and are not computed from values of other resources. Thus, once the set of values of the initial input is determined, no new values are added to the column. A careful look at the definition of EXPSDFML will show that new values are added to columnsf2, ... , f5 during each recursive iteration but not to column f1.
The concept of direct mapping column is very similar to that of invariant column introduced in Devanbu, P. and Agrawal, R., "Moving Selections Into Fixpoint Queries", Proc. of 4th Int. Conf on Data Engineering, Los Angeles, February, 1988. Direct mapping columns are actually a subset of invariant columns. However, the detection of the more general invariant columns requires a detailed analysis of the selection predicates. The advantage of concentrating on direct mapping columns is that their detection is trivial and they cover the majority of the cases for which the payoff for performing early selections is substantial.
Selections which are neither global nor applied to direct mapping columns will now be considered. Let Q2 be the query to find all females who might have been exposed to the virus either directly or indirectly through male carrier X and such that the last encounter took place in New York. The query is expressed in the extended relational algebra as follows:
πf2f1="X", f4="NY"((πe1, e2, e3, e4, e5E], [πe2, e1, e3, e4, e5E])
[f2=m1,f3<m3, m2=e1, m3<e3 :f1, e2, e3, e4, e5], [m2=f1, m3<f3,f2=e2, f3<e3:m1, e1, e3, e4, e5]((E,F,M),(E,F,M))))
Distributing the selection (σf4= "Nγ"5) over the fixpoint operator would result in:
πf2f1="X"(([(σe4="NY"e1, e2, e3, e4, e5E)],[σe4="NY"
e2, e1, e3, e4, e5E)])⊗[f2=m1,f3<m3, m2=e1, m3<e3 :f1, e2, e3, e4, e5],
[m2=f1,m3<f3,f2=e2,f3<e3:m1, e1,e3, e4, e5](((σe4="NY"E),F,M),
((σe4="NY"E),F,M))))
However, the above query expression will not generate the complete set of answers. This is due to the exclusion of qualified "bridge tuples" from the intermediate results used for computing the EXPSDFML and EXPSDML relations. According to the above expressions, only those tuples satisfying (σe4= "NY"E) participate in the initial and recursive inputs to the relation EXPSDFML. In order to produce the complete set of answers, all qualified "bridge tuples" need to be included. It means that all tuples which do not satisfy (σg4="NY" E ) must still be saved for subsequent computation to avoid the loss of certain answers.
The rules governing the movement of selections into fixpoint operators thus depend on whether the selections are global, and if not global whether they apply to direct mapping columns. They are formulated as follows:
A. Commuting selection predicates on direct mapping columns with fixpoint operations. Rule 2: Let SEDi be the set of selection predicates on direct mapping columns for mutually recursive relation Ri, where SEDi contains no reference to any column of mutually recursive relation R. where i≠ j. Then
σSED1 , ... , SEDn ((IINP1 , ... , IINPn ) CE1 : OE1 , ..., CEn : OEn (RINP1 ,
..., RINPn ))≡ (σSED1 IINP1 , ..., σSEDnlINPn ) CE1 : OE1 , ..., CEn : OEn
(RINP1 , ..., RINPn ))
For example, as seen earlier, this rule was used to move the selection predicate
f1= "X" ) into input relation E for query Q1.
B. Commuting global selection predicates with fixpoint operations
Rule 3: Let SEGi be the set of all global predicates in the selection expression referencing attributes of mutually recursive relation R: where SEGi contains no reference to any column of mutually recursive relation Rj where i≠ j; and let SEGi be the same set of predicates after the "G" tag has been removed. Then
σSEG1 , ..., SEGn ((IINP1 , ... , IINPn )⊗ CE1 : OE1 , ..., CEn : OEn (RINP1 , ..., RINPn))≡ (σSEG1 , IINP1 , ..., σSEGn,IINPn ) CE1 : OE1 , ..., CEn : OEn (σSEG1, RINP1 , ..., σSEGn ,RlNPn ))
For example, as seen earlier, this rule was used to move the global selection predicates
(f4 = "NY") and (m4 = "NY") in query Q1 into the initial and recursive inputs of the fixpoint operator.
The designation of global predicates is left to the user (see, generally, Rosenthal, A., Heiler, S., Dayal, U., and Manola, F., "Traversal Recursion: A Practical Approach to Support Recursive Applications", Proc. of ACM-SlGMOD 1986 Int. Conf. on
Management of Data, Washington, D.C., May 1986). As was exposed earlier, the detection of direct mapping columns is straightforward. The detection of such columns may be performed when a recursive relation is defined and that information permanently stored in the system catalogs. Thus far, selections wherein all the columns involved belong to a single mutually recursive relation have been considered. Selections of the form fi θfj where fi is a column of mutually recursive relation Ri, fj is a column of mutually recursive relation Rj where i≠ j, and θ is a comparison operator, will now be discussed.
Consider query Q3 which finds all males who might have been indirectly exposed to the virus through make carrier X. The query is expressed as follows:
πm2f1= "X",f2=m1 (([πe1, e2, e3, e4, e5E], [πe2, e1, e3, e4, e5 E]
[f2=m1,f3<m3, m2=e1, m3<e3 :f1 e2, e3, e4, e5], [m2=f1, m3<f3,f2=e2, f3<e3: m1, e1, e3, e4, e5] ((E,F,M), (E,F,M))))
The selection f2 = m1 is nothing but a join of the two recursive relations
EXPSDFML and EXPSDML. This will be detected by the query optimizer and query Q3 will be translated to:
πm2f1= "X"F((tπe1, e2, e3, e4, e5E],e2, e1, e3, e4, e5E])
[f2 =m1,f3<m3, m2=e1, m3<e3:f1, e2, e3, e4, e5], [m2=f1, m3<f3,f2=e2, f3<e3:m1, e1, e3, e4, e5]((E,F,M),(E,F,M)))⋈f2=m1πM(([πe1,e2, e3, e4, e5E], [πe2,e1, e3, e4, e5E])⊗[f3=m1,f3<m3,m2=e1, m3<e3:f1, e2, e3, e4, e5],
[m2=f1,m3<f3,f2=e2,f3<e3:m1, e1, e3, e4, e5]((E,F,M),(E,F,M)))))
At execution time, the fixpoint operation will be performed only once, generating both recursive relations EXPSDFML and EXPSDML simultaneously. A join will then be performed on both relations. That is, no cross products or projections of relations EXPSDFML and EXPSDML will actually take place to evaluate query Q3. This rule is formulated as:
C. Detection of join operations
Rule 4: Let fi θ fj be a selection where f: is a column of mutually recursive relation Ri , fj is a column of mutually recursive relation Rj where i≠ j, and θ is a comparison operator. Then σfi θfj(([lnp1], ..., [Inpn]) ⊗ [Cond1 : Out]], .., [Condn : Outn] (Recinp1,...,
Recinpn))≡ πRi(([Inp1],..., [Inpn]) ⊗[Cond] : Out1] [Condn : Outn]
(Recinp1 , .... Recinpn ))⋈fi θfj πRj(([Inp1],...,[lnpn]) ⊗ [Cond] : Out1 ], ...,
[Condn : Outn ] (Recinp1,..., Recinpn))
3. Distributing join operations over fixpoint operations
In traditional query optimization, selections are moved inside joins to reduce the size of the operands of the join operations. This is usually a good strategy because selections always reduce the sizes of their operands. On the other hand, the result of a join operation may be smaller or larger than the sizes of its operands. But the result of fixpoint operation is always larger than the size of its initial inputs. Therefore, a query optimizer should never consider moving a fixpoint operator inside a join operator, but should assess the value of moving a join operator inside a fixpoint operator.
Consider again query Ql. One can safely move the join operator (without the selection) inside the fixpoint operator. The query translates to:
πf2f1="X", G:f4= "NY", G: m4= "NY", p3=true ((P⋈p1=e2 E],[E])
[f2=m1,f3<m3, m2=e1, m3<e3 :f1, e2, e3, e4, e5, p2, p3], [m2=f1, m3<f3, f2=e2,f3<e3:m1, e1, e3, e4, e2]((P⋈p1=e2E'F,M),(E,F,M))))
However, the distribution of join operators over fixpoint operators is not always as trivial. As an example, consider a slightly modified schema where the relation PERSON is replaced by the relation VACCINATED-PERSONφerson, age) which contains the age of each vaccinated person in the population under study. The relation is depicted in the Table IX as:
Figure imgf000029_0001
Now, let Q4 be the query to find all vaccinated females and their age if they have been exposed to the virus either directly or indirectly through male carrier X. The query is expressed as follows:
π f2,v2f1="X" (V⋈v1 =f2(([E], [E])⊗[f2=m1,f3<m3, m1=e1,m3<e3:f1, e2, e3, e4, e5], [m2=f1, m3<f3,f2=e2,f3<e3 : m1, e1, e3, e4, e5]
((E,F,M),(E,F,M)))))
Attempting to distribute the join operator results in:
πf2,v2f1="X" (([V v1=e2E], [E])⊗[f2=m1,f3<m3, m2=e1, m3<e3:f1, e2, e3, e4, e5,v2], [m2=f1, m3<f3,f3=e2,f3<e3:m1, e1, e3, e4,e5]
((V⋈v1=e2E,F,M),(E,F,M))))
However, this transformation will not produce the correct result. This is again due to the loss of qualified "bridge tuples". Although non-vaccinated females are not requested in query Q4, their presence during the iterative process is essential to finding all vaccinated females who have been exposed indirectly to the virus carried by "X". In order to preserve the correct answer, the join operator has to transform to a right outer join operator (see, generally, Date, C., Relational Databases: Selected Writings,
Addison-Wesley Publishing Company, 1986) as it moves inside the fixpoint operator.
During query execution, all tuples which are strictly the result of the outer join may be marked so they are eliminated from the final result.
If the join is over a direct mapping column, the join operator need not be converted to a right outer join operator. Consider the modified schema again and let Q5 be the query to find all vaccinated males, their ages, and all females who have been exposed to them directly or indirectly. The query is expressed as follows:
πf1,v2,f3 (V⋈v1 =f1(([E],[E])⊗[f2=m1,f3<m3, m2=e1, m3<e3:f1, e2, e3, e4, e5], [m2=f1, m3<f3,f2=e2,f3<e3:m1, e1, e3, e4, e5]((E,F,M),(E,F,M)))) The distribution of the join operator results in: πf1,v2,f2 (([V⋈v1=e1E], [E])⊗[f2=m1,f3<m3, m2=e1, m3<e3:f1, e2, e3, e4, e5, v2], [m2=f1, m3<f3,f3=e2,f3<e3:m1, e1, e3, e4, e5]
((V⋈v1=e1E,F,M),(E,F,M)))
which will produce the correct answers because the join was over a direct mapping column.
For simplicity, the distribution of joins is performed over the initial inputs and non-recursive relations in the recursive inputs and only when the join columns are restricted to a single non-recursive relation in the inputs.
Rule 5: Let IINPi and RINPi be the initial and recursive input expressions for mutually recursive relation R: of a fixpoint operation. Let REGEXP be a regular relational expression whose output is, as usual, a single relation. Consider the
expression:
REGEXP⋈x θfj,((IINP1,....IINPN) CE1:OE1 ,...,CEn:En(RINP1,...,
RINPn))
where x is one of the columns of REGEXP, fj is one of the columns of mutually recursive relation Rj, and θ is a comparison operator. Consider the recursive input expression RINPj for relation Rj. It consists of the recursive relations R1,...,Rn, and some non-recursive relations NRj , ..., NRm . Similarly, the initial input expression IINPj for relation Rj consists of relations l1,..., ll. If column fj takes all its initial inputs from a single column of a single input relation Ik and all its recursive inputs from a single column of a non-recursive relation NRV , then the above expression is equivalent to:
(IINP1,...,IINPj-1, (l1,...,REGEXP∞xθfj'Ik,...,Il), IINPj+1,...,IINPn)
⊗CE1: OE1,....CEn:OEn(RINP1,...,RINpj-1.(NR1,...,RFGEXP∞xθfj" NRb, ...,NRm, R1, ...,Rn),RINPj+1,...,RINPn)
where∞ indicates a right outer join or regular join operator, depending upon whether the join column/ is a direct mapping column or not, and where fj' is the appropriate column in relation lk and fj" is the appropriate column in relation NRv . 4. Commuting and associating join operations
Rule 6: The commutative and associative algebraic laws for join operations still hold if each fixpoint operation is treated as a whole. That is,
A⋈ (B ⊗ (C,D ))⋈ E = A⋈E⋈(B ® (C,D )), and
(A⋈ B )⋈ (C ⊗ (D,E ))≡A⋈(B⋈(C ® (D,E )))
These commutative and associative properties allow the query optimizer to choose the best execution operations.
The algebraic properties of fixpoint operators developed above give the query optimizer more leeway in choosing efficient execution strategies. The transformation rules representing these properties can be added in a fairly straightforward manner to most existing query optimizers.
Most implementations of logic databases (see, for example, Morris, K., Ulman, J., and Gelder, A., "Design Overview of the NAIL System", Proc. 3rd Int. Conf on Logic Programming, 1986, and Zaniolo, C., and Sacca, D., "Rule Rewriting Methods for Efficient Implementation of Horn Logic", MCC Technical Report DB-084-87, March 1987) do not rely on any statistical information to determine their execution strategy. They commonly use simple heuristics which choose to extend the predicate with the largest number of bound arguments. One of the most promising features of the present approach is that it is targeted at existing relational query optimizers. Thus, the transformation rules presented herein become a tool for the query optimizer to choose among a menu of execution strategies based upon the estimated execution costs associated with each form a query can take.
From the foregoing it will be appreciated that the invention provides an effective and efficient method of evaluating linear and recursive queries in large databases.
Existing relational techniques are integrated with the novel fixpoint operator and transformation procedures provided by the invention to optimize even very complex recursive queries. Although a specific embodiment of the invention has been described and illustrated, the invention is not to be limited to the specific forms or arrangements of parts so described and illustrated, and various modifications and changes can be made without departing from the scope and spirit of the invention. Within the scope of the appended claims, therefore, the invention may be practiced otherwise than as specifically described and illustrated.

Claims

1. A method of evaluating a recursive query (19) of a computerized database (11), the method comprising: translating a recursive query into an expression that includes a fixpoint operator (13); optimizing the expression according to a set of transformation procedures (15); and evaluating the optimized expression (17) by reference to data in the database (11).
2. A method according to claim 1 wherein one of the procedures comprises commuting (25) a projection operation with a fixpoint operation.
3. A method according to claim 1 wherein one of the procedures comprises commuting (27) a selection operation with a fixpoint operation.
4. A method according to claim 3 wherein commuting a selection operation with a fixpoint operation comprises commuting a selection predicate on a direct mapping column (33) with a fixpoint operation.
5. A method according to claim 3 wherein commuting a selection operation with a fixpoint operation comprises commuting a global selection predicate (35) with a fixpoint operation.
6. A method according to claim 3 wherein commuting a selection operation with a fixpoint operation comprises commuting a selection predicate including a join (37) with a fixpoint operation.
7. A method according to claim 1 wherein one of the procedures comprises distributing a join operation (29) over a fixpoint operation.
8. A method according to claim 1 wherein one of the procedures comprises regrouping a join operation (31) and a fixpoint operation.
9. A method according to claim 8 wherein regrouping comprises commutation (39).
10. A method according to claim 8 wherein regrouping comprises
association (41).
11. A method of evaluating a recursive query of a computerized database (11), the method comprising: in a computer, translating (13) a recursive query provided by a user into an expression that includes a fixpoint operator; automatically optimizing (15) the expression in the computer according to a set of transformation procedures (43) stored in the computer; and automatically evaluating (17) the optimized expression by reference to data in the computer database.
PCT/US1992/001458 1991-02-26 1992-02-25 Method of evaluating a recursive query of a database WO1992015066A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US66025691A 1991-02-26 1991-02-26
US660,256 1991-02-26

Publications (1)

Publication Number Publication Date
WO1992015066A1 true WO1992015066A1 (en) 1992-09-03

Family

ID=24648750

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1992/001458 WO1992015066A1 (en) 1991-02-26 1992-02-25 Method of evaluating a recursive query of a database

Country Status (2)

Country Link
EP (1) EP0532727A1 (en)
WO (1) WO1992015066A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5548755A (en) * 1995-02-17 1996-08-20 International Business Machines Corporation System for optimizing correlated SQL queries in a relational database using magic decorrelation
US7249120B2 (en) * 2003-06-27 2007-07-24 Microsoft Corporation Method and apparatus for selecting candidate statistics to estimate the selectivity value of the conditional selectivity expression in optimize queries based on a set of predicates that each reference a set of relational database tables
US7526468B2 (en) * 1999-01-08 2009-04-28 Computer Associates Think, Inc. System and method for recursive path analysis of DBMS procedures
US7548935B2 (en) * 2002-05-09 2009-06-16 Robert Pecherer Method of recursive objects for representing hierarchies in relational database systems
US20120084322A1 (en) * 2008-10-07 2012-04-05 Microsoft Corporation Recursive processing in streaming queries
US9158816B2 (en) 2009-10-21 2015-10-13 Microsoft Technology Licensing, Llc Event processing with XML query based on reusable XML query template
US10262269B2 (en) 2015-08-31 2019-04-16 Semmle Limited Evaluating recursive and exiversal predicates

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ACM SIGMOD RECORD vol. 15, no. 2, June 1986, LE CHESNAY-CéDEX, FRANCE pages 177 - 186; G. GARDARIN ET AL.: 'Evaluation of database recursive logic programs as recurrent function series' *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5548755A (en) * 1995-02-17 1996-08-20 International Business Machines Corporation System for optimizing correlated SQL queries in a relational database using magic decorrelation
US7526468B2 (en) * 1999-01-08 2009-04-28 Computer Associates Think, Inc. System and method for recursive path analysis of DBMS procedures
US7548935B2 (en) * 2002-05-09 2009-06-16 Robert Pecherer Method of recursive objects for representing hierarchies in relational database systems
US7249120B2 (en) * 2003-06-27 2007-07-24 Microsoft Corporation Method and apparatus for selecting candidate statistics to estimate the selectivity value of the conditional selectivity expression in optimize queries based on a set of predicates that each reference a set of relational database tables
US20120084322A1 (en) * 2008-10-07 2012-04-05 Microsoft Corporation Recursive processing in streaming queries
US9229986B2 (en) * 2008-10-07 2016-01-05 Microsoft Technology Licensing, Llc Recursive processing in streaming queries
US9158816B2 (en) 2009-10-21 2015-10-13 Microsoft Technology Licensing, Llc Event processing with XML query based on reusable XML query template
US9348868B2 (en) 2009-10-21 2016-05-24 Microsoft Technology Licensing, Llc Event processing with XML query based on reusable XML query template
US10262269B2 (en) 2015-08-31 2019-04-16 Semmle Limited Evaluating recursive and exiversal predicates

Also Published As

Publication number Publication date
EP0532727A1 (en) 1993-03-24

Similar Documents

Publication Publication Date Title
Ooi Efficient query processing in geographic information systems
Astrahan et al. Implementation of a structured English query language
Sellis Multiple-query optimization
Abiteboul et al. On the representation and querying of sets of possible worlds
US6088524A (en) Method and apparatus for optimizing database queries involving aggregation predicates
US7184998B2 (en) System and methodology for generating bushy trees using a left-deep tree join enumeration algorithm
Silva et al. The similarity join database operator
Gudes et al. Discovering frequent graph patterns using disjoint paths
CN107169033A (en) Relation data enquiring and optimizing method with parallel framework is changed based on data pattern
Halpin A fact-oriented approach to schema transformation
van Emde Boas et al. Storing and evaluating Horn-clause rules in a relational database
WO1992015066A1 (en) Method of evaluating a recursive query of a database
Mhedhbi et al. Modern techniques for querying graph-structured relations: foundations, system implementations, and open challenges
Dadashzadeh An improved division operator for relational algebra
Brodsky et al. On approximation-based query evaluation, expensive predicates and constraint objects
Gardarin et al. OFL: A functional execution model for object query languages
Sadreddini et al. Framework for query optimization in distributed statistical databases
Giannotti et al. Nondeterministic, nonmonotonic logic databases
Ooi et al. Query optimization in an extended DBMS
Malkemus et al. Predicate derivation and monotonicity detection in DB2 UDB
Shan et al. Optimization of relational algebra expressions containing recursion operators
Ngu et al. Heterogeneous Query Optimization Using Maximal Sub-Queries
Wilson A requirements and design aid for relational data bases
Qadah et al. The processing of a class of transitive-closure queries on uniprocessor and shared-nothing multiprocessor systems
Zhang Interactive query formulation techniques for databases

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IT LU MC NL SE

WWE Wipo information: entry into national phase

Ref document number: 1992907452

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1992907452

Country of ref document: EP

WWR Wipo information: refused in national office

Ref document number: 1992907452

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1992907452

Country of ref document: EP