US20040059743A1 - Sampling statistics in a database system - Google Patents

Sampling statistics in a database system Download PDF

Info

Publication number
US20040059743A1
US20040059743A1 US10/255,734 US25573402A US2004059743A1 US 20040059743 A1 US20040059743 A1 US 20040059743A1 US 25573402 A US25573402 A US 25573402A US 2004059743 A1 US2004059743 A1 US 2004059743A1
Authority
US
United States
Prior art keywords
sample
data
size
database system
rows
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US10/255,734
Other versions
US7778996B2 (en
Inventor
Louis Burger
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Teradata US Inc
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US10/255,734 priority Critical patent/US7778996B2/en
Assigned to NCR CORPORATION reassignment NCR CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BURGER, LOUIS M.
Priority to EP03255225A priority patent/EP1403787A3/en
Publication of US20040059743A1 publication Critical patent/US20040059743A1/en
Assigned to TERADATA US, INC. reassignment TERADATA US, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NCR CORPORATION
Application granted granted Critical
Publication of US7778996B2 publication Critical patent/US7778996B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Definitions

  • a database is a collection of stored data that is logically related and that is accessible by one or more users or applications.
  • a popular type of database is the relational database management system (RDBMS), which includes relational tables made up of rows and columns (also referred to as tuples and attributes). Each row represents an occurrence of an entity defined by a table, with an entity being a person, place, thing, or other object about which the table contains information.
  • RDBMS relational database management system
  • SQL query language e.g., Structured Query Language or SQL
  • SQL query statements include INSERT, SELECT, UPDATE, and DELETE.
  • TERADATA® database mangement system from NCR Corporation.
  • the TERADATA® database systems are parallel processing systems capable of handling relatively large amounts of data.
  • a database system includes multiple nodes that manage access to multiple portions of data to enhance concurrent processing of data access and updates.
  • concurrent data processing is enhanced by the use of virtual processors, referred to as access module processors (AMPs), to further divide database tasks.
  • AMPs access module processors
  • Each AMP is responsible for a logical disk space.
  • one or more of the AMPs are invoked to perform database accesses, updates, and other manipulations.
  • One of the goals of a database management system is to optimize the performance of queries for access and manipulation of data stored in the database.
  • an optimal query plan is selected, with the optimal query plan being the one with the lowest cost (e.g., response time) as determined by an optimizer.
  • the response time is the amount of time it takes to complete the execution of a query on a given system.
  • an optimizer calculates cost and/or other useful metrics based on statistics of one or more columns (or attributes) of each table.
  • statistics are stored in the form of a histogram.
  • the cost of collecting statistics for such large tables can be quite high, especially if all rows of a table need to be scanned to collect the statistics.
  • some database users may choose not to collect statistics for columns of tables over a certain size.
  • the lack of statistics for some tables may adversely affect operation of certain components in the database system, such as the optimizer and other tools.
  • a mechanism for more effective collection of statistics in a database system.
  • a method for use in a database system includes scanning a sample containing less than all rows of the table. A characteristic of data (e.g., data skew) being read is determined, and a size of the sample is adjusted as the sample is scanned based on the determined characteristic.
  • a characteristic of data e.g., data skew
  • FIG. 1 is a block diagram of an example arrangement of a database system and client terminals coupled to the database system.
  • FIG. 2 is a flow diagram of a process of collecting a sample of statistics, according to one embodiment.
  • FIG. 3 is a block diagram of a storage containing various data structures.
  • FIG. 4 illustrates a portion of data stored as blocks.
  • a database system 14 is coupled over a data network 12 to one or more client terminals 10 .
  • the data network 12 include a local area network (LAN), a wide area network (WAN), the Internet, and so forth.
  • the client terminal 10 is capable of issuing queries according to a standard database query language to the database system 14 to access or update data or to create or alter data structures (e.g., tables, rows, and so forth) in the database system 14 .
  • Each client terminal 10 includes application software 11 through which a user can issue various queries to the database system 14 .
  • SQL Structured Query Language
  • ANSI American National Standards Institute
  • One version of SQL is the SQL-92 Standard, while another version is the SQL-99 Standard. In other embodiments, other versions of SQL or other standard database query languages are used.
  • a client terminal is able to issue a COLLECT STATISTICS query to the database system 14 to collect statistics about one or more tables stored in the database system 14 .
  • a sample (less than all) of the rows of each table is scanned so that the statistics are based on a percentage (less than 100%) of each table.
  • the sampled statistics collecting technique requires less processing time and input/output (I/O) accesses so that the user does not need to wait a long period of time while the database system 14 is collecting statistics.
  • the database system 14 is a parallel database system with multiple nodes 26 .
  • single-processor database systems are employed.
  • Certain of the nodes 26 of the database system 14 include access module processors (AMPs) 16 (more generally referred to as “access modules”) that are capable of concurrently managing access of data stored in respective storage modules 18 .
  • the storage modules 18 collectively make up a storage 20 in which various tables and other data structures are stored.
  • the AMPs 16 are software modules that are executable in respective nodes 26 . More than one AMP 16 is executable in each node 26 .
  • the AMPs 16 are based on AMPs used in TERADATA® systems from NCR Corporation.
  • One or more of the nodes 26 include a parsing engine 24 , which is coupled to the AMPs 16 over an interconnect layer 22 .
  • the AMPs respond to steps (in the form of instructions or commands) received from the parsing engine 24 . Responsibilities of the AMPs include locking databases, tables, and portions of tables; creating, modifying, or deleting definitions of tables; inserting, deleting, or modifying rows within tables; and retrieving information from definitions and tables.
  • the AMPs 16 also return responses (e.g., data accessed from a table stored in storage modules 18 ) to the parsing engine 24 .
  • the parsing engine 24 includes a parser 26 and a scheduler 28 .
  • the parser 28 interprets the SQL statement, checks it for proper SQL syntax, and evaluates it semantically.
  • the parser 28 also includes an optimizer 32 , whose role is to develop the least expensive plan to return the requested response. Several alternative query plans are evaluated and the most efficient query plan is selected by the optimizer 32 . As part of the query selection procedure, the optimizer 32 relies on statistics collected by the database system 14 . As discussed further below, such statistics are based on a sample of each table so that all rows of a table do not need to be scanned.
  • a selected query plan is converted into executable steps that are submitted to the AMPs 16 for execution.
  • the parser 26 passes the generated steps to a scheduler 24 , which controls the sequence in which the steps are communicated to the AMPs for execution.
  • One technique of query optimization performed by the optimizer 32 is to use a cost model to estimate the response times of a given query plan and to search the space of query plans to return a plan with low cost.
  • a cost model In the cost-based optimization model, different methods for doing a unit of work are compared and the most efficient method is selected (the plan with the lowest cost). Because the number of alternatives can be quite large, especially in a parallel database system with a large number of nodes storing a large relational database (with many tables), the optimizer 32 uses statistics to reduce the search space in optimizing queries.
  • FIG. 2 shows an embodiment of collecting statistics for use by the optimizer 32 based on a sample of a table (or a sample of multiple tables) stored in the database system 14 .
  • Collecting statistics based on a sample refers to collecting statistics based on reading a percentage (less than 100%) of the rows of a table.
  • Collecting statistics based on a sample reduces the number I/O accesses that are required of the storage 20 of the database system 14 , since a smaller number of rows need to be read. This is particularly important for a database system 14 with a large database, since the tables tend to be large and contain many rows.
  • sampling also reduces the subsequent CPU costs associated with aggregating and sorting the data.
  • the data contained in the tables of database system 14 is divided into blocks of data.
  • a “block” refers to a portion of data of a predetermined size.
  • the database system 14 reads rows contained in one block, skips N blocks, reads the next block, skips another N blocks, and so forth. By skipping N blocks, wherein N is greater than or equal to 1, only some predetermined percentage (which is dependent upon the value of N) of data of each table is read. For example, if N equals to 1, then the percentage of rows of a table read is 50%. If N equals to 2, then the percentage of rows read is 33%. N is referred to as the skip factor.
  • the value of N is varied by the database system 14 based on certain characteristics of the data being read from a given table.
  • One such characteristic is if an attribute data value is skewed.
  • a data value is considered to be skewed if there is a high frequency (higher than some predefined threshold) of the data value in the table.
  • Data distribution is skewed when some values have much higher frequencies than others. For example, given a column (or attribute) C in a table T with M rows, assume one of the column values of C is C1.
  • the value C1 of the column C is considered to be skewed if the frequency of occurrence of C1 in table T is greater than some predefined percentage (e.g., 1%, 2%, etc.) of the total table size (M). If skewed data is detected, then the value of N is reduced.
  • some predefined percentage e.g., 1%, 2%, etc.
  • the query optimizer 32 is more sensitive to a data distribution that is skewed than a data distribution that is uniform. Therefore, when it is determined that the data contained in a given table is skewed, then the value of N is reduced so that a larger number of rows from the given table are read. In effect, by reducing N in response to detection of skewed data, the sample size is increased or made larger for purposes of collecting statistics. Increasing the sample size in the presence of skewed data enhances the accuracy of the statistics that are collected.
  • the statistics collected for a given table are on columns or attributes that are arranged or ordered in the database system 14 according to some predefined algorithm, such as by a hash code or value of the attribute.
  • a given data value is stored on one of the storage modules 18 based on the hash value for the given data value.
  • the data value of an attribute is provided to a hashing algorithm, which produces a row hash.
  • the row hash is applied to select an entry of a hash map, with the entry containing a number that corresponds to a specific one of the AMPs 16 .
  • the corresponding row is then stored in the storage module associated with the selected AMP.
  • the attributes of a base table that are sorted according to their hash values are the primary indexes of the tables.
  • the sampled statistics collection technique according to some embodiments is thus applied to collect statistics on the primary index of a table, since the primary index is ordered according to the hash value.
  • the sampled statistics collection technique can also be applied to collect statistics of a secondary index in a secondary index table (referred to as a subtable).
  • a subtable contains an indexed column (the secondary index) that contains data values sorted by their hash values.
  • a secondary index is typically used to enhance the speed in which a database system is able to access data in a base table.
  • a subtable includes the indexed column as well as respective pointers to rows of the base table. Given a specific value of the indexed column, the pointer(s) to a row or rows of the base table can be used to more quickly retrieve the corresponding row(s) of the base table.
  • the data to be scanned is sorted by hash code on a given storage module (base table for primary indexes or subtable for secondary indexes). This ensures that equal data values are physically adjacent within the data blocks, which in turn makes it possible to compute accurate counts for those data values that are included in the sample.
  • sampled statistics collection technique can be applied to any attribute(s) that is sorted or ordered according to some predefined algorithm.
  • FIG. 2 shows a flow diagram of a process according to one embodiment of performing sampled statistics collection.
  • the process shown in FIG. 2 is performed by each of the AMPs 16 in parallel.
  • each AMP 16 scans a portion of the base table 202 or subtable 204 in collecting sampled statistics.
  • Each AMP is able to adjust the skip factor N independently (and thus adjust the sample size).
  • the sampled statistics collection technique can also be applied to a uni-processing system, e.g., a single-AMP system.
  • a variable i is initialized (at 102 ).
  • i can be initialized to the value 0.
  • the skip factor N is also initialized to a predefined value to achieve a specific sample size.
  • N can be set to some value that specifies a sample size of about 1%, 2%, 5%, 10%, 25%, and so forth.
  • the AMP 16 scans (at 104 ) block i of a base table 202 or subtable 204 (FIG. 3).
  • a (Value, Count) data structure is updated (at 106 ) based on the identification of a value and its duplicates. “Value” represents the value of the attribute on which statistics is being collected, and “Count” refers to the number of occurrences of the associated value.
  • the (Value, Count) data structure is stored in a temporary location, referred to as a spool 206 , as shown in FIG. 3. If a given value of an attribute occurs just once, Count is set to one.
  • the AMP 16 determines (at 108 ) if a skewed value has been detected. This occurs when a given attribute value exceeds a predetermined frequency of occurrence. In other words, Value is skewed if Count is greater than some predefined number. If this is detected, then the skip factor N is adjusted (at 110 ). In one embodiment, the skip factor N is reduced to increase the sample size. The value of N is adjusted based on the number of instances of detecting skew. The more instances of skewed values detected, the lower the value of N. In other words, if there is a lot of skew detected when scanning the table, the sample size continues to be increased by successively decreasing the value of N. In one embodiment, the minimum value of N is 1, so that the sample size is typically less than or equal to 50% of a given table. Under certain conditions, the sample size can be greater than 50%.
  • the AMP 16 increments (at 112 ) the value of i by adding to itself the value (N+1).
  • the AMP 16 determines (at 114 ) if all the blocks of the table portion have been scanned. If not, the AMP 16 scans (at 104 ) the next block i (which has been incremented by N+1). If all blocks have been processed, then the AMP 16 generates (at 116 ) a histogram 208 (FIG. 3). As shown in FIG. 3, the histogram 208 contains multiple intervals or buckets.
  • Each bucket contains the following information (according to one example implementation): a maximum value covered by an interval; a mode value (which is the most frequently occurring value in the interval); a number corresponding to the frequency of mode value; a number corresponding to the number of values not equal to the mode value in the interval; and a total number of rows in the bucket/interval.
  • a maximum value covered by an interval a maximum value covered by an interval
  • a mode value which is the most frequently occurring value in the interval
  • a number corresponding to the frequency of mode value a number corresponding to the number of values not equal to the mode value in the interval
  • a total number of rows in the bucket/interval In other implementations, other types of statistics can be collected.
  • the histogram 208 is based on a sample, the frequencies of the non-mode values and the total number of rows are multiplied by some factor to reflect estimated frequencies in the entire table. For example, if an effective sample size is 5%, and the total number of rows is 1000, then that frequency is multiplexed by a factor of 20 (1000 ⁇ 20) to obtain a value of 20000, which is the estimated total number of rows for the entire table.
  • each AMP builds a local histogram that is based on the (Value, Count) pairs stored in spool 206 .
  • block 0 contains attribute values 0, 1, and 2.
  • the attribute value 2 is also stored in block 1 .
  • the scan of block 0 extends into a portion of the adjacent block 1 to identify all duplicates of the value 2. Since data values are ordered by hash code in one embodiment, equal data values are physically adjacent.
  • the scan of block 3 can extend backwards into block 2 to retrieve all duplicate values that span blocks 2 and 3 .
  • the value 6 starts somewhere in block 2 and continues into block 3 .
  • portions of adjacent blocks are also scanned to retrieve duplicates so that an accurate duplicate count is recorded.
  • control units or processors include microprocessors, microcontrollers, processor modules or subsystems (including one or more microprocessors or microcontrollers), or other control or computing devices.
  • a “controller” refers to hardware, software, or a combination thereof.
  • a “controller” can refer to a single component or to plural components (whether software or hardware).
  • Data and instructions are stored in respective storage units, which can be implemented as one or more machine-readable storage media.
  • the storage media include different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; and optical media such as compact disks (CDs) or digital video disks (DVDs).
  • DRAMs or SRAMs dynamic or static random access memories
  • EPROMs erasable and programmable read-only memories
  • EEPROMs electrically erasable and programmable read-only memories
  • flash memories such as fixed, floppy and removable disks; other magnetic media including tape; and optical media such as compact disks (CDs) or digital video disks (DVDs).
  • the instructions of the software modules or layers are loaded or transported to each device or system in one of many different ways.
  • code segments including instructions stored on floppy disks, CD or DVD media, a hard disk, or transported through a network interface card, modem, or other interface device are loaded into the device or system and executed as corresponding software modules or layers.
  • data signals that are embodied in carrier waves (transmitted over telephone lines, network lines, wireless links, cables, and the like) communicate the code segments, including instructions, to the device or system.
  • carrier waves are in the form of electrical, optical, acoustical, electromagnetic, or other types of signals.

Abstract

A database system has the ability to perform sampled statistics collection, in which collection of statistics of a table is based on a sample (less than all rows of the table). The sample size is adjusted as the table is scanned in response to detecting a predetermined characteristic of the table (e.g., skewed data values are detected).

Description

    BACKGROUND
  • A database is a collection of stored data that is logically related and that is accessible by one or more users or applications. A popular type of database is the relational database management system (RDBMS), which includes relational tables made up of rows and columns (also referred to as tuples and attributes). Each row represents an occurrence of an entity defined by a table, with an entity being a person, place, thing, or other object about which the table contains information. [0001]
  • To extract data from, or to update, a relational table in an RDBMS, queries according to a standard database query language (e.g., Structured Query Language or SQL) are used. Examples of SQL query statements include INSERT, SELECT, UPDATE, and DELETE. [0002]
  • As applications become increasingly sophisticated, and data storage needs become greater, higher performance database systems are used. One such database system is the TERADATA® database mangement system from NCR Corporation. The TERADATA® database systems are parallel processing systems capable of handling relatively large amounts of data. In some arrangements, a database system includes multiple nodes that manage access to multiple portions of data to enhance concurrent processing of data access and updates. In TERADATA® database management systems, concurrent data processing is enhanced by the use of virtual processors, referred to as access module processors (AMPs), to further divide database tasks. Each AMP is responsible for a logical disk space. In response to a query, one or more of the AMPs are invoked to perform database accesses, updates, and other manipulations. [0003]
  • One of the goals of a database management system is to optimize the performance of queries for access and manipulation of data stored in the database. Given a target environment, an optimal query plan is selected, with the optimal query plan being the one with the lowest cost (e.g., response time) as determined by an optimizer. The response time is the amount of time it takes to complete the execution of a query on a given system. [0004]
  • Typically, an optimizer calculates cost and/or other useful metrics based on statistics of one or more columns (or attributes) of each table. In some cases, statistics are stored in the form of a histogram. In database systems that store large tables, the cost of collecting statistics for such large tables can be quite high, especially if all rows of a table need to be scanned to collect the statistics. As a result, some database users may choose not to collect statistics for columns of tables over a certain size. The lack of statistics for some tables may adversely affect operation of certain components in the database system, such as the optimizer and other tools. [0005]
  • SUMMARY
  • In general, a mechanism is provided for more effective collection of statistics in a database system. For example, a method for use in a database system includes scanning a sample containing less than all rows of the table. A characteristic of data (e.g., data skew) being read is determined, and a size of the sample is adjusted as the sample is scanned based on the determined characteristic. [0006]
  • Other or alternative features will become more apparent from the following description, from the drawings, and from the claims.[0007]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram of an example arrangement of a database system and client terminals coupled to the database system. [0008]
  • FIG. 2 is a flow diagram of a process of collecting a sample of statistics, according to one embodiment. [0009]
  • FIG. 3 is a block diagram of a storage containing various data structures. [0010]
  • FIG. 4 illustrates a portion of data stored as blocks. [0011]
  • DETAILED DESCRIPTION
  • In the following description, numerous details are set forth to provide an understanding of the present invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these details and that numerous variations or modifications from the described embodiments are possible. [0012]
  • As shown in FIG. 1, a [0013] database system 14 is coupled over a data network 12 to one or more client terminals 10. Examples of the data network 12 include a local area network (LAN), a wide area network (WAN), the Internet, and so forth. The client terminal 10 is capable of issuing queries according to a standard database query language to the database system 14 to access or update data or to create or alter data structures (e.g., tables, rows, and so forth) in the database system 14. Each client terminal 10 includes application software 11 through which a user can issue various queries to the database system 14. One example of a standard database query language is the Structured Query Language (SQL), provided by the American National Standards Institute (ANSI). One version of SQL is the SQL-92 Standard, while another version is the SQL-99 Standard. In other embodiments, other versions of SQL or other standard database query languages are used.
  • Also, according to some embodiments, a client terminal is able to issue a COLLECT STATISTICS query to the [0014] database system 14 to collect statistics about one or more tables stored in the database system 14. To save the amount of work needed of the database system 14 in collecting statistics, a sample (less than all) of the rows of each table is scanned so that the statistics are based on a percentage (less than 100%) of each table. For a database system 14 that stores large tables, the sampled statistics collecting technique requires less processing time and input/output (I/O) accesses so that the user does not need to wait a long period of time while the database system 14 is collecting statistics.
  • In one embodiment, the [0015] database system 14 is a parallel database system with multiple nodes 26. However, in other embodiments, single-processor database systems are employed. Certain of the nodes 26 of the database system 14 include access module processors (AMPs) 16 (more generally referred to as “access modules”) that are capable of concurrently managing access of data stored in respective storage modules 18. The storage modules 18 collectively make up a storage 20 in which various tables and other data structures are stored. The AMPs 16 are software modules that are executable in respective nodes 26. More than one AMP 16 is executable in each node 26. In one embodiment, the AMPs 16 are based on AMPs used in TERADATA® systems from NCR Corporation.
  • One or more of the [0016] nodes 26 include a parsing engine 24, which is coupled to the AMPs 16 over an interconnect layer 22. The AMPs respond to steps (in the form of instructions or commands) received from the parsing engine 24. Responsibilities of the AMPs include locking databases, tables, and portions of tables; creating, modifying, or deleting definitions of tables; inserting, deleting, or modifying rows within tables; and retrieving information from definitions and tables. The AMPs 16 also return responses (e.g., data accessed from a table stored in storage modules 18) to the parsing engine 24.
  • The [0017] parsing engine 24 includes a parser 26 and a scheduler 28. When the parsing engine 24 receives an SQL query from a client application, such as application software 11 running in a client terminal 10, the parser 28 interprets the SQL statement, checks it for proper SQL syntax, and evaluates it semantically. The parser 28 also includes an optimizer 32, whose role is to develop the least expensive plan to return the requested response. Several alternative query plans are evaluated and the most efficient query plan is selected by the optimizer 32. As part of the query selection procedure, the optimizer 32 relies on statistics collected by the database system 14. As discussed further below, such statistics are based on a sample of each table so that all rows of a table do not need to be scanned.
  • A selected query plan is converted into executable steps that are submitted to the [0018] AMPs 16 for execution. The parser 26 passes the generated steps to a scheduler 24, which controls the sequence in which the steps are communicated to the AMPs for execution.
  • One technique of query optimization performed by the [0019] optimizer 32 is to use a cost model to estimate the response times of a given query plan and to search the space of query plans to return a plan with low cost. In the cost-based optimization model, different methods for doing a unit of work are compared and the most efficient method is selected (the plan with the lowest cost). Because the number of alternatives can be quite large, especially in a parallel database system with a large number of nodes storing a large relational database (with many tables), the optimizer 32 uses statistics to reduce the search space in optimizing queries.
  • FIG. 2 shows an embodiment of collecting statistics for use by the [0020] optimizer 32 based on a sample of a table (or a sample of multiple tables) stored in the database system 14. Collecting statistics based on a sample refers to collecting statistics based on reading a percentage (less than 100%) of the rows of a table. Collecting statistics based on a sample reduces the number I/O accesses that are required of the storage 20 of the database system 14, since a smaller number of rows need to be read. This is particularly important for a database system 14 with a large database, since the tables tend to be large and contain many rows. In addition, sampling also reduces the subsequent CPU costs associated with aggregating and sorting the data.
  • In one implementation, the data contained in the tables of [0021] database system 14 is divided into blocks of data. A “block” refers to a portion of data of a predetermined size. In one embodiment, the database system 14 reads rows contained in one block, skips N blocks, reads the next block, skips another N blocks, and so forth. By skipping N blocks, wherein N is greater than or equal to 1, only some predetermined percentage (which is dependent upon the value of N) of data of each table is read. For example, if N equals to 1, then the percentage of rows of a table read is 50%. If N equals to 2, then the percentage of rows read is 33%. N is referred to as the skip factor.
  • In accordance with some embodiments of the invention, the value of N is varied by the [0022] database system 14 based on certain characteristics of the data being read from a given table. One such characteristic is if an attribute data value is skewed. A data value is considered to be skewed if there is a high frequency (higher than some predefined threshold) of the data value in the table. Data distribution is skewed when some values have much higher frequencies than others. For example, given a column (or attribute) C in a table T with M rows, assume one of the column values of C is C1. The value C1 of the column C is considered to be skewed if the frequency of occurrence of C1 in table T is greater than some predefined percentage (e.g., 1%, 2%, etc.) of the total table size (M). If skewed data is detected, then the value of N is reduced.
  • The [0023] query optimizer 32 is more sensitive to a data distribution that is skewed than a data distribution that is uniform. Therefore, when it is determined that the data contained in a given table is skewed, then the value of N is reduced so that a larger number of rows from the given table are read. In effect, by reducing N in response to detection of skewed data, the sample size is increased or made larger for purposes of collecting statistics. Increasing the sample size in the presence of skewed data enhances the accuracy of the statistics that are collected.
  • According to some embodiments, the statistics collected for a given table are on columns or attributes that are arranged or ordered in the [0024] database system 14 according to some predefined algorithm, such as by a hash code or value of the attribute. For example, in the database system 14 with multiple storage modules 18, a given data value is stored on one of the storage modules 18 based on the hash value for the given data value. The data value of an attribute is provided to a hashing algorithm, which produces a row hash. To select which storage module 18 the data value is to be stored on, the row hash is applied to select an entry of a hash map, with the entry containing a number that corresponds to a specific one of the AMPs 16. The corresponding row is then stored in the storage module associated with the selected AMP.
  • The attributes of a base table that are sorted according to their hash values are the primary indexes of the tables. The sampled statistics collection technique according to some embodiments is thus applied to collect statistics on the primary index of a table, since the primary index is ordered according to the hash value. Similarly, the sampled statistics collection technique can also be applied to collect statistics of a secondary index in a secondary index table (referred to as a subtable). A subtable contains an indexed column (the secondary index) that contains data values sorted by their hash values. A secondary index is typically used to enhance the speed in which a database system is able to access data in a base table. A subtable includes the indexed column as well as respective pointers to rows of the base table. Given a specific value of the indexed column, the pointer(s) to a row or rows of the base table can be used to more quickly retrieve the corresponding row(s) of the base table. [0025]
  • The data to be scanned is sorted by hash code on a given storage module (base table for primary indexes or subtable for secondary indexes). This ensures that equal data values are physically adjacent within the data blocks, which in turn makes it possible to compute accurate counts for those data values that are included in the sample. [0026]
  • Although reference is made to collecting sampled statistics on primary indexes and secondary indexes in this discussion, it is noted that the sampled statistics collection technique can be applied to any attribute(s) that is sorted or ordered according to some predefined algorithm. [0027]
  • FIG. 2 shows a flow diagram of a process according to one embodiment of performing sampled statistics collection. The process shown in FIG. 2 is performed by each of the [0028] AMPs 16 in parallel. Thus, each AMP 16 scans a portion of the base table 202 or subtable 204 in collecting sampled statistics. Each AMP is able to adjust the skip factor N independently (and thus adjust the sample size). Although described in the context of a parallel system, the sampled statistics collection technique can also be applied to a uni-processing system, e.g., a single-AMP system.
  • In the process of FIG. 2, a variable i is initialized (at [0029] 102). For example, i can be initialized to the value 0. The skip factor N is also initialized to a predefined value to achieve a specific sample size. For example, N can be set to some value that specifies a sample size of about 1%, 2%, 5%, 10%, 25%, and so forth. Next, the AMP 16 scans (at 104) block i of a base table 202 or subtable 204 (FIG. 3).
  • In scanning the block, all rows in the block are read. In reading the rows of the block, duplicate values of the attribute on which statistics are being collected are identified, with the number of duplicates counted. A (Value, Count) data structure is updated (at [0030] 106) based on the identification of a value and its duplicates. “Value” represents the value of the attribute on which statistics is being collected, and “Count” refers to the number of occurrences of the associated value. The (Value, Count) data structure is stored in a temporary location, referred to as a spool 206, as shown in FIG. 3. If a given value of an attribute occurs just once, Count is set to one.
  • Next, the [0031] AMP 16 determines (at 108) if a skewed value has been detected. This occurs when a given attribute value exceeds a predetermined frequency of occurrence. In other words, Value is skewed if Count is greater than some predefined number. If this is detected, then the skip factor N is adjusted (at 110). In one embodiment, the skip factor N is reduced to increase the sample size. The value of N is adjusted based on the number of instances of detecting skew. The more instances of skewed values detected, the lower the value of N. In other words, if there is a lot of skew detected when scanning the table, the sample size continues to be increased by successively decreasing the value of N. In one embodiment, the minimum value of N is 1, so that the sample size is typically less than or equal to 50% of a given table. Under certain conditions, the sample size can be greater than 50%.
  • Next, the [0032] AMP 16 increments (at 112) the value of i by adding to itself the value (N+1). Next, the AMP 16 determines (at 114) if all the blocks of the table portion have been scanned. If not, the AMP 16 scans (at 104) the next block i (which has been incremented by N+1). If all blocks have been processed, then the AMP 16 generates (at 116) a histogram 208 (FIG. 3). As shown in FIG. 3, the histogram 208 contains multiple intervals or buckets. Each bucket contains the following information (according to one example implementation): a maximum value covered by an interval; a mode value (which is the most frequently occurring value in the interval); a number corresponding to the frequency of mode value; a number corresponding to the number of values not equal to the mode value in the interval; and a total number of rows in the bucket/interval. In other implementations, other types of statistics can be collected.
  • Since the histogram [0033] 208 is based on a sample, the frequencies of the non-mode values and the total number of rows are multiplied by some factor to reflect estimated frequencies in the entire table. For example, if an effective sample size is 5%, and the total number of rows is 1000, then that frequency is multiplexed by a factor of 20 (1000×20) to obtain a value of 20000, which is the estimated total number of rows for the entire table.
  • Note that each AMP builds a local histogram that is based on the (Value, Count) pairs stored in [0034] spool 206.
  • When scanning the blocks of a table, it is often the case that a value of an attribute on which statistics is being collected spans multiple blocks. For example, as shown in FIG. 4, block [0035] 0 contains attribute values 0, 1, and 2. However, the attribute value 2 is also stored in block 1. To accurately count the duplicates of a given data value (e.g., 2), the scan of block 0 extends into a portion of the adjacent block 1 to identify all duplicates of the value 2. Since data values are ordered by hash code in one embodiment, equal data values are physically adjacent.
  • Similarly, when skipping to the next block ([0036] block 3 in the example) based on the value of the skip factor N (N=2 in the example), the scan of block 3 can extend backwards into block 2 to retrieve all duplicate values that span blocks 2 and 3. In the example shown, the value 6 starts somewhere in block 2 and continues into block 3.
  • Thus, generally, in performing the scan (at [0037] 104 in FIG. 2), portions of adjacent blocks are also scanned to retrieve duplicates so that an accurate duplicate count is recorded.
  • Instructions of the various software routines or modules discussed herein (such as the AMPs [0038] 16) are stored on one or more storage devices in the corresponding systems and loaded for execution on corresponding control units or processors. The control units or processors include microprocessors, microcontrollers, processor modules or subsystems (including one or more microprocessors or microcontrollers), or other control or computing devices. As used here, a “controller” refers to hardware, software, or a combination thereof. A “controller” can refer to a single component or to plural components (whether software or hardware).
  • Data and instructions (of the various software modules and layers) are stored in respective storage units, which can be implemented as one or more machine-readable storage media. The storage media include different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; and optical media such as compact disks (CDs) or digital video disks (DVDs). [0039]
  • The instructions of the software modules or layers are loaded or transported to each device or system in one of many different ways. For example, code segments including instructions stored on floppy disks, CD or DVD media, a hard disk, or transported through a network interface card, modem, or other interface device are loaded into the device or system and executed as corresponding software modules or layers. In the loading or transport process, data signals that are embodied in carrier waves (transmitted over telephone lines, network lines, wireless links, cables, and the like) communicate the code segments, including instructions, to the device or system. Such carrier waves are in the form of electrical, optical, acoustical, electromagnetic, or other types of signals. [0040]
  • While the invention has been disclosed with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover such modifications and variations as fall within the true spirit and scope of the invention. [0041]

Claims (22)

What is claimed is:
1. A method for use in a database system, comprising:
scanning a sample of a table, the sample containing less than all rows of the table;
determining a characteristic of data read by the scanning; and
adjusting a size of the sample as the table is scanned based on the determined characteristic; and
collecting statistics based on the sample.
2. The method of claim 1, wherein determining the characteristic comprises determining whether a data value is skewed.
3. The method of claim 2, wherein determining whether the data value is skewed comprises determining if a frequency of the data value is greater than a predetermined threshold.
4. The method of claim 2, wherein adjusting the size of the sample comprises increasing the size of the sample in response to detecting a data value is skewed.
5. The method of claim 4, wherein adjusting the size of the sample comprises successively increasing the size of the sample in response to plural instances of detecting skew.
6. The method of claim 1, wherein scanning the sample comprises reading one block, skipping N blocks, and then repeating the reading and skipping until complete.
7. The method of claim 6, wherein adjusting the size of the sample comprises adjusting a value of N.
8. The method of claim 7, wherein determining the characteristic comprises determining whether a data value is skewed
9. The method of claim 1, wherein collecting the statistics comprises building a histogram.
10. The method of claim 1, further comprising storing the table on plural storage modules that are managed by respective access modules, wherein the scanning, determining, adjusting, and collecting are performed by each of the access modules in parallel.
11. An article comprising at least one storage medium containing instructions that when executed cause a system to:
collect statistics of an attribute of a table based on a sample of the table, the sample having less than all rows of the table;
detect for a predetermined characteristic of the table; and
vary a size of the sample during statistics collection in response to detecting the predetermined characteristic.
12. The article of claim 11, wherein detecting for the predetermined characteristic comprises detecting for data skew.
13. The article of claim 12, wherein varying the size of the sample comprises increasing the size of the sample in response to detecting data skew.
14. The article of claim 13, wherein detecting for data skew comprises determining if a number of duplicates of a data value exceeds a predetermined threshold.
15. The article of claim 11, wherein the instructions when executed cause the system to skip rows of data to obtain the samples less than all rows of the table.
16. The article of claim 11, wherein the instructions when executed cause the system to read a block of data, skip N blocks, and repeat the reading and skipping until completed to obtain the sample less than all rows of the table.
17. A database system comprising:
a storage to store a table; and
a controller to:
read a sample of the table that is less than all rows of the table;
collect statistics based on the sample; and
adjust a size of the sample based on a predetermined characteristic of the table.
18. The database system of claim 17, the controller to collect statistics of an attribute of the table, the attribute being sorted by some predetermined algorithm.
19. The database system of claim 18, wherein the predetermined algorithm comprises a hash code.
20. The database system of claim 17, wherein the predetermined characteristic comprises data skew.
21. The database system of claim 20, the controller to increase the size of the sample in response to detecting data skew.
22. The database system of claim 17, wherein the storage contains multiple storage modules, and the controller comprises plural access modules to manage access of the storage modules, wherein each of the access modules performs the reading, collecting, and adjusting in parallel.
US10/255,734 2002-09-25 2002-09-25 Sampling statistics in a database system Active 2025-10-16 US7778996B2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US10/255,734 US7778996B2 (en) 2002-09-25 2002-09-25 Sampling statistics in a database system
EP03255225A EP1403787A3 (en) 2002-09-25 2003-08-23 Sampling statistics in a database system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US10/255,734 US7778996B2 (en) 2002-09-25 2002-09-25 Sampling statistics in a database system

Publications (2)

Publication Number Publication Date
US20040059743A1 true US20040059743A1 (en) 2004-03-25
US7778996B2 US7778996B2 (en) 2010-08-17

Family

ID=31977850

Family Applications (1)

Application Number Title Priority Date Filing Date
US10/255,734 Active 2025-10-16 US7778996B2 (en) 2002-09-25 2002-09-25 Sampling statistics in a database system

Country Status (2)

Country Link
US (1) US7778996B2 (en)
EP (1) EP1403787A3 (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040128287A1 (en) * 2002-12-20 2004-07-01 International Business Machines Corporation Self tuning database retrieval optimization using regression functions
US20050131914A1 (en) * 2003-12-11 2005-06-16 International Business Machines Corporation Apparatus and method for estimating cardinality when data skew is present
US20060149695A1 (en) * 2004-12-30 2006-07-06 International Business Machines Corporation Management of database statistics
US20080052286A1 (en) * 2004-11-18 2008-02-28 International Business Machines Corporation Method and Apparatus for Predicting Selectivity of Database Query Join Conditions Using Hypothetical Query Predicates Having Skewed Value Constants
US20090030875A1 (en) * 2004-01-07 2009-01-29 International Business Machines Corporation Statistics management
US20090182706A1 (en) * 2008-01-16 2009-07-16 Christopher Olston Generating Example Data for Testing Database Queries
US20090192980A1 (en) * 2008-01-30 2009-07-30 International Business Machines Corporation Method for Estimating the Number of Distinct Values in a Partitioned Dataset
US20090292668A1 (en) * 2008-05-22 2009-11-26 Yu Xu System, method, and computer-readable medium for partial redistribution, partial duplication of rows of parallel join operation on skewed data
US20100138456A1 (en) * 2008-12-02 2010-06-03 Alireza Aghili System, method, and computer-readable medium for a locality-sensitive non-unique secondary index
US20100223269A1 (en) * 2009-02-27 2010-09-02 International Business Machines Corporation System and method for an efficient query sort of a data stream with duplicate key values
WO2011127158A1 (en) * 2010-04-06 2011-10-13 Justone Database, Inc. Data storage and/or retrieval based on a database model-agnostic, schema-agnostic and workload-agnostic data strorage and access models
US20120191639A1 (en) * 2011-01-26 2012-07-26 International Business Machines Corporation Statistics forecast for range partitioned tables
US20140244682A1 (en) * 2013-02-25 2014-08-28 Michael Curtiss Sampling a Set of Data
US9135300B1 (en) * 2012-12-20 2015-09-15 Emc Corporation Efficient sampling with replacement
US20160162536A1 (en) * 2012-07-03 2016-06-09 Salesforce.Com, Inc. Statistics mechanisms in multitenant database environments
US9489427B2 (en) 2008-10-06 2016-11-08 Teradata Us, Inc. System, method and computer-readable medium for optimization of multiple-parallel join operations on skewed data
US20170199895A1 (en) * 2016-01-13 2017-07-13 International Business Machines Corporation Sampling-based deduplication estimation
US20170199892A1 (en) * 2016-01-13 2017-07-13 International Business Machines Corporation Gauging accuracy of sampling-based distinct element estimation
US9811938B2 (en) 2013-03-14 2017-11-07 Business Objects Software Ltd. Methods, apparatus and system for analytics replay utilizing random sampling
US9870398B1 (en) * 2012-12-31 2018-01-16 Teradata Us, Inc. Database-table sampling-percentage selection
US11061898B2 (en) * 2016-09-15 2021-07-13 Sap Se Uncertainty-aware selection of query execution plan

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100332791A1 (en) * 2009-06-25 2010-12-30 Yu Xu System, method, and computer-readable medium for optimizing processing of group-by queries featuring maximum or minimum equality conditions in a parallel processing system
US8830823B2 (en) 2010-07-06 2014-09-09 Nicira, Inc. Distributed control platform for large-scale production networks
US9525647B2 (en) 2010-07-06 2016-12-20 Nicira, Inc. Network control apparatus and method for creating and modifying logical switching elements
US9529849B2 (en) 2013-12-31 2016-12-27 Sybase, Inc. Online hash based optimizer statistics gathering in a database
US9354896B1 (en) * 2014-02-28 2016-05-31 Emc Corporation Hypervisor utilization to reduce bootstrap time for data storage system
US9798775B2 (en) * 2015-01-16 2017-10-24 International Business Machines Corporation Database statistical histogram forecasting
US10922314B2 (en) 2015-04-29 2021-02-16 Micro Focus Llc Incrementally updating a database statistic
US10558633B1 (en) * 2015-12-30 2020-02-11 Teradata Us, Inc. Hash-value-based single-pass data store statistics collection
US10324929B2 (en) * 2016-10-31 2019-06-18 Sap Se Provision of position data for query runtime errors

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5303383A (en) * 1991-05-01 1994-04-12 Ncr Corporation Multiprocessor computer system
US5511190A (en) * 1995-01-20 1996-04-23 Tandem Computers, Inc. Hash-based database grouping system and method
US5625815A (en) * 1995-01-23 1997-04-29 Tandem Computers, Incorporated Relational database system and method with high data availability during table data restructuring
US5627533A (en) * 1994-08-05 1997-05-06 Hayes Microcomputer Products, Inc. Adjusting encoding table size and memory allocation for data compression in response to input data
US5640584A (en) * 1994-12-12 1997-06-17 Ncr Corporation Virtual processor method and apparatus for enhancing parallelism and availability in computer systems
US5864842A (en) * 1995-10-23 1999-01-26 Ncr Corporation Optimization of SQL queries using hash star join operations
US5870752A (en) * 1997-08-21 1999-02-09 Lucent Technologies Inc. Incremental maintenance of an approximate histogram in a database system
US5884299A (en) * 1997-02-06 1999-03-16 Ncr Corporation Optimization of SQL queries involving aggregate expressions using a plurality of local and global aggregation operations
US5950188A (en) * 1996-11-14 1999-09-07 Sybase, Inc. Database system with methods for executing system-created internal SQL command statements
US5970495A (en) * 1995-09-27 1999-10-19 International Business Machines Corporation Method and apparatus for achieving uniform data distribution in a parallel database system
US6223171B1 (en) * 1998-08-25 2001-04-24 Microsoft Corporation What-if index analysis utility for database systems
US20020120435A1 (en) * 2001-02-28 2002-08-29 Frazier John D. Implementing a neural network in a database system
US20030088579A1 (en) * 2001-10-12 2003-05-08 Brown Douglas P. Collecting statistics in a database system
US6691099B1 (en) * 2001-05-31 2004-02-10 Oracle International Corporation Method and system for histogram determination in a database
US20060036600A1 (en) * 2001-01-12 2006-02-16 Microsoft Corporation Database aggregation query result estimator

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5878426A (en) * 1996-12-23 1999-03-02 Unisys Corporation Statistical database query using random sampling of records
US5893090A (en) * 1997-01-31 1999-04-06 Informix Software, Inc. Method and apparatus for performing an aggregate query in a database system
US6278989B1 (en) * 1998-08-25 2001-08-21 Microsoft Corporation Histogram construction using adaptive random sampling with cross-validation for database systems

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5303383A (en) * 1991-05-01 1994-04-12 Ncr Corporation Multiprocessor computer system
US5627533A (en) * 1994-08-05 1997-05-06 Hayes Microcomputer Products, Inc. Adjusting encoding table size and memory allocation for data compression in response to input data
US5640584A (en) * 1994-12-12 1997-06-17 Ncr Corporation Virtual processor method and apparatus for enhancing parallelism and availability in computer systems
US5511190A (en) * 1995-01-20 1996-04-23 Tandem Computers, Inc. Hash-based database grouping system and method
US5625815A (en) * 1995-01-23 1997-04-29 Tandem Computers, Incorporated Relational database system and method with high data availability during table data restructuring
US5970495A (en) * 1995-09-27 1999-10-19 International Business Machines Corporation Method and apparatus for achieving uniform data distribution in a parallel database system
US5864842A (en) * 1995-10-23 1999-01-26 Ncr Corporation Optimization of SQL queries using hash star join operations
US5950188A (en) * 1996-11-14 1999-09-07 Sybase, Inc. Database system with methods for executing system-created internal SQL command statements
US5884299A (en) * 1997-02-06 1999-03-16 Ncr Corporation Optimization of SQL queries involving aggregate expressions using a plurality of local and global aggregation operations
US5870752A (en) * 1997-08-21 1999-02-09 Lucent Technologies Inc. Incremental maintenance of an approximate histogram in a database system
US6223171B1 (en) * 1998-08-25 2001-04-24 Microsoft Corporation What-if index analysis utility for database systems
US20060036600A1 (en) * 2001-01-12 2006-02-16 Microsoft Corporation Database aggregation query result estimator
US20020120435A1 (en) * 2001-02-28 2002-08-29 Frazier John D. Implementing a neural network in a database system
US6691099B1 (en) * 2001-05-31 2004-02-10 Oracle International Corporation Method and system for histogram determination in a database
US20030088579A1 (en) * 2001-10-12 2003-05-08 Brown Douglas P. Collecting statistics in a database system

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080275841A1 (en) * 2002-12-20 2008-11-06 International Business Machines Corporation Self Tuning Database Retrieval Optimization Using Regression Functions
US20040128287A1 (en) * 2002-12-20 2004-07-01 International Business Machines Corporation Self tuning database retrieval optimization using regression functions
US7136850B2 (en) * 2002-12-20 2006-11-14 International Business Machines Corporation Self tuning database retrieval optimization using regression functions
US20070022136A1 (en) * 2002-12-20 2007-01-25 Martin Keller Self Tuning Database Retrieval Optimization Using Regression Functions
US20050131914A1 (en) * 2003-12-11 2005-06-16 International Business Machines Corporation Apparatus and method for estimating cardinality when data skew is present
US7203685B2 (en) * 2003-12-11 2007-04-10 International Business Machines Corporation Apparatus and method for estimating cardinality when data skew is present
US20090030875A1 (en) * 2004-01-07 2009-01-29 International Business Machines Corporation Statistics management
US7984024B2 (en) 2004-01-07 2011-07-19 International Business Machines Corporation Statistics management
US20080052286A1 (en) * 2004-11-18 2008-02-28 International Business Machines Corporation Method and Apparatus for Predicting Selectivity of Database Query Join Conditions Using Hypothetical Query Predicates Having Skewed Value Constants
US7483888B2 (en) * 2004-11-18 2009-01-27 International Business Machines Corporation Method and apparatus for predicting selectivity of database query join conditions using hypothetical query predicates having skewed value constants
US20080065594A1 (en) * 2004-11-18 2008-03-13 International Business Machines Corporation Method and Apparatus for Predicting Selectivity of Database Query Join Conditions Using Hypothetical Query Predicates Having Skewed Value Constants
US7987200B2 (en) 2004-11-18 2011-07-26 International Business Machines Corporation Method and apparatus for predicting selectivity of database query join conditions using hypothetical query predicates having skewed value constants
US20060149695A1 (en) * 2004-12-30 2006-07-06 International Business Machines Corporation Management of database statistics
US20090182706A1 (en) * 2008-01-16 2009-07-16 Christopher Olston Generating Example Data for Testing Database Queries
US7805447B2 (en) * 2008-01-16 2010-09-28 Yahoo! Inc. Generating example data for testing database queries
US7987177B2 (en) * 2008-01-30 2011-07-26 International Business Machines Corporation Method for estimating the number of distinct values in a partitioned dataset
US20090192980A1 (en) * 2008-01-30 2009-07-30 International Business Machines Corporation Method for Estimating the Number of Distinct Values in a Partitioned Dataset
US20090292668A1 (en) * 2008-05-22 2009-11-26 Yu Xu System, method, and computer-readable medium for partial redistribution, partial duplication of rows of parallel join operation on skewed data
US8131711B2 (en) * 2008-05-22 2012-03-06 Teradata Corporation System, method, and computer-readable medium for partial redistribution, partial duplication of rows of parallel join operation on skewed data
US9489427B2 (en) 2008-10-06 2016-11-08 Teradata Us, Inc. System, method and computer-readable medium for optimization of multiple-parallel join operations on skewed data
US20100138456A1 (en) * 2008-12-02 2010-06-03 Alireza Aghili System, method, and computer-readable medium for a locality-sensitive non-unique secondary index
US20100223269A1 (en) * 2009-02-27 2010-09-02 International Business Machines Corporation System and method for an efficient query sort of a data stream with duplicate key values
US9235622B2 (en) * 2009-02-27 2016-01-12 International Business Machines Corporation System and method for an efficient query sort of a data stream with duplicate key values
WO2011127158A1 (en) * 2010-04-06 2011-10-13 Justone Database, Inc. Data storage and/or retrieval based on a database model-agnostic, schema-agnostic and workload-agnostic data strorage and access models
US8738608B2 (en) 2010-04-06 2014-05-27 Justone Database, Inc. Apparatus, systems and methods for data storage and/or retrieval based on a database model-agnostic, schema-agnostic and workload-agnostic data storage and access models
US9965481B2 (en) 2010-04-06 2018-05-08 Edge Intelligence Software, Inc. Apparatus, systems and methods for data storage and/or retrieval based on a database model-agnostic, schema-agnostic and workload-agnostic data storage and access models
US20120191639A1 (en) * 2011-01-26 2012-07-26 International Business Machines Corporation Statistics forecast for range partitioned tables
US20160162536A1 (en) * 2012-07-03 2016-06-09 Salesforce.Com, Inc. Statistics mechanisms in multitenant database environments
US9760594B2 (en) * 2012-07-03 2017-09-12 Salesforce.Com, Inc. Statistics mechanisms in multitenant database environments
US9135300B1 (en) * 2012-12-20 2015-09-15 Emc Corporation Efficient sampling with replacement
US9870398B1 (en) * 2012-12-31 2018-01-16 Teradata Us, Inc. Database-table sampling-percentage selection
US10255318B2 (en) * 2013-02-25 2019-04-09 Facebook, Inc. Sampling a set of data
US20140244682A1 (en) * 2013-02-25 2014-08-28 Michael Curtiss Sampling a Set of Data
US9811938B2 (en) 2013-03-14 2017-11-07 Business Objects Software Ltd. Methods, apparatus and system for analytics replay utilizing random sampling
US10445920B2 (en) 2013-03-14 2019-10-15 Business Objects Software Ltd. Methods, apparatus and system for analytics replay utilizing random sampling
US20170199892A1 (en) * 2016-01-13 2017-07-13 International Business Machines Corporation Gauging accuracy of sampling-based distinct element estimation
US10169364B2 (en) * 2016-01-13 2019-01-01 International Business Machines Corporation Gauging accuracy of sampling-based distinct element estimation
US10198455B2 (en) * 2016-01-13 2019-02-05 International Business Machines Corporation Sampling-based deduplication estimation
US20170199895A1 (en) * 2016-01-13 2017-07-13 International Business Machines Corporation Sampling-based deduplication estimation
US11061898B2 (en) * 2016-09-15 2021-07-13 Sap Se Uncertainty-aware selection of query execution plan

Also Published As

Publication number Publication date
EP1403787A3 (en) 2009-04-29
US7778996B2 (en) 2010-08-17
EP1403787A2 (en) 2004-03-31

Similar Documents

Publication Publication Date Title
US7778996B2 (en) Sampling statistics in a database system
US6272487B1 (en) Query optimization through the use of multi-column statistics to avoid the problems of non-indexed column correlation
US6801903B2 (en) Collecting statistics in a database system
US10387411B2 (en) Determining a density of a key value referenced in a database query over a range of rows
Morfonios et al. ROLAP implementations of the data cube
US8108355B2 (en) Providing a partially sorted index
EP3748515B1 (en) Policy driven data placement and information lifecycle management
US7469241B2 (en) Efficient data aggregation operations using hash tables
US9805077B2 (en) Method and system for optimizing data access in a database using multi-class objects
US7069264B2 (en) Stratified sampling of data in a database system
JP2019530068A (en) Table incremental clustering maintenance
US8935233B2 (en) Approximate index in relational databases
Yu et al. Two birds, one stone: a fast, yet lightweight, indexing scheme for modern database systems
US20080040334A1 (en) Operation of Relational Database Optimizers by Inserting Redundant Sub-Queries in Complex Queries
US7447679B2 (en) Optimizing execution of a database query by using the partitioning schema of a partitioned object to select a subset of partitions from another partitioned object
US7949655B2 (en) Joins of relations in an object relational database system
US6636846B1 (en) Method for providing a system maintained materialized functionally dependent column for a database management system
US20030167275A1 (en) Computation of frequent data values
Gibbons et al. Aqua project white paper
CN111078705A (en) Spark platform based data index establishing method and data query method
Pena et al. Fast detection of denial constraint violations
US8255388B1 (en) Providing a progress indicator in a database system
US7321898B1 (en) Locking mechanism for materialized views in a database system
US7236971B1 (en) Method and system for deriving data through interpolation in a database system
US9378229B1 (en) Index selection based on a compressed workload

Legal Events

Date Code Title Description
AS Assignment

Owner name: NCR CORPORATION, OHIO

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:BURGER, LOUIS M.;REEL/FRAME:013342/0251

Effective date: 20020924

AS Assignment

Owner name: TERADATA US, INC., OHIO

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NCR CORPORATION;REEL/FRAME:020666/0438

Effective date: 20080228

Owner name: TERADATA US, INC.,OHIO

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NCR CORPORATION;REEL/FRAME:020666/0438

Effective date: 20080228

STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552)

Year of fee payment: 8

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 12