1/15/2012

Database concept

Study Notes ("Database management system" Third Edition)


1. Index
database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space.
Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

Type:
Non-clustered:

  • The physical order of the rows is not the same as the index order.
  • Typically created on column used in JOIN, WHERE, and ORDER BY clauses.
  • Good for tables whose values may be modified frequently.
Clustered: The physical records are in this sort order on disk. So one table can only have one clustered index.


Primary Index: An index on a set of fields that includes the primary key. Other indexes are called secondary indexes.


Index Data Structure:
Hash-Based Indexing.
Index-Organized File Hashed on age, with Auxiliary Index on sal
Tree-Based Indexing.
Tree-Structured Index
Index Specification in SQL:1999
CREAT INDEX XXX ON XXXX(table name)
            WITH STRUCTURE = BTREE,
                     KEY = (XX,XX)


2. Outer Join
Some variants of the join operation rely on null values.
For example, consider the join of two table: Sailors and Reserves. If tuples of Sailors do not match rows in Reserves, in Outer join, sailor rows without a matching Reserves row appear once in the result, with the result columns inherited from Reserves assigned null value.
Left outer join: rows in the left table(sailors) without a match in right table(Reserves) appear in the result, but not vice versa.
Right outer join: rows in the right table(Reserves) without a match in left table(Sailors) appear in the result, but not vice versa.
Full outer join: both tables(Reserves and Sailors) rows without a match appear in the result, but not vice versa.


3. Trigger
A trigger is a procedure that is automatically invoked by the DBMS in response to specified changes to the database.


4. JDBC: Java DataBase Connectiivity.


5. Access Paths
An access path is a way of retrieving tuples from a table and consists of either (1) a file scan or (2) an index plus a matching selection condition.


Consider a simple selection that is a conjunction of conditions of the form attr op value, where op is the comparison operators <>. Such selections are said to be conjunctive normal form (CNF), and each condition is called a conjunct. Intuitively, an index matches a selection condition if the index can be used to retrieve just the tuples that satisfy the condition.
  • A hash index matches a CNF selection if there is a term of the form attribute=value in the selection for each attribute in the index's search key.
  • A tree index matches a CNF selection if there is a term of the form attr op value for each attribute in a prefix of the index's search key.
An index can match some subset of the conjuncts in a selection condition (in CNF), even though it does not match the entire condition. We refer to the conjuncts that the index matches as the primary conjuncts in the selection.


The most selective access path is the one that retrieves the fewest pages; using the most selective access path minimizes the cost of data retrieval. The selectivity of an access path depends on the primary conjunct acts as filter on the table. The faction of tuples in the table that satisfy a given conjunct is called the reduction factor.


6. Normalization
Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.


7. The ACID Properties (Transaction Management)
A DBMS must ensure four important properties of transactions to maintain data in the face of concurrent access and system failure:
1. Atomic: Users should be able to regard the execution of each transaction as atomic: Either all actions are carried out or none are.
A DBMS ensures transaction atomicity by undoing the actions of incomplete transactions. To be able to do this, the DBMS maintains a record,  called the log, of all writes to the database. The log is also used to ensure durability: If the system crashes before the changes made by a completed transaction are written to the disk, the log is used to remember and store these changes when the system restarts.
2. Consistency: Each transaction, run by itself with no concurrent execution of other transactions, must preserve the consistency of the database.
3. Isolation: Transactions are isolated, or protected, from the effects of concurrently scheduling other transactions.
The isolation property is ensured by guaranteeing that, even though actions of several transactions might be interleaved, the net effect is identical to executing all transactions one after the other in some serial order. 
4. Durability: Once the DBMS informs the user that a transaction has been successfully completed, its effects should persist even if the system crashes before all its changes are reflected on disk.
The DBMS component that ensures atomicity and durability, called the recovery manager.


8. Schedule
A schedule is a list of actions(reading, writing, aborting, committing) from a set of transactions, and the order in which two actions of a transaction T appear in a schedule must be the same as the order in which they appear in T.
A schedule that contains either an abort or a commit for each transaction whose actions are listed in it is called a complete schedule.
Serial Schedule: The actions of different transactions are not interleaved --- that is, transactions are executed from start to finish, one by one.

A serializable schedule over a set S of cormnitted transactions is a schedule whose effect on any consistent database instance is guaranteed to be identical to that of some complete serial schedule over S. (Transactions can be interleaved.)  That is, the database instance that results from executing the given schedule is identical to the database instance that results from executing the transactions in some serial order.

All actions of aborted transactions are to be undone, and we can therefore imagine that they were never carried out to begin with.

we extend the definition of a serializable schedule as follows:
A serializable schedule over a set S of transactions is a schedule whose effect on any consistent database instance is guaranteed to be identical to that of some complete serial schedule over the set of committed transactions in S.

9. Concurrent execution of transactions
Motivation:
  • While one transaction is waiting for a page to be read in from disk, the CPU can process another transaction. (I/O activity can be done in parallel with CPU activity in a computer. Overlapping I/O and CPU activity reduces the amount of time disks and processors are idle and increases system throughput---the average number of transactions completed in a given time.)
  • Interleaved execution of a short transaction with a long transaction usually allows the short transaction to complete quickly. In serial execution, a short transaction could get stuck behind a long transaction, leading to unpredictable delays in response time, or average time taken to complete a transaction. 
Anomalies Due to Interleaved Execution:

Two actions on the same data object conflict if at least one of them is a write.


write-read (WR) conflict (Reading uncommitted data--dirty read): T2 reads a data object previously written by Tl. (Reading uncommitted data: dirty read)

read-write (RW) conflict (Unrepeatable Reads):
Anomalous behavior could result is that a transaction T2 could change the value of an object A that has been read by a transaction Tl, while Tl is still in progress.

write-write (WW) conflict (Overwriting Uncommitted Data):
A transaction T2 could overwrite the value of an object A, which has already been modified by a transaction Tl, while Tl is still in progress.


In a recoverable schedule, transactions commit only after (and if!) all transactions whose changes they read commit.

10. Locked-Based Concurrency Control
A DBMS must be able to ensure that only serializable, recoverable schedules are allowed and that no actions of committed transactions are lost while undoing aborted transactions. A DBMS typically uses a locking protocol to achieve this. 
A lock is a small bookkeeping object associated with a database object.
A locking protocol is a set of rules to be followed by each transaction (and enforced by the DBlVIS) to ensure that, even though actions of several transactions might be interleaved, the net effect is identical to executing all transactions in serial order.

11. Strict Two-Phase Locking (Strict 2PL)
Two rules:

1. If a transaction T wants to read (respectively, modify) an object, it first requests a shared (respectively, exclusive) lock on the object.
2. All locks held by a transaction are released when the transaction is completed.

The Strict 2PL algorithm allows only serializable schedules.

Deadlocks: a cycle of transactions waiting for locks to be released.

Ex: Transaction T1 sets an exclusive lock on object A, T2 sets an exclusive lock on B, T1 requests an exclusive lock on B and is queued, and T2 requests an exclusive lock on A and is queued. Now, T1 is waiting for T2 to release its lock and T2 is waiting for T1 to release its lock.


To be continued......

No comments:

Post a Comment