1/15/2012

Database concept

Study Notes ("Database management system" Third Edition)


1. Index
database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space.
Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

Type:
Non-clustered:

  • The physical order of the rows is not the same as the index order.
  • Typically created on column used in JOIN, WHERE, and ORDER BY clauses.
  • Good for tables whose values may be modified frequently.
Clustered: The physical records are in this sort order on disk. So one table can only have one clustered index.


Primary Index: An index on a set of fields that includes the primary key. Other indexes are called secondary indexes.


Index Data Structure:
Hash-Based Indexing.
Index-Organized File Hashed on age, with Auxiliary Index on sal
Tree-Based Indexing.
Tree-Structured Index
Index Specification in SQL:1999
CREAT INDEX XXX ON XXXX(table name)
            WITH STRUCTURE = BTREE,
                     KEY = (XX,XX)


2. Outer Join
Some variants of the join operation rely on null values.
For example, consider the join of two table: Sailors and Reserves. If tuples of Sailors do not match rows in Reserves, in Outer join, sailor rows without a matching Reserves row appear once in the result, with the result columns inherited from Reserves assigned null value.
Left outer join: rows in the left table(sailors) without a match in right table(Reserves) appear in the result, but not vice versa.
Right outer join: rows in the right table(Reserves) without a match in left table(Sailors) appear in the result, but not vice versa.
Full outer join: both tables(Reserves and Sailors) rows without a match appear in the result, but not vice versa.


3. Trigger
A trigger is a procedure that is automatically invoked by the DBMS in response to specified changes to the database.


4. JDBC: Java DataBase Connectiivity.


5. Access Paths
An access path is a way of retrieving tuples from a table and consists of either (1) a file scan or (2) an index plus a matching selection condition.


Consider a simple selection that is a conjunction of conditions of the form attr op value, where op is the comparison operators <>. Such selections are said to be conjunctive normal form (CNF), and each condition is called a conjunct. Intuitively, an index matches a selection condition if the index can be used to retrieve just the tuples that satisfy the condition.
  • A hash index matches a CNF selection if there is a term of the form attribute=value in the selection for each attribute in the index's search key.
  • A tree index matches a CNF selection if there is a term of the form attr op value for each attribute in a prefix of the index's search key.
An index can match some subset of the conjuncts in a selection condition (in CNF), even though it does not match the entire condition. We refer to the conjuncts that the index matches as the primary conjuncts in the selection.


The most selective access path is the one that retrieves the fewest pages; using the most selective access path minimizes the cost of data retrieval. The selectivity of an access path depends on the primary conjunct acts as filter on the table. The faction of tuples in the table that satisfy a given conjunct is called the reduction factor.


6. Normalization
Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.


7. The ACID Properties (Transaction Management)
A DBMS must ensure four important properties of transactions to maintain data in the face of concurrent access and system failure:
1. Atomic: Users should be able to regard the execution of each transaction as atomic: Either all actions are carried out or none are.
A DBMS ensures transaction atomicity by undoing the actions of incomplete transactions. To be able to do this, the DBMS maintains a record,  called the log, of all writes to the database. The log is also used to ensure durability: If the system crashes before the changes made by a completed transaction are written to the disk, the log is used to remember and store these changes when the system restarts.
2. Consistency: Each transaction, run by itself with no concurrent execution of other transactions, must preserve the consistency of the database.
3. Isolation: Transactions are isolated, or protected, from the effects of concurrently scheduling other transactions.
The isolation property is ensured by guaranteeing that, even though actions of several transactions might be interleaved, the net effect is identical to executing all transactions one after the other in some serial order. 
4. Durability: Once the DBMS informs the user that a transaction has been successfully completed, its effects should persist even if the system crashes before all its changes are reflected on disk.
The DBMS component that ensures atomicity and durability, called the recovery manager.


8. Schedule
A schedule is a list of actions(reading, writing, aborting, committing) from a set of transactions, and the order in which two actions of a transaction T appear in a schedule must be the same as the order in which they appear in T.
A schedule that contains either an abort or a commit for each transaction whose actions are listed in it is called a complete schedule.
Serial Schedule: The actions of different transactions are not interleaved --- that is, transactions are executed from start to finish, one by one.

A serializable schedule over a set S of cormnitted transactions is a schedule whose effect on any consistent database instance is guaranteed to be identical to that of some complete serial schedule over S. (Transactions can be interleaved.)  That is, the database instance that results from executing the given schedule is identical to the database instance that results from executing the transactions in some serial order.

All actions of aborted transactions are to be undone, and we can therefore imagine that they were never carried out to begin with.

we extend the definition of a serializable schedule as follows:
A serializable schedule over a set S of transactions is a schedule whose effect on any consistent database instance is guaranteed to be identical to that of some complete serial schedule over the set of committed transactions in S.

9. Concurrent execution of transactions
Motivation:
  • While one transaction is waiting for a page to be read in from disk, the CPU can process another transaction. (I/O activity can be done in parallel with CPU activity in a computer. Overlapping I/O and CPU activity reduces the amount of time disks and processors are idle and increases system throughput---the average number of transactions completed in a given time.)
  • Interleaved execution of a short transaction with a long transaction usually allows the short transaction to complete quickly. In serial execution, a short transaction could get stuck behind a long transaction, leading to unpredictable delays in response time, or average time taken to complete a transaction. 
Anomalies Due to Interleaved Execution:

Two actions on the same data object conflict if at least one of them is a write.


write-read (WR) conflict (Reading uncommitted data--dirty read): T2 reads a data object previously written by Tl. (Reading uncommitted data: dirty read)

read-write (RW) conflict (Unrepeatable Reads):
Anomalous behavior could result is that a transaction T2 could change the value of an object A that has been read by a transaction Tl, while Tl is still in progress.

write-write (WW) conflict (Overwriting Uncommitted Data):
A transaction T2 could overwrite the value of an object A, which has already been modified by a transaction Tl, while Tl is still in progress.


In a recoverable schedule, transactions commit only after (and if!) all transactions whose changes they read commit.

10. Locked-Based Concurrency Control
A DBMS must be able to ensure that only serializable, recoverable schedules are allowed and that no actions of committed transactions are lost while undoing aborted transactions. A DBMS typically uses a locking protocol to achieve this. 
A lock is a small bookkeeping object associated with a database object.
A locking protocol is a set of rules to be followed by each transaction (and enforced by the DBlVIS) to ensure that, even though actions of several transactions might be interleaved, the net effect is identical to executing all transactions in serial order.

11. Strict Two-Phase Locking (Strict 2PL)
Two rules:

1. If a transaction T wants to read (respectively, modify) an object, it first requests a shared (respectively, exclusive) lock on the object.
2. All locks held by a transaction are released when the transaction is completed.

The Strict 2PL algorithm allows only serializable schedules.

Deadlocks: a cycle of transactions waiting for locks to be released.

Ex: Transaction T1 sets an exclusive lock on object A, T2 sets an exclusive lock on B, T1 requests an exclusive lock on B and is queued, and T2 requests an exclusive lock on A and is queued. Now, T1 is waiting for T2 to release its lock and T2 is waiting for T1 to release its lock.


To be continued......

1/03/2012

Java Collection and Map


Interface: Collection, List, Set, Map.
Class: ArrayList, LinkedList, Vector, HashSet, TreeSet, HashTable, HashMap, TreeMap.
           (Stack is inherited from Vector)
Difference between List and Set:
List can have duplicated elements but Set cannot.


Difference between ArrayList, LinkedList and Vector:
ArrayList and LinkedList are not synchronized.
Vector is synchronized.


ArrayList and Vector are based on Array, but with dynamic size. 
LinkedList is based on linked list, there is a pointer in each node, pointing to the next element.


Looking up element:
ArrayList and Vector support random access, but LinkedList does not, it needs to read from the head.


Inserting and deleting element:
ArrayList needs to move elements that is behind the target index.
LinkedList needs to move pointer.
So, if the target index is in the end of the list, ArrayList is fast. Otherwise, LinkedList is fast.


So, when you need to add/remove lots number of elements randomly in the List, and read from head in sequence, LinkedList is better.
When you need to add/remove lots number of elements at end of the list and randomly read data, ArrayList is better.


Difference between HashSet and TreeSet:
HashSet: Elements are not sorted.
TreeSet: Elements are sorted. Sorting algorithm: Red Black Tree. 
               Either the elements in the set has implemented comparable interface, or define Comparator when defining the TreeSet Object.


Difference between HashTable, HashMap and TreeMap:
TreeMap: Asynchronized.
                If needs to be synchronized, can use synchronizedSortedMap method:
               1 SortedMap m = Collections.synchronizedSortedMap(new TreeMap(...));
                Elements are sorted.
Difference between HashTable, HashMap is discussed in the previous article.



Hash Table(Reference WIKIPEDIA)

A Hash Table is a data structure for storing key/value pairs.


Slot: One unit for storing data, the container of data. 
Hash function: maps key to the index of the slot.calculates an index from the data item's key and use this index to place the data into the array.
Hash collision: hash function maps two different keys to one slot.
Load factor:  the ratio n/s between n and the size s of its array of buckets. It represents the portion of the s buckets in the structure that are filled with one of the n stored entries. If it approaches 0, it means that many buckets are empty and memory is wasted.


Collision resolution:
1. Chaining
  • Each slot of the bucket array is a pointer to a linked list that contains the key-value pairs that hashed to the same location.
  • Lookup worst case: O(n). All the entries map to one single slot.
    Hash collision resolved by separate chaining.
    Hash collision by separate chaining with head records in the bucket array.


    2. Open Addressing

    • All entry records are stored in the bucket array itself. 
    • When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found.


    Advantage:
    Lookup speed is fast. But when there are many collisions, the speed will be decreased.
    So it's important to write a good hash function. In the same time of keeping load factor, reducing collision.


    Difference between Hash Table and Hash Map

    • Hash Table is synchronized, but Hash Map is not.
    • Hash Table inherits from Dictionary, but Hash Map inherits from Map.
    • Hash Table does not allow null key or null value. But Hash Map allows. It can have multiple keys maps to null value, but it can only have one null key. Also, it must use containsKey() to check whether one specific key exists. Because when it returns null when using get(). It could be either the key does not exist or the value of the key is null.
    • Hash Table uses Enumeration, but Hash Map uses Iterator.
    • In Hash Table, the default size of hash array is 11, and increased to 2*old +1 each time. In Hash Map, the default size is 16, and when resizing, the size doubles.
    • In Hash Table, it uses hashCode() directly, but in Hash Map, it recalculates the hash value and use & instead of mod.

    Algorithm-Tree (Reference WIKIPEDIA)


    1. Binary Search Tree
        Feature:
    •     The left subtree of a node contains only nodes with keys less than the node's key.
    •     The right subtree of a node contains only nodes with keys greater than the node's key.
    •     Both the left and right subtrees must also be binary search trees.
        The major advantage of binary search trees over other data structures is that the related sorting algorithms and search algorithms such as in-order traversal can be very efficient.
        Example:
    2. Red Black Tree
        Feature:
        A red–black tree is a binary search tree where each node has a colorattribute, the value of which is either red or black. In addition to the ordinary requirements imposed on binary search trees, the following requirements apply to red–black trees:
    1. A node is either red or black.
    2. The root is black. (This rule is sometimes omitted from other definitions. Since the root can always be changed from red to black, but not necessarily vice-versa, this rule has little effect on analysis.)
    3. All leaves are the same color as the root.
    4. Both children of every red node are black.
    5. Every simple path from a given node to any of its descendant leaves contains the same number of black nodes.
        Example:

    3. AVL Tree
        Feature:
       In an AVL tree, the heights of the two child subtrees of any node differ by at most one.
        AVL trees are more rigidly balanced than red-black trees, leading to slower insertion and removal but faster retrieval.
    4. B Tree
        Feature:
        B-tree is optimized for systems that read and write large blocks of data. It is commonly used in databases and filesystems.
      Example(A B-tree of order 2 or order 5):
    5. Prefix Tree
        Feature:

    • prefix tree, or trieis an ordered tree data structure that is used to store an associative array where the keys are usually strings.
    • Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key it is associated with.
    • All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.
        It saves the storing space but wastes memory.//Personal Opinion
        The following are the main advantages of tries over binary search trees (BSTs):
    • Looking up keys is faster. Looking up a key of length m takes worst case O(m) time. A BST performs O(log(n)) comparisons of keys, where n is the number of elements in the tree, because lookups depend on the depth of the tree, which is logarithmic in the number of keys if the tree is balanced. Hence in the worst case, a BST takes O(m log n) time. Moreover, in the worst case log(n) will approach m. Also, the simple operations tries use during lookup, such as array indexing using a character, are fast on real machines.
    • Tries are more space efficient when they contain a large number of short keys, because nodes are shared between keys with common initial subsequences.
    • Tries facilitate longest-prefix matching, helping to find the key sharing the longest possible prefix of characters all unique.
    • The number of internal nodes from root to leaf equals the length of the key. Balancing the tree is therefore no concern.

       A trie can also be used to replace a hash table, over which it has the following advantages:
    • Looking up data in a trie is faster in the worst case, O(m) time, compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.
    • There are no collisions of different keys in a trie.
    • Buckets in a trie which are analogous to hash table buckets that store key collisions are necessary only if a single key is associated with more than one value.
    • There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
    • A trie can provide an alphabetical ordering of the entries by key.

         Tries do have some drawbacks as well:
    • Tries can be slower in some cases than hash tables for looking up data, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random-access time is high compared to main memory.[5]
    • Some keys, such as floating point numbers, can lead to long chains and prefixes that are not particularly meaningful. Nevertheless a bitwise trie can handle standard IEEE single and double format floating point numbers.
      Example:

                                                                
    Summary of Time Complexity of each kind of tree:
    Name
    Construct
    Operation
    Self-Balanced

    Best
    Average
    Worst
    Best
    Average
    Worst

    Binary Search Tree
    O(nlogn)
    O(nlogn)
    O(n2)
    O(1)
    O(log n)
    O(log n)
    No
    Red Black Tree
    ----
    O(nlogn)
    ----
    O(1)
    O(log n)
    O(log n)
    Yes
    AVL Tree
    ----
    O(nlogn)
    ----
    O(1)
    O(log n)
    O(log n)
    Yes
    B Tree
    ----
    O(nlogn)
    ----
    O(1)
    O(log n)
    O(log n)
    Yes
    Prefix Tree
    ----
    O(mn)
    ----
    O(1)
    O(m)
    O(m)
    No

     m is the length of the key.