Wednesday, May 30, 2012

GIST NOTES 12 - Algorithms and Data Structures


GIST NOTES 12 - Algorithms and Data Structures

[DISCLAIMER: This is solely for non-commercial use. I don't claim ownership of this content. This is a crux of all my readings studies and analysis. Some of them are excerpts from famous books on  the subject. Some of them are my contemplation upon experiments with direct hand coded code samples using IDE or notepad.


I've created this mainly to reduce an entire book into few pages of critical content that we should never forget. Even after years, you don't need to read the entire book again to get back its philosophy. I hope these notes will help you to replay the entire book in your mind once again.]


To understand these notes, you may want to complete reading a book on this subject.


>One goal of sorting is to facilitate searching
>the no.of comparisons is a rough measure of efficiency in any sorting algorithm
>simple sorts: straight selection sort, bubble sort and insertion sort [they use brute force]
>more complex (hence more efficient) sorts: merge sort, quick sort, heap sort
>Big O - Order of magnitude
>O(n) - when n is large other components in the equation are negligible
>O(n^2) - when n is large, compared to n^2, other terms of the function are negligible
>O(1) is called bounded time
>O(log2(n)) - logarithmic time
>O(n) - is linear time
>O(n*log2(n)) - n times log2(n) ; [quick/merge/heap sorts]
>O(n^2) - quadratic  time
>O(2^n) - exponential time; These algorithms are extremely costly. An example
of a problem for which the best known solution is exponential is the traveling salesman
problem—given a set of cities and a set of roads that connect some of them, plus the
lengths of the roads, find a route that visits every city exactly once and minimizes total
travel distance. As you can see in Table 3.3, exponential times increase dramatically in
relation to the size of N. (It also is interesting to note that the values in the last column
grow so quickly that the computation time required for problems of this order may
exceed the estimated life span of the universe!)

>selection/bubble/insertion sorts are of order O(n^2)
>merge/quick/heap sorts employ divide and conquer approach because N^2 rapidly decreases mathematically when treated as (N/2)^2 + (N/2)^2
>merge sort - log2(n) divide levels and n comparisons in each merge no matter how many merges needed in each divide level
>quick sort - only if the pivot in each level divides the list into two equal halves, we get log2(n) levels; otherwise, the split levels could go up to n-1 levels degrading the quick sort from its best performance of n*log2(n) to n*(n-1) which is n^2; so the choice of the pivot in each level matters a lot

>merge sort needs extra temp array for sorting but quick sort does not need
>every recursive algorithm needs extra space to save the calls on stack when moving deep in recursion
>divide and conquer algorithms which split the list in half in each recursion need log2(n) recursive calls and hence log2(n) stack size to save method call context
>quick sort doesn't need to explicitly join left-list,pivot and right-list because the algorithm is performed in-place within the given list
>the in-place quick sort first selects a pivot-index and calls partition method to get a new pivot-index; using this new pivot-index only it further recurses to left and right list not the chosen pivot-index; this signifies that the partition method since it does in-place adjustments, the pivot value itself has to be moved to accommodate values less than pivot on left side and values greater than pivot on right side (gotcha?)

>heap with complete binary tree: Removing top element and reheapifying(assume an array holding the heap): In this process a hole is left at the top of the tree namely root's position. Now the intention is to find the next biggest element in the tree(which will definitely be in the next level to the hole, namely hole's children) and to move the last element somewhere to the front so that elements occupy one less space in the array due to the removal of the root:

Here, we move the hole down to next level as long as one of the hole's children is greater than the last element in the array; this signifies that when we pick one of root's children to be root, the heap order is not disturbed; if none of the hole's children is greater than the last element in the array, then move the last element of the array to the current hole; this signifies that we have moved the hole down as far as possible and found a new home to the last element of the array

>there are no better algorithms than O(n*log2(n)) for comparison based algorithms
>avoid new method calls and prefer inline codes instead; prefer non-recursive version instead of recursive version to improve the performance of sorting algorithms
>inherently unstable sorting algorithms are heap sort and quick sort

>binary search is better than linear search generally when N is larger
>binary search works only on arrays; for linked list based structures, binary search is done in the form of binary search tree

[Data Structures by Nell Dale - Good book, always concentrates on most complex methods]

>collision - when hashing produces same value for two items
>linear probing - when collision happens, start searching sequentially from the location returned by the hash function
>clustering - tendency of elements to become unevenly distributed in the hash table, with many elements clustering around a single location
>rehashing - resolving a collision by computing a new hash value based on the original location(colliding hash value) rather than the item key
>Quadratic probing - Resolving a hash collision by using the rehashing formula (HashValue + I2)% array-size, where I is the number of times
that the rehash function has been applied
>Random probing - Resolving a hash collision by generating pseudo-random hash values in successive applications of the rehash function
>bucket - a collection of elements associated with a particular hash location
>chain - a linked list of elements that share the same hash location
>folding - a hash method that breaks the key into several pieces and concatenates or XORs some of them to form the hash value

>In binary search tree, while deleting or inserting nodes, reconnecting the tree is done by recursive calls and reassignments of nodes at each level; this is needed because, children do not have back references to parent and hence while recursion call goes down from parent, make sure to reassign parent or its children with the result returned by the recursive call; also while inserting and deleting, they don't care about the shape of the tree; later may be one can rebalance the tree;

>to balance a tree, pick the element whose value is in the middle(which sits in between smallest and biggest elements of the tree) which divides the elements into two halves. Put the middle element as root and insert the rest of the elements on left and right subtree without ever changing the root again. Use INORDER to store the tree elements in an array which sorts the elements in ascending order automatically (need to check this approach works or not)

OFFICIAL ALGO:

Balance

Set count to tree.reset(INORDER).
For (int index = 0; index < count; index++)
Set array[index] = tree.getNextItem(ORDER).
tree = new BinarySearchTree().
tree.InsertTree(0, count-1)


InsertTree(low, high)

if (low == high) // Base case 1
tree.insert(nodes[low]).
else if ((low + 1) == high) // Base case 2
tree.insert(nodes[low]).
tree.insert(nodes[high]).
else
mid = (low + high) / 2.
tree.insert(mid).
tree.InsertTree(low, mid – 1).
tree.InsertTree(mid + 1, high).

>to replicate or clone a tree, save the tree in preorder in an array and recreate the new tree from the array; preorder always inserts the parent node first before inserting the children and hence it maintains the parent child relationship from the original tree

>in trees, every node has only one parent; in graphs each node can have more than one parent

>graphs - connected edges and vertices

>graphs can be directed(called digraph) or undirected(two-way)

>adjacent vertices - if the vertices are connected by an edge

>a binary tree is a special case of a directed graph(downward directed)

>consider an edge A----->B in which A is said to be "adjacent to B" and B is said to be "adjacent from A"

>A tree is a special case of a directed graph in which each vertex may only be adjacent from one
other vertex (its parent node) and one vertex (the root) is not adjacent from any other vertex.

>A complete graph is one in which every vertex is adjacent to every other vertex.

>A weighted graph is a graph in which each edge carries a value. Weighted graphs
can be used to represent applications in which the value of the connection between the
vertices is important, not just the existence of a connection.

>spanning tree -> v-1 edges connecting all v vertices in the graph is called a spanning tree
>minimum spanning tree -> a spanning tree which has minimum total cost
>Minimum Spanning Tree(MST) - Kruskal's algorithm

This algorithm creates a forest of trees. Initially the forest consists of n single node trees (and no edges). At each step, we add one (the cheapest one) edge so that it joins two trees together. If it were to form a cycle, it would simply link two nodes that were already part of a single connected tree, so that this edge would not be needed.
The basic algorithm looks like this:

Forest MinimumSpanningTree( Graph g, int n, double **costs ) {
   Forest T;
   Queue q;
   Edge e;
   T = ConsForest( g );
   q = ConsEdgeQueue( g, costs );
   for(i=0;i<(n-1);i++) {
       do {
          e = ExtractCheapestEdge( q );
       } while ( !Cycle( e, T ) );
       AddEdge( T, e );
   }
   return T;
}
The steps are:
The forest is constructed - with each node in a separate tree.
The edges are placed in a priority queue.
Until we've added n-1 edges,
Extract the cheapest edge from the queue,
If it forms a cycle, reject it,
Else add it to the forest. Adding it to the forest will join two trees together.
Every step will have joined two trees in the forest together, so that at the end, there will only be one tree in T.
We can use a heap for the priority queue. The trick here is to detect cycles. For this, we need a union-find structure.

>>Djikstra's algorithm (named after its discover, E.W. Dijkstra) solves the problem of finding the shortest path from a point in a graph (the source) to a destination. It turns out that one can find the shortest paths from a given source to all points in a graph in the same time, hence this problem is sometimes called the single-source shortest paths problem.

initialise_single_source( Graph g, Node s )
   for each vertex v in Vertices( g )
       g.d[v] := infinity
       g.pi[v] := nil
   g.d[s] := 0;
 

d array of best estimates of shortest path to each vertex
pi an array of predecessors for each vertex

relax( Node u, Node v, double w[][] )
    if d[v] > d[u] + w[u,v] then
       d[v] := d[u] + w[u,v]
       pi[v] := u
The algorithm itself is now:
shortest_paths( Graph g, Node s )
    initialise_single_source( g, s )
    S := { 0 }        /* Make S empty */
    Q := Vertices( g ) /* Put the vertices in a PQ */
    while not Empty(Q)
        u := ExtractCheapest( Q );
        AddNode( S, u ); /* Add u to S */
        for each vertex v in Adjacent( u )
            relax( u, v, w )

>>Huffman algorithm - uses binary max heap to encode symbols with efficient binary bit streams (of varying length); infrequent symbols(chars) go farther from root; frequent symbols get closer to root; weight of each node being the frequency(of occurrence) of the symbol, parent node's weight is the sum of its children weights.

The algorithm, keeps forming the binary heap tree, as new nodes are added.

> the order of insertion decides the shape of the tree; random order gives balanced tree

>Heapyfying - adding a new item to heap involves moving the hold from leaf to root upwards
            - removing an item from heap involves moving the hole(created by the removal of the node) downwards

>

>>java HashMap uses linked-list(chaining) to resolve collisions

>>random number sequence - not arbitrary number; but having a property of every number in the range can equally occur
>pseudo random numbers - any man made random number generator once its implementation logic is known it is no longer random; but the number sequence generated has many mathematical properties of real random number sequence; hence we call these pseudo random number generators/sequence
>quasi random number sequence - certain chosen random distribution property is met but not other properties
>random number generation methods: 1. linear congruential method, and 2. additive congruential method

1. linear - using modulus operator, a multiplier and adding one to the current item will produce next item in the random sequence (currentItem * c + 1) % m; here m decides the range within which the random numbers occur(0 to m-1); xxxE21 form numbers work well when chosen as constant c, where E is an even digit

2. additive - an arbitrary content is available in a register(bit register); xor right most two bits; right shift the register by one position; fill the left most void with xor'ed result - vola! you have a new random number! keep doing this to produce sequence of random numbers

>>sorting methods - Basically, the methods divide into three
types: those that sort in place and use no extra memory except perhaps for
a small stack or table; those that use a linked-list representation and so use
N extra words of memory for list pointers; and those that need enough extra
memory to hold another copy of the array to be sorted.


No comments:

Post a Comment