CS 178, Spring 2009: Solutions to Homework
5 and Homework 6
Homework 5 Solutions:
Ques.1: Refer to examples in
notes, and
examples (and questions with solutions provided) in textbook.
Ques.2: Each record has 100
bytes,
and disk blocks are 512 bytes; therefore
number of records per block (blocking factor) is 512/100 = 5.
Since 5
records per block and
10,000 records, we have
file size N= 10,000/5=2,000 disk blocks.
- External Sorting has a complexity of 2N log_M
(N) (log to the base M), for a file of size N disk blocks and
using M input buffers. In the
example under consideration, we have N=2,000.
- If we use 2 input and 1 output buffer in the external merge
sort then M=2 and the number of disk accesses needed to sort the file
is 2*(2,000)*log_2(2,000) = 4,000 * log_2 (2,000) = 4,000*11= 44,000
disk accesses.
- If we use 10 input and 1 output buffers then M=10, and number
of disk I/O operations is 2*(2,000)*log_10 (2,000)= 4,000 *4= 16,000.
Ques.3: Successful lookup times for different file organizations.
(refer to textbook and notes).
In the example file, we have n=10,000 records, with file size N=2,000
disk blocks and p=5 records fit on the disk block of size 512 and 4
bytes
addressing for pointers to disk records.
- (1) Heap File: Average lookup time for a heap file is given by
N/2, and worst case is N. Therefore, average time= 2,000/2=1,000
accesses and worst case =2,000 disk accesses.
- (2) Hashing used with 1000 buckets. First note that if the file
is stored as sequential file then it does necessarily provide any
improvement since the hash function may not preserve the sequential
ordering. For a hash file with B buckets, we have n/B records per
bucket and therefore n/Bp disk blocks in each bucket. The lookup time
average case is 1/2(n/Bp). Therefore we have (n/Bp)= (10,000/1000)=10
records per bucket and 10/5=2 disk blocks per bucket. The average
lookup time is therefore 1 and worst case is 2 disk accesses.
- (3) Searching a sorted file (sorted according to the search key
field). The file has 2,000 disk blocks and binary search on this
sorted file takes log_2(2,000)=11 disk accesses.
- (4) Using a one-level clustered index and 10 bytes for search
key. If we have a clustered index, then we need only store the anchor
records (i.e., the index will contain search key value of only the
first record of each disk block in the file) and we have 2,000 index
records. Each record has 10 bytes for search key and 4 bytes for disk
address -- i.e., each index record needs 14 bytes. Therefore, 512/14=36
index records fit on a disk block and we will thus need
(2,000/36)=56 disk blocks to store the index file. Searching
this index file, which is sorted by search key field, will take
log_2(56)=6 disk accesses. Therefore we need 6 +1 to fetch the data
block.
- (6) Secondary/dense indexing using B+ trees. Observe that if it
is dense indexing then each record in the file has
an entry in the B+tree index and thus we need to store n=10,000 entries
in the index which is stored at the leaf level.
- If K is size (in bytes) of search key field and P is number of
bytes
for tree and data pointer, then for a degree m B+ tree we have
(2m-1)*(K) + 2m*P <= Disk block size. Therefore, in this example,
we have (2m-1)*(10) + 2m*4 <=512. Therefore, we get m <=18.
Worst case each leaf node is half full with (m-1)=17 entries in each;
for 10,000 index entries we get 10,000/17=588 leaf nodes.
- Asymptotic analysis now gives us lookup time as O(h) where h is
height of the tree and h<= 1 + log_m(number of tree nodes)= 1+
log_17(3125)= 4. Alternately, do the worst case analysis from scratch.
The root node at
level 0 can have one entry and two children at level 1 each with
(m-1)=17 entries for a total of 34 entires. Each node at level 1 has 18
children nodes at Level 2 for a total of (2*18)=36 nodes at level 2 and
36*17=612 entries. Each of these 36 nodes at Level 2 has 18 children
at Level 3 for a total of (36*18) =648 nodes at Level 3 with 17 entries
in each node with 648*17 = 11016 entries. Since we have 10,000
entires we will need 3 levels. The cost of a lookup is an access
at each level plus one more to access the data; for the example above
we get 4 for index access plus 1 for data = 5 total.
Homework 6 Solutions
Ques.1: The right hand side (RHS)
is a
selection
over the set r1 U r2 (r1 union r2) which could
be a large set since union if union
performed
first.
The left hand side (LHS) first
performs selection
on each relation, therefore
only a small set of tuples may
satisfy
sigma_P(r1) and
sigma_P(r2). Thus the intermediate
result is smaller, requiring less
disk space and fewer accesses
to the disk.
Ques.2:
For r1: 5,000 tuples and 10 per block = 500 blocks
For r2: 10,000 tuples and 25 per block = 400 blocks
-
(a) Simple block based iteration: for each block of r2, scan all blocks
of r1 ==> 400 X 500 accesses to r1 plus 400 for r2
Total no. of disk accesses = 200,400
However, if we have r1 at outer loop then: for each block of r1, read
every
tuple
of r2 ==> 500 X 400 accesses to r2 plus 500 for r2 = 200,500
-
(b) Merge Join: Two costs associated with sort-merge join- (1)
pre-processing
cost to sort the two files
and (2) cost of the join. For (2), the algorithm reads one block
of each file exactly once
==> 400 + 500 = 900 disk accesses
For sorting the two
files,
use the external sorting algorithms assuming M buffers (M-way merge
sort).
Time to sort a file with
n blocks is 2n log_M disk accesses--
to sort r1 we need
2*500 log_M 500
to sort r2 we need 2*400
log_M 400
Total time = 1800 + 1000
log_M 500 + 800 log_M 400
Since we are not given any size of
the main memory, we can assume that M=2 (i.e., 2 input buffers and 1
out put buffer). If M=2, log_2 400=9 and
log_2 500=9 therefore total time =
1800 + 9000 + 7200. (If
on the other hand, M=10 then we get log_10(400)=3 and log_10(500)=3
and the time is 1800+3000 + 2400.)
-
(c) Hash Join: First create hash index ==> read all blocks for
total of 400 +500 =900 disk accesses and write back all blocks after
computing hash value h(k) for another 900 accesses.
Next compute join using hash table ==> each tuple is read from
bucket,
and each block
in bucket is read exactly once ==> 500 accesses into r1 and 400
access
into r2.
Therefore total time for Hash join = 3*900 = 2700 disk accesses. Note
that we are assuming enough memory to
hold all the buckets.
- (d) Equality search (x=A) on r1 using clustered B-tree index on
r1. Assume search key field requires 10 bytes and disk block address is
10 bytes. Since we are going to use clustered index, we only need one
entry for each data block -- i.e., we need only 500 entries in the
index file. Since K=10, P=10 we can derive degree m of B tree as:
(2m-1)*(10+10) + 2m*10 <= 1024. Therefore m <= 17. Therefore, if
each node is half full we have 16 entires in each node and we need
500/16= 32 nodes. The root node can have 1 entry and we have 2 nodes
at Level 1. Each node at level 1 has 17 children for a total of 34
children at Level 2 which is the number of levels we need for a B-tree
with at least 32 nodes. For equality search, we need to access each
level of the B tree index plus one for the data block -- thus we get a
total of 3+1=4 disk accesses for a lookup.
- (e) Equality search (x=A) on r1 using Hash index with 1000
buckets. If we have 1000 buckets, we have 5,000/1000 = 5 records per
bucket. Since 10 tuples fit in each disk block, each bucket has one
disk block. For an equality lookup, we need to compute the hash and
look up the appropriate bucket -- assuming hash table is stored
in memory we have one disk access for a lookup. i.e., fetch the one
disk block in the bucket and search in that disk block.
Ques.3: We can provide a number of access plans for this.
Firstly, note that no statistics are provided for us -- this makes it
difficult to estimate the sizes of the intermediate results.
- In one plan, we can first do the equality selection of r1.A=Smith
using a Hash table with 1000 buckets (part e in previous question) and
hash on name. Assuming uniform distribution across the buckets, we can
do this in one disk access. We are only fetching one disk block (i.e.,
one bucket), so we can estimate that the result is only 1 disk block
with 10 records. Using a block nested join, we can lookup all 400
blocks of r2 to compute the join, thus completing the query in a total
of 401 disk accesses.
- In a second plan, we can use the clustered B+ tree index to do
the equality search. Since we have clustered we only need 500 entries
in the index file. We can derive the degree m of the B+ tree
as: (2m-1)*10 + 2m*10 <=1024 and therefore m<= 25. Since we
have 500 entries we need 500/24=21 leaf nodes and we have a height 2
(3 levels) tree. Therefore we need 3+1=4 disk accesses to do equality
search. For this tuple we search through all 400 blocks of r2 to
determine the join. Thus completing the query in 4+400 disk accesses.
- In a third plan, we can use the unclustered B+ tree index to do
equality search. For unclustered we need 5,000 entries in index file
and therefore 5,000/24=209 leaf nodes. Therefore we need height 3 tree
for a total of 5 disk accesses to do the equality search. The query now
takes 405 accesses.
- Using a sorted file means we can do the search in log_2(500)=9
accesses, and then another 400 to do the join for a total of 409
accesses.
- Note that if we do the join first,then we need a *much* larger
number of disk accesses..For example, even a hash join will take 2700
disk accesses (see question 2).
- Note that having an index on r2 can provide significant
improvements. For example, we do the equality using Hash as before to
get the tuples of Smith -- these are in one disk block and at most 5
tuples. Next use a clustered B+ tree index on attribute C in relation
r2 (similar to part d of previous question) to search for tuples with
same C value as tuples in r1 with name Smith.
- We have 10,000 tuples in r2 and 400 disk blocks. If we have
clustered index then we need only 400 entries in index. We can derive
the degree m of the B+ tree as: (2m-1)*10 + 2m*10 <=1024 and
therefore m<= 25. Since we have 400 entries we need 400/24= 17 leaf
nodes and we have a height 2 tree. Therefore we need 3+1 disk accesses
to fetch data block for each lookup. Since we have 5 lookups (one for
each record in the disk block of r1), we have 5*4=20 disk accesses to
compute the query.
- If we have unclustered index then we need 10,000/24 = 417 leaf
nodes and we need height 3 B+ tree and 4+1 disk accesses to fetch data
block for each lookup. Therefore we need 5*5=25 disk accesses to
compute query.
- To get a more precise complexities, we need to collect the
statistics on the data. i.e., we can estimate (instead of assuming as
we did in the answer) the number of tuples with name Smith, and thus
better estimate our intermediate results.