Module 13: Probability and Algorithms


Introduction

 

Probability is the study of "chance":

The mathematics of probability is used to answer questions such as:

 

Probability and computer science:


Simple Estimation

 

In-Class Exercise 13.1: Consider the following situation. I place three closed boxes A, B and C, in front of you, one of which contains "the treasure" and the others are empty. The box containing the treasure has a secret marking on the exterior that only I can identify (So I know where the treasure is). You guess one box by pointing to it. Suppose you pick Box A. Now one or both of boxes B or C must be empty. If both are empty, I pick one of them randomly, else I pick the only empty one. Then, I show you that it is empty by opening it. After this, there are two boxes left, one of which contains the treasure. I now let you change your selection before finally revealing where the treasure is. The question is: does the option to change your selection matter?
 

Consider a dice-estimation example:

 

A game example:

 

In-Class Exercise 13.2: Write a program to estimate which of the two strategies for the 3-box problem (in exercise 13.1) is better. To do this, you can use 2 players, one of whom bets on "changing", the other who "stays with the current box". Then see whether anyone tends to win in the long run (playing multiple games). You will need UniformRandom.java.
 

A "real-valued" game example:

 

In-Class Exercise 13.3: Download the above code and experiment: what value of R makes the game fair?


Estimation and Accuracy

 

About data:

 

Estimation:

  • Suppose we observe n samples
    => the outcomes of n experiments.

  • Denote the samples by s1, s2, ..., sn.

  • Example:
    • Suppose roll a die 8 times and outcomes are 1, 2, 1, 1, 3, 6, 4, 4.
    • Then, s1 = 1, s2 = 2, s3 = 1, ..., s8 = 4.

  • Definition: the sample mean M is the average of the n samples:

    M = (s1 + s2 + ... + sn) / n.

  • Definition: the sample variance V is the average of squared-distances from the mean.

    V = ( (s1-M)2 + (s2-M)2 + ... (sn-M)2 ) / n

  • Definition: the sample standard deviation D is the square-root of the variance.

    D = V 0.5

 

Accuracy in estimation:

  • Recall the Two-Point-Game: the value R was close to but not 0.5.
    => how can we tell for sure it's not 0.5?

  • Number of samples: how many samples are "sufficient"?

  • The theory of probability provides a way to determine how many samples are "sufficient":
    • Suppose n samples are used to obtain estimates of mean and variance, say M' and V'.
    • Then, compute d = 1.96 * sqrt (V'/n)
    • Then the probability the real mean lies in the range [M'-d, M'+d] is 0.95.
    • The interval [M'-d, M'+d] is called a confidence interval.

  • Many statistical techniques help in improving the quality of estimation.
    (We would need a course on statistics to cover these methods).

  • Sample Java code:
    • We will define a useful class to collect data, accumulate totals, variance and compute the confidence interval.
    • To use the class:
      
           Statistics stats = new Statistics ();
           for (int i=0; i < numTrials; i++) {
             // Produce i-th data. 
             stats.add ( ... );
           }
           // Summary: 
           System.out.println (stats);
           
    • The code: (source file)
      
      public class Statistics {
      
        long numSamples = 0;
        double sampleTotal = 0;
        double sampleMean = 0;
        double sampleVar = 0;
        
        public void addData (double x)
        {
          // Increment number of samples obtained. 
          numSamples ++;
      
          // Accumulate total for sample mean. 
          sampleTotal += x;
      
          // First, save current mean. 
          double oldMean = sampleMean;
      
          // Now update sample mean. 
          sampleMean = sampleTotal / numSamples;
      
          // Use recurrence relation for variance. 
          if (numSamples == 1) {
                  sampleVar = 0;
          }
          else {
            sampleVar = (1 - 1.0 / (double) (numSamples - 1)) * sampleVar 
            + numSamples * (sampleMean - oldMean) * (sampleMean - oldMean);
          }
        }
      
        public double getMean ()
        {
          return sampleMean;
        }
      
        public double getVariance ()
        {
          return sampleVar;
        }
      
        public double getConfidenceHalfInterval ()
        {
          // mean +- this value, i.e., this is half the interval. 
          return 1.96 * Math.sqrt (sampleVar / numSamples);
        }
      
        public String toString()
        {
          if (numSamples == 0) 
            return "No samples";
          else 
            return getMean() + " +- " + getConfidenceHalfInterval() + "  #samples=" + numSamples;
        }
      
      }
         
 

In-Class Exercise 13.4: Use the above Statistics.java class to obtain an accurate estimate for the mean distance between two points randomly generated in the unit square. (The code for generating the two points has been given prior to Exercise 13.3).


Histograms

 

What are they?

  • Sometimes, estimating just the mean (average value) does not provide a complete picture
    => would like some sense of the "spread" of data.

  • A histogram provides a way to estimate "spread".
 

Example: a group of coin-flips

  • Suppose we flip a coin n times: how many "heads" are obtained?

  • The average number of heads: n / 2.

  • However, what does this say about how likely it is to get n / 3 heads?

  • Let P(i) be the probability that i heads are obtained
    (Here, 0 <= i <= n).

  • We will write code to estimate each P(i)
    => a histogram.
    • How do we flip a coin programmatically?
      
            // Produce a random number between 0 and 1. 
            double u = UniformRandom.uniform ();
            if (u < 0.5) {
              // Heads 
            }
            else {
              // Tails. 
            }
            
    • We will a flip a coin n times and count the number of heads.
    • Let bins[i] = how many times i heads appeared.
    • Sample Java code: (source file)
    
      static void computeHistogram (int numFlips, int numTrials)
      {
        // Create space for the bins. 
        int[] bins = new int [numFlips+1];
    
    
        // Perform trials. 
    
        for (int i=0; i < numTrials; i++) {
    
          // Flip coins "numFlips" times and count. 
          int count = 0;
          for (int j=0; j < numFlips; j++) {
            if (UniformRandom.uniform() < 0.5) {
              // Heads. 
              count ++;
            }
          }
    
          // Update counter. 
          bins[count] ++;
        }
    
        System.out.println ("Histogram: ");
        System.out.println ("  Bin   #Trials   Fraction");
        for (int i=0; i <= numFlips; i++) {
          System.out.println ("  " + i + "      " + bins[i] + "        " 
                              + ((double)bins[i]/(double)numTrials) );
        }
      }
        

  • The usage of "bin" will become clear below.
 

Example: evaluating whether a distribution is uniform

  • This example is for "real-valued" data.
    • For the coin flip case, there were only n+1 outcomes: { 0, 1, 2, ..., n }.
    • For real-valued data, the outcomes are infinite.
      => we will create "intervals" or "bins"

  • Sample Java code:
    
      static void computeHistogram (int numBins, int numTrials)
      {
        // Space for the bins. 
        int[] bins = new int [numBins];
    
        // Length of a single small interval. 
        double interval = 1.0 / (double) numBins;
    
        // Perform trials. 
        for (int i=0; i < numTrials; i++) {
    
          // Generate a value. 
          double u = UniformRandom.uniform ();
    
          // Find out where this value lies (which bin or interval). 
          int whichBin = (int) Math.floor (u / interval);
    
          // Update counter. 
          bins[whichBin] ++;
        }
    
        System.out.println ("Histogram: ");
        System.out.println ("  Bin   #Trials   Fraction");
        for (int i=0; i < numBins; i++) {
          System.out.println ("  " + i + "      " + bins[i] + "        " 
                              + ((double)bins[i]/(double)numTrials) );
        }
      }
       

  • Example with 10 bins:
    • With 10 bins and 10 samples, here is the output of a sample run:
      
      Histogram: 
        Bin   #Trials   Fraction
        0      0        0.0
        1      0        0.0
        2      1        0.1
        3      1        0.1
        4      2        0.2
        5      1        0.1
        6      1        0.1
        7      2        0.2
        8      2        0.2
        9      0        0.0
          
      Note:
      • We have also printed out the fraction of occurences for each bin.
      • The fraction should be around 0.1 (about 1/10-th) for a large number of samples.
    • With 1000 samples:
      
      Histogram: 
        Bin   #Trials   Fraction
        0      100       0.1
        1      98        0.098
        2      108       0.108
        3      105       0.105
        4      117       0.117
        5      103       0.103
        6      112       0.112
        7      81        0.081
        8      92        0.092
        9      84        0.084
          
    • With 100000 samples:
      
      Histogram: 
        Bin   #Trials   Fraction
        0      9854       0.09854
        1      10141      0.10141
        2      9914       0.09914
        3      9967       0.09967
        4      10101      0.10101
        5      9998       0.09998
        6      10144      0.10144
        7      10016      0.10016
        8      9895       0.09895
        9      9970       0.0997
          
 

In-Class Exercise 13.5: Obtain a histogram (use 10 bins) of the distance between two randomly generated points in the unit square. Is the distribution uniform?


Monte Carlo Estimation

 

What is it?

  • Monte Carlo estimation is a way of computing functions (usually definite integrals) using random generation.

  • Key ideas:
    • Consider a function that is difficult to integrate, e.g,

    • First, draw a box around it of any height:

    • Now suppose we generate points randomly in the box:

    • Let N be the total number of points generated.
    • Let Nf be the number points that fall below the curve.
    • Let A be the area of the box.
    • Let Af be the area under the curve.
    • Then, for large enough N

      Nf / N = Af / A

      Therefore

      Af = (A * Nf) / N

  • Note that all quantities on the right are easy to determine.
 

Example:

  • Let's compute the integral of an easy function like f(x) = x2 over the interval [0, 1].

  • Sample Java code: (source file)
    
          // Initialize count. 
          int count = 0;
    
          // Box dimensions and area. 
          double minY = 0;
          double maxY = 1;
          double boxArea = 1.0;
    
          // Perform trials. 
          for (int i=0; i < numTrials; i++) {
    
            // First pick an x value randomly. 
            double x = UniformRandom.uniform();
    
            // Now pick a y-value randomly in range. 
            double y = UniformRandom.uniform (minY, maxY);
    
            // Check if y < f(x) 
            if (y < x*x)
              count ++;
    
          }
          
          // Apply formula. 
          double area = boxArea * ((double) count / (double) numTrials);
    
          // Output. 
          System.out.println ("Area estimate: " + area);
       
 

In-Class Exercise 13.6: Find the area under the curve of the function

f(x) = e- x2/2

in the interval [-3, 3].


Generating Data Randomly

 

About random data generation:

  • Random generation is a subtantial topic in itself (several books on the subject).

  • There are efficient ways of using a standard uniform generator to generate from most distributions of interest.

  • There are many ways to generate uniform random numbers
    => the Lehmer modulo-generator is one of the most popular.
    (That is what we use in UniformRandom.java).

  • We will only consider a few simple cases.
 

Example: biased coin

  • Suppose we want to simulate coin flips with a biased coin
    => the probability of heads is not 0.5

  • Suppose instead the probability of "heads" is some number probHeads.

  • Sample Java code: (source file)
    
          if (UniformRandom.uniform() < probHeads) {
            // Heads. 
          }
       

  • Why does this work?

 

In-Class Exercise 13.7: Suppose we want to generate outcomes from rolling a biased die where the probabilities of getting {1, 2, 3, 4, 5, 6} are respectively {0.1, 0.2, 0.3, 0.15, 0.13, 0.12}. Use the following template and write code to generate from this distrbution.
 

Random permutations:

  • It is fairly easy to generate a random permutation of an array of values.

  • Sample code for permuting integer arrays: (source file)
    
      public static void permute (int[] A) 
      {
        for (int i=0; i < A.length; i++) {
    
          // Pick a random index in i+1 ... A.length-1 
          int k = (int) UniformRandom.uniform ( (int) i,  (int) A.length-1);
    
          // Swap. 
          int temp = A[i];
          A[i] = A[k];
          A[k] = temp;
    
        }
      }
       


Randomized Algorithms

 
Can an algorithm actually exploit randomness?

  • Recall Quicksort:
    • Pick the leftmost element as the median element.
    • Find the right place for the median.
    • Re-arrange elements to the left (smaller) and right (larger) of the median.
    • Recurse on left and right parts.

  • What could go wrong: leftmost element could be a poor choice.

  • Randomized quicksort: pick a random element as the median.

  • Does this really make a difference?
 
A different interpretation of worst-case performance:
  • For standard quicksort, we ask a devil's advocate question:
         => Can we pick a worst-case input to make the algorithm perform badly?

  • Answer: yes
         => Elements already sorted.

  • An important point: for standard quicksort, we always have such a worst-case input at our disposal.

  • For randomized quicksort:
         => Can we pick a worst-case input to make randomized-quicksort perform badly?

  • Answer: only with extremely low probability.

  • Only if we knew the random choices, could we pick such a worst-case input at will.
         => But we don't know the random choices ahead of time.
 
Let's look at a more revealing example of how randomness can help:
  • Consider a group of devices that need to communicate wirelessly:
    • Each device has a transmitter/receiver pair.
    • Each device has "stuff" (data) to send to others.

  • Without coordination, there can be collisions
         => Two or more devices transmitting simultaneously.

  • One possible solution:

    • Divide time into fixed slots.
    • Each device is assigned its own slot.
           => Equivalently, the devices take turns.
 

In-Class Exercise 13.8: Identify some inefficiencies with the above round-robin approach.
 

A randomized algorithm (protocol):

  • Divide time into fixed-size slots (as before).

  • Any device can transmit in any slot.

  • Allow collisions to occur, but re-transmit whenever there's a collision.

  • A device randomly decides not to transmit with probability (1-p).
         => Equivalently, a device transmits with probability p.

  • Why might this be useful?
         => If p is small, the chances of a collision are small.
 

In-Class Exercise 13.9: Suppose there are n devices and Device 3 wants to transmit in a given slot. What is the probability that the transmission is successful?

  • How would you write a simulation to determine this number? Write pseudocode.
  • Use this template to implement your idea.
  • Can you write down a mathematical expression for the result?
 

In-Class Exercise 13.10: Suppose there are n=5 devices. What is the optimal value of p?

  • Start by suggesting what "optimal" might mean.
  • Write a small program to compute the optimal value.
  • Can you derive a formula for the optimal value?


Evaluating Algorithms

 

What to evaluate:

  • Performance: running time of an algorithm.

  • Quality: quality of solutions produced (if it's an optimization algorithm).
 

General approach:

  • Generate many random data sets.

  • Run algorithm on each data set.

  • Evaluate (performance or quality) on each data set.

  • Output average performance or quality.
 

Simple example: sorting

  • We will evaluate the running time of Java's sorting algorithm using several randomly created arrays (of integers).

  • Comparison with theoretical running time:
    • Java uses MergeSort
      => theoretical running time is O(n log(n)).
    • We will compute the ratio of the actual running time to theoretical time.
      => should be approximately constant.

  • Sample code: (source file)
    
      static double performTrials (int numTrials, int arraySize)
      {
        // Initialize array A. 
        int[] A = new int [arraySize];
        for (int i=0; i < arraySize; i++)
          A[i] = i;
    
        // We will measure the total time taken over all trials. 
        double totalTimeTaken = 0;
    
        // Repeat "numTrials" times: 
    
        for (int n=0; n < numTrials; n++) {
    
          // Randomly permute elements. 
          RandomPermutation.permute (A);
    
          // Start timing. 
          double startTime = System.currentTimeMillis();
    
          // Sort. 
          Arrays.sort (A);
    
          // Measure time taken. 
          double timeTaken = System.currentTimeMillis() - startTime;
    
          // Accumulate, for average. 
          totalTimeTaken += timeTaken;
          
        }
        
        // Return average time taken. 
        return totalTimeTaken / numTrials;
      }
       

  • To compare with theoretical time:
    
          // Perform comparison with each array size. 
          // Here, arraySizes[i] = size of i-th array. 
          for (int i=0; i < arraySizes.length; i++) {
    
            // Get estimate using randomly generated data. 
            double avgTimeTaken = performTrials (numTrials, arraySizes[i]);
    
            // Get theoretical running time. 
            double theoretical = theoreticalTime (arraySizes[i]);
    
            // Print ratio. 
            System.out.println ("Actual=" + avgTimeTaken + " theoretical=" + theoretical 
                                + " ratio=" + (avgTimeTaken/theoretical));
          }
        
 

More interesting example: the Marriage Problem

  • The Marriage Problem:
    • There are n men and n women.
    • Each man ranks the women in order of preference.
    • Each woman ranks the men in order of preference.
    • A (group) marriage is a 1-1 pairing of men and women.
    • A marriage is unstable if there exists two couples (Ma, Wb) and (Mc, Wd) such that
      1. Ma prefers Wd over Wb
      2. Wd prefers Ma over Mc
      If this were the case, Ma and Wd would elope.
      => unstable solution.
    • Goal: write an algorithm to produce a stable marriage.

  • The Proposal algorithm: "Man proposes, woman disposes"
    
    Algorithm: proposalAlgorithm (menPrefs, womenPrefs)
    Input: preferences of the men and women
    
         // Initially, nobody is married. 
    1.   UnmarriedList = place all men in list sorted by ID;
    
         // As long as there are unmarried guys (and therefore, unmarried gals) 
         // we'll find pairings. 
    
    2.   while UnmarriedList not empty
    
    3.     Mi = lowest-numbered guy in UnmarriedList
    
           // Mi tries his luck. 
    4.     for j=0 to n-1
    
    5.       Wk = woman who is Mi's j-th preference.
    
    6.       // Mi proposes to Wk. 
    7.       if Wk is not married
               // She accepts. 
    8.         Mi and Wk are married
    9.         exit for-loop
    10.      else if Wk prefers Mi to current husband
               // She elopes. 
    11.        Wk's current husband is dumped back on the UnmarriedList;
    12.        Mi and Wk are married
    13.        exit for-loop
    14.      endif
    
    15.    endfor
    
    16.  endwhile
    
    Output: stable marriage
        

  • Analysis:
    • At every inner most step (the if-statement), a pairing occurs only if Wk has not rejected Mi before.
    • But every woman rejects each man only once.
      => maximum O(n2) rejections.
    • Or, O(n2) executions in the if-statement
      => O(n2) time overall.
    • We have assumed operations in the if-statement take O(1) time
      => can be done if ranks are computed ahead of time (one time sorting cost of O(n log(n)).

  • What about random data?
    • O(n2) is worst-case.
    • Best-case: O(n) (first try).
    • What is the average running time?
      => use randomly generated data.
    • What is the average "quality"?
      => find the average ranking of each mate.

  • Estimating the time taken:
    • Create several random data sets: random preferences.
    • Run the algorithm each time.
    • Accumulate total running time, and find the average.
    • Sample code: (source file)
      
          double totalTimeTaken = 0;
      
          // Repeat "numTrials" times: 
      
          for (int n=0;  n< numTrials; n++) {
      
            // Create set of random preferences for men and women. 
            for (int i=0; i < numCouples; i++) {
              // Create random prefs. 
              RandomPermutation.permute (menPrefs[i]);
              RandomPermutation.permute (womenPrefs[i]);
            }
      
            // Start timing. 
            double startTime = System.currentTimeMillis();
      
            // Perform matching. 
            marry (menPrefs, womenPrefs);
      
            // Measure time taken. 
            double timeTaken = System.currentTimeMillis() - startTime;
      
            // Accumulate, for average. 
            totalTimeTaken += timeTaken;
      
          }
        
          // Return average time taken. 
          return totalTimeTaken / numTrials;
            

  • It is possible to show that the average running time is no worse than O(n log(n)).
 

In-Class Exercise 13.11: Download this template and estimate the "quality" of solution for both men and women. To do this, use the following:

  • Once a solution (marriage) is produced, examine the spouse for each person, and find where in the ranking of that person the spouse occured. (Thus, the rank is equivalent to "how desirable in my list is my assignent spouse").
  • Call this the "spouse-rank" for a married person.
  • Then, compute the average "spouse-rank" over all the men.
  • Compute the average "spouse-rank" over all the women.
  • For each particular marriage, you will compute: (1) the average "spouse-rank" for the men, and (2) the average "spouse-rank" for the women. Now, over multiple data sets (and therefore, multiple marriages), compute the averages of each of these averages.
  • You will also need the classes UniformRandom and RandomPermutation. Both can be downloaded together as a jar file or as a gzipped tar file.