Tuesday, March 15, 2016

DATA SIMPLIFICATION: The Many Uses of Random Number Generators


Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 17, 2016). I hope I can convince you that this is a book worth reading.


Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.

If you are among the many students and professionals who are intimidated by statistics, then fear no more! With a little imagination, random number generators (to be accurate, pseudorandom number generators) can substitute for a wide range of statistical methods.

As it happens, modern computers can perform two simple processes, easily and very quickly. These two processes are: 1) generating random numbers, and 2) repeating sets of instructions thousands or millions of times. Using these two computational steps, we can accurately predict outcomes that would be intractable to any direct mathematical analysis. You are about to be rewarded with simple methods whereby every statistical test can be replicated and every probabilistic dilemma can be resolved; usually with a few lines of code (1-5).

To begin, let's perform a few very simple simulations that confirm what we already know, intuitively. Imagine that you have a pair of dice, and you would like to know how often you might expect each of the numbers (from one to six) to appear after you've thrown one die (5).

Let's simulate 600,000 throws of a die, using the Perl script, randtest.pl:
     #!/usr/bin/perl
     $count = 0;
     while ($count < 600000)
        {
        $count++;
        $one_of_six = (int(rand(6))+1);
        $hash{$one_of_six}++;
        }
      while(($key, $value) = each (%hash))
        {
      print "$key => $value\n";
        }
     exit;
The script, randtest.pl, begins by setting a loop that repeats 600,000 times, each repeat simulating the cast of a die. With each cast of the die, Perl generates a random integer, 1 through 6, simulating the outcome of a throw. The most important line of code is:
$one_of_six = (int(rand(6))+1);
The rand(6) command yields a pseudorandom number of value less than 6. We integerize the result using Perl's int() function, which truncates anything past the decimal point. This produces integer values of 0,1,2,3,4, or 5. We increment each value to produce 1,2,3,4,5 or 6. The script yields the total number die casts that would be expected for each of the possible outcomes.

Here is the output of randtest.pl.
C:\ftp>perl randtest.pl
1 => 100002
2 => 99902
3 => 99997
4 => 100103
5 => 99926
6 => 100070
As one might expect, each of the six equally likely outcomes of a thrown die occurred about 100,000 times, in our simulation.

Repeating the randtest.pl script produces a different set of outcome numbers, but the general result is the same. Each die outcome had about the same number of occurrences.
C:\ftp>perl randtest.pl
1 => 100766
2 => 99515
3 => 100157
4 => 99570
5 => 100092
6 => 99900
Let's get a little more practice with random number generators, before moving onto more challenging simulations. Occasionally in scripts, we need to create a new file, automatically, during the script's run time, and we want to be fairly certain that the file we create will not have the same filename as an existing file. An easy way of choosing a filename is to grab, at random, printable characters, concatenating them into an 11 character string suitable as a filename. The chance that you'll encounter two files with the same randomly chosen filename is very remote. In fact, the likelihood that any two selected filenames are identical exceeds to 2 to the 44th power.

Here is a Perl script, random_filenames.pl, that assigns a sequence of 11 randomly chosen uppercase alphabetic characters to a file name:
#!/usr/bin/perl
for ($count = 1; $count <= 12; $count++)
  {
  push(@listchar, chr(int(rand(26))+65));
  }
$listchar[8]= ".";
$randomfilename = join("",@listchar);
print "Your filename is $randomfilename\n";
exit;
Here is the output of the ranfile.pl script:
c:\ftp>random_filenames.pl
Your filename is OAOKSXAH.SIT
The key line of code in random_filenames.pl is:
push(@listchar, chr(int(rand(26))+65));
The rand(26) command yields a random value less than 26. The int() command converts the value to an integer. The number 65 is added to the value to produce a value ranging from 65 to 90, and the chr() command converts the numbers 65 through 90 to their ASCII equivalent; which just happen to be the uppercase alphabet from A to Z. The randomly chosen letter is pushed onto an array, and the process is repeated until a 12 character filename is generated.

Here is the equivalent Python script, random_filenames.py, that produces one random filename
#!/usr/bin/python
import random
filename = [0]*12
filename = map(lambda x: x is "" or chr(int(random.uniform(0,25) + 65)), filenam
e)
print ''.join(filename[0:8]) + "." + ''.join(filename[9:12])
exit
Here is the outcome of the random_filenames.py script:
c:\ftp>random_filenames.py
KYSDWKLF.RBA
In both these scripts, as in all of the scripts in this section, many outcomes may result from a small set of initial conditions. It's much easier to write these programs and observe their outcomes than to directly calculate all the possible outcomes of a set of governing equations.[jb outline.txt]

Let's use a random number generator to calculate the value of pi, without measuring anything, and without resorting to summing an infinite series of numbers. Here is a simple python script, pi.py, that does the job.
#!/usr/bin/python
import random
from math import sqrt
totr = 0
totsq = 0
for iterations in range(10000000):
  x= random.uniform(0,1)
  y= random.uniform(0,1)
  r= sqrt((x*x) + (y*y))
  if r < 1:
    totr = totr + 1
  totsq = totsq + 1
print float(totr)*4.0/float(totsq)
exit
The outcome of the pi.py script is:
output:
c:\ftp\py>pi.py
3.1414256
Here is an equivalent Ruby script. [jb]
#!/usr/local/bin/ruby
x = y = totr = totsq = 0.0
(1..100000).each do
  x = rand()
  y = rand()
  r = Math.sqrt((x*x) + (y*y))
  totr = totr + 1 if r < 1
  totsq = totsq + 1
end
puts (totr *4.0 / totsq)
exit
Here is an equivalent Perl script. [jb]
#!/usr/local/bin/perl
open(DATA,">pi.dat");
for (1..10000)
  {
  $x = rand();
  $y = rand();
  $r = sqrt(($x*$x) + ($y*$y));
  if ($r < 1)
    {
    $totr = $totr + 1;
    print DATA "$x\ $y\n";
    }
  $totsq = $totsq + 1;
  }
print eval(4 * ($totr / $totsq)); 
exit;
As one would hope, all three scripts produce approximately the same value of pi. The Perl script contains a few extra lines that produces an output file, named pi.dat, that will helps us visualize how these scripts work. The pi.dat script contains the x,y data points, generated by the random number generator, meeting the "if" statement's condition that the hypotenuse of the x,y coordinates must be less than one (i.e., less than a circle of radius 1). [jb]

We can plot the output of the script with a few lines of Gnuplot code:
gnuplot> set size square
gnuplot> unset key
gnuplot> plot 'c:\ftp\pi.dat'
The resulting graph is a quarter-circle within a square.


The data points produced by 10,000 random assignments
of x and y coordinates in a range of 0 to 1.
Randomly assigned data points whose hypotenuse
exceeds "1" are excluded.


The graph shows us, at a glance, how the ratio of the number of points in the quarter-circle, as a fraction of the total number of simulations, is related to the value of pi.


- Jules Berman (copyrighted material)

key words: computer science, data analysis, data repurposing, data simplification, simplifying data, random, pseudorandom, resampling, probability, simulations, Monte Carlo jules j berman

References:

[1] Simon JL. Resampling: The New Statistics. Second Edition, 1997. Available online at: http://www.resample.com/intro-text-online/, viewed on September 21, 2015.

[2] Efron B, Tibshirani RJ. An Introduction to the Bootstrap. CRC Press, Boca Raton, 1998.

[3] Diaconis P, Efron B. Computer-intensive methods in statistics. Scientific American, May, 116-130, 1983. Comment. Oft-cited explanation of resampling statistics, a field largely credited to Bradley Efron. The articles contains examples in Basic, Pascal, and Fortran source code.

[4] Anderson HL. Metropolis, Monte Carlo and the MANIAC. Los Alamos Science 14:96-108, 1986. Available at: http://library.lanl.gov/cgi-bin/getfile?00326886.pdf, viewed September 21, 2015.

[5] Berman JJ. Biomedical Informatics. Jones and Bartlett, Sudbury, MA, 2007.

No comments: