Saturday, March 19, 2016

DATA SIMPLIFICATION: Persistent Data


This is the last of my blogs related to topics selected from Data Simplification: Taming Information With Open Source Tools (released March, 2016). I hope that as you page back through my posts on Data Simplification topics, appearing throughout this month's blog, you'll find that this is a book worth reading.


Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.

A file that big?
It might be very useful.
But now it is gone.

-Haiku by David J. Liszewski

Your scripts create data objects, and the data objects hold data. Sometimes, these data objects are transient, existing only during a block or subroutine. At other times, the data objects produced by scripts represent prodigious amounts of data, resulting from complex and time-consuming calculations. What happens to these data structures when the script finishes executing? Ordinarily, when a script stops, all the data produced by the script simply vanishes.

Persistence is the ability of data to outlive the program that produced it. The methods by which we create persistent data are sometimes referred to as marshalling or serializing. Some of the language specific methods are called by such colorful names as data dumping, pickling, freezing/thawing, and storable/retrieve.

Data persistence can be ranked by level of sophistication. At the bottom is the exportation of data to a simple flat-file, wherein records are each one line in length, and each line of the record consists of a record key, followed by a list of record attributes. The simple spreadsheet stores data as tab delimited or comma separated line records. Flat-files can contain a limitless number of line records, but spreadsheets are limited by the number of records they can import and manage. Scripts can be written that parse through flat-files line by line (i.e., record by record), selecting data as they go. Software programs that write data to flat-files achieve a crude but serviceable type of data persistence.

A middle-level technique for creating persistent data is the venerable database. If nothing else, databases are made to create, store, and retrieve data records. Scripts that have access to a database can achieve persistence by creating database records that accommodate data objects. When the script ends, the database persists, and the data objects can be fetched and reconstructed for use in future scripts.

Perhaps the highest level of data persistence is achieved when complex data objects are saved in toto. Flat-files and databases may not be suited to storing complex data objects, holding encapsulated data values. Most languages provide built-in methods for storing complex objects, and a number of languages designed to describe complex forms of data have been developed. Data description languages, such as YAML (Yet Another Markup Language) and JSON (JavaScript Object Notation) can be adopted by any programming language.

Data persistence is essential to data simplification. Without data persistence, all data created by scripts is volatile, obliging data scientists to waste time recreating data that has ceased to exist. Essential tasks such as script debugging and data verification become impossible. It is worthwhile reviewing some of the techniques for data persistence that are readily accessible to Perl, Python and Ruby programmers.

Perl will dump any data structure into a persistent, external file, for later use. Here, the Perl script, data_dump.pl, creates a complex associative array, "%hash", which nests within itself a string, an integer, an array, and another associative array. This complex data structure is dumped into a persistent structure (i.e., an external file named dump_struct).
#!/usr/local/bin/perl
use Data::Dump qw(dump);
%hash = (
    number => 42,
    string => 'This is a string',
    array  => [ 1 .. 10 ],
    hash   => { apple => 'red', banana => 'yellow'},);
open(OUT, ">dump_struct");
print OUT dump \%hash;
exit;
The Perl script, data_slurp.pl picks up the external file, "dump_struct", created by the data_dump.pl script, and loads it into a variable.
#!/usr/local/bin/perl
use Data::Dump qw(dump);
open(IN, "dump_struct");
undef($/);
$data = eval ;
close $in;
dump $data;
exit;
Here is the output of the data_slurp.pl script, in which the contents in the variable "$data" are dumped onto the output screen:
c:\ftp>data_slurp.pl
{
  array  => [1 .. 10],
  hash   => { apple => "red", banana => "yellow" },
  number => 42,
  string => "This is a string",
}
Python pickles its data. Here, the Python script, pickle_up.py, pickles a string variable
#!/usr/bin/python
import pickle
pumpkin_color = "orange"
pickle.dump( pumpkin_color, open( "save.p", "wb" ) )
exit
The Python script, pickle_down.py, loads the pickle file, "save.p" and prints it to the screen.
#!/usr/bin/python
import pickle
pumpkin_color = pickle.load( open( "save.p", "rb" ) )
print(pumpkin_color)
exit
The output of the pickle_down.py script is shown here:
c:\ftp\py>pickle_down.py
orange
Where Python pickles, Ruby marshalls. In Ruby, whole objects, with their encapsulated data, are marshalled into an external file and demarshalled at will. Here is a short Ruby script, object_marshal.rb, that creates a new class, "Shoestring", a new class object, "loafer", and marshalls the new object into a persistent file, "output_file.per".
#!/usr/bin/ruby

class Shoestring < String   
  def initialize 
    @object_uuid = (`c\:\\cygwin64\\bin\\uuidgen.exe`).chomp
  end
  def object_uuid
    print @object_uuid
  end
end

loafer = Shoestring.new
output = File.open("output_file.per", "wb")
output.write(Marshal::dump(loafer))
exit
The script produces no output other than the binary file, "output_file.per". Notice that when we created the object, loafer, we included a method that encapsulates within the object a full uuid identifier, courtesy of cygwin's bundled utility, "uuidgen.exe".

We can demarshal the persistent "output_file.per" file, using the ruby script, object_demarshal.rb:
#!/usr/bin/ruby

class Shoestring < String   
  def initialize 
    @object_uuid = `c\:\\cygwin64\\bin\\uuidgen.exe`.chomp
  end
  def object_uuid
    print @object_uuid
  end
end

array = []
$/="\n\n"
out = File.open("output_file.per", "rb").each do 
  |object|
  array << Marshal::load(object)
  array.each do
    |object|
    puts object.object_uuid
    puts object.class
    puts object.class.superclass
  end
end
exit
The Ruby script, object_demarshal.rb, pulls the data object from the persistent file, "output_file.per" and directs Ruby to list the uuid for the object, the class of the object, and the superclass of the object.
c:\ftp>object_demarshal.rb
c2ace515-534f-411c-9d7c-5aef60f8c72a
Shoestring
String
Perl, Python and Ruby all have access to external database modules that can build database objects that exist as external files that persist after the script has executed. These database objects can be called from any script, with the contained data accessed quickly, with a simple command syntax (1).

Here is a Perl script, lucy.pl, that creates an associative array and ties it to a external database file, using the SDBM_file (Simple Database Management File) module.
#!/usr/local/bin/perl
use Fcntl;
use SDBM_File;
tie %lucy_hash, "SDBM_File", 'lucy', O_RDWR|O_CREAT|O_EXCL, 0644;
$lucy_hash{"Fred Mertz"} = "Neighbor";
$lucy_hash{"Ethel Mertz"} = "Neighbor";
$lucy_hash{"Lucy Ricardo"} = "Star";
$lucy_hash{"Ricky Ricardo"} = "Band leader";
untie %lucy_hash;
exit;
The lucy.pl script produces a persistent, external file, from which any Perl script can access the associative array created in the prior script. If we look in the directory from which the lucy.pl script was launched, we will find two new SDBM (Simple DataBase Manager) files, lucy.dir and lucy.pag. These are the persistent files that will substitute for the %lucy_hash associative array when invoked within other Perl scripts.

Here is a short Perl script, lucy_untie.pl, that extracts the persistent %lucy_hash associative array from the SDBM file in which it is stored:
#!/usr/local/bin/perl
use Fcntl;
use SDBM_File;
tie %lucy_hash, "SDBM_File", 'lucy', O_RDWR, 0644;
while(($key, $value) = each (%lucy_hash))
  {
  print "$key => $value\n";
  }
untie %mesh_hash;
exit;
Here is the output of the lucy_untie.pl script:
c:\ftp>lucy_untie.pl
Fred Mertz => Neighbor
Ethel Mertz => Neighbor
Lucy Ricardo => Star
Ricky Ricardo => Band leader
Here is the Python script, lucy.py, that creates a tiny external database. [jb meta.txt]
#!/usr/local/bin/python
import dumbdbm
lucy_hash = dumbdbm.open('lucy', 'c')
lucy_hash["Fred Mertz"] = "Neighbor"
lucy_hash["Ethel Mertz"] = "Neighbor"
lucy_hash["Lucy Ricardo"] = "Star"
lucy_hash["Ricky Ricardo"] = "Band leader"
lucy_hash.close()
exit
Here is the Python script, lucy_untie.py, that reads all of the key,value pairs held in the persistent database created for the lucy_hash dictionary object.
#!/usr/local/bin/python
import dumbdbm
lucy_hash = dumbdbm.open('lucy')
for character in lucy_hash.keys():
  print character, lucy_hash[character]
lucy_hash.close()
exit
Here is the output produced by the Python script, lucy_untie.py script.
c:\ftp>lucy_untie.py
Fred Mertz Neighbor
Ethel Mertz Neighbor
Lucy Ricardo Star
Ricky Ricardo Band leader
Ruby can also hold data in a persistent database, using the gdbm module. If you do not have the gdbm (GNU database manager) module installed in your Ruby distribution, you can install it as a Ruby GEM, using the following command line, from the system prompt:
c:\>gem install gdbm
The Ruby script, lucy.rb, creates an external database file, lucy.db:
#!/usr/local/bin/ruby
require 'gdbm'
lucy_hash = GDBM.new("lucy.db")
lucy_hash["Fred Mertz"] = "Neighbor"
lucy_hash["Ethel Mertz"] = "Neighbor"
lucy_hash["Lucy Ricardo"] = "Star"
lucy_hash["Ricky Ricardo"] = "Band leader"
lucy_hash.close
exit
The Ruby script, ruby_untie.db, reads the associate array stored as the persistent database, lucy.db:
#!/usr/local/bin/ruby
require 'gdbm'
gdbm = GDBM.new("lucy.db")
gdbm.each_pair do |name, role|
  print "#{name}: #{role}\n"
end
gdbm.close
exit
The output from the lucy_untie.rb script is:
c:\ftp>lucy_untie.rb
Ethel Mertz: Neighbor
Lucy Ricardo: Star
Ricky Ricardo: Band leader
Fred Mertz: Neighbor
Persistence is a simple and fundamental process ensuring that data created in your scripts can be recalled by yourself or by others who need to verify your results. Regardless of the programming language you use, or the data structures you prefer, you will need to familiarize with at least one data persistence technique.


- Jules Berman (copyrighted material)

key words: computer science, data science, data analysis, data simplification, simplifying data, persistence, databases, jules j berman

References:

[1] Berman JJ. Methods in Medical Informatics: Fundamentals of Healthcare Programming in Perl, Python, and Ruby. Chapman and Hall, Boca Raton 2010.

No comments: