This is the last of my blogs related to topics selected from Data Simplification: Taming Information With Open Source Tools (released March, 2016). I hope that as you page back through my posts on Data Simplification topics, appearing throughout this month's blog, you'll find that this is a book worth reading.
A file that big?
It might be very useful.
But now it is gone.
-Haiku by David J. Liszewski
Your scripts create data objects, and the data objects hold data. Sometimes, these data objects are transient, existing only during a block or subroutine. At other times, the data objects produced by scripts represent prodigious amounts of data, resulting from complex and time-consuming calculations. What happens to these data structures when the script finishes executing? Ordinarily, when a script stops, all the data produced by the script simply vanishes.
Persistence is the ability of data to outlive the program that produced it. The methods by which we create persistent data are sometimes referred to as marshalling or serializing. Some of the language specific methods are called by such colorful names as data dumping, pickling, freezing/thawing, and storable/retrieve.
Data persistence can be ranked by level of sophistication. At the bottom is the exportation of data to a simple flat-file, wherein records are each one line in length, and each line of the record consists of a record key, followed by a list of record attributes. The simple spreadsheet stores data as tab delimited or comma separated line records. Flat-files can contain a limitless number of line records, but spreadsheets are limited by the number of records they can import and manage. Scripts can be written that parse through flat-files line by line (i.e., record by record), selecting data as they go. Software programs that write data to flat-files achieve a crude but serviceable type of data persistence.
A middle-level technique for creating persistent data is the venerable database. If nothing else, databases are made to create, store, and retrieve data records. Scripts that have access to a database can achieve persistence by creating database records that accommodate data objects. When the script ends, the database persists, and the data objects can be fetched and reconstructed for use in future scripts.
Perhaps the highest level of data persistence is achieved when complex data objects are saved in toto. Flat-files and databases may not be suited to storing complex data objects, holding encapsulated data values. Most languages provide built-in methods for storing complex objects, and a number of languages designed to describe complex forms of data have been developed. Data description languages, such as YAML (Yet Another Markup Language) and JSON (JavaScript Object Notation) can be adopted by any programming language.
Data persistence is essential to data simplification. Without data persistence, all data created by scripts is volatile, obliging data scientists to waste time recreating data that has ceased to exist. Essential tasks such as script debugging and data verification become impossible. It is worthwhile reviewing some of the techniques for data persistence that are readily accessible to Perl, Python and Ruby programmers.
Perl will dump any data structure into a persistent, external file, for later use. Here, the Perl script, data_dump.pl, creates a complex associative array, "%hash", which nests within itself a string, an integer, an array, and another associative array. This complex data structure is dumped into a persistent structure (i.e., an external file named dump_struct).
#!/usr/local/bin/perl use Data::Dump qw(dump); %hash = ( number => 42, string => 'This is a string', array => [ 1 .. 10 ], hash => { apple => 'red', banana => 'yellow'},); open(OUT, ">dump_struct"); print OUT dump \%hash; exit;The Perl script, data_slurp.pl picks up the external file, "dump_struct", created by the data_dump.pl script, and loads it into a variable.
#!/usr/local/bin/perl use Data::Dump qw(dump); open(IN, "dump_struct"); undef($/); $data = evalHere is the output of the data_slurp.pl script, in which the contents in the variable "$data" are dumped onto the output screen:; close $in; dump $data; exit;
c:\ftp>data_slurp.pl { array => [1 .. 10], hash => { apple => "red", banana => "yellow" }, number => 42, string => "This is a string", }Python pickles its data. Here, the Python script, pickle_up.py, pickles a string variable
#!/usr/bin/python import pickle pumpkin_color = "orange" pickle.dump( pumpkin_color, open( "save.p", "wb" ) ) exitThe Python script, pickle_down.py, loads the pickle file, "save.p" and prints it to the screen.
#!/usr/bin/python import pickle pumpkin_color = pickle.load( open( "save.p", "rb" ) ) print(pumpkin_color) exitThe output of the pickle_down.py script is shown here:
c:\ftp\py>pickle_down.py orangeWhere Python pickles, Ruby marshalls. In Ruby, whole objects, with their encapsulated data, are marshalled into an external file and demarshalled at will. Here is a short Ruby script, object_marshal.rb, that creates a new class, "Shoestring", a new class object, "loafer", and marshalls the new object into a persistent file, "output_file.per".
#!/usr/bin/ruby class Shoestring < String def initialize @object_uuid = (`c\:\\cygwin64\\bin\\uuidgen.exe`).chomp end def object_uuid print @object_uuid end end loafer = Shoestring.new output = File.open("output_file.per", "wb") output.write(Marshal::dump(loafer)) exitThe script produces no output other than the binary file, "output_file.per". Notice that when we created the object, loafer, we included a method that encapsulates within the object a full uuid identifier, courtesy of cygwin's bundled utility, "uuidgen.exe".
We can demarshal the persistent "output_file.per" file, using the ruby script, object_demarshal.rb:
#!/usr/bin/ruby class Shoestring < String def initialize @object_uuid = `c\:\\cygwin64\\bin\\uuidgen.exe`.chomp end def object_uuid print @object_uuid end end array = [] $/="\n\n" out = File.open("output_file.per", "rb").each do |object| array << Marshal::load(object) array.each do |object| puts object.object_uuid puts object.class puts object.class.superclass end end exitThe Ruby script, object_demarshal.rb, pulls the data object from the persistent file, "output_file.per" and directs Ruby to list the uuid for the object, the class of the object, and the superclass of the object.
c:\ftp>object_demarshal.rb c2ace515-534f-411c-9d7c-5aef60f8c72a Shoestring StringPerl, Python and Ruby all have access to external database modules that can build database objects that exist as external files that persist after the script has executed. These database objects can be called from any script, with the contained data accessed quickly, with a simple command syntax (1).
Here is a Perl script, lucy.pl, that creates an associative array and ties it to a external database file, using the SDBM_file (Simple Database Management File) module.
#!/usr/local/bin/perl use Fcntl; use SDBM_File; tie %lucy_hash, "SDBM_File", 'lucy', O_RDWR|O_CREAT|O_EXCL, 0644; $lucy_hash{"Fred Mertz"} = "Neighbor"; $lucy_hash{"Ethel Mertz"} = "Neighbor"; $lucy_hash{"Lucy Ricardo"} = "Star"; $lucy_hash{"Ricky Ricardo"} = "Band leader"; untie %lucy_hash; exit;The lucy.pl script produces a persistent, external file, from which any Perl script can access the associative array created in the prior script. If we look in the directory from which the lucy.pl script was launched, we will find two new SDBM (Simple DataBase Manager) files, lucy.dir and lucy.pag. These are the persistent files that will substitute for the %lucy_hash associative array when invoked within other Perl scripts.
Here is a short Perl script, lucy_untie.pl, that extracts the persistent %lucy_hash associative array from the SDBM file in which it is stored:
#!/usr/local/bin/perl use Fcntl; use SDBM_File; tie %lucy_hash, "SDBM_File", 'lucy', O_RDWR, 0644; while(($key, $value) = each (%lucy_hash)) { print "$key => $value\n"; } untie %mesh_hash; exit;Here is the output of the lucy_untie.pl script:
c:\ftp>lucy_untie.pl Fred Mertz => Neighbor Ethel Mertz => Neighbor Lucy Ricardo => Star Ricky Ricardo => Band leaderHere is the Python script, lucy.py, that creates a tiny external database. [jb meta.txt]
#!/usr/local/bin/python import dumbdbm lucy_hash = dumbdbm.open('lucy', 'c') lucy_hash["Fred Mertz"] = "Neighbor" lucy_hash["Ethel Mertz"] = "Neighbor" lucy_hash["Lucy Ricardo"] = "Star" lucy_hash["Ricky Ricardo"] = "Band leader" lucy_hash.close() exitHere is the Python script, lucy_untie.py, that reads all of the key,value pairs held in the persistent database created for the lucy_hash dictionary object.
#!/usr/local/bin/python import dumbdbm lucy_hash = dumbdbm.open('lucy') for character in lucy_hash.keys(): print character, lucy_hash[character] lucy_hash.close() exitHere is the output produced by the Python script, lucy_untie.py script.
c:\ftp>lucy_untie.py Fred Mertz Neighbor Ethel Mertz Neighbor Lucy Ricardo Star Ricky Ricardo Band leaderRuby can also hold data in a persistent database, using the gdbm module. If you do not have the gdbm (GNU database manager) module installed in your Ruby distribution, you can install it as a Ruby GEM, using the following command line, from the system prompt:
c:\>gem install gdbmThe Ruby script, lucy.rb, creates an external database file, lucy.db:
#!/usr/local/bin/ruby require 'gdbm' lucy_hash = GDBM.new("lucy.db") lucy_hash["Fred Mertz"] = "Neighbor" lucy_hash["Ethel Mertz"] = "Neighbor" lucy_hash["Lucy Ricardo"] = "Star" lucy_hash["Ricky Ricardo"] = "Band leader" lucy_hash.close exitThe Ruby script, ruby_untie.db, reads the associate array stored as the persistent database, lucy.db:
#!/usr/local/bin/ruby require 'gdbm' gdbm = GDBM.new("lucy.db") gdbm.each_pair do |name, role| print "#{name}: #{role}\n" end gdbm.close exitThe output from the lucy_untie.rb script is:
c:\ftp>lucy_untie.rb Ethel Mertz: Neighbor Lucy Ricardo: Star Ricky Ricardo: Band leader Fred Mertz: NeighborPersistence is a simple and fundamental process ensuring that data created in your scripts can be recalled by yourself or by others who need to verify your results. Regardless of the programming language you use, or the data structures you prefer, you will need to familiarize with at least one data persistence technique.
- Jules Berman (copyrighted material)
key words: computer science, data science, data analysis, data simplification, simplifying data, persistence, databases, jules j berman
References:
[1] Berman JJ. Methods in Medical Informatics: Fundamentals of Healthcare Programming in Perl, Python, and Ruby. Chapman and Hall, Boca Raton 2010.
No comments:
Post a Comment