For very large files, the built-in sort function of programming languages just cannot do the job because the lines are put into an array (held in memory), and even computers with lots of memory tend to choke.
An easy short-cut involves only sorting the first few characters of each line (10 characters in the script provided below), instead of the entire line. In this way, the array of lines from the file can be shortened to ten characters per line, and this saves lots of memory.
I provided a short Ruby script (bigsort.pl) in my book, Ruby Programming for Medicine and Biology. The bigsort.pl script on page 150 of Ruby Programming for Medicine and Biology, has a few quirks. First, it assumes that the text file (to be sorted) is a DOS-style file with a two character (carriage-return,line-feed) linebreak. Also, it assumes that every line (in the file to be sorted) contains alphanumeric text.
Provided here is a minor modification to the bigsort.pl Ruby script. It should work for any type of text file and does not require text to appear on each line of the file that is being sorted. As with all my posted scripts, the method is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.
Thanks goes to Dr. Tim Rand, who spotted the error and sent me his own version of a fix on February 6, 2008.
#!/usr/local/bin/ruby
text = File.open("terms.txt", "r")
out = File.open("terms.put", "w")
linearray = Array.new
begin_position = 0
text.each_line do
|line|
old_position = begin_position
begin_position = text.pos
line = line.chomp! + " " #pad ten spaces
linearray << line.slice(0..9) + old_position.to_s
end
linearray.sort!
linearray.each do
|value|
seekplace = value.slice(10..20).to_i
text.seek(seekplace, IO::SEEK_SET)
out.puts(text.readline)
end
exit
-Jules Berman
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.