Linux Goodies

In Pursuit Of The Perfect O/S



Spreadsheet Pivot Tables the Linux Way

Amazon Computers

So What Is a Pivot Table?

Before I get started, I confess to not being a spreadsheet expert. In fact, in the 40 years I've worked with data, I've never used a spreadsheet professionally. I worked on time series data, and it often needed multiple and sometimes complex processes done on it before it could be analyzed. And often the analysis itself involved some complex processes not commonly found in spreadsheets. So I worked originally in fundamental languages like FORTRAN and C, and later, as computer memory size grew, matrix languages like Octave, Yorick, R, and PDL.

But at home, I do use spreadsheets some. Doing general bookkeeping, and for the last few years in helping to manage my websites. I download my Google report each month and do some work on it with a spreadsheet, usually the lowly but easy to install and use SC spreadsheet. Sometimes I use the ubiquitous to Linux Gnumeric. In order to get a data presentation and summary information that I wanted, I needed to pre-process the data with a custom Perl script.

It was in examining my website data awhile back that I found that Google Sheets had pivot tables. So I looked up this topic at the Wikipedia Pivot Tables article. That gave me enough information to see that a pivot table would be a pretty good way to examine my web page and web site performance, and in fact in many ways what I'd been struggling to create with cumbersome methods.

I will pause for those aghast at my ignorance of Pivot Tables to regather their composure.

From what I read I constructed a small example data set similar enough to my Google web site data to be instructive. I imagined being in charge of a sales force scattered over a few divisions, each division with a few sales people. The contrived data included the sales values for each sales person in each division, as well as the value of their returns and costs. The data constructed was the following:

division salesman sales returns costs
div1 bob 3000 400 250
div2 ted 2500 200 150
div3 mary 3200 150 200
div1 john 1600 200 100
div2 fred 3300 500 300
div3 billy 2000 150 300
div4 sue 1500 100 150
div1 jose 2200 300 95
div2 alice 4000 500 300
div4 debbie 1500 100 200
div3 ronnie 4500 360 200
div4 sam 5000 800 300
div1 james 3500 300 175
div2 suzie 4800 200 150

You see that the data is all mixed together, and that may not be the optimal way to visualize it. In a spreadsheet, one can easily sort the data by any of the columns and learn who the top sales people were, the ones with the most returns, etc. But it would be difficult to see how the divisions were performing.

That's where the Pivot Table concept comes in. It gives sort of a data-base light capacity to build reports from larger data sets.  Using the Pivot Table capacity in Google Sheets I was able to split out each division, with the separate divisions' data listed out in adjacent columns. A single division split-out looked like this:

division salesman sales returns costs
div1 bob 3000 400 250
div1 james 3500 300 175
div1 jose 2200 300 95
div1 john 1600 200 100

In the actual test scenario, each division, showing the sales person values, was listed as a sub-report in adjacent columns, so I could then easily examine each division. And through the Pivot Table options, I could choose to select from a number of column operations, like sort by sales values, or costs. The Pivot Table function still  viewed the data, though separated into sections, as connected. So if I chose to sort by sales amounts, all divisions in their respective spreadsheet positions were sorted by the selected sales amounts.

The Pivot Table function also constructed a summary table (the Pivot Table) that produced ensemble data for the divisions. It looked as follows:

division #personal sales returns costs
div1 4 10300 1200 620
div2 4 14600 1400 900
div3 3 9700 660 700
div4 3 8000 1000 650
Totals 14 42600 4260 2870

Again, once I figured out how to go through the motions in Google Sheets to fill in the Pivot Table form, the split-out divisions and the summary table were the results.

So, I tried it with my actual Google Adsense data. It didn't take me long to hit a snag.

In my test data, the sale persons and division they belonged to were handily in different columns, easily accessible with the Pivot Table utility. But with my Google data, the web site and web page were part of the same URL entry in a single column. So first I had to learn how to split out the web site and web page information from the URL name so that the Pivot Table selection criteria would work with the web site tag.

Once I pulled out of the URL the web site and web page information into separate columns, I was able to get my Pivot Table report generated for a selected month's data. Each web site's information appeared in a different part of the the spreadsheet (in adjacent columns), and the Pivot Table summary appeared in a different sheet (tab on the bottom of the spreadsheet).

Pretty close to what I wanted. I further wanted to have the summary data also reported as percentages so I could quickly see which web site was getting the greatest percentage of views, clicks, etc. While the Pivot Table forms in Google Sheets didn't offer that much flexibility, it was easy to make those additions to the Pivot Table summary sheet without negatively impacting any other part of the Pivot Table report.

To make the most use of the effort, I found that I could use the Google Sheets copy file function to make a copy of one month's report, rename the copy, and paste over the data portion with Adsense data from a different month's report. Thus all of the work would automatically be done on the new data.

Alas, Many Spreadsheets Don't Have Pivot Tables

This was handy in that Google Sheets is accessible from any computer, including my Chromebook. But being a browser-based spreadsheet, Google Sheets is a bit slower than a locally housed spreadsheet. So I wanted to be able to work with my data locally.

On my Linux computers, I have three spreadsheets available. Gnumeric, SC, and the Softmaker spreadsheet, Planmaker. As it happens, Planmaker has support for pivot tables, though going through its menus to setup a pivot table report was again a bit of a struggle. Planmaker also needed me to use some Planmaker string tools to pull out the web site and web page information from the URL entries. And Planmaker made just the summary table, not a sheet of the split out data for each criteria.

Gnumeric has no direct support for Pivot Tables. It has some filters for pulling out sub-sets of data, but it's not packaged up for anything so automatic and convenient as a pivot table. SC has no such support either, nor any real filtering support.

So I gave some thought as to what I might need if I wanted to pre-process data so that I could create something close to a Pivot Table with Gnumeric and SC (or any spreadsheet for that matter). And I didn't want to re-create the overly specific software that I'd used before. I wanted routines that were as generic as I could imagine. I looked what really is involved in creation of a Pivot Table. I found I needed the following:

  1. The ability to pull out sub-sets of data based on criteria column(s).
  2. The ability to sort the sub-sets on any columns desired.
  3. The ability to re-merge sub-sets of data into adjacent columns of merged file.
  4. The ability to create a summary (Pivot) table on the sub-set criteria, since in the merged file the sub-sets of data are easily accessible.

This didn't look to bad, really. Steps 1 through 3 could be done on tab or comma separated files prior to them being introduced into a spreadsheet. And item 4, assuming all of the other steps have already been applied to the data, would be a trivial exercise in a spreadsheet.

The Linux Way

So I decided to tackle this problem in what I call the Linux Way.

Say what?

The Linux Way, meaning a few simple, minimal function programs that do little, but can be ran in sequence to lead to a more complex solution. Being simple and limited in function, each program would be pretty simple to write, modify, and maintain. By running them in sequence, I could accomplish the things I needed to end up with a Pivot Table.

The scripts I chose to write were as follows:

  1. - a program that given a tab or comma separated data file, would pull out a sub-set of data based on a criteria column and a criteria value. It provides the additional option to output only specific columns.
  2. - a program that can sort a tab or comma separated file by any number of sort columns. It uses the fast, internal sort routine of Perl.
  3. - a program that combines several csv files into a single csv file, placing the data of each file in adjacent columns of the output file. It allows me to place one or more blank columns between data sets for clarity or additional use.
  4. For SC only, - a program to convert a tab or comma separated file to an SC format file, as SC can only read its own file format. The program additionally lets me specify a start column for the SC file data.

You may be thinking, "Isn't this taking a rather simple problem and making it more complicated?"

Well, not if you are working with a spreadsheet without the Pivot Table capability. And working with the little programs is really very easy. Just keep in mind that Perl starts its array indices at zero. I usually use a simple batch file, something like the following, a batch file for working with the previously described demo data:

# Batch file to create pivot table data
# 1st, split out each data sub-set as
# a separate file
# In this case, column 0 is the criteria column, search
# for div1, then div2, ... database.csv div1.csv 0 div1 database.csv div2.csv 0 div2 database.csv div3.csv 0 div3 database.csv div4.csv 0 div4

# Next, sort each sub-set csv file by columns 2 and 3,
# Which are sales and returns. div1.csv sorteddiv1.csv 2 3 div2.csv sorteddiv2.csv 2 3 div3.csv sorteddiv3.csv 2 3 div4.csv sorteddiv4.csv 2 3

# Next, Merge all sorted files together into adjacent columns.
# Leave 1 empty column between sub-sets of data 1 alldivs.csv sorteddiv1.csv sorteddiv2.csv sorteddiv3.csv sorteddiv4.csv

That doesn't look so bad, does it? One line is just copied to do the same operation with a different critera and output a different file. One line is just copied to sort each sub-set of data. And only one line is needed to put the data back together in a sensible form.

What's next? Pass the file to your favorite spreadsheet. If you want a summary table, that's easy to create with any spreadsheet, just taking account of the columns that contain the data of interest.

I created a generic Gnumeric (say that fast 3 times) spreadsheet file that has an empty sheet1 to hold the data from the batch file, and another sheet (sheet2) that holds the Pivot Table mechanics that refer to the data in sheet1. I can use this Gnumeric file for each new report, just pasting the data into sheet1 and saving the sheet to my desired report name.

In fact, I just injected the following two lines into the batch file to automatically load Gnumeric on the data and the Pivot Table sheet:

# Gnumeric references, added to end of batch file (after
gnumeric alldiv.csv pivot.gnumeric

So in operation, when I run the batch file, almost immediately two Gnumeric windows open. One has the processed and merged data with each sub-set in separate columns, the other has a blank sheet1 and a sheet2 with table mechanics in it. I just copy data from the data Gnumeric into sheet1 of the Pivot Sheet Gnumeric, and the work is done. I save the Pivot Sheet spreadsheet by the desired name for future reference.

With the Linux Way, Even SC Can Do It

The little Perl scripts do the bulk of the Pivot Table work prior to entering a spreadsheet, so ideally, it would give virtually any spreadsheet the capability. So I did it with the venerable Curses-based SC spreadsheet. This little spreadsheet is much like the old spreadsheets of DOS days, except its editing commands mimic the Linux favorite VI editor.

SC has all the basics, math functions, some financial functions, a few string functions and date functions, plus some range operators. So with it I can accomplish most of what I generally wish to do with a spreadsheet.

SC has one significant limitation, it can only read data in its native form. It can't even read in comma separated or tab separated files. It does often come packaged, however, with a tool to convert CSV files to SC format.

As it happens, the SC format is a quite simple ASCII format. So I wrote my own little Perl script to convert CSV or tab-separated files to SC files. I added one little refinement. I can specify on the command line what column in SC to begin the data.

This is handy because SC supports the merging of files. That means ether at run time or even interactively, SC can be told to merge SC files into the current spreadsheet, and as long as the files don't overlap one another when loaded, all is well. By being able to control what column CSV files will load in, I can have them pre-setup to be merged with other data.

So to do the magic with SC, I wrote a little SC sheet that is set to perform the summary Pivot Table statistics on data that will start in column A and extend to include data from each of my web sites. The Table mechanics are positioned to be past the columns used for the data, so it can be merged without messing up the data.

I just added a couple of lines to the end of my batch file (instead of the Gnumeric line) as follows:

csv2sc alldiv.csv

Recall that the "alldiv.csv" file was the separated, sorted, and merged data from the initial database. The "" is the name of the SC file that I created to contain the summary statistics mechanics, which assumes the same column positions for each month's merged CSV data.

In practice, the SC batch file is the quickest to use. I run the batch file indicating the current Adsense data, and almost immediately SC opens up with the properly positioned merged data, and  just past the last data column is the computed summary statistics. No cutting or pasting, just run a command and all the work is done and the spreadsheet  complete with summary table opens up.

Summary and Disclaimer

Available from this post is a tar file of the Perl scripts I use. Within each program is the following warning and disclaimer:

# ************************************************************
# This SOFTWARE PRODUCT is provided by THE PROVIDER "as is" and
# "with all faults." THE PROVIDER makes no representations or
# warranties of any kind concerning the safety, suitability,
# lack of viruses, inaccuracies, typographical errors,
# or other harmful components of this SOFTWARE PRODUCT.
# There are inherent dangers in the use of any software,
# and you are solely responsible for determining whether
# this SOFTWARE PRODUCT is compatible with your equipment
# and other software installed on your equipment. You are also
# solely responsible for the protection of your equipment
# and backup of your data, and THE PROVIDER will not be liable
# for any damages you may suffer in connection with using,
# modifying, or distributing this SOFTWARE PRODUCT.
# ************************************************************

I make the scripts available more or less to introduce a concept to you. Many old Linux hands will see nothing special here. Some intense spreadsheet users may be introduced to a new way of looking at how to use Linux.

You may loath Perl. But the scripts are so small that I think you can easily understand them and if desired, convert them to some other language you may like better. I used Perl because I'm familiar with it, and it is marvelous for dealing with tab or comma separated files. Also, the programs are commented, so that you can easily tune them to your liking, or if desired, extend them to be more general.

I offer them in this simple form so that taking them where you wish will be easy enough. In my particular application, they amazingly are sufficient. If you aren't into programming at all, I guess I'd suggest not using them, but just learning from this post a little about the Linux Way of doing things.

If your interested, here are the Perl CSV Routines and Demo Files packaged as a tar file.