Category Archives: programming

TeXlive install process, or developing an intelligent download timer: Part 2

In part one of this series, I presented the download predictions for a program installation. The ultimate goal here, is to develop a method for accurately predicting total download times from the beginning of the download process. To elaborate, as the download progresses, it should become increasingly easy to make an accurate prediction of the time remaining, because there is increasingly less and less to download, and you have increasingly more and more information about the download history.

Naturally, the first thing we want to do is see if the data actually follow any sort of trend. In theory, larger packages in the TexLive install should take longer, but the time/volume of each should be roughly the same, and should remain roughly constant (or really vary about some mean constantly) over time. The easiest way to determine this would be to simply plot the download speed over the duration of the download.


The speed varies significantly, but that’s okay. It, qualitatively, appears that the distribution of speeds randomly varies about a mean value. This is good for us, because it means that there is no trend in the speed over time, like we would see for example if the mean speed were changing over time.

This means that we can build a model of download speeds to predict the total download time. If we simply fit a linear model to the data above (i.e., elapsed package time ~ size) we find that the data are reasonably well explained (r^2 = 0.60) by a line of slope = 3.003797e-4 and intercept = 3.661338e-1.model

Then, we can use our linear model to evaluate, in essence, to predict the time it will take to download each package based on the size of each package and then sum them to produce a prediction for the total download time. Evaluation produces a predicted total download time of 29:26 mm:ss (plotted as dashed line below).


29:26 happens to be the exact time that our download took. That means that despite all the variations in download speeds, the mean over time was so constant that a simple linear model (a constant download speed) perfectly predicts the observed data; perhaps this is not surprising when you see the roughly constant-slope red line above.

Now, this model was based on perfect information at the end of the download, but in the next post, we’ll explore a common, simple, and popular prediction algorithm as a test of an a priori and ongoing prediction tool.

TeXlive install process, or developing an intelligent download timer: Part 1

I recently got a new laptop and during the process of setting up to my preferences, I install LaTeX through TeXlive. This means a massive download of many small packages that get included in the LaTeX install. In effect, this is how all software downloads go, many small parts that make up the whole. Installing TeXlive on Linux gave me the chance to actually see the report of the download, and of course to save it and plot it up after completion. Here is what the data output to the console looks like during install:data

After 3 downloads, the installer makes a prediction of the total time, and then reports the elapsed time against predicted time, along with some information about the current download. If we take this information for all 3188 packages and parse it apart for the desired information, we can plot the actual time, versus predicted time, so see how the prediction performs over time.timeseries

There are some pretty large swings in the predicted time at the beginning of the model, but by about 25% of the total download by size, the prediction becomes pretty stable, making only minor corrections. The corrections continue until the very end of the downloads.

Download time prediction is a really interesting problem to work on, since you are attempting to control for download speed which is largely dependent on things outside the realm of the personal computer and is likely to vary over timescales longer than a few minutes. I’ll be making a few posts about this topic over the next months, culminating with what I hope is a simple, fast, and accurate download time prediction algorithm. More to come!

History of the Houston Rodeo performances

The Houston Livestock Show and Rodeo is one of Houston’s largest and most famous annual events. Now, I won’t claim to know much about the Houston Rodeo, heck, I’ve only been to the Rodeo once, and have lived in Houston for a little over a year and a half! I went to look for the lineup for 2016 to see what show(s) I may want to see, but they haven’t released the lineup yet (comes out Jan 11 2016). I got curious of what the history of the event was like, and conveniently, they have a past performers page; this is the base source for the data used in this post.

First, I pulled apart the data on the page and built a dataset of each performer and every year they performed. The code I used to do this is an absolute mess so I’m not even going to share it, but I will post the dataset here (.rds file). Basically, I had to convert all the non-formatted year data, to clean uniformly formatted lists of years for each artist.


Above is the histogram of the number of performances across all the performers. As expected, the distribution is skewed right, towards the higher number of performances per performer. Just over 51% of performers have only performed one time, and 75% of performers have performed fewer than 3 times. This actually surprised me, I expected to see even fewer repeat performers. There have been a lot of big names come to the Rodeo over the years. The record for the most performances (25) is held by Wynonna Judd (Wynonna).

I then wanted to see how the number of shows per year changed over time, since the start of the Rodeo.


The above plot shows every year since the beginning of the Rodeo (1931) to the most recent completed event (2015). The blue line is a Loess smoothing of the data. Now, I think that the number of performances corresponds with the number days of the Rodeo (i.e. one concert a night), but I don’t have any data to confirm this. It looks like the number of concerts in recent years has declined, but I’m not sure if the event has also been shortened (e.g. from 30 to 20 days). Let’s compare that with the attendance figures from the Rodeo.
hr_compsDespite fewer performances per year since the mid 1990s, the attendance has continued to climb. Perhaps the planners realized they could lower the number of performers (i.e. cost) and still have people come to the Rodeo. The Rodeo is a charity that raises money for scholarships and such, so more excess revenue means more scholarships! Even without knowing why the planners decided to reduce the number of performers per year, it looks like the decision was a good one.

If we look back at the 2016 concerts announcement page, you can see that they list the genre of the shows each night, but not yet the performers. I wanted to see how the division of genre of performers has changed over the years of the Rodeo. So, I used my dataset and the API to get the top two user submitted “tags” for each artist. I then classed the performers into 8 different genres based on these tags. Most of the tags are genres so about 70% of the data was easy to class, I then manually binned all the remaining artists into the genres, trying to be as unbiased as possible.


It’s immediately clear that since the beginning, country music has always dominated the Houston Rodeo lineup. I think it’s interesting to see the increase in variety of music since the late 1990s, beginning to include a lot more Latin music and pop. I should caveat though, that the appearance of pop music may be complicated by the fact that what was once considered “pop” is now considered “oldies”. There have been a few comedians throughout the Rodeo’s run, but none in recent years. 2016 will feature 20 performances again, with a split that looks pretty darn similar to 2015, with a few substitutions:


Lal, 1991 in situ 10-Be production rates

10Be is a cosmogenic radioactive nuclide that is produced when high energy cosmic rays collide with nuclides and cause spallation. 10Be is produced in the atmosphere (and then transported down to the surface) as “meteoric”, and produced within mineral lattices in soil and rocks as “in situ“. In 1991, Devendra Lal wrote a highly cited paper about the physics of in situ produced Beryllium-10 (10Be). In the paper he lays out an equation for the production of in situ 10Be (q) based on latitude and altitude. I’m currently working on an idea I have for using cosmogenic nuclides as tracers for basin scale changes in uplift rate, so I wanted to see what his equation looked like applied. The equation is a third degree polynomial, with coefficients that depend on latitude (L), and direct dependency on altitude (y).

I grabbed an old raster (GEBCO 2014 30 arc second) I had laying around for Eastern North America and plotted it up. First, the elevation map (obviously latitude is on the y-axis…)


Elevation map for ENAM.

And then apply the Lal, 1991 equation and find


Plotting Lal’s 1991 in situ production rate equation for ENAM. Green–>red increasing production rate. Production rate = NA in water.

I think the interesting observation is for how little of the mapped area there is any significant change in the production rate. Maybe this should be obvious since the polynomial has direct dependence on altitude and altitude doesn’t change that much in most of the map. Further the dependence of latitude is not all all observable with this map; perhaps because the latitude range is not very large, or the coefficients never change by more than an order of magnitude anyway. Next time, maybe a world elevation map! Not sure my computer has enough memory…

You can grab the code I used from here and Lal’s paper from here.

River hysteresis and plotting hysteresis data in R

Hysteresis is the concept that a system (be it mechanical, numerical, or whatever else) is dependent of the history of the system, and not only the present conditions. This is the case for rivers. For example, consider the following theoretical flood curve and accompanied sediment discharge curve (Fig. 1a).

Figure 1. Theoretical plots to demonstrate the hysteresis of a river.

Figure 1. Theoretical plots to demonstrate the hysteresis of a river.

With the onset of the flood, the increased sediment transport capacity of the system entrains more sediment and the sediment discharge curve (red, Fig. 1a) rises. However, the system may soon run out of sediment to transport (really just a reduction in easily transportable sediment), and the sediment discharge curve decreases although the water discharge curve remains high in flood.

In Fig. 1b, the sediment discharge and water discharge are plotted through time, a typical way of observing the hysteresis of a system. Note that for the rising limb and falling limb of the river flood, the same water discharge produces two different sediment transport values.

Now, let’s imagine that we want to investigate how important the history of the system is to the present state of our study river. You can grab the data I’ll use, here. This is data from one year of flow on the Huanghe (Yellow River) in China, and it has been smoothed with a moving average function to make the hysteresis function more visible.

Making the plot

It is easy enough to plot a line with R (the lines function) but with a hysteresis plot, it is important to be able to determine which direction time is moving forward along the curve. For this reason we want to use arrows. So we plot the line data first, with:


and then using a constructed vector of every 22nd number, we plot an arrow over top of the lines using:

s <- seq(from=1, to=length(df[,"Qw"])-1, by=22)
arrows(df[,"Qw"][s], df[,"Qs"][s], df[,"Qw"][s+1], df[,"Qs"][s+1], 

Finally, with a few more modifications to the plot (see the full code here), we can produce Fig. 2 below. This plot is comparable to the theoretical one above.

Figure 2: Hysteresis plot from actual data.

Figure 2. Hysteresis plot from actual data.

Using the green lines and points, I have highlighted the observation that for the rising limb and falling limb of a flood, there can be substantially different sediment discharges for the same water discharge — this observation is not so easily made from the plot on the left, but it is immediately clear in the hysteresis plot on the right.

Be inspired while coding! — Matlab script

Here is a joke script I wrote a while ago that returns an inspirational quote when called. I made it as a joke to send to my lab group during a particularly grueling week of coding. You can simply call the function at the beginning of your script to have a quote printed to the stdout or you can wrap it inside a wbar if you want.

Grab the code from my GitHub, here.


example output from the inspire.m script.

Quick script for connecting to University VPN

I can be pretty lazy at times, so much that I will go out of my way a bit to write a bit of code to simplify my life. I don’t often connect to my university’s VPN, but when I eventually do want to, I can never remember the command needed to do it. So I have to take the time and look it up. Well I cut that out today with a basic little bash script with a name I can remember when I need it. I titled my script “” but anything memorable to you would be fine. The script takes one argument (“on” or “off”) to either connect or disconnect with the proxy. Stick the script in your bin folder and it will execute simply with on.

The script is reproduced below, or you can grab it from here.

# Andrew J. Moodie
# Feb 2015
# easily remembered interface for vpn connection
if [ "$1" = "on" ]
    sudo /usr/sbin/vpnc RiceVPN.conf
elif [ "$1" = "off" ]
    echo " input error -- one argument must be given"
    echo " valid inputs: 'on' or 'off'"

Note that you could further automate this if you wanted to (e.g. auto-enter passwords, connect on startup), but this is a bad idea for both security and for internet speed.