Category Archives: R/Matlab

History of the Houston Rodeo performances

The Houston Livestock Show and Rodeo is one of Houston’s largest and most famous annual events. Now, I won’t claim to know much about the Houston Rodeo, heck, I’ve only been to the Rodeo once, and have lived in Houston for a little over a year and a half! I went to look for the lineup for 2016 to see what show(s) I may want to see, but they haven’t released the lineup yet (comes out Jan 11 2016). I got curious of what the history of the event was like, and conveniently, they have a past performers page; this is the base source for the data used in this post.

First, I pulled apart the data on the page and built a dataset of each performer and every year they performed. The code I used to do this is an absolute mess so I’m not even going to share it, but I will post the dataset here (.rds file). Basically, I had to convert all the non-formatted year data, to clean uniformly formatted lists of years for each artist.


Above is the histogram of the number of performances across all the performers. As expected, the distribution is skewed right, towards the higher number of performances per performer. Just over 51% of performers have only performed one time, and 75% of performers have performed fewer than 3 times. This actually surprised me, I expected to see even fewer repeat performers. There have been a lot of big names come to the Rodeo over the years. The record for the most performances (25) is held by Wynonna Judd (Wynonna).

I then wanted to see how the number of shows per year changed over time, since the start of the Rodeo.


The above plot shows every year since the beginning of the Rodeo (1931) to the most recent completed event (2015). The blue line is a Loess smoothing of the data. Now, I think that the number of performances corresponds with the number days of the Rodeo (i.e. one concert a night), but I don’t have any data to confirm this. It looks like the number of concerts in recent years has declined, but I’m not sure if the event has also been shortened (e.g. from 30 to 20 days). Let’s compare that with the attendance figures from the Rodeo.
hr_compsDespite fewer performances per year since the mid 1990s, the attendance has continued to climb. Perhaps the planners realized they could lower the number of performers (i.e. cost) and still have people come to the Rodeo. The Rodeo is a charity that raises money for scholarships and such, so more excess revenue means more scholarships! Even without knowing why the planners decided to reduce the number of performers per year, it looks like the decision was a good one.

If we look back at the 2016 concerts announcement page, you can see that they list the genre of the shows each night, but not yet the performers. I wanted to see how the division of genre of performers has changed over the years of the Rodeo. So, I used my dataset and the API to get the top two user submitted “tags” for each artist. I then classed the performers into 8 different genres based on these tags. Most of the tags are genres so about 70% of the data was easy to class, I then manually binned all the remaining artists into the genres, trying to be as unbiased as possible.


It’s immediately clear that since the beginning, country music has always dominated the Houston Rodeo lineup. I think it’s interesting to see the increase in variety of music since the late 1990s, beginning to include a lot more Latin music and pop. I should caveat though, that the appearance of pop music may be complicated by the fact that what was once considered “pop” is now considered “oldies”. There have been a few comedians throughout the Rodeo’s run, but none in recent years. 2016 will feature 20 performances again, with a split that looks pretty darn similar to 2015, with a few substitutions:


Lal, 1991 in situ 10-Be production rates

10Be is a cosmogenic radioactive nuclide that is produced when high energy cosmic rays collide with nuclides and cause spallation. 10Be is produced in the atmosphere (and then transported down to the surface) as “meteoric”, and produced within mineral lattices in soil and rocks as “in situ“. In 1991, Devendra Lal wrote a highly cited paper about the physics of in situ produced Beryllium-10 (10Be). In the paper he lays out an equation for the production of in situ 10Be (q) based on latitude and altitude. I’m currently working on an idea I have for using cosmogenic nuclides as tracers for basin scale changes in uplift rate, so I wanted to see what his equation looked like applied. The equation is a third degree polynomial, with coefficients that depend on latitude (L), and direct dependency on altitude (y).

I grabbed an old raster (GEBCO 2014 30 arc second) I had laying around for Eastern North America and plotted it up. First, the elevation map (obviously latitude is on the y-axis…)


Elevation map for ENAM.

And then apply the Lal, 1991 equation and find


Plotting Lal’s 1991 in situ production rate equation for ENAM. Green–>red increasing production rate. Production rate = NA in water.

I think the interesting observation is for how little of the mapped area there is any significant change in the production rate. Maybe this should be obvious since the polynomial has direct dependence on altitude and altitude doesn’t change that much in most of the map. Further the dependence of latitude is not all all observable with this map; perhaps because the latitude range is not very large, or the coefficients never change by more than an order of magnitude anyway. Next time, maybe a world elevation map! Not sure my computer has enough memory…

You can grab the code I used from here and Lal’s paper from here.

River hysteresis and plotting hysteresis data in R

Hysteresis is the concept that a system (be it mechanical, numerical, or whatever else) is dependent of the history of the system, and not only the present conditions. This is the case for rivers. For example, consider the following theoretical flood curve and accompanied sediment discharge curve (Fig. 1a).

Figure 1. Theoretical plots to demonstrate the hysteresis of a river.

Figure 1. Theoretical plots to demonstrate the hysteresis of a river.

With the onset of the flood, the increased sediment transport capacity of the system entrains more sediment and the sediment discharge curve (red, Fig. 1a) rises. However, the system may soon run out of sediment to transport (really just a reduction in easily transportable sediment), and the sediment discharge curve decreases although the water discharge curve remains high in flood.

In Fig. 1b, the sediment discharge and water discharge are plotted through time, a typical way of observing the hysteresis of a system. Note that for the rising limb and falling limb of the river flood, the same water discharge produces two different sediment transport values.

Now, let’s imagine that we want to investigate how important the history of the system is to the present state of our study river. You can grab the data I’ll use, here. This is data from one year of flow on the Huanghe (Yellow River) in China, and it has been smoothed with a moving average function to make the hysteresis function more visible.

Making the plot

It is easy enough to plot a line with R (the lines function) but with a hysteresis plot, it is important to be able to determine which direction time is moving forward along the curve. For this reason we want to use arrows. So we plot the line data first, with:


and then using a constructed vector of every 22nd number, we plot an arrow over top of the lines using:

s <- seq(from=1, to=length(df[,"Qw"])-1, by=22)
arrows(df[,"Qw"][s], df[,"Qs"][s], df[,"Qw"][s+1], df[,"Qs"][s+1], 

Finally, with a few more modifications to the plot (see the full code here), we can produce Fig. 2 below. This plot is comparable to the theoretical one above.

Figure 2: Hysteresis plot from actual data.

Figure 2. Hysteresis plot from actual data.

Using the green lines and points, I have highlighted the observation that for the rising limb and falling limb of a flood, there can be substantially different sediment discharges for the same water discharge — this observation is not so easily made from the plot on the left, but it is immediately clear in the hysteresis plot on the right.

Be inspired while coding! — Matlab script

Here is a joke script I wrote a while ago that returns an inspirational quote when called. I made it as a joke to send to my lab group during a particularly grueling week of coding. You can simply call the function at the beginning of your script to have a quote printed to the stdout or you can wrap it inside a wbar if you want.

Grab the code from my GitHub, here.


example output from the inspire.m script.

Pint glass short-pours

Have you ever gotten a short pour in your pint glass at the bar but not said anything? Well, after reading this, you may decide you want to say something next time. I’m not the first one to look at the point I’m making here, but I didn’t like the way others have presented it, and wanted to run the numbers myself anyway. The problem is to determine how much beer you are really missing out on, by missing that top bit of the pour.

For a theoretical pint glass, the volume of the glass increases with increasing h non-linearly from the base of the glass to the top. This is because the area of a circle is defined by πr2, where r changes linearly along h from rb to rt. L represents the vertical length of glass not filled with beer, measured down from the top of the glass.

schematic for terms used in problem.

I approached this problem two ways. First, I set up some simple relations in Matlab, and then numerically estimated the integral to a high spatial resolution, to determine how the volume of liquid in the glass changes with increasing h. I defined the glass geometry by crudely measuring a pint glass, and then fudging the measurements such that volume obtained for the full glass was 16 oz (one pint). Second, I actually filled my glass with 1 oz. slugs of water, and measured the height of the liquid in the glass.

Figure 1 shows the modeled and experimental results.

Figure 1: modeled and experimental results for the pint glass problem.

Figure 1: modeled and experimental results for the pint glass problem.

Since the experimental results closely overlay the model results, it is valid to assume the model calculations are accurate and reflect an actual pint glass, so I will proceed only considering the modeled results.

It’s immediately clear (and consistent with our expectation) that the top of the glass is where most of the liquid is held. This is seen in the data with the line slope; a shallow slope in the bottom of the glass means that an increase in the height of liquid equals a small percentage of total volume, whereas at the top of the glass, the same increase in height accounts for a much larger percentage of total volume. This has everything to do with the fact that the cross sectional area of the glass increases with increasing height (Ah = πrh2).

But, to address the question at hand, how much does a short pour really cheat you, lets look at Figure 2.

Figure 2: manipulated model results to demonstrate volume lost for small loss in total pour height.

Figure 2: manipulated model results to demonstrate volume lost for small loss in total pour height.

You can see that for a pour in which the top 1/2 inch (1.27 cm) is left empty, the drinker missed out on about 15% of the total volume of the pint-sized beer he paid for! If you are a regular at a hypothetical bar that short pours, every 7 beers you buy, you would be paying for a beer you never got to drink. Now, maybe your bartender isn’t leaving 1/2 inch of empty space at the top of your glass (although I have had it happen), but I do hope that you may think twice about not saying anything if you’re given a bad pour in the future.



Following a suggestion from /u/myu42996: fraction per fraction


Reddit data — When is it really too soon to retail Christmas?

About this time every year, people begin to complain about retail stores having Christmas themed displays and merchandise out. Well, speaking objectively, I think it is totally fair game for retail stores to shift to Christmas-mode, once people begin to think and talk about Christmas. Can Reddit post topics act as a proxy to determine when people begin to talk about Christmas? In each of the following plots, the black open circles represent a single day’s value, and the red line is a 7-point moving average designed to eliminate the weekly periodicity of Reddit posting.

posts with 'christmas' in titleWell it looks like the beginning of an increase in Christmas related posts occurs in the middle of October, with a substantial increase at the very end of November (just after Thanksgiving). Let’s dig a little deeper though. In the plot below, I’ve taken the same data and plotted them on a logged y-axis to highlight the variability.

log plot of xmas posts

From the above plot, it seems that the steady increase begins as early as the middle of September! Is a steady increase really enough to conclude that the conversation has begun though? Well I decided to take a look at the variation in the data to try and answer that.

xmas_nmlzdIn the above plot, the data has been normalized per day to a percent of total posts that have Christmas in the title. On December 25th, 16% of posts to Reddit included the word Christmas in the title (over 16,000 posts)! Now, I took the period from April 1 to Aug 15, and determined the mean and standard deviation. The horizontal black line represents the mean for this period, and the gray box is 2 standard deviation from the mean.

Taking 2 complete standard deviations from the mean to be a good indicator of significant change, the conversation about Christmas breaches this threshold right in the middle of September.

Now, you are probably thinking “Well these posts in September and October are probably just all people complaining about the early retailing of Christmas!” Well yes, you may be right. But hey, even if that’s the case, the retail companies are succeeding at getting you to talk about Christmas which means it is worth their time to put up the merchandise early, since you then buy more!


Just for fun, here are the conversation plots for some other holidays!


Reddit data — Graduate School talk

This is the first post in a series I’ll be doing about posting on Reddit for 2013. The posts in this series search through every single post made to Reddit in 2013 — that’s over 50 GB worth of data, and over 39,000,000 posts!

For this post, I examined every post made to any subreddit for any word that related to graduate school (including law and medical school) for each day of 2013, in either the ‘title’ or the ‘self-text’. The key used for positive matches is at the end of this post.


posts made to Reddit with words relating to graduate school in 2013. 1 data point for each day, red line is 7 point moving-average.

Maybe not as telling as I had expected, there’s a ton of variance day to day and week to week, but the most obvious observation would be the spike in graduate school related comments in the month of April, following a consistent increase in posts in March. I would suggest this is probably due to the fact that this is the time of the year when a lot of acceptance decisions come out.

Normalizing the data against total posts for the day is not any more telling, the profile is stretched a bit in the y-direction. The 7 point moving-average is an attempt to remove the weekly periodicity of Reddit posting.

The key used was [‘grad school’, ‘graduate school’, “master’s”, ‘masters’, ‘ phd ‘, ‘ gre ‘, ‘letter of recommendation’, ‘letters of recommendation’, ‘doctorate’, ‘law school’, ‘med school’, ‘medical school’, ‘transcript’, ‘undergraduate gpa’, ‘undergrad gpa’]. There are of course more keywords that could have been used, but many have multiple implications, and this list was used as an attempt to minimize false positives.

CFD file sizes

I’m presently taking a computational fluid dynamics (CFD) course here at Rice (taught by Dr. Tayfun E. Tezduyar), and I was shocked to learn the sheer volume of data generated from a typical CFD simulation, and the digital storage required to be able to look back at the modeled results. A homework problem was the following:

“Consider a 3D computation of air circulation in a room with temperature effects. There are also 3 chemical species (e.g. 3 pollutants) we want to keep track of. The species concentrations are so low that they do not influence the fluid density or velocity. The number of grid points is 10 million, and the simulation takes 1000 time-steps. Assuming that a number takes 8 bytes, how much disk storage do you need to store all the computed data?”

By my calculations, you would generate approximately 640 Gigabytes (GB) of data in the process of solving this problem!

In the figure below you can see how mesh size influences the storage size of the generated data for three different simulation lengths. I have isolated the figure to the interesting data (everything with a mesh size below 10^4 points generates relatively small data volumes). All model parameters are provided below the figure.

data generated for model runs

Volume of data generated for model runs of different length time-steps. Model has 8 unknowns, with every unknown value requiring 8 bytes of storage.

Model parameters:

  • 8 unknowns for every mesh point at every timestep
    • conservation of momentum = 3
    • conservation of energy = 1
    • velocity = 1
    • species to track = 3
  • evenly spaced time-steps of 1000, 1500, and 2000
  • 8 bytes are required to store a single unknown (meaning one value at one point, at one time)

Interesting note: increasing the complexity of your problem without increasing the number of unknowns has no effect on the storage size, but only on the time required to complete the simulation!