Category Archives: visualization

Groundwater flow demonstration slides

I recently came across a set of old slides from the National Water Well Association, created in 1977. The slides can be found here as an imgur album. These images show principles of groundwater flow that were not novel at the time, but the images do a great job of showing (through time series primarily) how different mediums can drastically change groundwater flow. Further, the slides show the effects of groundwater extraction through natural (gaining streams) and anthropogenic (wells). Below is a gif of the first ~20 slides I put together, which shows the experimenters setup over the duration of the experiment. There is a constant “head”, or a presence of water, at the right side that is flowing towards the left side of the experimental setup, with some dye being injected at a few depths to show the flow lines.

groundwaterexperiment

I’ve included my favorite image from the set below, which shows (beautifully) a cone of depression. When initially drilling a well, it (obviously) is critical to put the well into the water table, such that the pumping end of the well is entirely submerged. Once the well is “switched on”, there is a drawdown of the water table above the well pump head.

groundwater14

The cone forms around the well extraction point with its deepest point at the well head, as the well pulls water in from ALL directions (below included!). This disturbs the flow lines, which is shown really well by this experiment. Note that this experiment just shows a 2D slice of the cone, when in reality the “aquifer” is 3D and would extend into and out of your computer screen, making the wedge you see here a cone in three dimensions.

TeXlive install process, or developing an intelligent download timer: Part 2

In part one of this series, I presented the download predictions for a program installation. The ultimate goal here, is to develop a method for accurately predicting total download times from the beginning of the download process. To elaborate, as the download progresses, it should become increasingly easy to make an accurate prediction of the time remaining, because there is increasingly less and less to download, and you have increasingly more and more information about the download history.

Naturally, the first thing we want to do is see if the data actually follow any sort of trend. In theory, larger packages in the TexLive install should take longer, but the time/volume of each should be roughly the same, and should remain roughly constant (or really vary about some mean constantly) over time. The easiest way to determine this would be to simply plot the download speed over the duration of the download.

secperkB_overtime

The speed varies significantly, but that’s okay. It, qualitatively, appears that the distribution of speeds randomly varies about a mean value. This is good for us, because it means that there is no trend in the speed over time, like we would see for example if the mean speed were changing over time.

This means that we can build a model of download speeds to predict the total download time. If we simply fit a linear model to the data above (i.e., elapsed package time ~ size) we find that the data are reasonably well explained (r^2 = 0.60) by a line of slope = 3.003797e-4 and intercept = 3.661338e-1.model

Then, we can use our linear model to evaluate, in essence, to predict the time it will take to download each package based on the size of each package and then sum them to produce a prediction for the total download time. Evaluation produces a predicted total download time of 29:26 mm:ss (plotted as dashed line below).

timeseries_witheval

29:26 happens to be the exact time that our download took. That means that despite all the variations in download speeds, the mean over time was so constant that a simple linear model (a constant download speed) perfectly predicts the observed data; perhaps this is not surprising when you see the roughly constant-slope red line above.

Now, this model was based on perfect information at the end of the download, but in the next post, we’ll explore a common, simple, and popular prediction algorithm as a test of an a priori and ongoing prediction tool.

TeXlive install process, or developing an intelligent download timer: Part 1

I recently got a new laptop and during the process of setting up to my preferences, I install LaTeX through TeXlive. This means a massive download of many small packages that get included in the LaTeX install. In effect, this is how all software downloads go, many small parts that make up the whole. Installing TeXlive on Linux gave me the chance to actually see the report of the download, and of course to save it and plot it up after completion. Here is what the data output to the console looks like during install:data

After 3 downloads, the installer makes a prediction of the total time, and then reports the elapsed time against predicted time, along with some information about the current download. If we take this information for all 3188 packages and parse it apart for the desired information, we can plot the actual time, versus predicted time, so see how the prediction performs over time.timeseries

There are some pretty large swings in the predicted time at the beginning of the model, but by about 25% of the total download by size, the prediction becomes pretty stable, making only minor corrections. The corrections continue until the very end of the downloads.

Download time prediction is a really interesting problem to work on, since you are attempting to control for download speed which is largely dependent on things outside the realm of the personal computer and is likely to vary over timescales longer than a few minutes. I’ll be making a few posts about this topic over the next months, culminating with what I hope is a simple, fast, and accurate download time prediction algorithm. More to come!

Google Opinion Rewards

Anyone who knows me, knows that I’m a Google fanboy. I don’t own any Apple products and will jump at any opportunity to tell anyone around me why Google is superior to Apple. Let me share with you one reason. Google Opinion Rewards is an app for your phone that periodically offers you the option to take a short survey.

Sometimes the survey is about a place you have visited recently (based on location history) and sometimes it is something completely random; but always, it only takes a few moments. You get paid (“rewarded”) for your time with Google Play Store credit. I take the surveys when I get a free second and I actually enjoy them; I think it’s kind of fun to know what companies want to know from you (the surveys are sponsored by other non-Google companies).

I first signed up for Google Opinion Rewards just over two years ago and wanted to see how much money I had made for what is basically zero time and effort investment. The app allows you to see your rewards history, but only on your phone. So I had to get a little creative here…I ended up taking screen grabs of all of the “data” and putting them through an image to text converter (free-ocr.com) to get the data into text form I could use.

2016-01-06 17.48.19

the screen captured form of the data

Below is the distribution of reward amounts and a cumulative total of the reward amounts received from the service.google_rewards_hist_hist
google_rewards_hist
Unsurprisingly, the bulk of reward amounts are on the low end of the spectrum. 10 cents is the mode, which is essentially the “thanks but no thanks” reward for your opinion. 25 cents is a popular number too, but interestingly 50 cents is not.

The cumulative plot shows that there have been some “dry spells” and some “hot streaks” to my survey responses, but I’ve been pretty consistently rewarded, especially in the 2015 year. There is no trend in the amount of a reward over time since I signed up for the service.

I have earned a total of $52.50 in Google Play credit for doing what I would effectively call doing nothing. That’s a pretty great deal, now of course you can only use the credits in Google’s own store, but hey that’s pretty good considering you can get apps, music, movies, and books there. For me, it has meant that I can buy any app whenever I want one without hesitation, for “free”. I have about $20 of credit sitting around right now, maybe I’ll go buy some Snapchat head effects…

History of the Houston Rodeo performances

The Houston Livestock Show and Rodeo is one of Houston’s largest and most famous annual events. Now, I won’t claim to know much about the Houston Rodeo, heck, I’ve only been to the Rodeo once, and have lived in Houston for a little over a year and a half! I went to look for the lineup for 2016 to see what show(s) I may want to see, but they haven’t released the lineup yet (comes out Jan 11 2016). I got curious of what the history of the event was like, and conveniently, they have a past performers page; this is the base source for the data used in this post.

First, I pulled apart the data on the page and built a dataset of each performer and every year they performed. The code I used to do this is an absolute mess so I’m not even going to share it, but I will post the dataset here (.rds file). Basically, I had to convert all the non-formatted year data, to clean uniformly formatted lists of years for each artist.

hr_hist

Above is the histogram of the number of performances across all the performers. As expected, the distribution is skewed right, towards the higher number of performances per performer. Just over 51% of performers have only performed one time, and 75% of performers have performed fewer than 3 times. This actually surprised me, I expected to see even fewer repeat performers. There have been a lot of big names come to the Rodeo over the years. The record for the most performances (25) is held by Wynonna Judd (Wynonna).

I then wanted to see how the number of shows per year changed over time, since the start of the Rodeo.

hr_peryr

The above plot shows every year since the beginning of the Rodeo (1931) to the most recent completed event (2015). The blue line is a Loess smoothing of the data. Now, I think that the number of performances corresponds with the number days of the Rodeo (i.e. one concert a night), but I don’t have any data to confirm this. It looks like the number of concerts in recent years has declined, but I’m not sure if the event has also been shortened (e.g. from 30 to 20 days). Let’s compare that with the attendance figures from the Rodeo.
hr_compsDespite fewer performances per year since the mid 1990s, the attendance has continued to climb. Perhaps the planners realized they could lower the number of performers (i.e. cost) and still have people come to the Rodeo. The Rodeo is a charity that raises money for scholarships and such, so more excess revenue means more scholarships! Even without knowing why the planners decided to reduce the number of performers per year, it looks like the decision was a good one.

If we look back at the 2016 concerts announcement page, you can see that they list the genre of the shows each night, but not yet the performers. I wanted to see how the division of genre of performers has changed over the years of the Rodeo. So, I used my dataset and the Last.fm API to get the top two user submitted “tags” for each artist. I then classed the performers into 8 different genres based on these tags. Most of the tags are genres so about 70% of the data was easy to class, I then manually binned all the remaining artists into the genres, trying to be as unbiased as possible.

hr_breakdown

It’s immediately clear that since the beginning, country music has always dominated the Houston Rodeo lineup. I think it’s interesting to see the increase in variety of music since the late 1990s, beginning to include a lot more Latin music and pop. I should caveat though, that the appearance of pop music may be complicated by the fact that what was once considered “pop” is now considered “oldies”. There have been a few comedians throughout the Rodeo’s run, but none in recent years. 2016 will feature 20 performances again, with a split that looks pretty darn similar to 2015, with a few substitutions:

hr_2016

Lal, 1991 in situ 10-Be production rates

10Be is a cosmogenic radioactive nuclide that is produced when high energy cosmic rays collide with nuclides and cause spallation. 10Be is produced in the atmosphere (and then transported down to the surface) as “meteoric”, and produced within mineral lattices in soil and rocks as “in situ“. In 1991, Devendra Lal wrote a highly cited paper about the physics of in situ produced Beryllium-10 (10Be). In the paper he lays out an equation for the production of in situ 10Be (q) based on latitude and altitude. I’m currently working on an idea I have for using cosmogenic nuclides as tracers for basin scale changes in uplift rate, so I wanted to see what his equation looked like applied. The equation is a third degree polynomial, with coefficients that depend on latitude (L), and direct dependency on altitude (y).
Lal_1991_table1

I grabbed an old raster (GEBCO 2014 30 arc second) I had laying around for Eastern North America and plotted it up. First, the elevation map (obviously latitude is on the y-axis…)

elev_map

Elevation map for ENAM.

And then apply the Lal, 1991 equation and find

Lal_map

Plotting Lal’s 1991 in situ production rate equation for ENAM. Green–>red increasing production rate. Production rate = NA in water.

I think the interesting observation is for how little of the mapped area there is any significant change in the production rate. Maybe this should be obvious since the polynomial has direct dependence on altitude and altitude doesn’t change that much in most of the map. Further the dependence of latitude is not all all observable with this map; perhaps because the latitude range is not very large, or the coefficients never change by more than an order of magnitude anyway. Next time, maybe a world elevation map! Not sure my computer has enough memory…

You can grab the code I used from here and Lal’s paper from here.

Pint glass short-pours

Have you ever gotten a short pour in your pint glass at the bar but not said anything? Well, after reading this, you may decide you want to say something next time. I’m not the first one to look at the point I’m making here, but I didn’t like the way others have presented it, and wanted to run the numbers myself anyway. The problem is to determine how much beer you are really missing out on, by missing that top bit of the pour.

For a theoretical pint glass, the volume of the glass increases with increasing h non-linearly from the base of the glass to the top. This is because the area of a circle is defined by πr2, where r changes linearly along h from rb to rt. L represents the vertical length of glass not filled with beer, measured down from the top of the glass.

schematic for terms used in problem.

I approached this problem two ways. First, I set up some simple relations in Matlab, and then numerically estimated the integral to a high spatial resolution, to determine how the volume of liquid in the glass changes with increasing h. I defined the glass geometry by crudely measuring a pint glass, and then fudging the measurements such that volume obtained for the full glass was 16 oz (one pint). Second, I actually filled my glass with 1 oz. slugs of water, and measured the height of the liquid in the glass.

Figure 1 shows the modeled and experimental results.

Figure 1: modeled and experimental results for the pint glass problem.

Figure 1: modeled and experimental results for the pint glass problem.

Since the experimental results closely overlay the model results, it is valid to assume the model calculations are accurate and reflect an actual pint glass, so I will proceed only considering the modeled results.

It’s immediately clear (and consistent with our expectation) that the top of the glass is where most of the liquid is held. This is seen in the data with the line slope; a shallow slope in the bottom of the glass means that an increase in the height of liquid equals a small percentage of total volume, whereas at the top of the glass, the same increase in height accounts for a much larger percentage of total volume. This has everything to do with the fact that the cross sectional area of the glass increases with increasing height (Ah = πrh2).

But, to address the question at hand, how much does a short pour really cheat you, lets look at Figure 2.

Figure 2: manipulated model results to demonstrate volume lost for small loss in total pour height.

Figure 2: manipulated model results to demonstrate volume lost for small loss in total pour height.

You can see that for a pour in which the top 1/2 inch (1.27 cm) is left empty, the drinker missed out on about 15% of the total volume of the pint-sized beer he paid for! If you are a regular at a hypothetical bar that short pours, every 7 beers you buy, you would be paying for a beer you never got to drink. Now, maybe your bartender isn’t leaving 1/2 inch of empty space at the top of your glass (although I have had it happen), but I do hope that you may think twice about not saying anything if you’re given a bad pour in the future.

 

——————————————————————————————–

Following a suggestion from /u/myu42996: fraction per fraction

fracfrac

Reddit data — When is it really too soon to retail Christmas?

About this time every year, people begin to complain about retail stores having Christmas themed displays and merchandise out. Well, speaking objectively, I think it is totally fair game for retail stores to shift to Christmas-mode, once people begin to think and talk about Christmas. Can Reddit post topics act as a proxy to determine when people begin to talk about Christmas? In each of the following plots, the black open circles represent a single day’s value, and the red line is a 7-point moving average designed to eliminate the weekly periodicity of Reddit posting.

posts with 'christmas' in titleWell it looks like the beginning of an increase in Christmas related posts occurs in the middle of October, with a substantial increase at the very end of November (just after Thanksgiving). Let’s dig a little deeper though. In the plot below, I’ve taken the same data and plotted them on a logged y-axis to highlight the variability.

log plot of xmas posts

From the above plot, it seems that the steady increase begins as early as the middle of September! Is a steady increase really enough to conclude that the conversation has begun though? Well I decided to take a look at the variation in the data to try and answer that.

xmas_nmlzdIn the above plot, the data has been normalized per day to a percent of total posts that have Christmas in the title. On December 25th, 16% of posts to Reddit included the word Christmas in the title (over 16,000 posts)! Now, I took the period from April 1 to Aug 15, and determined the mean and standard deviation. The horizontal black line represents the mean for this period, and the gray box is 2 standard deviation from the mean.

Taking 2 complete standard deviations from the mean to be a good indicator of significant change, the conversation about Christmas breaches this threshold right in the middle of September.

Now, you are probably thinking “Well these posts in September and October are probably just all people complaining about the early retailing of Christmas!” Well yes, you may be right. But hey, even if that’s the case, the retail companies are succeeding at getting you to talk about Christmas which means it is worth their time to put up the merchandise early, since you then buy more!

 

Just for fun, here are the conversation plots for some other holidays!

all_holiday

Manning and Favre: Career TD leaders

In honor of Peyton Manning becoming the second player in NFL history to throw more than 500 touchdown passes, I put together a little graphic to show how he compares to the other QB with more than 500, Brett Favre.

manning_favre_cum_tds

Regular season cumulative touchdown passes for Peyton Manning and Brett Favre.

Looks like Favre slowed at the end of his career, but Peyton has only accelerated in the last few seasons with the Broncos. I have no doubts he will overtake Favre and become the leader in the next month.

Data from The Football Database

 

UPDATE: After sharing to Reddit, I have made a few more plots, and I include them below.

smoothed raw touchdown per game count. Not really much trend besides maybe the recent high per game TD period for Manning

smoothed raw touchdown per game count. Not really much trend besides maybe the recent high per game TD period for Manning

cumulative interceptions for the two QBs

cumulative interceptions for the two QBs

Reddit data — Graduate School talk

This is the first post in a series I’ll be doing about posting on Reddit for 2013. The posts in this series search through every single post made to Reddit in 2013 — that’s over 50 GB worth of data, and over 39,000,000 posts!

For this post, I examined every post made to any subreddit for any word that related to graduate school (including law and medical school) for each day of 2013, in either the ‘title’ or the ‘self-text’. The key used for positive matches is at the end of this post.

grad_talk

posts made to Reddit with words relating to graduate school in 2013. 1 data point for each day, red line is 7 point moving-average.

Maybe not as telling as I had expected, there’s a ton of variance day to day and week to week, but the most obvious observation would be the spike in graduate school related comments in the month of April, following a consistent increase in posts in March. I would suggest this is probably due to the fact that this is the time of the year when a lot of acceptance decisions come out.

Normalizing the data against total posts for the day is not any more telling, the profile is stretched a bit in the y-direction. The 7 point moving-average is an attempt to remove the weekly periodicity of Reddit posting.

The key used was [‘grad school’, ‘graduate school’, “master’s”, ‘masters’, ‘ phd ‘, ‘ gre ‘, ‘letter of recommendation’, ‘letters of recommendation’, ‘doctorate’, ‘law school’, ‘med school’, ‘medical school’, ‘transcript’, ‘undergraduate gpa’, ‘undergrad gpa’]. There are of course more keywords that could have been used, but many have multiple implications, and this list was used as an attempt to minimize false positives.