Category Archives: programming

Dietrich Settling Velocity Matlab code

Just about two years ago, I published a code for calculating the settling velocity of a particle in water by the Dietrich, 1982 method. I later realized an error in the code that made it produce erroneous estimates. I replaced it with a corrected version of the code written in R, that you can still find on my Github page, but won’t link here. I’ve taken down the old Matlab code and unpublished the original post.

Since the first publication two years ago, I’ve dramatically improved my technical coding skills, and, just as importantly, learned a ton about how to share code with others. A new version of the code can be found linked a the end of this post, but first some philosophy about coding.

As scientists, we want to be able to reliably reproduce results and we have an attraction to openness and access to knowledge, as such, we inevitably share code with one another. I’ve shared a good bit of my code thus far in my career, and I’ve made a few notes about the matter below:

  1. It is better to organize your code in a way that may not be the most succinct (when speed is not an issue) while improving the readability of the code. In essence, making your code readable to those unfamiliar with it, so they can follow along with your logic.
  2. Comment everything. It is so easy to get into the groove of writing something and blow through it until it’s finished and then never return to clean up and document the code. When others (or even you) go back to look at it, simple comments explaining what a line does go miles further than a detailed explanation in the documentation.
  3. Write your code to withstand more that you are currently using it for. What I mean by this is to not write code that works only when you are manipulating this particular dataset or working this particular problem, but code that will work over a broader range of data or can be applied to another problem. Along the same line, use explicit values as sparingly as possible, e.g., if X = [1:12] is a 12 element vector and you want the middle value, do not use X(6), but instead use X(length(X)/2); or slightly better yet, X(ceil(length(X)/2)) to at least not throw an error and give you the first of the two middle rows if X is an odd number of elements in length.

 

Anyway, the code is uploaded to my Github and you can find it there. The documentation output on a help get_DSV call in Matlab is produced below.

get_dsv_documentation

 

References:
Dietrich, W. E. Settling velocity of natural particles. Water Resources Research 18, 1615–1626 (1982).

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No.1450681. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation.

Yellow River shoreline traces

For some ongoing research I needed a set of shoreline traces of the Yellow River delta that I could measure some properties of to compare modeling results to. My model predicts delta lobe growth rates, for example the growth of the Qingshuiguo lobe on the eastern side of the delta from 1976 to 1996.

early growth stage of Qingshuiguo lobe of the Yellow River delta, figure from van Gelder et al., 1994.

early growth stage of Qingshuiguo lobe of the Yellow River delta, figure from van Gelder et al., 1994.

It’s a little early for me to release the data I’ve collected from the shoreline traces (n=296), or the code I used to generate the traces — but below is a figure mapping all the shoreline traces I have from 1855 to present. My model can also predict rates of delta system growth (i.e., growth over many lobe cycles) and I have generated data from these traces that I can compare to as well. I like the below figure, because I think it nicely demonstrates the rapid reworking of delta lobes (yes even on “river dominated” deltas!); evidenced most clearly by the retreat of the Diakou lobe to the north side of the delta.

xom_shoreline_map

 

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No.1450681. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation.

TeXlive install process, or developing an intelligent download timer: Part 2

In part one of this series, I presented the download predictions for a program installation. The ultimate goal here, is to develop a method for accurately predicting total download times from the beginning of the download process. To elaborate, as the download progresses, it should become increasingly easy to make an accurate prediction of the time remaining, because there is increasingly less and less to download, and you have increasingly more and more information about the download history.

Naturally, the first thing we want to do is see if the data actually follow any sort of trend. In theory, larger packages in the TexLive install should take longer, but the time/volume of each should be roughly the same, and should remain roughly constant (or really vary about some mean constantly) over time. The easiest way to determine this would be to simply plot the download speed over the duration of the download.

secperkB_overtime

The speed varies significantly, but that’s okay. It, qualitatively, appears that the distribution of speeds randomly varies about a mean value. This is good for us, because it means that there is no trend in the speed over time, like we would see for example if the mean speed were changing over time.

This means that we can build a model of download speeds to predict the total download time. If we simply fit a linear model to the data above (i.e., elapsed package time ~ size) we find that the data are reasonably well explained (r^2 = 0.60) by a line of slope = 3.003797e-4 and intercept = 3.661338e-1.model

Then, we can use our linear model to evaluate, in essence, to predict the time it will take to download each package based on the size of each package and then sum them to produce a prediction for the total download time. Evaluation produces a predicted total download time of 29:26 mm:ss (plotted as dashed line below).

timeseries_witheval

29:26 happens to be the exact time that our download took. That means that despite all the variations in download speeds, the mean over time was so constant that a simple linear model (a constant download speed) perfectly predicts the observed data; perhaps this is not surprising when you see the roughly constant-slope red line above.

Now, this model was based on perfect information at the end of the download, but in the next post, we’ll explore a common, simple, and popular prediction algorithm as a test of an a priori and ongoing prediction tool.

TeXlive install process, or developing an intelligent download timer: Part 1

I recently got a new laptop and during the process of setting up to my preferences, I install LaTeX through TeXlive. This means a massive download of many small packages that get included in the LaTeX install. In effect, this is how all software downloads go, many small parts that make up the whole. Installing TeXlive on Linux gave me the chance to actually see the report of the download, and of course to save it and plot it up after completion. Here is what the data output to the console looks like during install:data

After 3 downloads, the installer makes a prediction of the total time, and then reports the elapsed time against predicted time, along with some information about the current download. If we take this information for all 3188 packages and parse it apart for the desired information, we can plot the actual time, versus predicted time, so see how the prediction performs over time.timeseries

There are some pretty large swings in the predicted time at the beginning of the model, but by about 25% of the total download by size, the prediction becomes pretty stable, making only minor corrections. The corrections continue until the very end of the downloads.

Download time prediction is a really interesting problem to work on, since you are attempting to control for download speed which is largely dependent on things outside the realm of the personal computer and is likely to vary over timescales longer than a few minutes. I’ll be making a few posts about this topic over the next months, culminating with what I hope is a simple, fast, and accurate download time prediction algorithm. More to come!

History of the Houston Rodeo performances

The Houston Livestock Show and Rodeo is one of Houston’s largest and most famous annual events. Now, I won’t claim to know much about the Houston Rodeo, heck, I’ve only been to the Rodeo once, and have lived in Houston for a little over a year and a half! I went to look for the lineup for 2016 to see what show(s) I may want to see, but they haven’t released the lineup yet (comes out Jan 11 2016). I got curious of what the history of the event was like, and conveniently, they have a past performers page; this is the base source for the data used in this post.

First, I pulled apart the data on the page and built a dataset of each performer and every year they performed. The code I used to do this is an absolute mess so I’m not even going to share it, but I will post the dataset here (.rds file). Basically, I had to convert all the non-formatted year data, to clean uniformly formatted lists of years for each artist.

hr_hist

Above is the histogram of the number of performances across all the performers. As expected, the distribution is skewed right, towards the higher number of performances per performer. Just over 51% of performers have only performed one time, and 75% of performers have performed fewer than 3 times. This actually surprised me, I expected to see even fewer repeat performers. There have been a lot of big names come to the Rodeo over the years. The record for the most performances (25) is held by Wynonna Judd (Wynonna).

I then wanted to see how the number of shows per year changed over time, since the start of the Rodeo.

hr_peryr

The above plot shows every year since the beginning of the Rodeo (1931) to the most recent completed event (2015). The blue line is a Loess smoothing of the data. Now, I think that the number of performances corresponds with the number days of the Rodeo (i.e. one concert a night), but I don’t have any data to confirm this. It looks like the number of concerts in recent years has declined, but I’m not sure if the event has also been shortened (e.g. from 30 to 20 days). Let’s compare that with the attendance figures from the Rodeo.
hr_compsDespite fewer performances per year since the mid 1990s, the attendance has continued to climb. Perhaps the planners realized they could lower the number of performers (i.e. cost) and still have people come to the Rodeo. The Rodeo is a charity that raises money for scholarships and such, so more excess revenue means more scholarships! Even without knowing why the planners decided to reduce the number of performers per year, it looks like the decision was a good one.

If we look back at the 2016 concerts announcement page, you can see that they list the genre of the shows each night, but not yet the performers. I wanted to see how the division of genre of performers has changed over the years of the Rodeo. So, I used my dataset and the Last.fm API to get the top two user submitted “tags” for each artist. I then classed the performers into 8 different genres based on these tags. Most of the tags are genres so about 70% of the data was easy to class, I then manually binned all the remaining artists into the genres, trying to be as unbiased as possible.

hr_breakdown

It’s immediately clear that since the beginning, country music has always dominated the Houston Rodeo lineup. I think it’s interesting to see the increase in variety of music since the late 1990s, beginning to include a lot more Latin music and pop. I should caveat though, that the appearance of pop music may be complicated by the fact that what was once considered “pop” is now considered “oldies”. There have been a few comedians throughout the Rodeo’s run, but none in recent years. 2016 will feature 20 performances again, with a split that looks pretty darn similar to 2015, with a few substitutions:

hr_2016

Lal, 1991 in situ 10-Be production rates

10Be is a cosmogenic radioactive nuclide that is produced when high energy cosmic rays collide with nuclides and cause spallation. 10Be is produced in the atmosphere (and then transported down to the surface) as “meteoric”, and produced within mineral lattices in soil and rocks as “in situ“. In 1991, Devendra Lal wrote a highly cited paper about the physics of in situ produced Beryllium-10 (10Be). In the paper he lays out an equation for the production of in situ 10Be (q) based on latitude and altitude. I’m currently working on an idea I have for using cosmogenic nuclides as tracers for basin scale changes in uplift rate, so I wanted to see what his equation looked like applied. The equation is a third degree polynomial, with coefficients that depend on latitude (L), and direct dependency on altitude (y).
Lal_1991_table1

I grabbed an old raster (GEBCO 2014 30 arc second) I had laying around for Eastern North America and plotted it up. First, the elevation map (obviously latitude is on the y-axis…)

elev_map

Elevation map for ENAM.

And then apply the Lal, 1991 equation and find

Lal_map

Plotting Lal’s 1991 in situ production rate equation for ENAM. Green–>red increasing production rate. Production rate = NA in water.

I think the interesting observation is for how little of the mapped area there is any significant change in the production rate. Maybe this should be obvious since the polynomial has direct dependence on altitude and altitude doesn’t change that much in most of the map. Further the dependence of latitude is not all all observable with this map; perhaps because the latitude range is not very large, or the coefficients never change by more than an order of magnitude anyway. Next time, maybe a world elevation map! Not sure my computer has enough memory…

You can grab the code I used from here and Lal’s paper from here.

River hysteresis and plotting hysteresis data in R

Hysteresis is the concept that a system (be it mechanical, numerical, or whatever else) is dependent of the history of the system, and not only the present conditions. This is the case for rivers. For example, consider the following theoretical flood curve and accompanied sediment discharge curve (Fig. 1a).

Figure 1. Theoretical plots to demonstrate the hysteresis of a river.

Figure 1. Theoretical plots to demonstrate the hysteresis of a river.

With the onset of the flood, the increased sediment transport capacity of the system entrains more sediment and the sediment discharge curve (red, Fig. 1a) rises. However, the system may soon run out of sediment to transport (really just a reduction in easily transportable sediment), and the sediment discharge curve decreases although the water discharge curve remains high in flood.

In Fig. 1b, the sediment discharge and water discharge are plotted through time, a typical way of observing the hysteresis of a system. Note that for the rising limb and falling limb of the river flood, the same water discharge produces two different sediment transport values.

Now, let’s imagine that we want to investigate how important the history of the system is to the present state of our study river. You can grab the data I’ll use, here. This is data from one year of flow on the Huanghe (Yellow River) in China, and it has been smoothed with a moving average function to make the hysteresis function more visible.

Making the plot

It is easy enough to plot a line with R (the lines function) but with a hysteresis plot, it is important to be able to determine which direction time is moving forward along the curve. For this reason we want to use arrows. So we plot the line data first, with:

lines(df[1:366,"Qw"],df[1:366,"Qs"],
        lwd=2,
        )

and then using a constructed vector of every 22nd number, we plot an arrow over top of the lines using:

s <- seq(from=1, to=length(df[,"Qw"])-1, by=22)
arrows(df[,"Qw"][s], df[,"Qs"][s], df[,"Qw"][s+1], df[,"Qs"][s+1], 
        length=0.1,
        code=2,
        lwd=2
        )

Finally, with a few more modifications to the plot (see the full code here), we can produce Fig. 2 below. This plot is comparable to the theoretical one above.

Figure 2: Hysteresis plot from actual data.

Figure 2. Hysteresis plot from actual data.

Using the green lines and points, I have highlighted the observation that for the rising limb and falling limb of a flood, there can be substantially different sediment discharges for the same water discharge — this observation is not so easily made from the plot on the left, but it is immediately clear in the hysteresis plot on the right.

Be inspired while coding! — Matlab script

Here is a joke script I wrote a while ago that returns an inspirational quote when called. I made it as a joke to send to my lab group during a particularly grueling week of coding. You can simply call the function at the beginning of your script to have a quote printed to the stdout or you can wrap it inside a wbar if you want.

Grab the code from my GitHub, here.

inspire_output_example

example output from the inspire.m script.

Quick script for connecting to University VPN

I can be pretty lazy at times, so much that I will go out of my way a bit to write a bit of code to simplify my life. I don’t often connect to my university’s VPN, but when I eventually do want to, I can never remember the command needed to do it. So I have to take the time and look it up. Well I cut that out today with a basic little bash script with a name I can remember when I need it. I titled my script “rice_proxy.sh” but anything memorable to you would be fine. The script takes one argument (“on” or “off”) to either connect or disconnect with the proxy. Stick the script in your bin folder and it will execute simply with rice_proxy.sh on.

The script is reproduced below, or you can grab it from here.

#!/bin/bash
# Andrew J. Moodie
# Feb 2015
# easily remembered interface for vpn connection
if [ "$1" = "on" ]
    then
    sudo /usr/sbin/vpnc RiceVPN.conf
elif [ "$1" = "off" ]
    then
    vpnc­-disconnect
else
    echo " input error -- one argument must be given"
    echo " valid inputs: 'on' or 'off'"
fi

Note that you could further automate this if you wanted to (e.g. auto-enter passwords, connect on startup), but this is a bad idea for both security and for internet speed.