James A. Rising

Entries categorized as ‘Research’

Scripts for Twitter Data

April 22, 2015 · Leave a Comment

Twitter data– the endless stream of tweets, the user network, and the rise and fall of hashtags– offers a flood of insight into the minute-by-minute state of the society. Or at least one self-selecting part of it. A lot of people want to use it for research, and it turns out to be pretty easy to do so.

You can either purchase twitter data, or collect it in real-time. If you purchase twitter data, it’s all organized for you and available historically, but it basically isn’t anything that you can’t get yourself by monitoring twitter in real-time. I’ve used GNIP, where the going rate was about $500 per million tweets in 2013.

There are two main ways to collect data directly from twitter: “queries” and the “stream”. Queries let you get up to 1000 tweets at any point in time– whichever the most recent tweets that match your search criteria. The stream gives you a fraction of a percent of tweets continuously, which very quickly adds up, based on filtering criteria.

Scripts for doing these two options are below, but you need to decide on the search/streaming criteria. Typically, these are search terms and geographical constraints. See Twitter’s API documentation to decide on your search options.

Twitter uses an athentication system to identify both the individual collecting the data, and what tool is helping them do it. It is easy to register a new tool, whereby you pretend that you’re a startup with a great new app. Here are the steps:

  1. Install python’s twitter package, using “easy_install twitter” or “pip install twitter”.
  2. Create an app at https://apps.twitter.com/. Leave the callback URL blank, but fill in the rest.
  3. Set the CONSUMER_KEY and CONSUMER_SECRET in the code below to the values you get on the keys and access tokens tab of your app.
  4. Fill in the name of the application.
  5. Fill in any search terms or structured searches you like.
  6. If you’re using the downloaded scripts, which output data to a CSV file, change where the file is written, to some directory (where it says “twitter/us_”).
  7. Run the script from your computer’s terminal (i.e., python search.py)
  8. The script will pop up a browser for you to log into twitter and accept permissions from your app.
  9. Get data.

Here is what a simple script looks like:

import os, twitter

APP_NAME = "Your app name"
CONSUMER_KEY = 'Your consumer key'
CONSUMER_SECRET = 'Your consumer token'

# Do we already have a token saved?
MY_TWITTER_CREDS = os.path.expanduser('~/.class_credentials')
if not os.path.exists(MY_TWITTER_CREDS):
    # This will ask you to accept the permissions and save the token
    twitter.oauth_dance(APP_NAME, CONSUMER_KEY, CONSUMER_SECRET,
                        MY_TWITTER_CREDS)

# Read the token
oauth_token, oauth_secret = twitter.read_token_file(MY_TWITTER_CREDS)

# Open up an API object, with the OAuth token
api = twitter.Twitter(api_version="1.1", auth=twitter.OAuth(oauth_token, oauth_secret, CONSUMER_KEY, CONSUMER_SECRET))

# Perform our query
tweets = api.search.tweets(q="risky business")

# Print the results
for tweet in tweets['statuses']:
    if not 'text' in tweet:
        continue

    print tweet
    break

For automating twitter collection, I’ve put together scripts for queries (search.py), streaming (filter.py), and bash scripts that run them repeatedly (repsearch.sh and repfilter.sh). Download the scripts.

To use the repetition scripts, make the repetition scripts executable by running “chmod a+x repsearch.sh repfilter.sh“. Then run them, by typing ./repfilter.sh or ./repsearch.sh. Note that these will create many many files over time, which you’ll have to merge together.

Categories: Research · Software

US Water Network

April 20, 2015 · Leave a Comment

The America’s Water project, coordinated at Columbia’s Water Center by Upmanu Lall, is trying to understand the US water system as an integrated whole, and understand how that system will evolve over the next decades. Doing so will require a comprehensive model, incorporating agriculture, energy, cities, policy, and more.

We are just beginning to lay the foundation for that model. A first step is to create a network of links between station gauges around the US, representing upstream and downstream flows and counties served. The ultimate form of that model will rely on physical flow data, but I created a first pass using simple rules:

  1. Every gauge can only be connected to one downstream gauge (but not visa versa).
  2. Upstream gauges must be at a higher elevation than downstream gauges.
  3. Upstream gauges must be fed by a smaller drainage basin than downstream gauges.
  4. Of the gauges that satisfy the first two constraints, the chosen downstream gauge is the one with the shortest distance and the most “plausible” streamflow.

The full description is available on Overleaf. I’ve applied the algorithm to the GAGES II database from USGSU, which includes all station gauges with at least 20 years of data.

network3

Every red dot is a gauge, black lines are upstream-downstream connections between gauges, and the blue and green lines connect counties with each of the gauges by similar rules to the ones above (green edges if the link is forced to be longer than 100 km).

This kind of network opens the door for a lot of interesting analyses. For example, if agricultural withdrawals increase in the midwest, how much less water will be available downstream? We’re working now to construct a full optimization model that accounts for upstream dependencies.

Another simple question is, how much of the demand in each county is satisfied by flows available to it? Here are the results, and many cities show up in sharp red, showing that their demands exceed the surface water by 10 times or more.

supplydemand-surface

Categories: Research

Negative result: country-wide growing degree-days

December 15, 2014 · Leave a Comment

I put this here as a warning. While growing degree-days (GDD) are well-known as an effective model to predict yields, they don’t perform so hot at the country-scale.

I used mean temperature GDDs, between 8 and 24 degrees C, estimated at many locations from station data, and then using the weighted average by production within each country. Here are the results:

Statistical models
Barley Maize Millet Rice Sorghum Wheat
GDD / 1000 -0.03 0.01 -0.07** 0.08 0.04* -0.08***
(0.01) (0.01) (0.03) (0.06) (0.02) (0.02)
Precip. (m) 0.09 0.11*** 0.12* 0.02 0.14*** -0.04
(0.05) (0.03) (0.05) (0.03) (0.04) (0.04)
Country Cubic Y Y Y Y Y Y
R2 0.95 0.97 0.91 0.97 0.92 0.96
Adj. R2 0.94 0.96 0.90 0.97 0.91 0.95
Num. obs. 1639 3595 1516 1721 2300 1791
***p < 0.001, **p < 0.01, *p < 0.05

As you can see, for most crops, these GDDs aren’t even significant, and as frequently negative as positive. This defies a century of agricultural research, but the same data at a fine spatial scale seems to work just fine.

Categories: Research

Growing Degree-Day Calculations

December 11, 2014 · 2 Comments

Schlenker and Roberts (2009) use daily minimum and maximum temperatures to calculate growing degrees, rather than daily mean temperatures. This is important when the effect of extreme temperatures is an issue, since these often will not show up in mean temperatures.

Growing degree days form a useful model of crop productivity. DMAS has examples of these for maize, soybeans, and cotton.

To do this, they use a sinusoidal approximation, integrating the area of a curve through the minimum and maximum temperatures:
thresholds
(adapted from here— but don’t use their calculations!)

The calculations aren’t very difficult, but require some careful math. I had a need to write them in python and translate them to R, so I’m providing them here for anyone’s benefit.

import numpy as np
import warnings

warnings.simplefilter("ignore", RuntimeWarning)

def above_threshold(mins, maxs, threshold):
    """Use a sinusoidal approximation to estimate the number of Growing
Degree-Days above a given threshold, using daily minimum and maximum
temperatures.

mins and maxs are numpy arrays; threshold is in the same units."""

    # Determine crossing points, as a fraction of the day
    plus_over_2 = (mins + maxs)/2
    minus_over_2 = (maxs - mins)/2
    two_pi = 2*np.pi
    # d0s is the times of crossing above; d1s is when cross below
    d0s = np.arcsin((threshold - plus_over_2) / minus_over_2) / two_pi
    d1s = .5 - d0s

    # If always above or below threshold, set crossings accordingly
    aboves = mins >= threshold
    belows = maxs <= threshold

    d0s[aboves] = 0
    d1s[aboves] = 1
    d0s[belows] = 0
    d1s[belows] = 0

    # Calculate integral
    F1s = -minus_over_2 * np.cos(2*np.pi*d1s) / two_pi + plus_over_2 * d1s
    F0s = -minus_over_2 * np.cos(2*np.pi*d0s) / two_pi + plus_over_2 * d0s
    return np.sum(F1s - F0s - threshold * (d1s - d0s))

def get_gddkdd(mins, maxs, gdd_start, kdd_start):
    """Get the Growing Degree-Days, as degree-days between gdd_start and
kdd_start, and Killing Degree-Days, as the degree-days above
kdd_start.

mins and maxs are numpy arrays; threshold is in the same units."""

    dd_lowup = above_threshold(mins, maxs, gdd_start)
    dd_above = above_threshold(mins, maxs, kdd_start)
    dd_lower = dd_lowup - dd_above

    return (dd_lower, dd_above)

Download the code for R or python.

Categories: Research · Software

The Distribution of Possible Climates

October 21, 2014 · Leave a Comment

There’s a lot we don’t know about future climates. Despite continuous improvements, our models still do notoriously poorly with some fairly fundamental dynamics: ocean energy transport, the Indian monsoon, and just reproducing past climate. Worse, we have the computing resources to run each model a hand-full of times. We haven’t estimated the distribution of possible temperatures that even one model predicts.

Instead, we treat the range of models as a kind of distribution over beliefs. This is useful, but a poor approximation to an actual distribution. For one, most of the models are incestuously related. Climate is chaotic, no matter how you model it, so changing the inputs within measurement error can produce a whole range of future temperatures.

This is all motivation for a big innovation from the Risky Business report, on the part of Bob Kopp and DJ Rasmussen.

First, they collected the full range of 44 CMIP5 models (downscaled and adjusted to predict every US county). They extracted each one’s global 2080-2099 temperature prediction.

Then they ran an *hemisphere* scale model (MAGICC), for which its possible to compute enough runs to produce a whole distribution of future temperatures.

Here’s the result: a matching between models and the temperatures distribution for end-of-century global temperatures.

magicc-cmip5

[From American Climate Prospectus, Technical Appendix I: Physical Climate Projections]

The opaque maps are each the output of a different GCM. The ghosted maps aren’t– at least, not alone. In fact, every ghosted map represents a portion of the probability distribution which is currently missing from our models.

To fix this, they made “surrogate” models, by decomposing existing GCM outputs into “forced” seasonal patterns which scale with temperature, and unforced variability. Then they scaled up the forced pattern, added back in the variability, and bingo! A new surrogate model.

Surrogates can’t tell us the true horrors of a high-temperature world, but without them we’d be navigating without a tail.

Categories: Research

Tweets and Emotions

October 10, 2014 · Leave a Comment

This animation is dedicated to my advisor Upmanu Lall, who’s been so supportive of all my many projects, including this one! Prabhat Barnwal and I have been studying the effects of Hurricane Sandy on the New York area through the lens of twitter emotions, ever since we propitiously began collecting tweets a week before the storm emerged (see the working paper).

The animation shows every geolocated tweet in between October 20 and November 5, across much of the Northeastern seaboard– about 2 million tweets in total.

Each tweet is colored based on our term-usage analysis of its emotional content. The hue reflects happiness, varying from green (happy) to red (sad), greyer colors reflect a lack of authority, and darker colors reflect a lack of engagement. The hurricane, as a red circle, jutters through the bottom-left of the screen the night of Oct. 29.

The first thing to notice is how much more tweeting activity there is near the hurricane and the election. Look at the difference between seconds 22 (midnight on Oct. 25) and 40 (midnight on Oct. 30).

The night of the hurricane, the tweets edge toward pastel as people get excited. The green glow near NYC that each night finishes with becomes brighter, but the next morning the whole region turns red as people survey the disaster.

Categories: Data · Research

Simple Robust Standard Errors Library for R

September 2, 2014 · Leave a Comment

R doesn’t have a built-in method for calculating heteroskedastically-robust and clustered standard errors. Since this is a common need for us, I’ve put together a little library to provide these functions.

Download the library.

The file contains three functions:

  • get.coef.robust: Calculate heteroskedastically-robust standard errors
  • get.coef.clust: Calculate clustered standard errors
  • get.conf.clust: Calculate confidence intervals for clustered standard errors

The arguments for the functions are below, and then an example is at the end.

get.coef.robust(mod, var=c(), estimator=”HC”)
Calculate heteroskedastically-robust standard errors

mod: the result of a call to lm()
     e.g., lm(satell ~ width + factor(color))
var: a list of indexes to extract only certain coefficients (default: all)
     e.g., 1:2 [to drop FE from above]
estimator: an estimator type passed to vcovHC (default: White's estimator)
Returns an object of type coeftest, with estimate, std. err, t and p values

get.coef.clust(mod, cluster, var=c())
Calculate clustered standard errors

mod: the result of a call to lm()
     e.g., lm(satell ~ width + factor(color))
cluster: a cluster variable, with a length equal to the observation count
     e.g., color
var: a list of indexes to extract only certain coefficients (default: all)
     e.g., 1:2 [to drop FE from above]
Returns an object of type coeftest, with estimate, std. err, t and p values

get.conf.clust(mod, cluster, xx, alpha, var=c())
Calculate confidence intervals for clustered standard errors

mod: the result of a call to lm()
     e.g., lm(satell ~ width + factor(color))
cluster: a cluster variable, with a length equal to the observation count
     e.g., color
xx: new values for each of the variables (only needs to be as long as 'var')
     e.g., seq(0, 50, length.out=100)
alpha: the level of confidence, as an error rate
     e.g., .05 [for 95% confidence intervals]
var: a list of indexes to extract only certain coefficients (default: all)
     e.g., 1:2 [to drop FE from above]
Returns a data.frame of yhat, lo, and hi

Example in Stata and R

The following example compare the results from Stata and R for an example from section 3.3.2 of “An Introduction to Categorical Data Analysis” by Alan Agresti. The data describes 173 female horseshoe crabs, and the number of male “satellites” that live near them.

Download the data as a CSV file. The columns are the crabs color (2-5), spine condition (1-3), carapace width (cm), number of satellite crabs, and weight (kg).

The code below calculates a simple model with crab color fixed-effects, relating width (which is thought to be an explanatory variable) and number of satellites. The first model is a straight OLS regression; then we add robust errors, and finally crab color clustered errors.

The analysis in Stata.

First, let’s do the analysis in Stata, to make sure that we can reproduce it in R.

import delim data.csv

xi: reg satell width i.color
xi: reg satell width i.color, vce(hc2)
xi: reg satell width i.color, vce(cluster color)

Here are excerpts of the results.

The OLS model:

------------------------------------------------------------------------------
      satell |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       width |   .4633951   .1117381     4.15   0.000     .2428033    .6839868
       _cons |  -8.412887   3.133074    -2.69   0.008    -14.59816   -2.227618
------------------------------------------------------------------------------

The robust errors:

------------------------------------------------------------------------------
             |             Robust HC2
      satell |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       width |   .4633951   .1028225     4.51   0.000     .2604045    .6663856
       _cons |  -8.412887   2.939015    -2.86   0.005    -14.21505   -2.610726
------------------------------------------------------------------------------

Note that we’ve used HC2 robust errors, to provide a direct comparison with the R results.

The clustered errors.

------------------------------------------------------------------------------
             |               Robust
      satell |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       width |   .4633951   .0466564     9.93   0.002     .3149135    .6118767
       _cons |  -8.412887   1.258169    -6.69   0.007    -12.41694   -4.408833
------------------------------------------------------------------------------

The robust and clustered errors are a little tighter for this example.

The analysis in R.

Here’s the equivalent code in R.

source("tools_reg.R")
tbl <- read.csv("data.csv")

mod <- lm(satell ~ width + factor(color), data=tbl)

summary(mod)
get.coef.robust(mod, estimator="HC2")
get.coef.clust(mod, tbl$color)

And excerpts of the results. The OLS model:

               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -8.4129     3.1331  -2.685  0.00798 ** 
width            0.4634     0.1117   4.147 5.33e-05 ***

The coefficient estimates will be the same for all three models. Stata gives us -8.412887 + .4633951 w and R gives us -8.4129 + 0.4634 w. These are identical, but with different reporting precision, as are the standard errors.

The robust errors:

(Intercept)    -8.41289    2.93902 -2.8625  0.004739 ** 
width           0.46340    0.10282  4.5067 1.229e-05 ***

Again, we use HC2 robust errors. Stata provides 2.939015 and .1028225 for the errors, matching these exactly.

The clustered errors:

                Estimate Std. Error  t value  Pr(>|t|)    
(Intercept)    -8.412887   1.258169  -6.6866 3.235e-10 ***
width           0.463395   0.046656   9.9321 < 2.2e-16 ***

Stata gives the exact same values as R.

Categories: Research

Risky Business Report released today!

June 25, 2014 · Leave a Comment

Sol, Amir, and I have been slaving away over a report on the business-case for fighting climate change.  And it was released this morning!  The media outlets give a sense of the highlights:

Forbes:
Today’s report from Risky Business – the project chaired by Steyer, former U.S. Treasury Secretary Hank Paulson, and former NYC Mayor Michael Bloomberg – puts actual numbers on the financial risk the United States faces from unmitigated climate change.
New York Times:
[Quotes our guy:] “the most detailed modeling ever done on the impact of climate change on specific sectors of the U.S. economy.”

Huffington Post:
Parts Of America Will Be ‘Unsuited For Outdoor Activity’ Thanks To Climate Change, Report Finds

Financial Times:
For example, by the last two decades of the century, if greenhouse gas emissions carry on rising unchecked, the net number of heat and cold-related deaths in the US is forecast as likely to be 0.9 per cent to 18.8 per cent higher. But the analysis also shows a one in 20 chance that the number of deaths will rise more than 32.56 per cent, and another one in 20 chance that it will fall by more than 7.77 per cent.
Twitter:

#RiskyBusiness: By end of the century, OR, WA & ID could have more days > 95°F/yr than there are currently in Texas | http://riskybusiness.org/uploads/files/RiskyBusiness_PrintedReport_FINAL_WEB_OPTIMIZED.pdf …

2050: $66b-$106b of US coastal property likely under water, $238b-$507b worth by 2100 #ClimateChange #riskybusiness http://bit.ly/1sANaJj

Also Huff Po:
Higher temperatures will reduce Midwest crop yields by 19 percent by midcentury and by 63 percent by the end of the century.

The region, which has averaged eight days of temperatures over 95 degrees each year, will likely see an additional 17 to 52 of these days by midcentury and up to four months of them by the end of the century. This could lead to 11,000 to 36,000 additional deaths per year.

There’s also some over-the-top praise from the choir– Amir can send you some gems from
Capitalists Take on Climate Change.

Take a look!  Here’s the media report:
http://riskybusiness.org/report/overview/executive-summary

And the scientific report (what we actually helped write):
http://rhg.com/reports/climate-prospectus

Categories: Research

R Packages for Bayesian Networks

June 22, 2014 · Leave a Comment

I’m in a group studying Bayesian techniques for environmental research, under Upmanu Lall at the Earth Institute. We recently did a session on the packages available in R for modeling Bayesian Networks. Here’s the spreadsheet that came out of the meeting:

Categories: Research

SSA and NINO

June 14, 2014 · Leave a Comment

NINO 3 is a measure of El Niño/La Niña (ENSO) intensity. It’s often said that ENSO has a period of 3-7 years. Why is it so hard to identify a single period? This is a job for SSA!

ninoscree

The scree plot shows that there are a ton of significant eigenvectors– this is a very complicated signal– but I want to focus on the first four, which are a shade more significant than the rest. They come in two pairs, which for SSA means that they represent two sinusoids. Here are the eigenvectors.

ninovects

How well do these two capture the ENSO signal?

ninossar

Not great, but it gets most of the ups and downs. Just not the peaks.

So what are the periods of these two? The first two are at 5.8 years, and the second two at 3.5 years. To note: those are relatively prime, so the two frequencies are always going to be going in and out of sync. So it’s not so much that ENSO has a clear frequency (it doesn’t), nor that that frequency is 3-7 years (because what would that mean anyway). It has two main frequencies, like a dial tone.

Categories: Research