Sunday, December 18, 2016

Holiday Haze

Your dedicated blogger is about to vanish in the holiday haze, returning early in the new year. Meanwhile, all best wishes for the holidays.  If you're at ASSA Chicago, I hope you'll come to the Penn Economics party, Sat. Jan. 7, 6:00-8:00, Sheraton Grand Chicago, Mayfair Room.  Thanks so much for your past, present and future support.

[Photo credit:  Public domain, by Marcus Quigmire, from Florida, USA (Happy Holidays  Uploaded by Princess Mérida) [CC-BY-SA-2.0 (http://creativecommons.org/licenses/by-sa/2.0)], via Wikimedia Commons]

Sunday, December 11, 2016

Varieties of RCT Extensibility

Even internally-valid RCT's have issues. They reveal the treatment effect only for the precise experiment performed and situation studied. Consider, for example, a study of the effects of fertilizer on crop yield, done for region X during a heat wave. Even if internally valid, the estimated treatment effect is that of fertilizer on crop yield in region X during a heat wave. The results do not necessarily generalize -- and in this example surely do not generalize -- to times of normal" weather, even in region X. And of course, for a variety of reasons, they may not generalize to regions other than X, even in heat waves.

Note the interesting time-series dimension to the failure of external validity (extensibility) in the example above. (The estimate is obtained during this year's heat wave, but next year may be "normal", or "cool". And this despite the lack of any true structural change. But of course there could be true structural change, which would only make matters worse.) This contrasts with the usual cross-sectional focus of extensibility discussions (e.g., we get effect e in region X, but what effect would we get in region Z?)

In essence, we'd like panel data, to account both for cross-section effects and time-series effects, but most RCT's unfortunately have only a single cross section.

Mark Rosenzweig and Chris Udry have a fascinating new paper, "Extenal Validity in a Stochastic World", that grapples with some of the time-series extensibility issues raised above.

Monday, December 5, 2016

Exogenous vs. Endogenous Volatility Dynamics

I always thought putting exogenous volatility dynamics in macro-model shocks was a cop-out.  Somehow it seemed more satisfying for volatility to be determined endogenously, in equilibrium.  Then I came around:  We allow for shocks with exogenous conditional-mean dynamics (e.g., AR(1)), so why shouldn't we allow for shocks with exogenous conditional-volatility dynamics?  Now I might shift back, at least in part, thanks to new work by Sydney Ludvigson, Sai Ma, and Serena Ng, "Uncertainty and Business Cycles: Exogenous Impulse or Endogenous Response?", which attempts to sort things out. The October 2016 version is here.  It turns out that real (macro) volatility appears largely endogenous, whereas nominal (financial market) volatility appears largely exogenous.

Monday, November 28, 2016

Gary Gorton, Harald Uhlig, and the Great Crisis

Gary Gorton has made clear that the financial crisis of 2007 was in essence a traditional banking panic, not unlike those of the ninetheeth century.  A key corollary is that the root cause of the Panic of 2007 can't be something relatively new, like "Too Big to Fail".  (See this.)  Lots of people blame residential mortgage-backed securities (RMBS's), but they're also too new.  Interestingly, in new work Juan Ospina and Harald Uhlig examine RBMS's directly.  Sure enough, and contrary to popular impression, they performed quite well through the crisis.

Sunday, November 20, 2016

Dense Data for Long Memory

From the last post, you might think that efficient learning about low-frequency phenomena requires tall data. Certainly efficient estimation of trend, as stressed in the last post, does require tall data. But it turns out that efficient estimation of other aspects of low-frequency dynamics sometimes requires only dense data. In particular, consider a pure long memory, or "fractionally integrated", process, $$(1-L)^d x_t = \epsilon_t$$, 0 < $$d$$ < 1/2. (See, for example, this or this.) In a general $$I(d)$$ process, $$d$$ governs only low-frequency behavior (the rate of decay of long-lag autocorrelations toward zero, or equivalently, the rate of explosion of low-frequency spectra toward infinity), so tall data are needed for efficient estimation of $$d$$. But in a pure long-memory process, one parameter ($$d$$) governs behavior at all frequencies, including arbitrarily low frequencies, due to the self-similarity ("scaling law") of pure long memory. Hence for pure long memory a short but dense sample can be as informative about $$d$$ as a tall sample. (And pure long memory often appears to be a highly-accurate approximation to financial asset return volatilities, as for example in ABDL.)

Monday, November 7, 2016

Big Data for Volatility vs.Trend

Although largely uninformative for some purposes, dense data (high-frequency sampling) are highly informative for others.  The massive example of recent decades is volatility estimation.  The basic insight traces at least to Robert Merton's early work. Roughly put, as we sample returns arbitrarily finely, we can infer underlying volatility (quadratic variation) arbitrarily well.

So, what is it for which dense data are "largely uninformative"?  The massive example of recent decades is long-term trend.  Again roughly put and assuming linearity, long-term trend is effectively a line segment drawn between a sample's first and last observations, so for efficient estimation we need tall data (long calendar span), not dense data.

Assembling everything, for estimating yesterday's stock-market volatility you'd love to have yesterday's 1-minute intra-day returns, but for estimating the expected return on the stock market (the slope of a linear log-price trend) you'd much rather have 100 years of annual returns, despite the fact that a naive count would say that 1 day of 1-minute returns is a much "bigger" sample.

So different aspects of Big Data -- in this case dense vs. tall -- are of different value for different things.  Dense data promote accurate volatility estimation, and tall data promote accurate trend estimation.

Thursday, November 3, 2016

StatPrize

Check out this new prize, http://statprize.org/ (Thanks, Dave Giles, for informing me via your tweet.) It should be USD 1 Million, ahead of the Nobel, as statistics is a key part (arguably the key part) of the foundation on which every science builds.

And obviously check out David Cox, the first winner. Every time I've given an Oxford econometrics seminar, he has shown up. It's humbling that he evidently thinks he might have something to learn from me. What an amazing scientist, and what an amazing gentleman.

And also obviously, the new StatPrize can't help but remind me of Ted Anderson's recent passing, not to mention the earlier but recent passings, for example, of Herman Wold, Edmond Mallinvaud, and Arnold Zellner. Wow -- sometimes the Stockholm gears just grind too slowly. Moving forward, StatPrize will presumably make such econometric recognition failures less likely.

Monday, October 31, 2016

Econometric Analysis of Recurrent Events

Don Harding and Adrian Pagan have a fascinating new book (HP) that just arrived in the snail mail.  Partly HP has a retro feel (think: Bry-Boshan (BB)) and partly it has a futurist feel (think: taking BB to wildly new places).  Notwithstanding the assertion in the conclusion of HP's first chapter (here), I remain of the Diebold-Rudebusch view that Hamilton-style Markov switching remains the most compelling way to think about nonlinear business-cycle events like "expansions" and "recessions" and "peaks" and "troughs".  At the very least, however, HP has significantly heightened my awareness and appreciation of alternative approaches.  Definitely worth a very serious read.

Monday, October 24, 2016

Machine Learning vs. Econometrics, IV

Some of my recent posts on this topic emphasized that (1) machine learning (ML) tends to focus on non-causal prediction, whereas econometrics and statistics (E/S) has both non-causal and causal parts, and (2) E/S tends to be more concerned with probabilistic assessment of forecast uncertainty. Here are some related thoughts.

As for (1), it's wonderful to see the ML and E/S literatures beginning to cross-fertilize, driven in significant part by E/S. Names like Athey, Chernozukov, and Imbens come immediately to mind. See, for example, the material here under "Econometric Theory and Machine Learning", and here under "Big Data: Post-Selection Inference for Causal Effects" and "Big Data: Prediction Methods".

As for (2) but staying with causal prediction, note that the traditional econometric approach treats causal prediction as an estimation problem (whether by instrumental variables, fully-structural modeling, or whatever...) and focuses not only on point estimates, but also on inference (standard errors, etc.) and hence implicitly on interval prediction of causal effects (by inverting the test statistics).  Similarly, the financial-econometric "event study" approach, which directly compares forecasts of what would have happened in the absence of an intervention to what happened with the intervention, also focuses on inference for the treatment effect, and hence implicitly on interval prediction.

Sunday, October 16, 2016

Machine Learning vs. Econometrics, III

I emphasized here that both machine learning (ML) and econometrics (E) prominently feature prediction, one distinction being that ML tends to focus on non-causal prediction, whereas a significant part of E focuses on causal prediction. So they're both focused on prediction, but there's a non-causal vs. causal distinction.  [Alternatively, as Dean Foster notes, you can think of both ML and E as focused on estimation, but with different estimands.  ML tends to focus on estimating conditional expectations, whereas the causal part of E focuses on estimating partial derivatives.]

In any event, there's another key distinction between much of ML and Econmetrics/Statistics (E/S):   E/S tends to be more concerned with probabilistic assessment of uncertainty.  Whereas ML is often satisfied with point forecasts, E/S often wants interval, and ultimately density, forecasts.

There are at least two classes of reasons for the difference.

First, E/S recognizes that uncertainty is often of intrinsic economic interest.  Think market risk, credit risk, counter-party risk, systemic risk, inflation risk, business cycle risk, etc.

Second, E/S is evidently uncomfortable with ML's implicit certainty-equivalence approach of simply plugging point forecasts into decision rules obtained under perfect foresight.  Evidently the linear-quadratic-Gaussian world in which certainty equivalence holds resonates less than completely with E/S types.  That sounds right to me.  [By the way, see my earlier piece on optimal prediction under asymmetric loss.]

Monday, October 10, 2016

Machine Learning vs. Econometrics, II

My last post focused on one key distinction between machine learning (ML) and econometrics (E):   non-causal ML prediction vs. causal E prediction.  I promised later to highlight another, even more important, distinction.  I'll get there in the next post.

But first let me note a key similarity.  ML vs. E in terms of non-causal vs. causal prediction is really only comparing ML to "half" of E (the causal part).  The other part of E (and of course statistics, so let's call it E/S), going back a century or so, focuses on non-causal prediction, just like ML.  The leading example is time-series E/S.  Just take a look at an E/S text like Elliott and Timmermann (contents and first chapter here; index here).  A lot of it looks like parts of ML.  But it's not "E/S people chasing ML ideas"; rather, E/S has been in the game for decades, often well ahead of ML.

For this reason the E/S crowd sometimes wonders whether "ML" and "data science" are just the same old wine in a new bottle.  (The joke goes, Q: What is a "data scientist"?  A: A statistician who lives in San Francisco.)  ML/DataScience is not the same old wine, but it's a blend, and a significant part of the blend is indeed E/S.

To be continued...

Sunday, October 2, 2016

Machine Learning vs. Econometrics, I

[If you're reading this in email, remember to click through on the title to get the math to render.]

Machine learning (ML) is almost always centered on prediction; think "$$\hat{y}$$".   Econometrics (E) is often, but not always, centered on prediction.  Instead it's also often interested on estimation and associated inference; think "$$\hat{\beta}$$".

Or so the story usually goes. But that misses the real distinction. Both ML and E as described above are centered on prediction.  The key difference is that ML focuses on non-causal prediction (if a new person $$i$$ arrives with covariates $$X_i$$, what is my minimium-MSE guess of her $$y_i$$?), whereas the part of econometrics highlighted above focuses on causal prediction (if I intervene and give person $$i$$ a certain treatment, what is my minimum-MSE guess of $$\Delta y_i$$?).
It just happens that, assuming linearity, a "minimum-MSE guess of $$\Delta y_i$$" is the same as a "minimum-MSE estimate of $$\beta_i$$".

So there is a ML vs. E distinction here, but it's not "prediction vs. estimation" -- it's all prediction.  Instead, the issue is non-causal prediction vs. causal prediction.

But there's another ML vs. E difference that's even more fundamental.  TO BE CONTINUED...

Monday, September 26, 2016

Fascinating Conference at Chicago

I just returned from the University of Chicago conference, "Machine Learning: What's in it for Economics?"  Lots of cool things percolating.  I'm teaching a Penn Ph.D. course later this fall on aspects of the ML/econometrics interface.  Feeling really charged.

By the way, hadn't yet been to the new Chicago economics "cathedral" (Saieh Hall for Economics) and Becker-Friedman Institute.  Wow.  What an institution, both intellectually and physically.

Tuesday, September 20, 2016

On "Shorter Papers"

Journals should not corral shorter papers into sections like "Shorter Papers".  Doing so sends a subtle (actually unsubtle) message that shorter papers are basically second-class citizens, somehow less good, or less important, or less something -- not just less long -- than longer papers.  If a paper is above the bar, then it's above the bar, and regardless of its length it should then be published simply as a paper, not a "shorter paper", or a "note", or anything else.  Many shorter papers are much more important than the vast majority of longer papers.

Monday, September 12, 2016

Time-Series Econometrics and Climate Change

It's exciting to see time series econometrics contributing to the climate change discussion.

Check out the upcoming CREATES conference, "Econometric Models of Climate Change", here.

Here are a few good examples of recent time-series climate research, in chronological order.  (There are many more.  Look through the reference lists, for example, in the 2016 and 2017 papers below.)

Jim Stock et al. (2009) in Climatic Change.

Pierre Perron et al. (2013) in Nature.

Peter Phillips et al. (2016) in Nature.

Proietti and Hillebrand (2017), forthcoming in Journal of the Royal Statistical Society.

Tuesday, September 6, 2016

Inane Journal "Impact Factors"

Why are journals so obsessed with "impact factors"? (The five-year impact factor is average citations/article in a five-year window.)  They're often calculated to three decimal places, and publishers trumpet victory when they go from (say) 1.225 to 1.311!  It's hard to think of a dumber statistic, or dumber over-interpretation.  Are the numbers after the decimal point anything more than noise, and for that matter, are the numbers before the decimal much more than noise?

Why don't journals instead use the same citation indexes used for individuals? The leading index seems to be the h-index, which is the largest integer h such that an individual has h papers, each cited at least h times. I don't know who cooked up the h-index, and
surely it has issues too, but the gurus love it, and in my experience it tells the truth.

Even better, why not stop obsessing over clearly-insufficient statistics of any kind? I propose instead looking at what I'll call a "citation signature plot" (CSP), simply plotting the number of cites for the most-cited paper, the number of cites for the second-most-cited paper, and so on. (Use whatever window(s) you want.) The CSP reveals everything, instantly and visually. How high is the CSP for the top papers? How quickly, and with what pattern, does it approach zero? etc., etc. It's all there.

Google-Scholar CSP's are easy to make for individuals, and they're tremendously informative. They'd be only slightly harder to make for journals. I'd love to see some.

Monday, August 29, 2016

On Credible Cointegration Analyses

I may not know whether some $$I(1)$$ variables are cointegrated, but if they are, I often have a very strong view about the likely number and nature of cointegrating combinations. Single-factor structure is common in many areas of economics and finance, so if cointegration is present in an $$N$$-variable system, for example, a natural benchmark is 1 common trend ($$N-1$$ cointegrating combinations).  And moreover, the natural cointegrating combinations are almost always spreads or ratios (which of course are spreads in logs). For example, log consumption and log income may or may not be cointegrated, but if they are, then the obvious benchmark cointegrating combination is $$(ln C - ln Y)$$. Similarly, the obvious benchmark for $$N$$ government bond yields $$y$$ is $$N-1$$ cointegrating combinations, given by term spreads relative to some reference yield; e.g., $$y_2 - y_1$$, $$y_3 - y_1$$, ..., $$y_N - y_1$$.

There's not much literature exploring this perspective. (One notable exception is Horvath and Watson, "Testing for Cointegration When Some of the Cointegrating Vectors are Prespecified", Econometric Theory, 11, 952-984.) We need more.

Sunday, August 21, 2016

More on Big Data and Mixed Frequencies

I recently blogged on Big Data and mixed-frequency data, arguing that Big Data (wide data, in particular) leads naturally to mixed-frequency data.  (See here for the tall data / wide data / dense data taxonomy.)  The obvious just occurred to me, namely that it's also true in the other direction. That is, mixed-frequency situations also lead naturally to Big Data, and with a subtle twist: the nature of the Big Data may be dense rather than wide. The theoretically-pure way to set things up is as a state-space system laid out at the highest observed frequency, appropriately treating most of the lower-frequency data as missing, as in ADS.  By construction, the system is dense if any of the series are dense, as the system is laid out at the highest frequency.

Wednesday, August 17, 2016

On the Evils of Hodrick-Prescott Detrending

[If you're reading this in email, remember to click through on the title to get the math to render.]

Jim Hamilton has a very cool new paper, "Why You Should Never Use the Hodrick-Prescott (HP) Filter".

Of course we've known of the pitfalls of HP ever since Cogley and Nason (1995) brought them into razor-sharp focus decades ago.  The title of the even-earlier Nelson and Kang (1981) classic, "Spurious Periodicity in Inappropriately Detrended Time Series", says it all.  Nelson-Kang made the spurious-periodicity case against polynomial detrending of I(1) series.  Hamilton makes the spurious-periodicity case against HP detrending of many types of series, including I(1).  (Or, more precisely, Hamilton adds even more weight to the Cogley-Nason spurious-periodicity case against HP.)

But the main contribution of Hamilton's paper is constructive, not destructive.  It provides a superior detrending method, based only on a simple linear projection.

Here's a way to understand what "Hamilton detrending" does and why it works, based on a nice connection to Beveridge-Nelson (1981) detrending not noticed in Hamilton's paper.

First consider Beveridge-Nelson (BN) trend for I(1) series.  BN trend is just a very long-run forecast based on an infinite past.  [You want a very long-run forecast in the BN environment because the stationary cycle washes out from a very long-run forecast, leaving just the forecast of the underlying random-walk stochastic trend, which is also the current value of the trend since it's a random walk.  So the BN trend at any time is just a very long-run forecast made at that time.]  Hence BN trend is implicitly based on the projection: $$y_t ~ \rightarrow ~ c, ~ y_{t-h}, ~...,~ y_{t-h-p}$$, for $$h \rightarrow \infty$$ and $$p \rightarrow \infty$$.

Now consider Hamilton trend.  It is explicitly based on the projection: $$y_t ~ \rightarrow ~ c, ~ y_{t-h}, ~...,~ y_{t-h-p}$$, for $$p = 3$$.  (Hamilton also uses a benchmark of  $$h = 8$$.)

So BN and Hamilton are both "linear projection trends", differing only in choice of $$h$$ and $$p$$!  BN takes an infinite forecast horizon and projects on an infinite past.  Hamilton takes a medium forecast horizon and projects on just the recent past.

Much of Hamilton's paper is devoted to defending the choice of $$p = 3$$, which turns out to perform well for a wide range of data-generating processes (not just I(1)).  The BN choice of $$h = p = \infty$$, in contrast, although optimal for I(1) series, is less robust to other DGP's.  (And of course estimation of the BN projection as written above is infeasible, which people avoid in practice by assuming low-ordered ARIMA structure.)

Monday, August 15, 2016

More on Nonlinear Forecasting Over the Cycle

Related to my last post, here's a new paper that just arrived from Rachidi Kotchoni and Dalibor Stevanovic, "Forecasting U.S. Recessions and Economic Activity". It's not non-parametric, but it is non-linear. As Dalibor put it, "The method is very simple: predict turning points and recession probabilities in the first step, and then augment a direct AR model with the forecasted probability." Kotchoni-Stevanovic and Guerron-Quintana-Zhong are usefully read together.

Sunday, August 14, 2016

Nearest-Neighbor Forecasting in Times of Crisis

Nonparametric K-nearest-neighbor forecasting remains natural and obvious and potentially very useful, as it has been since its inception long ago.

[Most crudely: Find the K-history closest to the present K-history, see what followed it, and use that as a forecast. Slightly less crudely: Find the N K-histories closest to the present K-history, see what followed each of them, and take an average. There are many obvious additional refinements.]

Overall, nearest-neighbor forecasting remains curiously under-utilized in dynamic econometrics. Maybe that will change. In an interesting recent development, for example, new Federal Reserve System research by Pablo Guerron-Quintana and Molin Zhong puts nearest-neighbor methods to good use for forecasting in times of crisis.

Monday, August 8, 2016

NSF Grants vs. Improved Data

Lots of people are talking about the Cowen-Tabarrok Journal of Economic Perspectives piece, "A Skeptical View of the National Science Foundation’s Role in Economic Research". See, for example, John Cochrane's insightful "A Look in the Mirror".

A look in the mirror indeed. I was a 25-year ward of the NSF, but for the past several years I've been on the run. I bolted in part because the economics NSF reward-to-effort ratio has fallen dramatically for senior researchers, and in part because, conditional on the ongoing existence of NSF grants, I feel strongly that NSF money and "signaling" are better allocated to young assistant and associate professors, for whom the signaling value from NSF support is much higher.

Cowen-Tabarrok make some very good points. But I can see both sides of many of their issues and sub-issues, so I'm not taking sides. Instead let me make just one observation (and I'm hardly the first).

If NSF funds were to be re-allocated, improved data collection and dissemination looks attractive. I'm not talking about funding cute RCTs-of-the-month. Rather, I'm talking about funding increased and ongoing commitment to improving our fundamental price and quantity data (i.e., the national accounts and related statistics). They desperately need to be brought into the new millennium. Just look, for example, at the wealth of issues raised in recent decades by the.

Ironically, it's hard to make a formal case (at least for data dissemination as opposed to creation), as Chris Sims has emphasized with typical brilliance. His "The Futility of Cost-Benefit Analysis for Data Dissemination" explains "why the apparently reasonable idea of applying cost-benefit analysis to government programs founders when applied to data dissemination programs." So who knows how I came to feel that NSF funds might usefully be re-allocated to data collection and dissemination. But so be it.

Monday, August 1, 2016

On the Superiority of Observed Information

Earlier I claimed that "Efron-Hinkley holds up -- observed information dominates estimated expected information for finite-sample MLE inference." Several of you have asked for elaboration.

The earlier post grew from a 6 AM Hong Kong breakfast conversation with Per Mykland (with both of us suffering from 12-hour jet lag), so I wanted to get some detail from him before elaborating, to avoid erroneous recollections. But it's basically as I recalled -- mostly coming from the good large-deviation properties of the likelihood ratio. The following is adapted from that conversation and a subsequent email exchange. (Any errors or omissions are entirely mine.)

There was quite a bit of work in the 1980s and 1990s. It was kicked off by Efron and Hinkley (1978). The main message is in their plot on p. 460, suggesting that the observed info was a more accurate estimator. Research gradually focused on the behavior of the likelihood ratio ($$LR$$) statistic and its signed squared root $$R=sgn(\hat{\theta} - \theta ) \sqrt{LR}$$, which was seen to have good conditionality properties, local sufficiency, and most crucially, good large-deviation properties.  (For details see Mykland (1999), Mykland (2001), and the references there.)

The large-deviation situation is as follows.  Most statistics have cumulant behavior as in Mykland (1999) eq. (2.1).  In contrast, $$R$$ has cumulant behavior as in Mykland (1999) eq. (2.2), which yields the large deviation properties of Mykland (1999) Theorem 1. (Also see Theorems 1 and 2 of Mykland (2001).)

Tuesday, July 26, 2016

An important Example of Simultaneously Wide and Dense Data

By the way, related to my last post on wide and dense data, an important example of analysis of data that are both wide and dense is the high-frequency high-dimensional factor modeling of Pelger and Ait-Sahalia and Xiu.  Effectively they treat wide sets of realized volatilities, each of which is constructed from underlying dense data.

Monday, July 25, 2016

The Action is in Wide and/or Dense Data

I recently blogged on varieties of Big Data: (1) tall, (2) wide, and (3) dense.

Presumably tall data are the least interesting insofar as the only way to get a long calendar span is to sit around and wait, in contrast to wide and dense data, which now appear routinely.

But it occurs to me that tall data are also the least interesting for another reason:  wide data make tall data impossible from a certain perspective. In particular, non-parametric estimation in high dimensions (that is, with wide data) is always subject to the fundamental and inescapable "curse of dimensionality":  the rate at which estimation error vanishes gets hopelessly slow, very quickly, as dimension grows.  [Wonks will recall that the Stone-optimal rate in $$d$$ dimensions is $$\sqrt{T^{1- \frac{d}{d+4}}}$$.]

The upshot:  As our datasets get wider, they also implicitly get less tall. That's all the more reason to downplay tall data.  The action is in wide and dense data (whether separately or jointly).

Monday, July 18, 2016

The HAC Emperor has no Clothes: Part 2

The time-series kernel-HAC literature seems to have forgotten about pre-whitening. But most of the action is in the pre-whitening, as stressed in my earlier post. In time-series contexts, parametric allowance for good-old ARMA-GARCH disturbances (with AIC order selection, say) is likely to be all that's needed, cleaning out whatever conditional-mean and conditional-variance dynamics are operative, after which there's little/no need for anything else. (And although I say "parametric" ARMA/GARCH, it's actually fully non-parametric from a sieve perspective.)

Instead, people focus on kernel-HAC sans prewhitening, and obsess over truncation lag selection. Truncation lag selection is indeed very important when pre-whitening is forgotten, as too short a lag can lead to seriously distorted inference, as emphasized in the brilliant early work of Kiefer-Vogelsang and in important recent work by Lewis, Lazarus, Stock and Watson. But all of that becomes much less important when pre-whitening is successfully implemented.

[Of course spectra need not be rational, so ARMA is just an approximation to a more general Wold representation (and remember, GARCH(1,1) is just an ARMA(1,1) in squares). But is that really a problem? In econometrics don't we feel comfortable with ARMA approximations 99.9 percent of the time? The only econometrically-interesting process I can think of that doesn't admit a finite-ordered ARMA representation is long memory (fractional integration). But that too can be handled parametrically by introducing just one more parameter, moving from ARMA(p,q) to ARFIMA(p,d,q).]

My earlier post linked to the key early work of Den Haan and Levin, which remains unpublished. I am confident that their basic message remains intact. Indeed recent work revisits and amplifies it in important ways; see Kapetanios and Psaradakis (2016) and new work in progress by Richard Baillie to be presented at the September 2016 NBER/NSF time-series meeting at Columbia ("Is Robust Inference with OLS Sensible in Time Series Regressions?").

Sunday, July 10, 2016

Contemporaneous, Independent, and Complementary

You've probably been in a situation where you and someone else discovered something "contemporaneously and independently". Despite the initial sinking feeling, I've come to realize that there's usually nothing to worry about.

First, normal-time science has a certain internal momentum -- it simply must evolve in certain ways -- so people often identify and pluck the low-hanging fruit more-or-less simultaneously.

Second, and crucially, such incidents are usually not just the same discovery made twice. Rather, although intimately-related, the two contributions usually differ in subtle but important ways, rendering them complements, not substitutes.

Here's a good recent example in financial econometrics, working out asymptotics for high-frequency high-dimensional factor models. On the one hand, consider Pelger, and on the other hand consider Ait-Sahalia and Xiu.  There's plenty of room in the world for both, and the whole is even greater than the sum of the (individually-impressive) parts.

Sunday, July 3, 2016

DAG Software

Some time ago I mentioned the DAG (directed acyclical graph) primer by Judea Pearl et al.  As noted in Pearl's recent blog post, a manual will be available with software solutions based on a DAGitty R package.  See http://dagitty.net/primer/

More generally -- that is, quite apart from the Pearl et al. primer -- check out DAGity at http://dagitty.net.  Click on "launch" and play around for a few minutes. Very cool.

Sunday, June 26, 2016

Regularization for Long Memory

Two earlier regularization posts focused on panel data and generic time series contexts. Now consider a specific time-series context: long memory. For exposition consider the simplest case of a pure long memory DGP,  $$(1-L)^d y_t = \varepsilon_t$$ with  $$|d| < 1/2$$.  This $$ARFIMA(0,d,0)$$ process is  is $$AR(\infty)$$ with very slowly decaying coefficients due to the long memory. If you KNEW the world was was $$ARFIMA(0,d,0)$$ you'd just fit $$d$$ using GPH or Whittle or whatever, but you're not sure, so you'd like to stay flexible and fit a very long $$AR$$ (an $$AR(100)$$, say). But such a profligate parameterization is infeasible or at least very wasteful. A solution is to fit the $$AR(100)$$ but regularize by estimating with ridge or a LASSO variant, say.

Related, recall the Corsi "HAR" approximation to long memory. It's just a long autoregression subject to coefficient restrictions. So you could do a LASSO estimation, as in Audrino and Knaus (2013). Related analysis and references are in a Humboldt University 2015 master's thesis.)

Finally, note that in all of the above it might be desirable to change the LASSO centering point for shrinage/selection to match the long-memory restriction. (In standard LASSO it's just 0.)

Wednesday, June 22, 2016

Observed Info vs. Estimated Expected Info

All told, after decades of research, it seems that Efron-Hinkley holds up -- observed information dominates estimated expected information for finite-sample MLE inference. It's both easier to calculate and more accurate. Let me know if you disagree.

[Efron, B. and Hinkley, D.V. (1978), "Assessing the Accuracy of the Maximum Likelihood Estimator: Observed Versus Expected Fisher Information", Biometrika, 65, 457–487.]

Tuesday, June 21, 2016

Mixed-Frequency High-Dimensional Time Series

Notice that high dimensions and mixed frequencies go together in time series. (If you're looking at a huge number of series, it's highly unlikely that all will be measured at the same frequency, unless you arbitrarily exclude all frequencies but one.) So high-dim MIDAS vector autoregression (VAR) will play a big role moving forward. The MIDAS literature is starting to go multivariate, with MIDAS VAR's appearing; see Ghysels (2015, in press) and Mikosch and Neuwirth (2016 w.p.)

But the multivariate MIDAS literature is still low-dim rather than high-dim. Next steps will be:

(1) move to high-dim VAR estimation by using regularization methods (e.g. LASSO variants),

(2) allow for many observational frequencies (five or six, say),

(3) allow for the "rough edges" that will invariably arise at the beginning and end of the sample, and

(4) visualize results using network graphics.

Conditional Dependence and Partial Correlation

In the multivariate normal case, conditional independence is the same as zero partial correlation.  (See below.) That makes a lot of things a lot simpler.  In particular, determining ordering in a DAG is just a matter of assessing partial correlations. Of course in many applications normality may not hold, but still...

Aust. N.Z. J. Stat. 46(4), 2004, 657–664
PARTIAL CORRELATION AND CONDITIONAL CORRELATION AS MEASURES OF CONDITIONAL INDEPENDENCE
Kunihiro Baba1∗, Ritei Shibata1 and Masaaki Sibuya2
Keio University and Takachiho University
Summary
This paper investigates the roles of partial correlation and conditional correlation as mea-sures of the conditional independence of two random variables. It ﬁrst establishes a sufﬁ-cientconditionforthecoincidenceofthepartialcorrelationwiththeconditionalcorrelation. The condition is satisﬁed not only for multivariate normal but also for elliptical, multi-variate hypergeometric, multivariate negative hypergeometric, multinomial and Dirichlet distributions. Such families of distributions are characterized by a semigroup property as a parametric family of distributions. A necessary and sufﬁcient condition for the coinci-dence of the partial covariance with the conditional covariance is also derived. However, a known family of multivariate distributions which satisﬁes this condition cannot be found, except for the multivariate normal. The paper also shows that conditional independence has no close ties with zero partial correlation except in the case of the multivariate normal distribution; it has rather close ties to the zero conditional correlation. It shows that the equivalence between zero conditional covariance and conditional independence for normal variables is retained by any monotone transformation of each variable. The results suggest that care must be taken when using such correlations as measures of conditional indepen-dence unless the joint distribution is known to be normal. Otherwise a new concept of conditional independence may need to be introduced in place of conditional independence through zero conditional correlation or other statistics.
Keywords: elliptical distribution; exchangeability; graphical modelling; monotone transformation.

Saturday, June 18, 2016

A Little Bit More on Dave Backus

In the days since his passing, lots of wonderful things have been said about Dave Backus. (See, for example, the obituary by Tom Cooley, posted on David Levine's page.) They're all true. But none sufficiently stress what was for me his essence: complete selflessness. We've all had a few good colleagues, even great colleagues, but Dave took it to an entirely different level.

The "Teaching" section of his web page begins, "I have an open-source attitude toward teaching materials". Dave had an open-source attitude toward everything. He lived for team building, cross-fertilization, mentoring, and on and on. A lesser person would have traded the selflessness for a longer c.v., but not Dave. And we're all better off for it.

SoFiE 2016 Hong Kong (and 2017 New York)

Hats off to all those who helped make the Hong Kong SoFiE meeting such a success. Special thanks (in alphabetical order) to Charlotte Chen, Yin-Wong Cheung, Jianqing Fan, Eric Ghysels, Ravi Jagannathan, Yingying Li, Daniel Preve, and Giorgio Valente. The conference web site is here

Mark your calendars now for what promises to be a very special tenth-anniversary meeting next year in New York, hosted by Rob Engle at NYU's Stern School. The dates are June 20-23, 2017.

Tuesday, June 14, 2016

Indicator Saturation Estimation

In an earlier post, "Fixed Effects Without Panel Data",  I argued that you could allow for (and indeed estimate) fixed effects in pure cross sections (i.e., no need for panel data) by using regularization estimators like LASSO. The idea is to fit a profligately-parameterized model but then to recover d.f. by regularization.

Note that you can use the same idea in time-series contexts.  Even in a pure time series, you can allow for period-by-period time effects, broken polynomial trend with an arbitrary number of breakpoints, etc., via regularization.

It turns out that a fascinating small literature on so-called "indicator saturation estimation" pursues this idea.  The "indicators" are things like period-by-period time dummies, break-date location dummies, etc., and "saturation" refers to the profligate parameterization.  Prominent contributors include David Hendry and Soren Johanssen; see this new paper and those that it cites.  (Very cool application, by the way, to detecting historical volcanic eruptions.)

Monday, June 6, 2016

Fixed Effects Without Panel Data

Consider a pure cross section (CS) of size N.  Generally you'd like to allow for individual effects, but you can't, because OLS with a full set of N individual dummies is conceptually infeasible. (You'd exhaust degrees of freedom.) That's usually what motivates the desirability/beauty of panel data -- there you have NxT observations, so including N individual dummies becomes conceptually feasible.

But there's no need to stay with OLS.  You can recover d.f. using regularization estimators like ridge (shrinkage) or LASSO (shrinkage and selection).  So including a full set of individual dummies, even in a pure CS, is completely feasible!  For implementation you just have to select the ridge or lasso penalty parameter, which is reliably done by cross validation (say).

There are two key points.  The first is that you can allow for individual fixed effects even in a pure CS; that is, there's no need for panel data.  That's what I've emphasized so far.

The second is that the proposed method actually gives estimates of the fixed effects.  Sometimes they're just nuisance parameters that can be ignored; indeed standard panel estimation methods "difference them out", so they're not even estimated.  But estimates of the fixed effects are crucial for forecasting:  to forecast y_i, you need not only Mr. i's covariates and estimates of the "slope parameters", but also an estimate of Mr. i's intercept!  That's why forecasting is so conspicuously absent from most of the panel literature -- the fixed effects are not estimated, so forecasting is hopeless.  Regularized estimation, in contrast, delivers estimates of fixed effects, thereby facilitating forecasting, and you don't even need a panel.

Friday, June 3, 2016

Causal Estimation and Millions of Lives

This just in from a fine former Ph.D. student.  He returned to India many years ago and made his fortune in finance.  He's now devoting himself the greater good, working with the Bill and Melinda Gates Foundation.

I reminded him that I'm not likely to be a big help, as I generally don't do causal estimation or experimental design. But he kindly allowed me to post his communication below (abridged and slightly edited). Please post comments for him if you have any suggestions. [As you know, I write this blog more like a newspaper column, neither encouraging nor receiving many comments -- so now's your chance to comment!]

He writes:

One of the key challenges we face in our work is that causality is not known, and while theory and large scale studies, such as those published in the Lancet, do provide us with some guidance, it is far from clear that they reflect the reality on the ground when we are intervening in field settings with markedly different starting points from those that were used in the studies. However, while we observe the ground situation imperfectly and with large error, the inertia in the underlying system that we are trying to impact is so high that that it would perhaps be safe to say that, unlike in the corporate world, there isn’t a lot of creative destruction going on here. In such a situation it would seem to me that the best way to learn about the “true but unobserved” reality and how to permanently change it and scale the change cost-effectively (such as nurse behavior in facilities) is to go on attempting different interventions which are structured in such a way as to allow for a rapid convergence to the most effective interventions (similar to the famous Runge-Kutta iterative methods for rapidly and efficiently arriving at solutions to differential equations to the desired level of accuracy).

However, while the need is for rapid learning, the most popular methods proceed by collecting months or years of data in both intervention and control settings, and at the end of it all, if done very-very carefully, all that they can tell you is that there were some links (or not) between the interventions and results without giving you any insight into why something happened or what can be done to improve it. In the meanwhile one is expected to hold the intervention steady and almost discard all the knowledge that is continuously being generated and be patient even while lives are being lost because the intervention was not quite designed well. While the problems with such an approach are apparent, the alternative cannot be instinct or gut feeling and a series of uncoordinated actions in the name of “being responsive”.

I am writing to request your help in pointing us to literature that can act as a guide to how we may do this better. ... I have indeed found some ideas in the literature that may be somewhat useful, ... [and] while very interesting and informative, I’m afraid it is not yet clear to me how we will apply these ideas in our actual field settings, and how we will design our Measurement, Learning, and Evaluation approaches differently so that we can actually implement these ideas in difficult on-ground settings in remote parts of our country involving, literally, millions of lives.

Saturday, May 28, 2016

No Hesitations at 500K

Some company just emailed to inform me that No Hesitations had made its list of the Top 100 Economics Blogs.  I was pretty happy until I decided that there were probably only 70 or 80 economics blogs.

But seriously, thanks a lot for your wonderful support.  No Hesitations has about 500,000 pageviews since launching in summer 2013, and the trend (below) looks good.  The time has flown by, and I look forward to continuing.

Monday, May 23, 2016

Here's a continuation of this recent post (for students) on listening to writing.

OK, you say, Martin Amis interviews are entertaining, but Martin Amis is not a mere mortal, so what's the practical writing advice for the rest of us? Read this, from Gary Provost (evidently the highlighting is keyed to different sentence lengths):

Sunday, May 22, 2016

Martin Amis on How to Write a Great Sentence

It's been a while since I did a piece on good writing, for students.   In an old post I said "Listen to your words; push your prose toward poetry."  That's perhaps a bit much -- you don't need to write poetry, but you do need to listen to your writing.

On the listening theme, check out this Martin Amis clip, even if I don't see why you shouldn't repeat prefixes or suffixes in the same sentence (in fact I think the repetition can sometimes be poetic, a sort of alliteration, when done tastefully).  And while you're at it, take a look at this marvelous older clip too.

Friday, May 20, 2016

Hazard Functions for U.S. Expansions

Glenn Rudebusch has a very nice 2016 FRBSF Letter, "Will the Economic Recovery Die of Old Age?".  He draws on perspective and results from our joint work of 25 years ago (including a paper we did with Dan Sichel -- see below), and he applies them to the present expansion.  He correctly emphasizes that U.S. expansion hazard functions are basically flat, so "old" expansions are no more likely to end than "young" ones. That's of some comfort, since the present expansion, which started in mid-2009, is getting long in the tooth!

Actually, the flat expansion hazard is only for post-WWII expansions; the prewar expansion hazard is sharply increasing. Here's how they compare (copied from Glenn's FRBSFLetter):

Probability of an Expansion ending within a month

Perhaps the massive difference is due to "good policy", that is, post-war policy success in "keeping expansions alive".  Or perhaps it's just "good luck" -- but it's so big and systematic that luck alone seems an unlikely explanation.

For more on all this, and to see the equally-fascinating and very different results for recession hazards, see Diebold, Rudebusch and Sichel (1992), which I consider to be the best statement of our work in the area.

[Footnote:  I wrote this post about three days ago, intending to release it next week. I just learned that The Economist (May 21st issue) also reports on the Rudebusch FRBSF Letter (see http://www.economist.com/news/finance-and-economics/21699124-when-periods-economic-growth-come-end-old-age-rarely-blame-murder), so I'm releasing it early.  Interesting that both The Economist and I are not only slow -- Glenn sent me his Letter in February, when it was published! -- but also identically slow.]

R/Finance 2016: Applied Finance with R

At R/Finance 2016: Applied Finance with R.  Interesting group, with many constituencies, and interesting program, which appears below (or go to http://www.rinfinance.com/agenda/).

Tuesday, May 17, 2016

Statistical Machine Learning Circa 1989

I've always been a massive fan of statisticians whose work is rigorous yet practical, with emphasis on modeling. People like Box, Cox, Hastie, and Tibshirani obviously come to mind.  So too, of course, do Leo Brieman and Jerry Friedman.

I had the good luck to stumble into a week-long intensive lecture series with Jerry Friedman in 1989, a sort of summer school for twenty-something assistant professors and the like.  At the time I was a young economist in DC at the Federal Reserve Board, and the lectures were just down the street at GW.

I thought I would attend to learn some non-parametrics, and I definitely did learn some non-parametrics.  But far more than that, Jerry opened my eyes to what would be unfolding for the next half-century -- flexible, algorithmic, high-dimensional methods -- the statistics of "Big Data" and "machine learning".

I just found the binder containing his lecture notes.  The contents appear below.  Read the opening overview, "Modern Statistics and the Computer Revolution".  Amazingly prescient.  Remember, this was 1989!

[Side note:  There I also had the pleasure of first meeting Bob Stine, who has now been my esteemed Penn Statistics colleague for more than 25 years.]