Hypothesis

26 Matching Annotations

Jan 2020
pubs.aeaweb.org pubs.aeaweb.org

Machine Learning: An Applied Econometric Approach

26
1. daaronr 13 Jan 2020
  
  in Public
  
  Suppose the algorithm chooses a tree that splits on education but not on age. Conditional on this tree, the estimated coefficients are consistent. But that does not imply that treatment effects do not also vary by age, as education may well covary with age; on other draws of the data, in fact, the same procedure could have chosen a tree that split on age instead
  
  a caveat
  
  ml-reading-group
2. daaronr 13 Jan 2020
  
  in Public
  
  hese heterogenous treatment effects can be used to assign treatments; Misra and Dubé (2016) illustrate this on the problem of price targeting, applying Bayesian regularized methods to a large-scale experiment where prices were randomly assigned
  
  todo -- look into the implication for treatment assignment with heterogeneity
  
  ml-reading-group
3. daaronr 13 Jan 2020
  
  in Public
  
  Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey (2016) take care of high-dimensional controls in treatment effect estimation by solving two simultaneous prediction problems, one in the outcome and one in the treatment equation.
  
  this seems similar to my idea of regularizing on only a subset of the variables
  
  ml-reading-group daaronr
4. daaronr 13 Jan 2020
  
  in Public
  
  These same techniques applied here result in split-sample instrumental variables (Angrist and Krueger 1995) and “jackknife” instrumental variables
  
  some classical solutions to IV bias are akin to ML solutions
  
  ml-reading-group
5. daaronr 13 Jan 2020
  
  in Public
  
  Understood this way, the finite-sample biases in instrumental variables are a consequence of overfitting.
  
  traditional 'finite sample bias of IV' is really overfitting
  
  ml-reading-group
6. daaronr 13 Jan 2020
  
  in Public
  
  Even when we are interested in a parameter β ˆ, the tool we use to recover that parameter may contain (often implicitly) a prediction component. Take the case of linear instrumental variables understood as a two-stage procedure: first regress x = γ′z + δ on the instrument z, then regress y = β′x + ε on the fitted values x ˆ. The first stage is typically handled as an estimation step. But this is effectively a prediction task: only the predictions x ˆ enter the second stage; the coefficients in the first stage are merely a means to these fitted values.
  
  first stage of IV -- handled as an estimation problem, but really it's a prediction problem!
  
  ml-reading-group
7. daaronr 13 Jan 2020
  
  in Public
  
  Prediction in the Service of Estimation
  
  This is especially relevant to economists across the board, even the ML skeptics
  
  ml-reading-group
8. daaronr 13 Jan 2020
  
  in Public
  
  New Data
  
  The first application: constructing variables and meaning from high-dimensional data, especially outcome variables
  
  satellite images (of energy use, lights etc) --> economic activity
  
  cell phone data, Google street view to measure wealth
  
  extract similarity of firms from 10k reports
  
  even traditional data .. matching individuals in historical censuses
  
  ml-reading-group
9. daaronr 13 Jan 2020
  
  in Public
  
  Zhao and Yu (2006) who establish asymptotic model-selection consistency for the LASSO. Besides assuming that the true model is “sparse”—only a few variables are relevant—they also require the “irrepresentable condition” between observables: loosely put, none of the irrelevant covariates can be even moderately related to the set of relevant ones.
  
  Basically unrealistic for microeconomic applications imho
  
  ml-reading-group
10. daaronr 13 Jan 2020
  
  in Public
  
  First, it encourages the choice of less complex, but wrong models. Even if the best model uses interactions of number of bathrooms with number of rooms, regularization may lead to a choice of a simpler (but worse) model that uses only number of fireplaces. Second, it can bring with it a cousin of omitted variable bias, where we are typically concerned with correlations between observed variables and unobserved ones. Here, when regular-ization excludes some variables, even a correlation between observed variables and other observed (but excluded) ones can create bias in the estimated coefficients.
  
  Is this equally a problem for procedures that do not assum sparsity, such as the Ridge model?
  
  ml-reading-group
11. daaronr 13 Jan 2020
  
  in Public
  
  97the variables are correlated with each other (say the number of rooms of a house and its square-footage), then such variables are substitutes in predicting house prices. Similar predictions can be produced using very different variables. Which variables are actually chosen depends on the specific finite sample.
  
  Lasso-chosen variables are unstable because of what we usually call 'multicollinearity.'<br> This presents a problem for making inferences from estimated coefficients.
  
  ml-reading-group
12. daaronr 12 Jan 2020
  
  in Public
  
  Through its regularizer, LASSO produces a sparse prediction function, so that many coefficients are zero and are “not used”—in this example, we find that more than half the variables are unused in each run
  
  This is true but they fail to mention that LASSO also shrinks the coefficients on variables that it keeps towards zero (relative to OLS). I think this is commonly misunderstood (from people I've spoken with).
  
  ml-reading-group
13. daaronr 12 Jan 2020
  
  in Public
  
  One obvious problem that arises in making such inferences is the lack of stan-dard errors on the coefficients. Even when machine-learning predictors produce familiar output like linear functions, forming these standard errors can be more complicated than seems at first glance as they would have to account for the model selection itself. In fact, Leeb and Pötscher (2006, 2008) develop conditions under which it is impossible to obtain (uniformly) consistent estimates of the distribution of model parameters after data-driven selection.
  
  This is a very serious limitation for Economics academic work.
  
  ml-reading-group
14. daaronr 12 Jan 2020
  
  in Public
  
  First, econometrics can guide design choices, such as the number of folds or the function class.
  
  How would Econometrics guide us in this?
  
  ml-reading-group
15. daaronr 12 Jan 2020
  
  in Public
  
  These choices about how to represent the features will interact with the regularizer and function class: A linear model can reproduce the log base area per room from log base area and log room number easily, while a regression tree would require many splits to do so.
  
  The choice of 'how to represent the features' is consequential ... it's not just 'throw it all in' (kitchen sink approach)
  
  ml-reading-group
16. daaronr 12 Jan 2020
  
  in Public
  
  Ta b l e 2Some Machine Learning Algorithms
  
  This is a very helpful table!
  
  ml-reading-group
17. daaronr 12 Jan 2020
  
  in Public
  
  Picking the prediction func-tion then involves two steps: The first step is, conditional on a level of complexity, to pick the best in-sample loss-minimizing function.8 The second step is to estimate the optimal level of complexity using empirical tuning (as we saw in cross-validating the depth of the tree).
  
  ML explained while standing on one leg.
  
  ml-reading-group
18. daaronr 12 Jan 2020
  
  in Public
  
  egularization combines with the observability of predic-tion quality to allow us to fit flexible functional forms and still find generalizable structure.
  
  But we can't really make statistical inferences about the structure, can we?
  
  ml-reading-group
19. daaronr 12 Jan 2020
  
  in Public
  
  This procedure works because prediction quality is observable: both predic-tions y ˆ and outcomes y are observed. Contrast this with parameter estimation, where typically we must rely on assumptions about the data-generating process to ensure consistency.
  
  I'm not clear what the implication they are making here is. Does it in some sense 'not work' with respect to parameter estimation?
  
  ml-reading-group
20. daaronr 12 Jan 2020
  
  in Public
  
  In empirical tuning, we create an out-of-sample experiment inside the original sample.
  
  remember that tuning is done within the training sample
  
  ml-reading-group
21. daaronr 12 Jan 2020
  
  in Public
  
  Performance of Different Algorithms in Predicting House Values
  
  Any reason they didn't try a Ridge or an Elastic net model here? My instinct is that these will beat LASSO for most Economic applications.
  
  ml-reading-group
22. daaronr 12 Jan 2020
  
  in Public
  
  We consider 10,000 randomly selected owner-occupied units from the 2011 metropolitan sample of the American Housing Survey. In addition to the values of each unit, we also include 150 variables that contain information about the unit and its location, such as the number of rooms, the base area, and the census region within the United States. To compare different prediction tech-niques, we evaluate how well each approach predicts (log) unit value on a separate hold-out set of 41,808 units from the same sample. All details on the sample and our empirical exercise can be found in an online appendix available with this paper athttp://e-jep.org
  
  Seems a useful example for trying/testing/benchmarking. But the link didn't work for me. Can anyone find it? Is it interactive? (This is why I think papers should be html and not pdfs...)
  
  ml-reading-group
23. daaronr 12 Jan 2020
  
  in Public
  
  Making sense of complex data such as images and text often involves a prediction pre-processing step.
  
  In using 'new kinds of data' in Economics we often need to do a 'classification step' first
  
  ml-reading-group
24. daaronr 12 Jan 2020
  
  in Public
  
  The fundamental insight behind these breakthroughs is as much statis-tical as computational. Machine intelligence became possible once researchers stopped approaching intelligence tasks procedurally and began tackling them empirically.
  
  I hadn't thought about how this unites the 'statistics to learn stuff' part of ML and the 'build a tool to do a task' part. Well-phrased.
  
  ml-reading-group
25. daaronr 10 Jan 2020
  
  in Public
  
  In another category of applications, the key object of interest is actually a parameter β, but the inference procedures (often implicitly) contain a prediction task. For example, the first stage of a linear instrumental variables regres-sion is effectively prediction. The same is true when estimating heterogeneous treatment effects, testing for effects on multiple outcomes in experiments, and flexibly controlling for observed confounders.
  
  This is most relevant tool for me. Before I learned about ML I often thought about using 'stepwise selection' for such tasks... to find the best set of 'control variables' etc. But without regularisation this seemed problematic.
  
  ml-reading-group
26. daaronr 10 Jan 2020
  
  in Public
  
  Machine Learning: An Applied Econometric Approach
  
  Shall we use Hypothesis to have a discussion ?
  
  ml-reading-group
Visit annotations in context

Tags

daaronr

ml-reading-group

Annotators

daaronr

URL

pubs.aeaweb.org/doi/pdfplus/10.1257/jep.31.2.87

Tags

Annotators

URL