Skip to main content

Search NYU Steinhardt

Fall 2020 Seminars

Spatially-coupled hidden Markov models for short-term forecasting of wind speeds

A seminar by Vianey Leos Barajas, Assistant Professor at the University of Toronto, Dept. of Statistical Sciences and School of the Environment

November 18, 2020

Watch Recording


Hidden Markov models (HMMs) provide a flexible framework to model time series data where the observation process, Yt, is taken to be driven by an un-derlying latent state process, Zt. In this talk, we will focus on discrete-time, finite-state HMMs as they provide a flexible framework that facilitates extending the basic structure in many interesting ways.

HMMs can accommodate multivariate processes by (i) assuming that a single state governs the M observations at time t, (ii) assuming that each observation process is governed by its own HMM, irrespective of what occurs elsewhere, or (iii) a balance between the two, as in the coupled HMM framework. Coupled HMMs assume that a collection of M observation processes is governed by its respective M state processes. However, the mth state process at time t, Zm,t not only depends on Zm,t−1 but also on the collection of state processes Z−m,t−1. We introduce spatially-coupled hidden Markov models whereby the state processes interact according to an imposed neighborhood structure and the observations are collected across S spatial locations. We outline an application to short-term forecasting of wind speed using data collected across multiple wind turbines at a wind farm.

Digital Trace Data: Modes of Data Collection, Applications, and Errors

A seminar by Frauke Kreuter, Professor of Statistics and Data Science at the Ludwig-Maximilians-University of Munich

October 28, 2020

Watch Recording of Frauke Kreuter's PRIISM seminar


Digital traces, left by individuals when they act or interact online provide researchers with new opportunities for studying social and behavioral phenomena. This talk covers digital trace data and their use in the computational social sciences. Key to a successful use of digital trace data is a clear vision of the research goals. Knowing how to match available data and research needs is just as important as the evaluation of data quality, and respecting respondents privacy. We will discuss inferential challenges and possible ways to deal with them, finding the right measures to ensure reproducibility and replicability, and how to create sufficient transparency when working with digital trace data. 

Bayesian Canonicalization of Voter Registration Files 

A seminar by Andee Kaplan, Assistant Professor, Colorado State University

October 14, 2020

Watch Recording of PRIISM Seminar with Andee Kaplan


Entity resolution (record linkage or de-deduplication) is the process of merging noisy databases to remove duplicate entities in the absence of a unique identifier. One major challenge of utilizing linked data is identifying the canonical (or representative) records without duplicate information to pass to an inferential downstream task. The canonicalization step is particularly crucial after entity resolution, as a multi-stage approach allows for multiple analyses to be performed on the same linked data. While this approach can be scalable, the uncertainty from each stage of the entity resolution process is not naturally propagated throughout the pipeline and into the downstream task. In this talk, Dr. Kaplan presented five fully unsupervised methods to choose canonical records from linked data, including a fully Bayesian approach which propagates the error from linkage through to the downstream inference. This multi-stage approach is illustrated and evaluated on simulated entity resolution data sets as well as voter registration data available from the North Carolina State Board of Elections (NCSBE). The NCSBE has released a snapshot of their voter registration databases regularly since 2005, providing a changing view of the voter registration information over time as new voters register, voters are dropped from the register, and voter information is updated. Dr. Kaplan and her team compared the proposed canonicalization methods after performing entity resolution on five snapshots and examined the relationship between demographic information and party affiliation on the resulting canonical data sets.

A unified framework for the latent variable approach to statistical social network analysis

A seminar by Samrachana Adhikari, Assistant Professor, NYU School of Medicine

September 30, 2020

Watch Recording of PRIISM Seminar with Samrachana Adhikari


While social network data provide new opportunities to understand complex relational mechanisms, they also present modeling challenges. Units of observation in social network are often not independent and identically distributed, as commonly assumed in many statistical models, and hence require new tools to analyze the data, to make inference and address issues of model selection and goodness of fit, while accounting for the complex dependence structures. Many recent developments have been made in statistical methodologies to account for such complications. In particular, latent variable network models that accommodate edge correlations implicitly, by assuming an underlying latent factor, are increasing in popularity. Although these models are examples of what is a growing body of research, much of the research is focused on proposing new models or extending others. There has been very little work on unifying the models in a single framework.

In this talk, Dr. Adhikari first reviewed different latent variable network models for analyzing social network data. She then introduced a complete framework that organizes existing latent variable network models within an integrative generalized additive model, called the Conditionally Independent Dyad (CID) models. The class of CID models includes existing network models that assume dyad (or edge) independence conditional on latent variables and other components in the model.  By presenting analysis of advice seeking network of teachers as an example, she illustrated the utility of the proposed framework. Dr. Adhikari ended with discussion of existing and future extensions of the proposed class of network models to incorporate multiple related networks.

Understanding reasons for differences in intervention effects across sites

A seminar by Kara Rudolph, Assistant Professor, Columbia University

September 16, 2020

Watch Recording of Kara Rudolph's seminar


Multi-site interventions are common in public health, public policy, and economics. Do we expect an intervention effect in one site to be the same as the intervention effect in another site? In many cases, we would answer “no”. First, there could be differences in site-level variables related to intervention design/implementation or contextual variables, like the economy, that would modify intervention effectiveness. Such variables suggest that the intervention either is not the same or does not work the same in the two sites. Second, there could be differences in person-level variables—population composition—across sites that also modify intervention effectiveness. There could also be differences in the mechanisms producing intermediate variables on the pathway from the intervention to the outcome. The latter two reasons could cause intervention effects to differ across sites even if the interventions are structured and implemented in an identical fashion. An example of this, which we use to motivate this work, is from the Moving to Opportunity (MTO) trial. MTO is a five-site, encouragement-design intervention in which families in public housing were randomized to receive housing vouchers and logistical support to move to low-poverty neighborhoods. When we started this work, there had been no quantitative examination of the underlying reasons for differences in MTO’s effects across sites. We propose doubly robust and efficient estimators to predict total, direct, and indirect effects of treatment in a new site based on data in source sites. The extent to which these predicted estimates correspond with the observed estimates can shed light on reasons for site differences in intervention effects.

Event Archive

Inferential LASSO in Single Case Experimental Design to Estimate Effect Size

A seminar by Jay Verkulien, Associate Professor, CUNY

February 26, 2020


Single case experimental design (SCED) is widely used in areas such as behavior modification, rehabilitation medicine, or training of special participants. The general goal in most analyses of SCED data is to assess the effect size of the intervention, often for subsequent processing in, for example, a meta-analysis. The design itself is a strong one for this purpose, particularly randomized multiple baseline designs (MBD). However, SCEDs tend to be fairly small N designs due to the nature of the target population and intervention. This means that regression analysis of SCED data is already running fairly lean in terms of error df even before one factors in covariates needed to model trend, external events, or other nuisance data features. These additional variables are not typically of substantive interest but need to be accounted for properly. Linear and generalized linear models do not cope with this situation well and fail completely when the number of predictors, P, exceeds N. Many machine learning methods have been devised to address this problem, which occurs in other areas such as genetic microarray studies, text mining, and neuroscience. These methods all incorporate some kind of regularization to provide enough structure to allow estimation to proceed. They are also useful even when N > P to avoid overfitting or to help determine if a visual object on a graph is "real" or not. However, they are fundamentally predictive in nature and do not work well when the goal involves interpreting the resulting regression components. Recent research has focused on adapting methods such as tree-based analyses or regularized regression to causal effect estimation. While the talk provides a conceptual overview to the problem of high dimension broadly, I focus on the inferential LASSO (e.g., Belloni, Chernozhukov, & Hansen, 2014) as a promising new method that is a straightforward extension of multiple regression. (Joint work with Mariola Moeyaert, SUNY Albany).

Measuring Poverty

A seminar by Chaitra Nagaraja, Associate Professor, Fordham University

February 12, 2020


The development of any measure, particularly an official one, is a mixture of ideology, convenience, and chance. To illustrate this, I will use poverty measurement as an example. The primary choice in measuring poverty is between taking an absolute versus a relative approach. The former is favored by the U.S. whereas the latter is preferred by the European Union. In this talk, I will compare various measures (with a focus on the U.S.) using a historical perspective to understand the effects of those statistics on a country’s residents.

Born in the Wrong Months? The role of Kindergarten entrance age cut-off in students’ academic progress in NYC public schools

Date, Time, Location
11/20/19 (WEDS)
11:00 am - 12:00 pm,
Kimball 3rd flr conf rm 

Talk Category:
Data Science for Social Impact

Ying Lu (NYU)

Abstract: The age cut-off for public Kindergarten entrance in New York City is December 31, while the common practice in the country is to have age cut-off in September or earlier. This means that on average, about a quarter of NYC public school Kindergarteners start formal schooling younger than five in a public school setting. Extensive research has suggested that children exhibit different social and cognitive development growth trajectories in early childhood. In particular, students who start Kindergarten at an older age (earlier birth month) are better prepared socially and cognitively for formal schooling. On the other hand, other research also argues that relative advantage of age disappears as students get older. In this paper, we use proprietary data from NYC DOE to show how birth month plays an important role in determining the path of children’s academic progress. Following a birth cohort of students (born in 2005) starting Kindergarten till 7th grade, and using discrete event history analysis, we show that students who are born in later birth months, especially those who were born after September 1 (entering Kindergarten before turning 5) show higher risk of repeating grades (whether voluntarily or involuntarily) and being classified into the special education category throughout elementary school. The academic progression gap widens further when considering other factors such as students’ race, gender, and socio-economic backgrounds. We further use longitudinal growth curve models to explore the patterns of students grade level achievements over time (3rd to 7th grades common core test) considering their ages at Kindergarten entrance as well as their experiences of academic progression (ever had been held back), and the interplays of these factors with students demographic and socioeconomic characteristics. A regression discontinuity design was also employed to explore the impact of holding very young students back at earlier grades on their academic achievement trajectories.

Sensitivity analyses for unobserved effect moderation when generalizing from trial to population

Date, Time, Location:
11/6/19 (WEDS)
11:00 am - 12:00 pm,
Kimball 3rd flr conf rm

Talk Category:
Statistical Methodology

Elizabeth Stuart (JHU)

Abstract: In the presence of treatment effect heterogeneity, the average treatment effect (ATE) in a randomized controlled trial (RCT) may differ from the average effect of the same treatment if applied to a target population of interest. But for policy purposes we may desire an estimate of the target population ATE. If all treatment effect moderators are observed in the RCT and in a dataset representing the target population, then we can obtain an estimate for the target population ATE by adjusting for the difference in the distribution of the moderators between the two samples. However, that is often an unrealistic assumption in practice. This talk will discuss methods for generalizing treatment effects under that assumption, as well as sensitivity analyses for two situations: (1) where we cannot adjust for a specific moderator observed in the RCT because we do not observe it in the target population; and (2) where we are concerned that the treatment effect may be moderated by factors not observed even in the RCT. Outcome-model and weighting-based sensitivity analysis methods are presented. The methods are applied to examples in drug abuse treatment. Implications for study design and analyses are also discussed, when interest is in a target population ATE.

Permutation Weighting: A classification-based approach to balancing weights

Date, Time, Location:
10/23/19 (WEDS)
11:00 am - 12:00 pm,
Kimball 3rd flr conf rm

Talk Category:
Statistical Methodology

Drew Dimmery (Facebook)

Abstract: This work provides a new lens through which to view balancing weights for observational causal inference as approximating a notional target trial. We formalize this intuition and show that our approach -- Permutation Weighting -- provides a new way to estimate many existing balancing weights. This allows the estimation of weights through a standard binary classifier (no matter the cardinality of treatment). Arbitrary probabilistic classifiers may be used in this method; the hypothesis space of the classifier corresponds to the nature of the balance constraints imposed through the resulting weights. We provide theoretical results which bound bias and variance in terms of the regret of the classifier, show that these disappear asymptotically and demonstrate that our classification problem directly minimizes imbalance. Since a wide variety of existing methods may be estimated through this regime, the approach allows for direct model comparison between balancing weights (both existing methods and new ones) based on classifier loss as well as hyper-parameter tuning using cross-validation. We compare estimating weights with permutation weighting to minimizing the classifier risk of a propensity score model for inverse propensity score weighting and show that the latter does not necessarily imply minimal imbalance on covariates. Finally, we demonstrate how the classification-based view provides a flexible mechanism to define new balancing weights; we demonstrate this with balancing weights based on gradient-boosted decision trees and neural networks. Simulation and empirical evaluations indicate that permutation weighting outperforms existing weighting methods for causal effect estimation.


Secrecy, Criminal Justice, and Variable Importance

Date, Time, Location:
9/25/19 (WEDS)
11:00 am - 12:00 pm,
239 Greene (East Bldng) rm 320 

Talk Category:
Data Science for Social Impact

Cynthia Rudin (Duke)

Abstract: The US justice system often uses a combination of (biased) human decision makers and complicated black box proprietary algorithms for high stakes decisions that deeply affect individuals. All of this is still happening, despite the fact that for several years, we have known that interpretable machine learning models were just as accurate as any complicated machine learning methods for predicting criminal recidivism. It is much easier to debate the fairness of an interpretable model than a proprietary model. The most popular proprietary model, COMPAS, was accused by the ProPublica group of being racially biased in 2016, but their analysis was flawed and the true story is much more complicated; their analysis relies on a flawed definition of variable importance that was used to identify the race variable as being important. In this talk, I will start by introducing a very general form of variable importance, called model class reliance. Model class reliance measures how important a variable is to any sufficiently accurate predictive model within a class. I will use this and other data-centered tools to provide our own investigation of whether COMPAS depends on race, and what else it depends on. Through this analysis, we find another problem with using complicated proprietary models, which is that they seem to be often miscomputed. An easy fix to all of this is to use interpretable (transparent) models instead of complicated or proprietary models in criminal justice.

Health Benefits of Reducing Air Traffic Pollution: Evidence from Changes in Flight Paths

Date, Time, Location:
9/18/19 (WEDS)
11:00 am - 12:00 pm,
Kimball 3rd flr conf rm

Talk Category:
Data Science for Social Impact

Augustin de Coulon (IZA)

Abstract: This paper investigates externalities generated by air transportation pollution on health. As a source of exogenous variation, we use an unannounced five-month trial that reallocated early morning aircraft landings at London Heathrow airport. Our measure of health is prescribed medications pending on conditions known to be aggravated by pollution, especially sleep disturbances. Compared to the control regions, we observe a significant and substantial decrease in prescribed drugs for respiratory and central nervous system disorders in the areas subjected to reduced air traffic between 4:30am and 6.00am. Our findings suggest therefore a causal influence of air traffic on health conditions.

Data Tripper: Authorship Attribution Analysis of Lennon-McCartney Songs

Date, Time, Location:
9/6/19 (FRI)
3:00 pm - 4:15 pm,
Kimball 1st floor lounge

Talk Category:
General Interest

Mark Glickman (Harvard)

Abstract: The songwriting duo of John Lennon and Paul McCartney, the two founding members of the Beatles, have composed some of the most popular and memorable songs of the last century. Despite having authored songs under the joint credit agreement of Lennon-McCartney, it is well-documented that most of their songs or portions of songs were primarily written by exactly one of the two. Some Lennon-McCartney songs are actually of disputed authorship. For Lennon-McCartney songs of known and unknown authorship written and recorded over the period 1962-66, we extracted musical features from each song or song portion. These features consist of the occurrence of melodic notes, chords, melodic note pairs, chord change pairs, and four-note melody contours. We developed a prediction model based on variable screening followed by logistic regression with elastic net regularization. We applied our model to the prediction of songs and song portions with unknown or disputed authorship.

Speaker: Dr. Glickman is a Senior Lecturer in Statistics at Harvard University, a Fellow of the American Statistical Association (ASA), and a Senior Statistician at the Center for Healthcare Organization and Implementation Research. His research interests are primarily in the areas of statistical modeling for rating competitors in games and sports, and in statistical methods applied to problems in health services research. Besides publishing extensively in these areas, Dr. Glickman invented the Glicko and Glicko-2 rating systems (and is a U.S. national master in Chess), and has served as Chair and Program Chair of the ASA's Section on Statistics in Sports.

This talk is co-sponsored by the NYU Stern Department of Technology, Operations, and Statistics. A mixer will follow from 4:15 - 5:30.

Urban Modeling's Future - a Big Data Reality

Date, Time, Location: 5/8/2019 (Weds)
11:00 am - 12:00 pm
Kimball 3rd Fl Conf Rm 

Talk Category: Research Generation

Speaker: Debra Laefer (NYU)

Abstract: Until recently, the history of urban modeling has relied on relatively simplified models. This has been a function of data collection limitations and computing barriers. Consequently, two streams of modeling have emerged. At a local level, highly detailed Building Information Modeling has dominated. At a broader scale, CityGML has been the major player. The absence of key pieces of data and major inconsistencies in the respective schema of the systems prevent their interoperability. While efforts continue to align the systems, recent tandem advancements in remote sensing technology and distributed computing now offer a complete circumvention of those problems and have lifted the previous restrictions in data acquisition and processing. This lecture will show the emerging state-of-technology in remote sensing and BigData computing and present some of the clear value of such a workflow, as well as the remaining challenges from both the remote sensing and the computing side.

Bio: With degrees from the University of Illinois Urbana-Champaign (MS, PhD), NYU (MEng), and Columbia University (BS, BA), Prof. Debra Laefer has a wide-ranging background spanning from geotechnical and structural engineering to art history and historic preservation. Not surprisingly, Prof. Laefer’s work often stands at the cross-roads of technology creation and community values such as devising technical solutions for protecting architecturally significant buildings from sub-surface construction. Her work has been featured in National Geographic, Forbes, and TechCrunch. <

How Education Systems Undermine Gender Equity

Date, Time, Location: 5/1/2019 (Weds)
11:00 am - 12:00 pm
Kimball 3rd Fl Conf Rm 

Talk Category: Data and Social Impact

Speaker: Joseph Cimpian (NYU)

Abstract: From the time students enter kindergarten, teachers overestimate the abilities of boys in math, relative to behaviorally and academically matched girls, contributing to a gender gap favoring boys in both math achievement and confidence. Using data from numerous nationally representative studies spanning kindergarten through university level, as well as experimental evidence, I demonstrate how girls and young women face discrimination and bias throughout their academic careers and suggest that a substantial portion of the growth in the male–female math achievement gap is socially constructed. Each of the studies leads to a broader set of considerations about why females are viewed as less intellectually capable than their male peers. The studies also demonstrate that biases can be exhibited and perpetuated by members of negatively stereotyped groups (e.g., female teachers demonstrate greater bias against girls than do male teachers), and raise questions about the root causes of their biases and the long-term effects of being negatively stereotyped oneself. This research also suggests that comparing boys and girls on metrics such as standardized tests and grades may contribute to a false belief that education systems promote the success of females. Together, the studies suggest several implications for research, teacher professional development, and policy.

Modelling intergenerational exchanges using models for multivariate longitudinal data with latent variables in the presence of zero excess.

Date, Time, Location: 4/17/2019 (Weds)
11:00 am - 12:00 pm
Kimball 3rd Fl Conf Rm 

Talk Category: Statistical Methodology

Speaker: Irini Moustaki (LSE)

Abstract: In this talk we will discuss some primary results from the modelling of dyadic data that provide information on intergenerational exchanges in the UK. We will use longitudinal data from three waves of the UK Household Longitudinal Survey, to study and explain associations between exchanges of support from the respondent to their parents and to their children. The data resemble the structure of dyadic data, they are collected across time and they are also multivariate because constructs of interest are measured by multiple indicators. Support is measured by a set of binary indicators of different kinds of help.
We propose two different joint models of bidirectional exchanges with support given and support received treated as a multivariate response, and covariances between responses measuring the extent of reciprocation between generations. Moreover, joint modelling of longitudinal data allows for the possibility that reciprocation may occur contemporaneously or may be postponed until the donor is in need of help or the recipient is in a position to reciprocate.

Difference-in-Differences Estimates of Demographic Processes

Date, Time, Location: 4/10/2019 (Weds)
11:00 am - 12:00 pm
Kimball 3rd Fl Conf Rm 

Talk Category: Statistical Methodology

Speaker: Lawrence Wu (NYU)

Abstract: We examine difference-in-differences procedures for estimating the causal effect of treatment when the outcome is a single-decrement demographic process. We use the classic case of two groups and two periods to contrast a standard and widely-used linear probability difference-in-differences estimator with an analogous proportional hazard difference-in-differences estimator. Formal derivations and illustrative examples show that the linear probability estimator is inconsistent, yielding estimates that, for example, evolve with time since treatment. We conclude that knowledge of how the data are generated is a necessary component for causal inference.

Statistics of Police Shootings and Racial Profiling

Date, Time, Location: 4/3/2019 (Weds)
11:00 am - 12:00 pm
Kimball 3rd Fl Conf Rm 

Talk Category: Data and Social Impact

Speaker: Gregory Ridgeway (UPenn)

Abstract: The police are chronically a topic of heated debate. However, most statistical analyses brought to bear on questions of police fairness rarely provide clarity on or solutions to the problems. This talk will cover statistical methods for estimating racial bias in traffic stops, identifying problematic cops, and determining which officers are most at risk for police shootings. All of these methods have been part of investigations of police departments in Oakland, Cincinnati, and New York and show that statistics has an important role in prominent crime and justice policy questions.

Statistical Intuitions and the Reproducibility Crisis in Science

Date, Time, Location: 2/27/2019 (Weds)
11:00 am - 12:00 pm
Kimball 3rd Fl Conf Rm 

Talk Category: Data and Social Impact

Speaker: Eric Loken (UConn)

Abstract: Science is responding well to the so-called reproducibility crisis with positive improvements in methodology and transparency. Another area for improvement is awareness of statistical issues impacting inference. We explore how some problematic intuitions about measurement, statistical power, multiple analyses, and levels of analysis can affect the interpretation of research results, perhaps leading to mistaken claims.
Speaker: Eric Loken is in the Neag School of Education at The University of Connecticut. He studies advanced statistical models including hierarchical models, measurement models, factor and mixture models, and their applications in health and education research. He works extensively in educational measurement with applications to large scale testing. Recent work has addressed issues surrounding statistical inference, and the relationship to failures to replicate research results.

Quantitative Measures to Assess Community Engagement in Research

Date, Time, Location: 2/6/2019 (Weds)
11:00 am - 12:00 pm
Kimball 3rd Fl Conf Rm 

Talk Category: Data and Social Impact

Speaker: Melody Goodman (NYU)

Abstract: The utility of community-engaged health research has been well established. However, measurement and evaluation of community engagement in research activities (patient/stakeholder perceptions of the benefit of collaborations that indicate how engaged the patient/stakeholder feels) has been limited. The level of community engagement across studies can vary greatly from minimal engagement to fully collaborative partnerships. Methods for measuring the level of community engagement in research are still emerging in the field due to the methodological gap in the assessment of stakeholder engagement, likely due to the lack of existing measures. There is a need to rigorously evaluate the impact of community/stakeholder engagement on the development, implementation and outcomes of research studies, which requires the development, validation, and implementation of tools that can be used to assess stakeholder engagement.

We use community-engaged research approaches and mixed-methods (qualitative/quantitative) study design to validate a measure to assess the level of community engagement in research studies from the stakeholder perspective. As part of the measurement validation process, we are conducting a series of web-based surveys of community members/community health stakeholders who have participated in previous community-engaged research studies. The surveys examine construct validity and internal consistency of the measure. We examined content validity through a five round modified Delphi process to reach consensus among experts and construct validity is assessed through participant surveys.

Research that develops standardized, reliable, and accurate measures to assess community engagement is essential to understanding the impact of community engagement on the scientific process and scientific discovery. Implementation of gold standard quantitative measures to assess community engagement in research would make a major contribution to community-engaged science. These measures are necessary to assess associations between community engagement and research outcomes.

Decision-driven sensitivity analyses via Bayesian optimization

Date, Time, Location: 12/5/2018 (Weds)
11:00 am - 12:00 pm
Kimball 3rd Fl Conf Rm 

Talk Category: Statistical Methodology

Speaker: Russell Steele (McGill)

Abstract: Every statistical analysis requires at least some subjective or untestable assumptions. For example, in Bayesian modelling, the analysis requires specification of hyperparameters for prior distributions which are either intended to reflect subjective beliefs about the model or to reflect relative ignorance about the model under a certain notion of ignorance. Similarly, causal models require assumptions about parameters related to unmeasured confounding. Violations of these untestable or subjective assumptions can invalidate the conclusions of analyses or lead to conclusions that only hold for a narrow range of choices for those assumptions. Currently, researchers compute several estimates based on either multiple “reasonable” values or a wide range of “possible” values for these inestimable parameters. Even when the dimension of the inestimable parameter space is relatively small, the sensitivity analyses generally are not systematically conducted and may either waste valuable computational time on choices that lead to roughly the same inference or will miss examining values of those parameters that would change the conclusions of the analysis.

In this talk, I will propose the use of Bayesian optimization approaches for decision-driven sensitivity analyses. We assume that a decision will be made as a function of the model estimates or predictions from particular model which relies on inestimable parameters. We use a Bayesian optimization approach to identify partitions of the space of inestimable parameter values where the decision based on the observed data and assumed parameter values change, rather to rely on non-systematically chosen values for the sensitivity analysis. We will illustrate our proposed approach on a hierarchical Bayesian meta-analysis example from the literature.

The work that will be presented was done in collaboration with Louis Arsenault-Mahjoubi, an undergraduate mathematics and statistics student at McGill University.

Omitted and included variable bias in tests for disparate impact

Date, Time, Location: 11/14/2018 (Weds)
11:00 am - 12:00 pm
Kimball 3rd Fl Conf Rm 

Talk Category: Data and Social Impact

 Speaker: Ravi Shroff (NYU)

Abstract: Policymakers often seek to gauge discrimination against groups defined by race, gender, and other protected attributes. A common strategy is to estimate disparities after controlling for observed covariates in a regression model. However, not all relevant factors may be available to researchers, leading to omitted variable bias. Conversely, controlling for all available factors may also skew results, leading to so-called "included variable bias". We introduce a simple strategy, which we call risk-adjusted regression, that addresses both concerns in settings where decision makers have clear and measurable policy objectives. First, we use all available covariates to estimate the expected utility of possible decisions. Second, we measure disparities after controlling for these utility estimates alone, omitting other factors. Finally, we examine the sensitivity of results to unmeasured confounding. We demonstrate this method on a detailed dataset of 2.2 million police stops of pedestrians in New York City.

Structural Equation Modeling in Stata

Date, Time, Location: 10/31/2018 (Weds)
10:30 am - 12:00 pm,
Kimball 3rd Fl Conf Rm 

Talk Category: Didactic

 Speaker: Chuck Huber (Stata Corp)

Abstract: This talk introduces the concepts and jargon of structural equation modeling (SEM) including path diagrams, latent variables, endogenous and exogenous variables, and goodness of fit. I demonstrate how to fit many familiar models such as linear regression, multivariate regression, logistic regression, confirmatory factor analysis, and multilevel models using -sem-. I wrap up by demonstrating how to fit structural equation models that contain both structural and measurement components. *Co-sponsored with CUNY Grad Center EPSY

Adaptive Designs in Clinical Trials: An Introduction and Example

Date, Time, Location: 10/24/2018 (Weds)
11:00 am - 12:00 pm,
Kimball 3rd Fl Conf Rm 

Talk Category: Statistical Methodology/Didactic

 Speaker: Leslie McClure (Drexel)

Abstract: Planning for randomized clinical trials relies on assumptions that are often incorrect, leading to inefficient designs that could spend resources unnecessarily. Recently, trialists have been advocating for implementation of adaptive designs, which allow researchers to modify some aspect of their trial part-way through the study based on accumulating data. In this talk, I will introduce the concept of adaptive designs and describe several different adaptations that can be made in clinical trials. I will then describe a real-life example of a sample size re-estimation from the Secondary Prevention of Small Subcortical Strokes (SPS3) study, describe the statistical impact of implementing this design change, and describe the effect of the adaptation on the practical aspects of the study.

Disrupting Education? Experimental Evidence on Technology-Aided Instruction in India

Date, Time, Location: 5/2/2018, (Weds.) 11:00 am - 12:00 pm 
3rd Fl. Conf. Rm, Kimball

Talk Category: Data for Social Impact

 Speaker: Alejandro Ganimian

Abstract: We present experimental evidence on the impact of a personalized technology-aided after-school instruction program on learning outcomes. Our setting is middle-school grades in urban India, where a lottery provided winning students with a voucher to cover program costs. We find that lottery winners scored 0.36σ higher in math and 0.22σ higher in Hindi relative to lottery losers after just 4.5-months of access to the program. IV estimates suggest that attending the program for 90 days would increase math and Hindi test scores by 0.59σ and 0.36σ respectively. We find similar absolute test score gains for all students, but the relative gain was much greater for academically-weaker students because their rate of learning in the control group was close to zero. We show that the program was able to effectively cater to the very wide variation in student learning levels within a single grade by precisely targeting instruction to the level of student preparation. The program was cost effective, both in terms of productivity per dollar and unit of time. Our results suggest that well-designed technology-aided instruction programs can sharply improve productivity in delivering education.

BART for Causal Inference

Date, Time, Location: 4/25/2018, (Weds.) 11:00 am - 12:00 pm 
3rd Fl. Conf. Rm, Kimball

Talk Category: Didactic

 Speaker: Jennifer Hill

Abstract: There has been increasing interest in the past decade in use of machine learning tools in causal inference to help reduce reliance on parametric assumptions and allow for more accurate estimation of heterogeneous effects. This talk reviews the work in this area that capitalizes on Bayesian Additive Regression Trees, an algorithm that embeds a tree-based machine learning technique within a Bayesian framework to allow for flexible estimation and valid assessments of uncertainty. It will further describe extensions of the original work to address common issues in causal inference: lack of common support, violations of the ignorability assumption, and generalizability of results to broader populations. It will also describe existing R packages for traditional BART implementation as well as debut a new R package for causal inference using BART, bartCause.

Simulating a Marginal Structural Model

Date, Time, Location: 2/28/2018, (Weds.) 11:00 am - 12:00 pm 
3rd Fl. Conf. Rm, Kimball

Talk Category: Didactic

 Speaker: Keith Goldfeld

Abstract: In so many ways, simulation is an extremely useful tool to learn, teach, and understand the theory and practice of statistics. A series of examples (interspersed with minimal theory) will hopefully illuminate the underbelly of confounding, colliding, and marginal structural models. Drawing on the potential outcomes framework, the examples will use the R simstudy package, a tool that is designed to make data simulation as painless as possible.

Graphs as Poetry

Date, Time, Location: 2/7/2018, (Weds.) 11:00 am - 12:00 pm 
3rd Fl. Conf. Rm, Kimball

Talk Category: Statistical Methodology

 Speaker: Howard Wainer

Abstract: Visual displays of empirical information are too often thought to be just compact summaries that, at their best, can clarify a muddled situation. This is partially true, as far as it goes, but it omits the magic. We have long known that data visualization is an alchemist that can make good scientists great and transform great scientists into giants. In this talk we will see that sometimes, albeit too rarely, the combination of critical questions addressed by important data and illuminated by evocative displays can achieve a transcendent, and often wholly unexpected, result. At their best, visualizations can communicate emotions and feelings in addition to cold, hard facts.

Unraveling and Anticipating Heterogeneity: Single Subject Designs & Individualized Treatment Protocols

Date, Time, Location: 11/3/17 (Fri.) ALL DAY (9-5, tent.), 3rd Fl. Conf. Rm, Kimball

Talk Category: Mixed

Speaker: Leading experts in SSD, Causal & Bayesian Inference

Abstract: This will be a 1-day symposium on the topic of Single Subject Design (SSD) and methods for their analysis.  It will bring together leading researchers in the areas of multilevel models, Bayesian modeling, and meta-analysis to discuss best practices with leading practitioners who utilize SSDs as well as how to use results from single case designs to better inform larger scale clinical trials in this field.   These practitioners will be drawn from the fields of special education and rehabilitation science.  In particular, the areas of Physical Therapy, Occupational Therapy and Communication Science Disorders will be invited.

Panel discussions will be convened in which methodologists are paired with practitioners to discuss each phase of the science, from exploratory data analysis (related to designs employing graphical methods), more general design aspects, and analysis.  Particular emphasis will be given to research supporting Individualized Treatment Protocols.  In addition, there will be individual presentations representing new methodology for these designs, and reports from practitioners on their ongoing clinical trials to spur additional discussion of appropriate methodology.

Introduction to Bayesian Analysis Using Stata

Date, Time, Location: 10/18/17 (Weds.), 10:30am-12:00pm, 3rd Fl. Conf. Rm, Kimball

Talk Category: Didactic

Speaker: Chuck Huber (Stata Corp.)

Abstract: Bayesian analysis has become a popular tool for many statistical applications. Yet many data analysts have little training in the theory of Bayesian analysis and software used to fit Bayesian models. This talk will provide an intuitive introduction to the concepts of Bayesian analysis and demonstrate how to fit Bayesian models using Stata. No prior knowledge of Bayesian analysis is necessary and specific topics will include the relationship between likelihood functions, prior, and posterior distributions, Markov Chain Monte Carlo (MCMC) using the Metropolis-Hastings algorithm, and how to use Stata's Bayes prefix to fit Bayesian models.

Embedding the Analysis of Observational Data for Causal Effects within a Hypothetical Randomized Experiment

Date, Time, Location: 9/14/2017 (Thurs.), 12:30-2:00pm, 295 Lafayette Street, 2nd Floor, The Rudin Forum* (NYU Wagner)

Talk Category: Statistical Methodology

Speaker: Don Rubin (Harvard)

Abstract: Consider a statistical analysis that draws causal inferences using an observational data set, inferences that are presented as being valid in the standard frequentist senses; that is an analysis that produces (a) point estimates, which are presented as being approximately unbiased for their estimands, (b) p-values, which are presented as being valid in the sense of rejecting true null hypotheses at the nominal level or less often, and/or (c) confidence intervals, which are presented as having at least their nominal coverage for their estimands. For the hypothetical validity of these statements (that is, if certain explicit assumptions were true, then the validity of the statements would follow), the analysis must embed the observational study in a hypothetical randomized experiment that created the observed data, or a subset of that data set. This effort is a multistage effort with thought-provoking tasks, especially in the first stage, which is purely conceptual. Other stages may often rely on modern computing to implement efficiently, but the first stage demands careful scientific argumentation to make the embedding plausible to thoughtful readers of the proffered statistical analysis. Otherwise, the resulting analysis is vulnerable to criticism for being simply a presentation of scientifically meaningless arithmetic calculations. In current practice, this perspective is rarely implemented with any rigor, for example, completely eschewing the first stage. Instead, often analyses appear to be conducted using computer programs run with limited consideration of the assumptions of the methods being used, producing tables of numbers with recondite interpretations, and presented using jargon, which may be familiar but also may be scientifically impenetrable. Somewhat paradoxically, the conceptual tasks, which are usually omitted in publications, often would be the most interesting to consumers of the analyses. These points will be illustrated using the analysis of an observational data set addressing the causal effects of parental smoking on their children’s lung function. This presentation may appear provocative, but it is intended to encourage applied researchers, especially those working on problems with policy implications, to focus on important conceptual issues rather than on minor technical ones.

Multilevel modeling of single-subject experimental data: Handling data and design complexities

Date, Time, Location: 5/10/2017, 11:00 - 12:00 
3rd Fl. Conf. Rm, Kimball

Talk Category: Statistical Methodology

Speaker: Mariola Moeyaert
(University at Albany)

Abstract: There has been a substantial increase in the use of single-subject experimental designs (SSEDs) over the last decade of research to provide detailed examination of the effect of interventions. Whereas group comparison designs focus on the average treatment effect at one point of time, SSEDs allow researchers to investigate at the individual level the size and evolution of intervention effects. In addition, SSED studies may be more feasible than group experimental studies due to logistical and resource constraints, or due to studying a low incidence or highly fragmented population.

To enhance generalizability, researchers replicate across subjects and use meta-analysis to pool effects from individuals. Our research group was one of the first to propose, develop and promote the use of multilevel models to synthesize data across subjects, allowing for estimation of the mean treatment effect, variation in effects over subjects and studies, and subject and study characteristic moderator effects (Moeyaert, Ugille, Ferron, Beretvas, & Van den Noortgate, 2013a, 2013b, 2014). Moreover, multilevel models can handle unstandardized and standardized raw data or effect sizes, linear and nonlinear time trends, treatment effects on time trends, autocorrelation and other complex covariance structures at each level.

This presentation considers multiple complexities in the context of hierarchical linear modeling of SSED studies including the estimation of the variance components, which tend to be biased and imprecisely estimated. Results of a recent simulation study using Bayesian estimation techniques to deal with this issue will be discussed (Moeyaert, Rindskopf, Onghena & Van den Noortgate, 2017).

Collaborative targeted learning using regression shrinkage

Date, Time, Location: 5/3/2017, 11:00 - 12:00 
3rd Fl. Conf. Rm, Kimball

Talk Category: Statistical Methodology

Speaker: Mireille Schnitzer
(University of Montreal)

Abstract: Causal inference practitioners are routinely presented with the challenge of wanting to adjust for large numbers of covariates despite limited sample sizes. Collaborative Targeted Maximum Likelihood Estimation (CTMLE) is a general framework for constructing doubly robust semiparametric causal estimators that data-adaptively reduce model complexity in the propensity score in order to optimize a preferred loss function. This stepwise complexity reduction is based on a loss function placed on a strategically updated model for the outcome variable, assessed through cross-validation. New work involves integrating penalized regression methods into a stepwise CTMLE procedure that may allow for a more flexible type of model selection than existing variable selection techniques. Two new algorithms are presented and assessed through simulation. The methods are then used in a pharmacoepidemiology example of the evaluation of the safety of asthma mediation during pregnancy.

Remarks on the Mean-Difference Transformation and Bland-Altman Plot

Date, Time, Location: 4/26/2017, 11:00 - 12:00 
3rd Fl. Conf. Rm, Kimball

Talk Category: Statistical Methodology

Speaker: Jay Verkulien

Abstract: Tukey's mean-difference transformation and the Bland-Altman plot (e.g., Bland & Altman, 1986) are widely used in method comparison studies throughout the sciences, particularly in the health sciences. While intuitively appealing, easy to compute, and giving some notable advantages over simply reporting coefficients such as the concordance coefficient or intraclass correlations, they exhibit unusual behavior. In particular, one often observes systematic trends in the BA plot and they are very subject to outliers, among other issues. The purpose of this talk is to propose and study a generative model that lays out the logic of the mean-difference transformation and hence the BA plot, indicating when and why systematic trend may occur. The model provides insight into when users should expect problems with the BA plot and suggests that it should not be applied in circumstances when a more informative design such as instrumental variables is necessary. I also suggest some improvements to the graphics based on semi-parametric regression methods and discuss how putting the BA plot in a Bayesian framework could be helpful.

Bayesian Causal Forests: Heterogeneous Treatment Effects from Observational Data

Date, Time, Location:4/19/2017, 11:00 - 12:00 
3rd Fl. Conf. Rm, Kimball

Talk Category: Statistical Methodology

Speaker: Carlos Carvalho
(UT Austin)

Abstract: This paper develops a semi-parametric Bayesian regression model for estimating heterogeneous treatment effects from observational data. Standard nonlinear regression models, which may work quite well for prediction, can yield badly biased estimates of treatment effects when fit to data with strong confounding. Our Bayesian causal forests model avoids this problem by directly incorporating an estimate of the propensity function in the specification of the response model, implicitly inducing a covariate-dependent prior on the regression function. This new parametrization also allows treatment heterogeneity to be regularized separately from the prognostic effect of control variables, making it possible to informatively “shrink to homogeneity”, in contrast to existing Bayesian non- and semi-parametric approaches. Joint work with P. Richard Hahn and Jared Murray.

Log-Linear Bayesian Additive Regression Trees

Date, Time, Location:4/5/2017, 11:00 - 12:00 
3rd Fl. Conf. Rm, Kimball

Talk Category: Statistical Methodology

Speaker: Jared Murray

Abstract: Bayesian additive regression trees (BART) have been applied to nonparametric mean regression and binary classification problems in a range of applied areas. To date BART models have been limited to models for Gaussian "data", either observed or latent, and with good reason - the Bayesian backfitting MCMC algorithm for BART is remarkably efficient in Gaussian models. But while many useful models are naturally cast in terms of observed or latent Gaussian variables, many others are not. In this talk I extend BART to a range of log-linear models including multinomial logistic regression and count regression models with zero-inflation and overdispersion. Extending to these non-Gaussian settings requires a novel prior distribution over BART's parameters. Like the original BART prior, this new prior distribution is carefully constructed and calibrated to be flexible while avoiding overfitting. With this new prior distribution and some data augmentation techniques I am able to implement an efficient generalization of the Bayesian backfitting algorithm for MCMC in log-linear (and other) BART models. I demonstrate the utility of these new methods with several examples and applications.

Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman's Critique

Date, Time, Location:3/23/2017, 11:00 - 12:30 
3rd Fl. Conf. Rm, Kimball

Talk Category: Statistical Methodology

Speaker: Winston Lin

Abstract: This talk will be mostly based on my 2013 Annals of Applied Statistics paper, which reexamines David Freedman's critique of ordinary least squares regression adjustment in randomized experiments. Random assignment is intended to create comparable treatment and control groups, reducing the need for dubious statistical models. Nevertheless, researchers often use linear regression models to adjust for random treatment-control differences in baseline characteristics. The classic rationale, which assumes the regression model is true, is that adjustment tends to reduce the variance of the estimated treatment effect. In contrast, Freedman used a randomization-based inference framework to argue that under model misspecification, OLS adjustment can lead to increased asymptotic variance, invalid estimates of variance, and small-sample bias. My paper shows that in sufficiently large samples, those problems are either minor or easily fixed. Neglected parallels between regression adjustment in experiments and regression estimators in survey sampling turn out to be very helpful for intuition.

Finding common support through largest connected components and predicting counterfactuals for causal inference

Date, Time, Location: 3/22/2017, 2:00 - 3:30 
3rd Fl. Conf. Rm, Kimball

Talk Category: Statistical Methodology

Speaker: Sharif Mahmood

Location: Kimball Hall, 246 Greene Street, 3rd Floor conference room
Abstract: Finding treatment effects in observational studies is complicated by the need to control for confounders. Common approaches for controlling include using prognostically important covariates to form groups of similar units containing both treatment and control units (e.g. statistical matching) and/or modeling responses through interpolation. Hence, treatment effects are only reliably estimated for a subpopulation under which a common support assumption holds--one in which treatment and control covariate spaces overlap. Given a distance metric measuring dissimilarity between units, we use techniques in graph theory to find common support. We construct an adjacency graph where edges are drawn between similar treated and control units. We then determine regions of common support by finding the largest connected components (LCC) of this graph. We show that LCC improves on existing methods by efficiently constructing regions that preserve clustering in the data while ensuring interpretability of the region through the distance metric. We apply our LCC method on a study of the effectiveness of right heart catheterization (RHC). To further control for confounders, we implement six matching algorithms for analyses. We find that RHC is a risky procedure for the patients and that clinical outcomes are significantly worse for patients that undergo RHC.

Simple Rules for Decision-Making

Date, Time, Location: 3/9/2017, 11:30 - 12:30 
3rd Fl. Conf. Rm, Kimball

Talk Category: Data for Social Impact

Speaker: Ravi Shroff

Abstract: Doctors, judges, and other experts typically rely on experience and intuition rather than statistical models when making decisions, often at the cost of significantly worse outcomes. I'll present a simple and intuitive strategy for creating statistically informed decision rules that are easy to apply, easy to understand, and perform on par with state-of-the art machine learning methods in many settings. I'll illustrate these rules with two applications to the criminal justice system: investigatory stop decisions and pretrial detention decisions.

Scaling Latent Quantities from Text: From Black-and-White to Shades of Gray

Date, Time, Location:3/1/2017, 12:30 - 1:30 
3rd Fl. Conf. Rm, Kimball

Talk Category: Statistical Methodology

Speaker: Patrick Perry
(NYU Stern)

Abstract: Probabilistic methods for classifying texts according to the likelihood of class membership form a rich tradition in machine learning and natural language processing. For many important problems, however, class prediction is either uninteresting, because it is known, or uninformative, because it yields poor information about a latent quantity of interest. In scaling political speeches, for instance, party membership is both known and uninformative, in the sense that in systems with party discipline, what is interesting is a latent trait in the speech, such as ideological position, often at odds with party membership. Predictive tools common in machine learning, where the goal is to predict a black-or-white class--such as spam, sentiment, or authorship--are not directly designed for the measurement problem of estimating latent quantities, especially those that are not inherently unobservable through direct means.

In this talk, I present a method for modeling texts not as black or white representations, but rather as explicit mixtures of perspectives. The focus shifts from predicting an unobserved discrete label to estimating the mixture proportions expressed in a text. In this "shades of gray" worldview, we are able to estimate not only the graynesses of texts but also those of the words making up a text, using likelihood-based inference. While this method is novel in its application to text, it be can situated in and compared to known approaches such as dictionary methods, topic models, and the wordscores scaling method. This new method has a fundamental linguistic and statistical foundation, and exploring this foundation exposes implicit assumptions found in previous approaches. I explore the robustness properties of the method and discuss issues of uncertainty quantification. My motivating application throughout the talk will be scaling legislative debate speeches.

Large, Sparse Optimal Matching in an Observational Study of Surgical Outcomes

Abstract: How do health outcomes for newly-trained surgeons' patients compare with those for patients of experienced surgeons? To answer this question using data from Medicare, we introduce a new form of matching that pairs patients of 1252 new surgeons to patients of experienced surgeons, exactly balancing 176 surgical procedures and closely balancing 2.9 million finer patient categories. The new matching algorithm (which uses penalized network flows) exploits a sparse network to quickly optimize a match two orders of magnitude larger than usual in statistical matching, and allowing for extensive use of a new form of marginal balance constraint.

Generalized Ridge Regression Using an Iterative Solution

Speaker: Kathryn is a postdoc at Columbia University's Earth Institute. Her PhD is in applied economics with interests in development economics, and applied statistics.

Location: Kimball Hall, 246 Greene Street, 3rd Floor conference room

Abstract: An iterative method is introduced for solving noisy, ill-conditioned inverse problems, where the standard ridge regression is just the first iteration of the iterative method to be presented. In addition to the regularization parameter, lambda, we introduce an iteration parameter k, which generalizes the ridge regression. The derived noise damping filter is a generalization of the standard ridge regression filter (also known as Tikhonov). Application of the generalized solution performs better than the pseudo-inverse (the default solution to OLS in most statistical packages), and better than standard ridge regression (L-2 regularization), when the covariate matrix or design matrix is ill-conditioned, or highly collinear. A few examples are presented using both simulated and real data.

Latent Space Models for Affiliation Networks

Speaker: Catherine (“Kate”) Calder is professor of statistics at The Ohio State University, where she has served on the faculty since 2003. Her research interests include spatial statistics, Bayesian modeling and computation, and network analysis, with application to problems in the social, environmental, and health sciences.

Location: Kimball Hall, 246 Greene Street, 3rd Floor conference room

Abstract: An affiliation network is a particular type of two-mode social network that consists of a set of `actors' and a set of `events' where ties indicate an actor's participation in an event. Methods for the analysis of affiliation networks are particularly useful for studying patterns of segregation and integration in social structures characterized by both people and potentially shared activities (e.g., parties, corporate board memberships, church attendance, etc.) One way to analyze affiliation networks is to consider one-mode network matrices that are derived from an affiliation network, but this approach may lead to the loss of important structural features of the data. The most comprehensive approach is to study both actors and events simultaneously. Statistical methods for studying affiliation networks, however, are less well developed than methods for studying one-mode, or actor-actor, networks. In this talk, I will describe a bilinear generalized mixed-effects model, which contains interacting random effects representing common activity pattern profiles and shared patterns of participation in these profiles. I will demonstrate how the proposed model is able to capture forth-order dependence, a common feature of affiliation networks, and describe a Markov chain Monte Carlo algorithm for Bayesian inference. I then will use the latent space interpretation of model components to explore patterns in extracurricular activity membership of students in a racially-diverse high school in a Midwestern metropolitan area. Using techniques from spatial point pattern analysis, I will show how our model can provide insight into patterns of racial segregation in the voluntary extracurricular activity participation profiles of adolescents. This talk is based on joint work with Yanan Jia and Chris Browning.

Why so many research hypotheses are mostly false and how to test

Speaker: Paul De Boeck is a professor of quantitative psychology at the Ohio State University. Before moving to OSU in 2012 he was a professor of psychological methods at the University of Amsterdam (Netherlands) and a professor of psychological assessment at the KULeuven (Belgium). He was president of the Psychometric Society in 2008 and he is the founding editor of the Applied Research and Case Studies section of Psychometrika. His research interests are generalized linear mixed models and explanatory item response theory, and applications of these approaches in the domains of individual differences in cognition, emotion, and psychopathology. More recently he tries to get his work published on the credibility crisis in psychology and feasible but perhaps uncommon methods that may be useful as a response to the crisis.

Location: Kimball Hall, 246 Greene Street, 3rd floor

Abstract: From a recent Science article with a large number of replications of psychological studies the base rate of the null hypothesis of no effect can be estimated. It turns out to be extremely high, which implies that many research hypotheses are false. As I will explain they are perhaps not fully false but mostly false. A possible explanation for why unlikely hypotheses tend to be selected for empirical studies can be found in expected utility theory. It can be shown that for low to moderately high power rates, the expected utility of studies increases with the probability of the null hypothesis being true. A high probability of the null hypothesis being true can be understood as reflecting a contextual variation of effects that are in general not much different from zero. Increasing the power of studies has become a popular remedy to counter the replicability crisis but this strategy is highly misleading if effects vary. Meta-analysis is considered another remedy but it is a suboptimal and labor-intensive approach and it is only long-term method. Two more feasible methods will be discussed to deal with contextual variation.

Be the Data and More: Using interactive, analytic methods to enhance learning from data for students

Speaker: Leanna House is an Associate Professor of Statistics at Virginia Tech (VT), Blacksburg, Virginia and has been at VT since 2008. Prior to VT, she worked at Battelle Memorial Institute, Columbus, Ohio; received her Ph.D. in Statistics from Duke University, Durham, North Carolina in 2006; and subsequently served as a post-doctoral research associate for two years in the Department of Mathematical Sciences at Durham University, Durham, United Kingdom. Dr. House has authored or co-authored 25 journal papers and has been a strong statistical contributor to successful grant proposals including, "NRT-DESE: UrbComp: Data Science for Modeling, Understanding, and Advancing Urban Populations", “Usable Multiple Scale Big Data Analytics Through Interactive Visualization” , "Critical Thinking with Data Visualization", ``Examining the Taxonomic, Genetic, and Functional Diversity of Amphibian Skin Microbiota", and ``Bayesian Analysis and Visual Analytics''.

Location: Kimball Hall, 246 Greene Street, 3rd Floor conference room

Abstract: Datasets, no matter how big, are just tables of numbers without individuals to learn from the data, i.e., discover, process, assess, and communicate information in the data. Data visualizations are often used to present data to individuals, but most are created independently of human learning processes and lack transparency. To bridge the gap between people thinking critically about data and the utility of visualizations, we developed Bayesian Visual Analytics (BaVA) and its deterministic form, Visual to Parametric Interaction (V2PI). BaVA and V2PI transform static images of data to dynamic versions that respond to expert feedback. When applied iteratively, experts may explore data progressively in a sequence that parallels their personal sense-making processes. BaVA and V2PI have shown useful in both industry settings and the classroom. For example, we merged V2PI with motion detection software to create Be the Data. In Be the Data students physcially move in a space to communicate their expert feedback about data projected overhead. The idea is that participants have an opportuntiy to explore analytical relationships between data points by exploring relationships between themselves. This talk will focus on presenting the BaVA paradigm and its education applications.

Bayesian Inference and Stan Tutorial

Speaker: Vince is a postdoc in NYU PRIISM program working on causal inference and nonparametrics. His recent work includes the causal inference competition at the 2016 Atlantic Causal Inference Conference and software to perform semiparametric sensitivity analyses evaluating the validity of the ignorability assumption in causal inference.

Location: Kimball Hall, 246 Greene Street, 3rd Floor conference room

Abstract: This two hour session is focused on getting started with Stan and how to use it in your research. Stan is an open-source Bayesian probabilistic programming environment that takes a lot of the work out of model fitting so that researchers can focus on model building and interpretation. List of topics will include: overview of Bayesian statistics, overview of Stan and MCMC, writing models in Stan, and a tutorial session where participants can write a model on their own or develop models that they have been working on independently. Stan has interfaces to numerous programming languages, but the talk will focus on R.
NOTE: Please bring a laptop with RStudio and RStan installed to this session

Basing Causal Inferences about Policy Impacts on Non-Representative Samples of Sites – Risks, Consequences, and Fixes

Speaker: Dr. Stephen Bell is an Abt Associates Fellow who holds a Ph.D. in Economics from the University of Wisconsin-Madison. He has designed and analyzed more than a dozen large-scale social experiments of policy interventions to assist disadvantaged Americans, with current work focusing on a slate of papers for IES and NSF on making findings of rigorous impact evaluations more generalizable to the nation and other inference populations. His research on methodologies for measuring social program impacts, both experimental and quasi-experimental econometric techniques, has been widely published. The work presented comes from collaborative work with Elizabeth A Stuart, Robert B. Olsen, and Larry L. Orr.

Location: Kimball Hall, 246 Greene Street, 3rd Floor conference room

Abstract: Randomized impact evaluations of social and educational interventions—while constituting the “gold standard” of internal validity due to the lack of selection bias between treated and untreated cases—usually lack external validity. Due to cost and convenience, or local resistance, they are almost always conducted in a set of sites that are not a probability sample of the desired inference population— the nation as a whole for social programs or a given state or school district for educational innovations. We use statistical theory and data from the Reading First evaluation to examine the risks and consequences for social experiments of non-representative site selection, asking when and to what degree policy decisions are led astray by tarnished “gold standard” evidence. We also explore possible ex ante design-based solutions to this problem and the performance of ex post methods in the literature for overcoming non-representative site selection through analytic adjustments after the fact.

Mediation: From Intuition to Data Analysis

Speaker: Ilya Shpitser is an Assistant Professor in the Department of Computer Science at Johns Hopkins University. His research includes all areas of causal inference and missing data, particularly using graphical models. Much of the recent applications of his work involved teasing out causation from association in observational medical data.

Location: Center for Data Science, 726 Broadway, 7th floor

Abstract: Modern causal inference links the "top-down" representation of causal intuitions and "bottom-up" data analysis with the aim of choosing policy. Two innovations that proved key for this synthesis were a formalization of Hume's counterfactual account of causation using potential outcomes (due to Jerzy Neyman), and viewing cause effect relationships via directed acyclic graphs (due to Sewall Wright). I will briefly review how a synthesis of these two ideas was instrumental in formally representing the notion of "causal effect" as a parameter in the language of potential outcomes, and discuss a complete identification theory linking these types of causal parameters and observed data, as well as approaches to estimation of the resulting statistical parameters. I will then describe, in more detail, how my collaborators and I are applying the same approach to mediation, the study of effects along particular causal pathways. I consider mediated effects at their most general: I allow arbitrary models, the presence of hidden variables, multiple outcomes, longitudinal treatments, and effects along arbitrary sets of causal pathways. As was the case with causal effects, there are three distinct but related problems to solve -- a representation problem (what sort of potential outcome does an effect along a set of pathways correspond to), an identification problem (can a causal parameter of interest be expressed as a functional of observed data), and an estimation problem (what are good ways of estimating the resulting statistical parameter). I report a complete solution to the first two problems, and progress on the third. In particular, my collaborators and I show that for some parameters that arise in mediation settings, triply robust estimators exist, which rely on an outcome model, a mediator model, and a treatment model, and which remain consistent if any two of these three models are correct. Some of the reported results are a joint work with Eric Tchetgen Tchetgen, Caleb Miles, Phyllis Kanki, and Seema Meloni.

Bayes vs Maximum Likelihood: The case of bivariate probit models

Speaker: Adriana Crespo-Tenorio, PhD is on a mission to connect people’s online behavior to their offline lives. As researcher in the Ads Research team at Facebook, her work focuses on finding the best ways for digital advertising to break through to audiences in a mobile world and link users' Feed experience to outcomes IRL. Adriana joined Facebook after working at The New York Times’s Customer Insights Group. She holds a PhD in political economy and applied statistics from Washington University in St Louis.

Location: Kimball Hall, 246 Greene St, 3rd Floor Conference Room
Abstract: Bivariate probit models are a common choice for scholars wishing to estimate causal effects in instrumental variable models where both the treatment and outcome are binary.However, standard maximum likelihood approaches for estimating bivariate probit models are problematic. Numerical routines in common software suites frequently generate inaccurate parameter estimates, and even estimated correctly, maximum likelihood routines provide no straightforward way to produce estimates of uncertainty for causal quantities of interest. In this article, we show that adopting a Bayesian approach provides more accurate estimates of key parameters and facilitates the direct calculation of causal quantities along with their attendant measures of uncertainty.

Scalable Bayesian Inference with Hamiltonian Monte Carlo

Speaker: Michael Betancourt earned his PhD in Physics from MIT and is currently a Postdoctoral Research Associate at Warwick.

Location: Center for Data Science, 726 Broadway, 7th floor
Abstract: The modern preponderance of data has fueled a revolution in data science, but the complex nature of those data also limits naive inferences. To truly take advantage of these data we also need tools for building and fitting statistical models that capture those complexities. In this talk I’ll discuss some of the practical challenges of building and fitting such models in the context of real analyses. I will particularly emphasize the importance of Hamiltonian Monte Carlo and Stan, state-of-the-art computational tools that allow us to tackle these contemporary data without sacrificing the fidelity of our inferences.

Improving Human Learning with Unified Machine Learning Frameworks: Towards Faster, Better, and Less Expensive Education

Speaker: Dr. José González-Brenes is currently engaged in his research agenda as a scientist at Pearson. His work studies methods that enable faster, better, and less expensive education with principled quantitative methods.

Location: Kimball Hall, 246 Greene St, 3rd Floor Conference Room
Abstract: Seminal results from cognitive science suggest that personalized education is effective to improve learners’ outcomes. However, the effort for instructors to create content for each of their students can sometimes be prohibitive. Recent progress in machine learning has enabled technology for teachers to deliver personalized education. Unfortunately, the statistical models used by these systems are often tailored for ad-hoc domains and do not generalize across applications. In this talk, I will discuss my work towards the goal of a unified statistical framework of human learning. This line of work is more flexible, more efficient, and more accurate than previous technology. Moreover, it generalizes previous popular models from the literature. Additionally, I will outline recent progress on novel methodology to evaluate statistical models for education with a learner-centric perspective. My findings suggest that prior work often uses evaluation methods that may misrepresent the educational value of educational systems. My work is a promising alternative that improves the evaluation of machine learning models in education.

Probabilistic Cause-of-death Assignment using Verbal Autopsies

Speaker: Tyler McCormick is an Assistant Professor in the Departments of Statistics and Sociology at the University of Washington, Seattle.

Location: Kimball Hall, 246 Greene St, 3rd Floor Conference Room
Abstract: In regions without complete-coverage civil registration and vital statistics systems there is uncertainty about even the most basic demographic indicators. In such areas the majority of deaths occur outside hospitals and are not recorded. Worldwide, fewer than one-third of deaths are assigned a cause, with the least information available from the most impoverished nations. In populations like this, verbal autopsy (VA) is a commonly used tool to assess cause of death and estimate cause-specific mortality rates and the distribution of deaths by cause. VA uses an interview with caregivers of the decedent to elicit data describing the signs and symptoms leading up to the death. This paper develops a new statistical tool known as InSilicoVA to classify cause of death using information acquired through VA. InSilicoVA shares uncertainty between cause of death assignments for specific individuals and the distribution of deaths by cause across the population. Using side-by-side comparisons with both observed and simulated data, we demonstrate that InSilicoVA has distinct advantages compared to currently available methods.

Topic-adjusted visibility metric for scientific articles

Speaker: Tian Zheng is an Associate Professor of Statistics in the Statistics Department at Columbia University. Her research focuses on developing novel methods and improving existing methods for exploring and analyzing interesting patterns in complex data from different application domains.

Location: Kimball Hall, 246 Greene St, 3rd Floor Conference Room
Abstract: Measuring the impact of scientific articles is important for evaluating the research output of individual scientists, academic institutions and journals. While citations are raw data for constructing impact measures, there exist biases and potential issues if factors affecting citation patterns are not properly accounted for. In this talk, I present a new model that aims to address the problem of field variation and introduce an article level metric useful for evaluating individual articles’ topic-adjusted visibility. This measure derives from joint probabilistic modeling of the content in the articles and the citations amongst them using latent Dirichlet allocation (LDA) and the mixed membership stochastic blockmodel (MMSB). This proposed model provides a visibility metric for individual articles adjusted for field variation in citation rates, a structural understanding of citation behavior in different fields, and article recommendations which take into account article visibility and citation patterns. For this work, we also developed an efficient algorithm for model fitting using variational methods. To scale up to large networks, we developed an online variant using stochastic gradient methods and case-control likelihood approximation. Results from an application of our methods to the benchmark KDD Cup 2003 dataset with approximately 30,000 high energy physics papers will also be presented.

Small sample adjustments to F-tests for cluster robust standard errors

Speaker: Elizabeth Tipton is an Assistant Professor of Applied Statistics in the Human Development Department at Teachers College, Columbia University. Her research focuses on the design and analysis of field experiments; issues of external validity and generalizability in experiments; and meta-analysis, particularly of dependent estimates.

Location: Kimball Hall, 246 Greene St, 3rd Floor Conference Room
Abstract: Data analysts commonly ‘cluster’ their standard errors to account for correlations arising from the sampling of aggregate units (e.g., states), each containing multiple observations. When the number of clusters is small to moderate, however, this approach can lead to biased standard errors and hypothesis tests with inflated Type I error. One solution that is receiving increased attention is the use of the bias-reduced linearization (BRL). In this paper, we extend the BRL approach to include an F-test that can be implemented in a wide range of applications. A simulation study reveals that that this test has Type I error close to nominal even with a very small number of clusters, and importantly, that it outperforms the usual estimator even when the number of clusters is moderate (e.g., 50 – 100).

The Controversies over Null Hypothesis Testing and Replication

Speaker: Barry Cohen received his PhD in Experimental Psychology from NYU, and is currently a clinical associate professor in the (GSAS) department of psychology at NYU, where he teaches courses in statistics and research design at the graduate level.

Location: Kimball Hall, 246 Greene St, 3rd Floor Conference Room
Abstract: The arguments against null hypothesis significance testing (NHST) have been greatly exaggerated, and do not apply equally to all types of psychological research. I will discuss the conditions under which NHST serves several useful purposes, which may outweigh its undeniable drawbacks. In brief, NHST works best when the null hypothesis is rarely true, the direction of the results is more important than the magnitude, extremely large samples are not used, and tiny effects have no serious consequences. Priming studies in social psychology will be used as an example of this type of research. Part of the controversy over failures to replicate notable psychological studies is related to misunderstandings and misuses of NHST. I will conclude by discussing the resistance to banning NHST and its p values in favor of reports of effects sizes and/or confidence intervals, and describing some of the possible solutions to the drawbacks of NHST.

The Curious Case of the Instrumental Variable Estimator for the Complier Average Causal Effect

Speaker: Russell Steele is an Associate Professor in the Department of Mathematics and Statistics at McGill University. Prof. Steele’s primacy statistical methodological interests lie in the areas of methods for analyzing data with missing values and model selection, although he is more broadly interested in statistical applications. He has a broad range of substantive interests in medicine, publishing work in rheumatology, sports medicine, and design and interpretation of meta-analyses.

Location: Pless Hall, 32 Washington Pl, 5th Floor Conference Room

Abstract: In randomized clinical trials, subjects often do not comply with their randomized treatment arm. Although one can still unbiasedly estimate the causal effect of being assigned to treatment using the common Intention-to-Treat (ITT) estimator, there is now potential confounding of the causal effect of actually *receiving* treatment. Basic alternative estimators such as the per protocol or as treated estimators have been used, but are generally biased for estimating the causal effect of interest. Balke and Pearl (1997) and Angrist, et al. (1996) independently proposed an instrumental variable (IV) estimator that would estimate the causal effect (the Complier Average Causal Effect — CACE) of receiving treatment in a subpopulation of people who would comply with treatment assignment (i.e. the compliers). In this talk, I will first review the CACE and the IV estimator. I will then dissect the instrumental variable estimator in order to compare it to the per protocol and as treated estimators. I will show that the basic IV estimator and its confidence interval can be computed from basic summary statistics that should be reported in any randomized trial. My formulation of the IV estimator will also allow for simple sensitivity analyses that can be done using a basic Excel spreadsheet. I will then describe future interesting directions for compliance research that I am currently working on. Most of this work appears in a recently published article at the American Journal of Epidemiology and is co-authored by Ian Shrier, Jay Kaufmann and Robert Platt.

Covariate Selection with Observational Data: Simulation Results and Discussion

Speaker: Bryan Keller is Assistant Professor of Applied Statistics at Teachers College, Columbia University. His current research interests include causal inference and applications of data mining methods to social and education sciences. His scholarly work has been published in Structural Equation Modeling, Psychometrika, and Multivariate Behavioral Research.

Location: Kimball Hall, 246 Greene St, 3rd Floor Conference Room

Abstract: In an effort to protect against omitted variable bias, statisticians have traditionally favored an inclusive approach to covariate selection for causal inference, so long as covariates were measured before any treatment was administered. There are, however, three classes of variables, which, if conditioned upon, are known to degrade either the bias or efficiency of an estimate of a causal effect: non-informative variables (NVs), instrumental variables (IVs), and collider variables. The decision about whether to control for a potential collider variable must be based on theory about how the data were generated. In contrast, one need only establish a lack of association with the outcome variable in order to identify an NV or an IV. We investigate three empirical methods – forward stepwise selection, the lasso, and recursive feature elimination with random forests – for detection of NVs and IVs through simulation studies in which we judge their efficacy by (a) sensitivity and specificity in identifying true or near NVs and IVs and (b) the overall effect on bias and mean-squared error of the causal effect estimator, relative to inclusion of all pretreatment variables. Results and implications are discussed.

The End of Intelligence? What might Big Data, Learning Analytics and the Information Age Mean for how we Measure Education

Speaker: Charles Lang is a Postdoctoral Associate in the Department of Administration, Leadership & Technology at Steinhardt School of Culture, Education & Human Development, NYU. He recieved his doctorate in Human Development and Education from the Harvard Graduate School of Education and studies methodologies for capturing learning within the nascent field of learning analytics.

Location: Kimball Hall, 246 Greene St, 3rd Floor Conference Room

Abstract: For over a century educational measurement has developed analytical tools designed to maximize the inferential power of limited samples: a biannual state test, a regular accreditation exam, a once in a lifetime SAT. But can this methodology adapt to a world in which previous limitations on data collection have been dramatically reduced? A world with a greater variety of data formats, representing a larger number of conditions, on a finer timescale, with a larger sample of students. Starting from a methodological basis, Charles will discuss the implications that changes in data collection may have on how education is measured and the impact that this might have on the disciplines, institutions, and practitioners that utilize educational measurement.

Studying Change with Difference Scores versus ANCOVA: Issues, Perspectives and Advances

Speaker: Pat Shrout is a Professor of Psychology at NYU. His methodologic research has been primarily in psychometrics, sampling, and multilevel models for analysis of growth and change.

Location: Kimball Hall, 246 Greene St, 3rd Floor Conference Room

Abstract: Nearly 50 years ago, Lord (1967) described a so-called paradox in statistical analysis whereby two reasonable analyses of pre-treatment/post-treatment data lead to different results. I revisit the issues, review some of the historical discussion, and present an analysis of the alternate analyses with a causal model that distinguishes treatment effects from trait, state, and error variation. In addition to comparing numerical results from difference score and ANCOVA adjustment for pre-treatment group differences, I consider results based on propensity score adjustment.

Seminars 2008 – 2015

WHO: Matthew Steinberg

WHAT: Classroom Context and Observed Teacher Performance: What Do Teacher Observation Scores Really Measure?

WHEN: March 25, 2015

WHERE: 246 Greene Street, 3rd floor conference room

ABSTRACT: As federal, state, and local policy reforms mandate the implementation of more rigorous teacher evaluation systems, measures of teacher performance are increasingly being used to support improvements in teacher effectiveness and inform decisions related to teacher retention. Observations of teachers’ classroom instruction take a central role in these systems, accounting for the majority of a teacher’s summative evaluation rating upon which accountability decisions are based. This study explores the extent to which classroom context influences measures of teacher performance based on classroom observation scores. Using data from the Measures of Effective Teaching (MET) study, we find that the context in which teachers work—most notably, the incoming academic performance of their students—plays a critical role in determining teachers’ measured performance, even after accounting for teachers’ endowed instructional abilities. The influence of student achievement on measured teacher performance is particularly salient for English Language Arts (ELA) instruction; for aspects of classroom practice that depend on a teacher’s interactions with her students; and for subject-specific teachers compared with their generalist counterparts. Further, evidence suggests that the intentional sorting of teachers to students has a significant influence on measured ELA (though not math) instruction. Implications for high-stakes teacher-accountability policies are discussed.

BIO:Dr. Steinberg is an Assistant Professor of Education, with appointments in the Education Policy and Teaching, Learning and Leadership Divisions. He is the Faculty Methodologist for the University of Pennsylvania IES Pre-Doctoral Training Program, as well as a Faculty Fellow with the University of Pennsylvania Institute for Urban Research and an Affiliated Researcher with the University of Chicago Consortium on Chicago School Research.

WHO: Lixing Zhu, Department of Mathematics/Hong Kong Baptist University

WHEN: Wednesday October 29, 2014

ABSTRACT: For a factor model, the involved covariance matrix often has no row sparse structure because the common factors may lead some variables to strongly associate with many others. Under the ultra-high dimensional paradigm, this feature causes existing methods for sparse covariance matrices in the literature to be not directly applicable. In this paper, for a general covariance matrix, a novel approach to detect these variables that are called the pivotal variables is suggested. Then, two-stage estimation procedures are proposed to handle ultra-high dimensionality in a factor model. In these procedures, pivotal variable detection is performed as a screening step and then existing approaches are applied to refine the working model. The estimation efficiency can be promoted under weaker assumptions on the model structure. Simulations are conducted to examine the performance of the new method.

WHO: Theo Damoulas (CUSP)

WHAT: Mining NYPD’s 911 Call Data: Resource Allocation, Crimes, and Civic Engagement

WHEN: Weds. Dec. 10, 2014, 11am-12pm

WHERE: Kimball 3rd Fl. Conference Room

ABSTRACT: NYPD’s 911 calls capture some of the most interesting urban activity in New York City such as serious crimes, family disputes, bombing attacks, natural disasters, and of course prank phone calls.In this talk I will describe research in progress conducted at the Center for Urban Science and Progress at NYU, in collaboration with NYPD. The work spans multiple areas of applied statistical interest such as sampling bias, time series analysis, and spatial statistics. The domain is very rich and offers many opportunities for research in core statistical and computational areas such as causal inference, search and pattern matching algorithms, evidence and data integration, ensemble models, and uncertainty quantification. At the same time there is great potential for positively impacting the quality of life of New Yorkers, and the day-to-day operation of NYPD.

WHO: Luke Keele

WHAT: Estimating Post-Treatment Effect Modification With Generalized Structural Mean Models

WHEN: Monday, February 24, 2014, 12-1:15pm

WHERE: 246 Greene St (Kimball), 3rd floor conference room 301W

ABSTRACT: In randomized controlled trials, the evaluation of an overall treatment effect is often followed by effect modification or subgroup analyses, where the possibility of a different magnitude or direction of effect for varying values of a covariate is explored. While studies of effect modification are typically restricted to pretreatment covariates, longitudinal experimental designs permit the examination of treatment effect modification by intermediate outcomes, where intermediates are measured after treatment but before the final outcome. We present a generalized structural mean model (GSMM) for analyzing treatment effect modification by post-treatment covariates. The model can accommodate post-treatment effect modification with both full compliance and noncompliance to assigned treatment status. The methods are evaluated using a simulation study that demonstrates that our approach retains unbiased estimation of effect modification by intermediate variables which are affected by treatment and also predict outcomes. We illustrate the method using a randomized trial designed to promote re-employment through teaching skills to enhance self-esteem and inoculate job seekers against setbacks in the job search process. Our analysis provides some evidence that the intervention was much less successful among subjects that displayed higher levels of depression at intermediate post-treatment waves of the study.

BIO: Dr. Keele received his PhD in Political Science from the University of North Carolina at Chapel Hill.

WHO: Luke Keele

WHAT: Didactic Talk: Causal Mediation Analysis

WHEN: Tuesday, February 25, 2014, 10:45-11:45am

WHERE: 246 Greene St (Kimball), 3rd floor conference room 301W

ABSTRACT: Causal analysis in the social sciences has largely focused on the estimation of treatment effects. Researchers often also seek to understand how a causal relationship arises. That is, they wish to know why a treatment works. In this talk, I introduce causal mediation analysis, a statistical framework for analyzing how a specific treatment changes an outcome. Using the potential outcomes framework, I outline both the counterfactual comparison implied by a causal mediation analysis and exactly what assumptions are sufficient for identifying causal mediation effects. I highlight that commonly used statistical methods for identifying causal mechanisms rely upon untestable assumptions and and may be inappropriate even under those assumptions. Casual mediation analysis is illustrated via an intervention study that seeks to understand whether single-sex classrooms improve academic performance.

WHO: Daphna Harel

WHAT: Research Talk: The Effect of Collapsing Categories on the Estimation of the Latent Trait

WHEN: Wednesday, February 26, 2014, 12-1:15pm

WHERE: 246 Greene St (Kimball), 3rd floor conference room 301W

ABSTRACT: Researchers often collapse categories of ordinal data out of convenience or in an attempt to improve model performance. Collapsing categories is quite common when fitting item response theory (IRT) when items are deemed to behave poorly. In this talk, I define the true model for the collapsed data both from a marginal and conditional perspective and develop a new paradigm for thinking about the problem of collapsing categories. I explore the issue of collapsing categories through the lens of model misspecification and explore the asymptotic behaviour of the parameter estimates from the misspecified model. I review and critique several current methods for deciding when to collapse categories and present simulation results on the effect of collapsing on the estimation of the latent trait.

BIO: Daphna Harel is a PhD candidate in Probability and Statistics at McGill University, and will be graduating in August of this year.

WHO: Daphna Harel

WHAT: Didactic Talk: An Introduction to Item Response Theory and Its Applications

WHEN: Thursday, February 27, 2014, 11am-12pm

WHEN: 246 Greene St (Kimball), 3rd floor conference room 301W

ABSTRACT: When a trait or construct cannot be measured directly, researchers often use multi-item questionnaires or tests to collect data that can provide insight about the underlying (or latent) trait. Item Response Theory (IRT) provides a class of statistical models that relate these observed responses to the latent trait allowing for inference to be made while still accounting for item-level characteristics. In this talk, I will introduce four commonly used IRT models: the Rasch model, the two-parameter model, the Partial Credit model and the Generalized Partial Credit model. My comparison will focus on the interpretation of and selection amongst these four models. One common use of IRT models is to determine whether an item functions the same for all types of people. This issue of Differential Item Functioning will be explored in the case of dichotomous items for both the Rasch model and two-parameter model. Lastly, three important summary statistics, the empirical Bayes estimator, the summed score and the weighted summed score will be presented and the use of each will be explained, specifically for the Partial Credit model and Generalized Partial Credit model.

WHO: Ivan Diaz

WHAT: Research Talk: Definition and estimation of causal effects for continuous exposures: theory and applications

WHEN: Thursday, February 13, 2014, 11am-12:15pm

WHERE: 246 Greene St (Kimball), 3rd floor conference room

ABSTRACT: The definition of a causal effect typically involves counterfactual variables resulting from interventions that modify the exposure of interest deterministically. However, this approach might yield infeasible interventions in some applications. A stochastic intervention generalizes the framework to define counterfactuals in which the post-intervention exposure is stochastic rather than deterministic. In this talk I will present a new approach to causal effects based on stochastic interventions, I will focus on an application of this methodology to the definition and estimation of the causal effect of a shift of a continuous exposure. This parameter is of general interest since it generalizes the interpretation of the coefficient in a main effects regression model to a nonparametric model. I will discuss two estimators of the causal effect: an M-estimator and a targeted minimum loss based estimator (TMLE), both of them efficient in the nonparametric model. I will discuss the methods in the context of an application to the evaluation of the effect of physical activity on all-cause mortality in the elderly.

BIO: Dr. Diaz received his PhD in Biostatistics from the School of Public Health, University of California at Berkeley, under the direction of Mark van der Laan and is completing a postdoc at Johns Hopkins this year.

WHO: Ivan Diaz

WHAT: Definition and estimation of causal effects for continuous exposures

WHEN: Thursday February 13, 2014, 2-3pm

WHERE: 246 Greene St (Kimball), 3rd floor conference room

ABSTRACT: In this talk I will discuss some important practical aspects of the definition and interpretation of potential (also called counterfactual) outcomes. These aspects must be considered with care when defining estimands in causal inference for observational studies. In particular, when working with continuous exposures, interventions that result in the usual potential outcomes are often inconceivable. As a consequence, the standard framework fails to provide relevant answers to scientific questions about interventions on the exposure. To solve this problem, I will present a proposal that defines counterfactual outcomes in terms of plausible interventions. I will define the causal effect of such interventions, and present an outcome regression estimator whose implementation is straightforward using existing regression software. The methods will be illustrated using an application to the evaluation of the effect of physical activity on all-cause mortality in the elderly.

WHO: David Ong (Assistant Professor of Economics at Peking University Business School)

WHAT: Income attraction: An online dating field experiment

WHEN: Thursday, November 21, 11am-12:15pm,

WHERE: 3rd floor conference room in Kimball Hall (246 Greene St)

ABSTRACT: Marriage rates have been decreasing in the US contemporaneously as women’s relative wages have been increasing. We found the opposite pattern in China. Prior empirical studies with US marriage data indicate that women marry up (and men marry down) economically. Furthermore, if the wife earns more, less happiness and greater strife are reported, the gender gap in housework increases, and they are more likely to divorce. However, these observational studies cannot identify whether these consequences were due to men’s preference for lower income women, or women’s preference for higher income men, or to other factors. We complement this literature by measuring income based attraction in a field experiment. We randomly assigned income levels to 360 unique artificial profiles on a major online dating website and recorded the incomes of nearly 4000 visits. We found that men of all income levels visited women’s profiles with different income levels at roughly equal rates. In contrast, women at all income levels visited men with higher income at higher rates, and surprisingly, these higher rates increased with the women’s own income. Men with the highest level of income got ten times more visits than the lowest. We discuss how the gender difference in “income attraction” might shed light on marriage and gender wage patterns, the wage premium for married men, and other stylized facts, e.g., why the gender gap in housework is higher for women who earn more than their husbands. This is the first field experimental study of gender differences in preferences for mate income.

WHO: Adam Glynn, Harvard University

WHAT: Front-door Difference-in-Differences Estimators: The Effects of Early In-person Voting on Turnout

WHEN: Thursday, Nov. 7, 2013, 11am-12:15pm

WHERE: 3rd floor conference room in Kimball Hall (246 Greene St)

ABSTRACT: In this talk, we develop front-door difference-in-differences estimators that utilize mechanistic information from post-treatment variables in addition to information from pre-treatment covariates. Even when the front-door criterion does not hold, these estimators allow the identification of causal effects by utilizing assumptions that are analogous to standard difference-in-differences assumptions. We also demonstrate that causal effects can be bounded by front-door and front-door difference-in-differences estimators under relaxed assumptions. We illustrate these points with an application to the effects of early in-person voting on turnout. Despite recent claims that early in-person voting had either an undetectable effect or a negative effect on turnout in 2008, we find evidence that early in-person voting had small positive effects on turnout in Florida in 2008. Moreover, we find evidence that early in-person voting disproportionately benefits African-American turnout.

WHO: Vincent Dorie, IES Postdoctoral Fellow, PRIISM Center

WHAT: Gaussian Processes for Causal Inference

WHEN: Thursday, October 24, 2013, 11am-12:15pm

WHERE: 3rd floor conference room in Kimball Hall (246 Greene St)

ABSTRACT: This brown bag talk will provide a mathematical and literature background for Gaussian Processes (GP) and discuss the use of GP in non-parametric modeling of the response surface for use in making straightforward causal comparisons. Additional topics include scalability, incorporating treatment levels as a spatial dimension, and the requirements for a fully-automated "black box" system for causal inference.

WHO: Nicole Carnegie, Harvard University

WHAT: Linkage of viral sequences among HIV-infected village residents in Botswana: estimation of clustering rates in the presence of missing data

WHEN: Thursday, Sept. 19, 2013, 11am-12:15pm

ABSTRACT: Linkage analysis is useful in investigating disease transmission dynamics and the effect of interventions on them, but estimates of probabilities of linkage between infected people from observed data can be biased downward when missingness is informative. We investigate variation in the rates at which subjects' viral genotypes link by viral load (low/high) and ART status using blood samples from household surveys in the Northeast sector of Mochudi, Botswana. The probability of obtaining a sequence from a sample varies with viral load; samples with low viral load are harder to amplify. Pairwise genetic distances were estimated from aligned nucleotide sequences of HIV-1C env gp120. It is first shown that the probability that randomly selected sequences are linked can be estimated consistently from observed data. This is then used to develop maximum likelihood estimates of the probability that a sequence from one group links to at least one sequence from another group under the assumption of independence across pairs. Furthermore, a resampling approach is developed that adjusts for the presence of correlation within individuals, with diagnostics for assessing the reliability of the method.

Sequences were obtained for 65% of subjects with high viral load (HVL, n=117), 54% of subjects with low viral load but not on ART (LVL, n=180), and 45% of subjects on ART (ART, n=126). The probability of linkage between two individuals is highest if both have HVL, and lowest if one has LVL and the other has LVL or is on ART. Linkage across groups is high for HVL and lower for LVL and ART. Adjustment for missing data increases the group-wise linkage rates by 40-100%, and changes the relative rates between groups. Bias in inferences regarding HIV viral linkage that arise from differential ability to genotype samples can be reduced by appropriate methods for accommodating missing data.

WHO: Daphna Harel, McGill University, Department of Mathematics and Statistics

WHAT: The Inadequacy of the Summed Score (and How You Can Fix It!)

WHEN: Thursday, October 17, 2013, 11am-12:15pm

WHERE: 3rd floor conference room in Kimball Hall (246 Greene St)

ABSTRACT: Health researchers often use patient and physician questionnaires to assess certain aspects of health status. Item Response Theory (IRT) provides a set of tools for examining the properties of the instrument and for estimation of the latent trait for each individual. In my research, I critically examine the usefulness of the summed score over items and an alternative weighted summed score (using weights computed from the IRT model) as an alternative to both the empirical Bayes estimator and maximum likelihood estimator for the Generalized Partial Credit Model. First, I will talk about two useful theoretical properties of the weighted summed score that I have proven as part of my work. Then I will relate the weighted summed score to other commonly used estimators of the latent trait. I will demonstrate the importance of these results in the context of both simulated and real data on the Center for Epidemiological Studies Depression Scale. 

WHOJuan Bello, NYU
WHAT:  Brown Bag talk: Information Extraction from Music Audio
WHEN: Wed, April 18, 2012, 11:15am-12:15pm
WHERE: 246 Greene Street, Floor 3, Conference Room
ABSTRACT:  This talk will overview a mix of concepts, problems and techniques at the crossroads between signal processing, machine learning and music. I will start by motivating the use of content-based methods for the analysis and retrieval of music. Then, I will introduce work in three projects being investigated at the Music and Audio Research Lab (MARL): automatic chord recognition using hidden Markov models, music structure analysis using probabilistic latent component analysis, and feature learning using convolutional neural networks. In the process of doing so, I hope to illustrate some of the challenges and opportunities in the field of music informatics.

Read more about Professor Bello and the lab:


WHO:Juan Bello, NYU
WHAT:  Brown Bag talk: Information Extraction from Music Audio
WHEN: Wed, April 18, 2012, 11:15am-12:15pm
WHERE: 246 Greene Street, Floor 3, Conference Room
ABSTRACT:  This talk will overview a mix of concepts, problems and techniques at the crossroads between signal processing, machine learning and music. I will start by motivating the use of content-based methods for the analysis and retrieval of music. Then, I will introduce work in three projects being investigated at the Music and Audio Research Lab (MARL): automatic chord recognition using hidden Markov models, music structure analysis using probabilistic latent component analysis, and feature learning using convolutional neural networks. In the process of doing so, I hope to illustrate some of the challenges and opportunities in the field of music informatics.

Read more about Professor Bello and the lab:

WHO:Drew Conway
WHAT:  The impact of data science on the social sciences: perspective of a political scientist
WHEN: Wed, April 4, 2012, 11am-12pm
WHERE: 246 Greene Street, Floor 3, Conference Room
ABSTRACT:  As an emergent discipline, "data science" is by its very nature interdisciplinary.  But what separates this new discipline from traditional data mining work is a fundamental interest in human behavior.  Data science has been borne out of the proliferation of massive records of online human behavior, e.g., Facebook, Twitter, LinkedIn, etc.  It is the very presence of this data, and the accompanying tools for processing it, which have lead to the meteoric rise in demand for data science.  As such, principles from social science and a deep understanding of the data's substance represent core components in most data science endeavors.  In this talk I will describe this and the other core components of data science through examples from my own experience, highlighting the role of social science.

WHO:Ji Seung Yang 
WHAT: Talk: "Estimation of Contextual Effects through Multilevel Latent Variable Modeling with a Metropolis-Hastings Robbins-Monro Algorithm" 
WHEN: Tuesday, March 20, 2012, 1pm-2pm
WHERE: Pless Hall, 82 Washington Square East, 5th Floor Conference Room
ABSTRACT: Since human beings are social, their behaviors are naturally influenced by social groups such as one’s family, classroom, school, workplace, and country. Therefore, understanding human behaviors through not only an individual level perspective but also the lens of social context helps social researchers obtain a more complete picture of the individuals as well as society. The main theme of this talk is the definition and estimation of a contextual effect using nonlinear multilevel latent variable modeling  in which measurement error and sampling error are more properly addressed. The discussion is centered around an on-going research project that adopts a new algorithm, Metropolis-Hastings Robbins-Monro (MH-RM), to improve estimation efficiency in obtaining full-information maximum likelihood estimates (FIML) of the contextual effect. The MH-RM combines Markov chain Monte Carlo (MCMC) sampling and Stochastic Approximation to obtain FIML estimates more efficiently in complex models. This talk considers contextual effects not only as compositional effects but also as cross-level interactions, in which latent predictors are measured by categorical manifest variables. 

WHO:Preeti Raghavan and Ying Lu
WHAT: Brown Bag Discussion: Statistical modelling strategies for analyzing human movement data
WHEN: Tuesday, March 20, 2012, 10:30am-12pm
WHERE: 246 Greene Street, Floor 3, Conference Room
ABSTRACT:  Recent colloborations between Dr. Preeti Raghavan (Motor Recovery Lab, Rusk Institute) and Dr. Ying Lu (member of PRIISM) will be discussed in this talk. Using rich information of kinematic and EMG data collected at the Motor Recovery Lab, we are interested in the moverment patterns and how they change when the physiology is modified due to training, injury, disease and disability. We have explored Principle Component Analysis as a tool for dimension reduction to identify common patterns. Since the movement data are typically recorded over a period of time, it is important to model the movement pattern over time. We will discuss two aspects, treating the movement data as functional data (the functional approach) or as time series data. Accordingly we will discuss the use of functional PCA and dynamic factor analysis. Future directions of connecting EMG (muscle activities) with kinematic measures in these two contexts will also be discussed.

WHO: Ji Seung Yang
WHAT:  Talk: "An Introduction to Item Response Theory"
WHEN: Mon, March 19, 2012, 1pm – 2pm
WHERE: Pless Hall, 82 Washington Square East, 4th Floor, Payne Conference Room
ABSTRACT:  Item Response Theory (IRT) is a state-of-the-art method that has been widely used in large-scale educational assessments. Recently there has been an increased awareness of the potential benefits of IRT methodology not only in education but also in other fields such as health-related outcomes research and mental health assessment. This talk is to introduce fundamentals of IRT to an audience who is not acquainted with IRT. In addition to the key concepts of IRT, the three most popular IRT models for dichotomously scored responses will be illustrated, using an empirical data example extracted from Programme for International Student Assessment (PISA, OECD). This talk covers the principles of item analysis and scoring people in IRT framework and provides a list of advanced IRT topics at the end to sketch out the current methodological research stream in IRT. 

WHO: Peter Halpin, University of Amsterdam
WHAT:  Talk: "Three perspectives on item response theory"
WHEN: Tue, March 6, 2012, 2:30pm – 3:30pm
WHERE: Payne Room, 4th Floor, Pless Hall
ABSTRACT:  In this talk I introduce item response theory (IRT) to a general audience through consideration of three different perspectives. Firstly, I outline how IRT can be motivated with reference to classical test theory (CTT). This gives us the conventional view of IRT as a theory of test scores. Secondly, I compare IRT and discrete factor analysis (DFA). From a statistical perspective, the differences are largely a matter of emphasis. This situates IRT in the more general domain of latent variable modelling. Thirdly, I show how IRT can be represented in terms of generalized (non-) linear models. This leads to the notion of explanatory IRT, or the inclusion of covariates to model individual differences. Comparison of these perspectives allows for a relatively up-to-date “big picture” of IRT.

WHO: Peter Halpin, University of Amsterdam
WHAT:  Talk: "Point process models of human dynamics"
WHEN: Mon, March 5,  2012,2:30pm – 3:30pm
WHERE: Payne Room, 4th Floor, Pless Hall
ABSTRACT: There is an increasing demand for the analysis of intensive time series data collected on relatively few observational units. In this presentation I address the case of discrete events observed at irregular time points. In particular I discuss a class of models for coupled streams of events. These models have many natural applications in the study of human behaviour, of which I emphasize relationship counselling and classroom dynamics. I summarize my own results on parameter estimation and illustrate the model using an example from post graduate training. I also discuss ongoing developments regarding inclusion of random, time-varying covariates with measurement error and various other topics.

WHO:Jay Verkuilen (CUNY Graduate Center, Educational Psychology)
WHAT: Brown Bag Seminar: Model Comparison is Judgment, Model Selection is Decision Making
WHEN: February 15th, 2012, 11am-12pm
WHERE: 246 Greene Street, Floor 3, Conference Room
ABSTRACT: Model Comparison (MC) and Model Selection (MS) are now commonly used procedures in the statistical analysis of data in the behavioral and biological sciences. However, a number of puzzling questions seem to remain largely unexamined, many of which parallel issues that have been studied empirically in the judgment and decision making literature. In general, both MC and MS involve multiple criteria and are thus likely to be subject to the same difficulties as many other multi-criteria decision problems. For example, standard MS rules based upon Akaike weights employ a variation of Luce’s choice rule. The fact that Luce’s choice rule was constructed to encapsulate a probabilistic version of the ‘independence of irrelevant alternatives’ (IIA) condition has a number of consequences for the choice set of models to be compared. Contractions and dilations of the choice set are likely to be problematic, particularly given that information criteria measure only predictive success and not other aspects of the problem that are meaningful but more difficult to quantify, such as interpretability. In addition, in many models it is not entirely clear how to properly define quantities such as sample size or the number of parameters, and there are a number of key assumptions that are likely to be violated in common models, such as that of a regular likelihood. We consider some alternative ways of thinking about the problem. We offer some examples to illustrate, one using loglinear analysis and the other a binary mixed model.

WHO: Cyrus Samii (Department of Politics, NYU)
WHAT: Dealing with Attrition in Randomized Experiments: Non-parametric andSemi-Parametric Approaches
WHEN: December 7th, 201111am-12noon
WHERE: 246 Greene Street, Floor 3, Conference Room
ABSTRACT: Uncontrolled missingness in experimental data may underminerandomization as the basis for unbiased inference of average treatmenteffects. This paper reviews methods that attempt to address thisproblem for inference on average treatment effects. I review inferencewith non-parametric bounds and inference with semi-parametricadjustment through inverse-probability weighting, imputation, andtheir combination.  The analysis is rooted in the Neyman-Rubinpotential outcomes model, which helps to expose key assumptionsnecessary for identification and also for valid statistical inference(e.g., interval construction).

 WHO: Eric Loken, Research Associate Professor Department of Human Development and
Family Studies, Pennsylvania State University 
WHAT: The Psychometrics of College Testing: Why Don't We Practice What We Teach?
WHEN: November 9th, 201111:30am-12:30pm
WHERE: 246 Greene Street, Floor 3, Conference Room
ABSTRACT: Universities with large introductory classes are essentially operating like major testing organizations. The college assessment model, however, is many decades old, and almost no attention is given to evaluating the psychometric properties of classroom testing. This is surprising considering risks in accountability, and lost opportunities for innovation in pedagogy. As used in colleges, multiple choice tests are often guaranteed to provide unequal information across the ability spectrum, and almost nothing is known about the consistency of measurement properties across subgroups. Course management systems that encourage testing from item banks can expose students to dramatically unequal assessment. Aside from issues of fairness and validity, the neglect of research on testing in undergraduate classes represents a missed opportunity to take an empirical approach to pedagogy. Years of testing have generated vast amounts of data on student performance. These data can be leveraged to inform pedagological approaches. They can also be leveraged to provide novel assessments and tools to better encourage and measure student learning.

WHO: Krista Gile (Department of Mathematics and Statistics University of  Massachusetts/Amherst)
WHAT: An "introduction" to Respondent Driven Sampling (RDS) methodology.
WHEN: October 13th, 2011, 1pm-3pm
WHERE: 246 Greene Street, Floor 3, Conference Room
ABSTRACT: Krista Gile (Department of Mathematics and Statistics University of Massachusetts/Amherst) is a statistician who works closely with social and behavioral scientists in the area of RDS. RDS is an innovative sampling technique for studying hidden and hard-to-reach populations for which no sampling frame can be obtained. RDS has been widely used to  sample populations at high risk of HIV infection and has also been used to survey undocumented workers and migrants.

In addition to providing an introduction to RDS for the PRIISM community, Krista will also be giving a statistical methodology talk at the NYU Stern/IOMS Dept. on Friday. Details are available here

WHO: Roderick J. Little, Department of Biostatistics, University of Michigan,
and Associate Director for Research and Methodology, Bureau of the Census
WHAT: Subsample Ignorable Likelihood for Regression Analysis with Missing Data
WHEN: April 15, 2011, 1:00-2:30pm
Tisch Bldg. LC-21 
ABSTRACT: Two common approaches to regression with missing covariates are complete-case analysis (CC) and ignorable likelihood (IL) methods. We review these approaches, and propose a hybrid class, subsample ignorable likelihood (SSIL) methods, which applies an IL method to the subsample of observations that are complete on one set of variables, but possibly incomplete on others. Conditions on the missing data mechanism are presented under which SSIL gives consistent estimates, but both CC and IL are inconsistent. We motivate and apply the proposed method to data from National Health and Nutrition Examination Survey, and illustrate properties of the methods by simulation. Extensions to non-likelihood analyses are also mentioned. (Joint Work with Nanhua Zhang)

WHO: Pat Sharkey, NYU Sociology
WHAT: Confronting selection into and out of social settings: Neighborhood change and children's economic outcomes
WHEN: Wednesday, March 23rd, 2011, 10:45am-12noon
WHERE: Kimball Hall (246 Greene St) Room 506W
ABSTRACT: Selection bias continues to be a central methodological problem facing observational research estimating the effects of social settings on individuals. This article develops a method to estimate the impact of change in a particular social setting, the residential neighborhood, that is designed to address non-random selection into a neighborhood and non-random selection out of a neighborhood. Utilizing matching to confront selection into neighborhood environments and instrumental variables to confront selection out of changing neighborhoods, the method is applied to assess the effect of a decline in neighborhood concentrated disadvantage on the economic fortunes of African American children living within changing neighborhoods. Substantive findings indicate that a one standard deviation decline in concentrated disadvantage leads to increases in African American children's adult economic outcomes, but no effects on educational attainment or health.

WHO: Russ Steel, McGill University
WHAT:Modelling Birthweight in the Presence of Gestational Age Measurement Error - A Semi-parametric Multiple Imputation Model
WHEN: March 2nd, 2011
WHERE: 285 Mercer Floor 3, Conference Room
ABSTRACT:  Gestational age is an important variable in perinatal research, as it is a strong predictor of mortality and other adverse outcomes, and is also a component of measures of fetal growth. However, gestational ages measured using the date of the last menstrual period (LMP) are prone to substantial errors. These errors are apparent in most population-based data sources, which often show such implausible features as a bimodal distribution of birth weight at early preterm gestational ages (≤ 34 weeks) and constant or declining mean birth weight at postterm gestational ages (≥ 42 weeks). These features are likely consequences of errors in gestational age. Gestational age plays a critical role in measurement of outcome (preterm birth, small for gestational age) and is an important predictor of subsequent outcomes. It is important in the development of fetal growth standards. Therefore, accurate measurement of gestational age, or, failing that, a reasonable understanding of the structure of measurement error in the gestational age variable, is critical for perinatal research. In this talk, I will discuss the challenges in adjusting for gestational age measurement error via multiple imputation. In particular, I will emphasize the tension between flexibly modelling the distribution of birthweights within a gestational age and allowing for gestational age measurement error. I will also discuss strategies for incorporating prior information about the measurement error distribution and averaging over uncertainty in the distribution of the birthweights conditional on the true gestational age.

WHO: Professor Jack Buckley, Department of Applied Statistics, Social Science, and Humanities, NYU Steinhardt / PRIISM Center
WHAT: Didactic Talk: Using Multilevel Data to Control for Unobserved Confounders: Fixed and Random Effects Approaches
WHEN: November 3rd, 2010, 10:45AM-12PM
WHERE: 246 Greene Street, 3rd Floor
ABSTRACTA didactic talk is a lecture on a topic of importance to applied researchers. The presentation will have a greater focus on either teaching the basic properties of a less familiar method or emphasizing aspects of a more familiar methodology that are essential to good practice. The presentation level should be appropriate for faculty working in the quantitative social, behavioral, policy and allied health sciences, as well as their advanced graduate students.

WHO: Guido Imbens, Harvard University, Dept. of Economics
WHEN: Friday October 29th, 2010, 3:15am-4:45pm
WHAT: Methods lecture: An Empirical Model for Strategic Network Formation. This talk is co-sponsored with the NYU Department of Economics.
WHERE: 246 Greene Street,1st Floor Lounge, just south of Waverly.
TOPIC: Abstract: We develop and analyze a tractable empirical model for strategic network formation that can be estimated with data from a single network at a single point in time. We model the network formation as a sequential process where in each period a single randomly selected pair of agents has the opportunity to form a link. Conditional on such an opportunity, a link will be formed if both agents view the
link as beneficial to them.  They base their decision on their own characteristics, the characteristics of the potential partner, and on features of the current state of the network, such as whether the two potential partners already have friends in common.  A key assumption is that agents do not take into account possible future changes to the network.  This assumption avoids complications with the presence of multiple equilibria, and also greatly simplifies the computational burden of analyzing these models.  We use Bayesian markov-chain-monte-carlo methods to obtain draws from the posterior distribution of interest.  We apply our methods to a social network of 669 high school students, with, in average, 4.6 friends. We then use the model to evaluate the effect of an alternative assignment to classes on the topology of the network.

Paper: An Empirical Model for Strategic Network Formation 
This is joint work with Nicholas Christakis, James Fowler, and Karthik Kalyanaraman.

WHO: Pat Shrout, New York University, Dept. of Psychology
WHEN: Wednesday October 27th, 2010, 10:45am-12noon
WHAT: Brown Bag. Coffee will be provided. This will be an informal discussion of the methodology associated with a work in progress.
WHERE: 246 Greene Street, 3rd floor Conference Room, just south of Waverly.
TOPIC: Pat Shrout, New York University, Dept. of Psychology will present work-in-progress that examines lagged effects of conflict in intimate couples on same-day closeness. The data is derived from daily diaries, and as such is more intensive (dense) than traditional longitudinal data. Pat will discuss open issues arising in model selection, which highlight the tension between model choice, substantive questions, interpretation and causality.

WHO: Jianqing Fan, Frederick L. Moore '18 Professor of Finance and Professor of Statistics, Princeton University
WHAT: Statistics in Society lecture. Forecasting Large Panel Data with Penalized Least-Squares. This talk is also co-sponsored by the Stern IOMS-Statistics Group 
WHEN: September 17, 2010 12:00pm-1pm 
ABSTRACT:  Large Panel data arise from many diverse fields such as economics, finance, meteorology, energy demand management and ecology where spatial-temporal data are collected. Neighborhood correlations allow us to better forecast future outcomes, yet neighborhood selection becomes an important and challenging task. In this talk, we introduce the penalized least-squares to select the neighborhood variables that have an impact on the forecasting power. An iterative two-scale approach will be introduced. The inherent error (noise level) will also be estimated in the high-dimensional regression problems, which serves as the benchmark for forecasting errors. The techniques will be illustrated in forecasting the US house price indices at various Core Based Statistical Area (CBSA) levels.

Jianqing Fan is Frederick L. Moore'18 Professor of Finance and Director of Committee of Statistical Studies at Princeton University, past president of the Institute of Mathematical Statistics (2006-2009) and president of International Chinese Statistical Association. He has coauthored two highly-regarded books on "Local Polynomial Modeling" (1996) and "Nonlinear time series: Parametric and Nonparametric Methods" (2003) and authored or coauthored over 150 articles on computational biology, financial econometrics, semiparametric and non-parametric modeling, statistical learning, nonlinear time series, survival analysis, longitudinal data analysis, and other aspects of theoretical and methodological statistics. He has been consistently ranked as a top 10 highly-cited mathematical scientist since the existence of such a ranking. His published work has been recognized by The 2000 COPSS Presidents' Award, given annually to an outstanding statistician under age 40, the Humboldt Research Award for lifetime achievement in 2006, the Morningside Gold Medal of Applied Mathematics in 2007, Guggenheim Fellow in 2009, and the election to fellow of American Associations for Advancement of Science, Institute of Mathematical Statistics, and American Statistical Association.

WHO: Professor Jennifer Hill, Department of Applied Statistics, Social Science, and Humanities,
                NYU Steinhardt / PRIISM Center
WHAT: Didactic Talk: An introduction to multiple imputation: a more principled missing data solution
WHEN: May 5th, 2010, 10:45AM-12PM
WHERE: 246 Greene Street, 3rd Floor
ABSTRACT: A didactic talk is a lecture on a topic of importance to applied researchers. The presentation will have a greater focus on either teaching the basic properties of a less familiar method or emphasizing aspects of a more familiar methodology that are essential to good practice. The presentation level should be appropriate for faculty working in the quantitative social, behavioral, policy and allied health sciences, as well as their advanced graduate students.

WHO: Ying Lu, Assistant Professor of Applied Statistics, Department of Applied Statistics, Social Science, and Humanities, NYU Steinhardt / PRIISM Center
WHAT: Brown Bag Talk - Variable Selection For Linear Mixed Effect Models
WHEN: March 24th, 2010, 10:50AM-12PM
WHERE: 246 Greene Street, 3rd Floor
ABSTRACT: Mixed effect models are fundamental tools for the analysis of longitudinal data, panel data and cross-sectional data. They are widely used by various fields of social sciences, medical and biological sciences. However, the complex nature of these models has made variable selection and parameter estimation a challenging problem. In this paper, we propose a simple iterative procedure that estimates and selects fixed and random effects for linear mixed models. In particular, we propose to utilize the partial consistency property of the random effect coefficients and select groups of random effects simultaneously via a data-oriented penalty function (the smoothly clipped absolute deviation penalty function). We show that the proposed method is a consistent variable selection procedure and possesses the Oracle properties. Simulation studies and a real data analysis are also conducted to empirically examine the performance of this procedure.

WHO: Mark S. Handcock
WHAT: Statistical Methods for Sampling Hidden Networked Populations
WHEN: February 12, 2010, 11:30-12:30pm; Kaufman Management Center, 5-90 
ABSTRACT: Part of the Stern IOMS-Statistics Seminar Series, this talk will provide an overview of probability models and inferential methods for the analysis of data collected using Respondent Driven Sampling (RDS). RDS is an innovative sampling technique for studying hidden and hard-to-reach populations for which no sampling frame can be obtained. RDS has been widely used to sample populations at high risk of HIV infection and has also been used to survey undocumented workers and migrants. RDS solves the problem of sampling from hidden populations by replacing independent random sampling from a sampling frame by a referral chain of dependent observations: starting with a small group of seed respondents chosen by the researcher, the study participants themselves recruit additional survey respondents by referring their friends into the study. As an alternative to frame-based sampling, the chain-referral approach employed by RDS can be extremely successful as a means of recruiting respondents.

Current estimation relies on sampling weights estimated by treating the sampling process as a random walk on a graph, where the graph is the social network of relations among members of the target population.

These estimates are based on strong assumptions allowing the sample to be treated as a probability sample. In particular, the current estimator assumes a with-replacement sample or small sample fraction, while in practice samples are without-replacement, and often include a large fraction of the population. A large sample fraction, combined with different mean nodal degrees for infected and uninfected population members, induces substantial bias in the estimates. We introduce a new estimator which accounts for the without-replacement nature of the sampling process, and removes this bias. We then briefly introduce a further extension which uses a parametric model for the underlying social network to reduce the bias induced by the initial convenience sample.

This is joint work with Krista J. Gile, Nuffield College, Oxford. The research papers used as a basis for this talk, can be found at Ms. Gile's website regarding the following topics:
"Respondent-Driven Sampling: An Assessment of Current Methodology" (2010). Sociological Methodology forthcoming.
"Modeling Networks from Sampled Data" (2010). Annals of Applied Statistics forthcoming.

WHO: Mark S. Handcock, Department of Statistics, University of California - Los Angeles
WHAT: The fifth PRIISM-organized Statistics in Society lecture, this talk is also co-sponsored by the Stern IOMS-Statistics Group.
WHEN: February 11, 2010, 12:00-1:30pm; 19 University Place, 1st floor lecture hall 
ABSTRACT: In many situations information from a sample of individuals can be supplemented by information from population level data on the relationship of the explanatory variable with the dependent variables. Sources of population level data include a census, vital events registration systems and other governmental administrative record systems. They contain too few variables, however, to estimate demographically interesting models. Thus in a typical situation the estimation is done by using sample survey data alone, and the information from complete enumeration procedures is ignored. Sample survey data, however, are subjected to sampling error and bias due to non- response, whereas population level data are comparatively free of sampling error and typically less biased from the effects of non-response.

In this talk we will review statistical methods for the incorporation of population level information and show it can lead to statistically more accurate estimates and better inference. Population level information can be incorporated via constraints on functions of the model parameters. In general the constraints are non-linear, making the task of maximum likelihood estimation more difficult. We present an alternative approach exploiting the notion of an empirical likelihood.

We give an application to demographic hazard modeling by combining panel survey data with birth registration data to estimate annual birth probabilities by parity.

This is joint work with Sanjay Chaudhuri (National University of Singapore), and Michael S. Rendall (RAND Corporation). The research paper, "Generalised Linear Models Incorporating Population Level Information: An Empirical Likelihood Based Approach” (2008) (with Sanjay Chaudhuri and Michael S. Rendall). Journal of the Royal Statistical Society, B, 70, Part 2, pp. 311-328, was used as a basis for this talk.

WHO: Michael Sobel, Columbia University
WHAT:Fixed Effects Models in Causal Inference, a wWork-in-progress that clarifies the role of fixed effects models in causal inference.  He will make explicit the assumptions researchers implicitly make when using such models and what is actually being estimated
both of which are commonly misunderstood by those who use this strategy to identify causal effects. Coffee and some alternative beverages will be provided.
WHEN: December 9th, 2009, 12:00-1:30pm; 19 University Place, 1st floor lecture hall  
WHERE: We meet in the 3rd floor conference room in Kimball Hall, which is 246 Greene St., just south of Waverly.

WHO: Dr. Michael Foster, Professor of Maternal and Child Health in the School of Public Health, University of North Carolina, Chapel Hill.
WHAT: Dr. Foster will present the 4th Statistics in Society lecture, entitled: "Does Special Education Actually Work?" This talk will explore the efficacy of current special education policies while highlighting the role of new methods in causal inference in helping to answer it. Jointly sponsored by the Departments of Teaching and Learning and Applied Psychology, and by the Institute for Human Development and Social Change. The lecture will be followed immediately by a reception celebrating the official launch of the PRIISM Center.
WHEN: Thursday, October 1, 2009, 11:00 AM - 2:00 PM
WHERE: Room 900, Kimmel Center
ABSTRACT: This presentation assesses the effect of special education on school dropout (that is, the timing of a significant interruption in schooling) for children at risk for emotional and behavioral disorders (EBD). The analysis assesses the extent to which involvement in special education services raises the likelihood of an interruption in schooling in the presence of time-dependent confounding by aggression.  By using a child's observed school interruption time and history of special education and aggression, this strategy for assessing causal effects (which relies on g-estimation) relates the observed timing of school interruption to the counterfactual;  that is, what would have occurred had the child never been involved in special education. This analysis involves data on 1089 children collected by the Fast Track project. Subject to important assumptions, our results indicate that involvement in special education services reduces time to school
interruption by a factor of 0.64 to 0.93. In conclusion the effcacy of special education services is questionable which suggests that more research should be devoted to developing effective school-based interventions for children with emotional and behavioral problems.

WHO: Dr. Michael Greenstone
WHAT: Weather & Death in India: Mechanisms and Implications for Climate Change - This Event is Free and Open to the Public
WHEN: May 5, 2009 4:15pm - 5:30pm
WHERE: NYU Kimmel Center, Room 914 (9th Floor), 60 Washington Square South
ABSTRACT: Is climate change truly a matter of life and death? Join us as acclaimed economist Dr. Michael Greenstone discusses revelatory new research on the impact of variations in weather on well-being in India. The results indicate that high temperatures dramatically increase mortality rates; for example, 1 additional day with a mean temperature above 32° C, relative to a day in the 22° - 24° C range, increases the annual mortality rate by 0.9% in rural areas. This effect appears to be related to substantial reductions in the income of agricultural laborers due to these same hot days. Finally, the estimated temperature-mortality relationship and state of the art climate change projections reveal a substantial increase in mortality due to climate change, which greatly exceeds the expected impact in the US and other developed countries. Co-sponsored by the Global MPH program, the NYU Steinhardt School of Culture, Education and Human Development, and the NYU Environmental Studies program. Presented as part of the ongoing series Statistics in Society, organized by the Steinhardt PRIISM Center.

Michael Greenstone is the 3M Professor of Environmental Economics in the Department of Economics at the Massachusetts Institute of Technology. He also is a Research Associate at the National Bureau of Economic Research (NBER) and a Nonresident Senior Fellow at Brookings. His research is focused on estimating the costs and benefits of environmental quality. He has worked extensively on the Clean Air Act and examined its impacts on air quality, manufacturing activity, housing prices, and infant mortality to assess its costs and benefits. He is currently engaged in a large scale project to estimate the economic costs of climate change. Other current projects include examinations of: the benefits of the Superfund program; the economic and health impacts of indoor air pollution in Orissa, India; individual's revealed value of a statistical life; the impact of air pollution on infant mortality in developing countries; and the costs of biodiversity.

Greenstone is also interested in the consequences of government regulation, more generally. He is conducting or has conducted research on: the effects of federal antidiscrimination laws on black infant mortality rates; the impacts of mandated disclosure laws on equity markets; and the welfare consequences of state and local subsidies given to businesses that locate within their jurisdictions. Greenstone received a Ph.D. in economics from Princeton University and a BA in economics with High Honors from Swarthmore College.

WHO: Mark Hansen, UCLA
WHAT: Data analysis in an 'expanded field'
WHEN:Thursday February 12, 2009, 12:00-1:15,
WHERE: Warren Weaver Hall, Room 1302

Image removed.
Photo courtesy of Ben Rubin, EAR Studio

The Center for the Promotion of Research Involving Innovative
Statistical Methodology (PRIISM) is delighted to announce the second
Statistics in Society lecture for the academic year. 

Mark Hansen, a UCLA statistician with joint appointments in Electrical Engineering and Design/Media Art, will be giving a talk that examines the interface between statistics, computing and society entitled "Data analysis in an 'expanded field' ". 

Hansen is perhaps best known locally for his work co-creating a current art installation, "Movable Type" in the New York Times Building here in manhattan. However his research reaches far beyond this realm drawing on fields as diverse as information theory, numerical analysis, computer science, and ecology.

For instance, Hansen currently serves as Co-PI for the Center for Embedded Networked Sensing or CENS, an NSF Science and Technology Center ) that describes itself as "a major research enterprise focused on developing wireless sensing systems and applying this revolutionary technology to critical scientific and societal pursuits. In the same way that the development of the Internet transformed our ability to communicate, the ever
decreasing size and cost of computing components is setting the stage for detection, processing, and communication technology to be embedded throughout the physical world and, thereby, fostering both a deeper understanding of the natural and built environment and, ultimately, enhancing our ability to design and control these complex systems."

For an example of how the center's work on "urban sensing" can inform the interaction between society and the environment see

WHO:Andrew Gelman, Professor in the Departments of Statistics and Political Science at Columbia University
WHAT: Red State, Blue State, Rich State, Poor State: Why Americans Vote the Way They Do
Image removed.WHEN:10AM, Tuesday October 14th, 2008
WHERE:802 Kimmel Center for University Life, 60 Washington Square South
ABSTRACT:Andrew Gelman is a Professor in the Departments of Statistics and Political Science at Columbia University.  His new book, "Red State, Blue State, Rich State, Poor State: Why Americans Vote the Way They Do," is receiving tremendous critical praise. Gelman has recently been featured on several radio programs including WNYC's Leonard Lopate Show.
Professor Gelman recently appeared on the Leonard Lopate show; his talk will draw from his book on the same topic.