Improving Measurement in Longitudinal Studies of Aging

This is the newest thematic area of NIMLAS, first introduced in 2025. This thematic area focuses on all issues related to methods for measuring and reducing measurement error in longitudinal studies of aging. Additional content will be forthcoming, along with opportunities to join this new working group.

NIMLAS activities and funding considerations in this thematic area will be heavily weighted toward critical directions for future methodological research in longitudinal studies of aging that have been identified in the working group meetings for this area. These can be found below, and we will generally entertain innovative applications of generative AI technology for each of these critical directions.

Current Critical Directions for Future Research on Improving Measurement in Longitudinal Studies of Aging:

Differential Effectiveness of Measurement: Are the instruments that we use to collect survey measures (or passive measures, using new measurement technologies) equally accepted, effective, and suitable for measuring different subpopulations, resulting in broad representation? Do we need different measurement instruments for different subgroups / languages, based on cognitive interviewing, pre-testing, knowledge of technology, etc.?
Interview Mode: What are the optimal modes of data collection for longitudinal studies of aging? How does the mode affect measurement as well as other errors in longitudinal survey data collection? How does the mixed-mode design improve data quality?
Measurement Error: What is the prevalence of measurement error unique to longitudinal studies (e.g., learning effects/conditioning), and what are the best methods for measuring and reducing this measurement error? Is the measurement error systematic or variable? What are the sources of reduced reliability in self-administered items on the web or using IVR technology?
Measuring Cognition: What are the best practices for obtaining the highest quality measures of cognition for different aging subpopulations?
Mode Effects and Cognition: What data collection modes are optimal for measuring cognition and which cognitive measures are optimal for which modes? How can researchers modify cognition measurement tools to best fit the mode? How can researchers adjust for mode effects when analyzing cognition?
Proxy Measurement: When measuring aging populations, how often is proxy measurement needed, and does proxy measurement introduce response error?
Rapid Assessment: How best to collect for addressing urgent public health needs? What are the options for population-based data collection under unprecedented circumstances faced by the population at large?

Bibliography

All bibliography entries below are tagged with colored shapes corresponding to the major thematic research areas of NIMLAS. Specific critical topics for future research that the particular product within each area is addressing are provided in text next to the colored shapes.

Author	Title	Source	Summary	Critical Topics Tag
Schroeder, H. M., Ofstedal, M. B., West, B. T.	Assessing the Heterogeneity in Mode Effects on Data Quality, Response Distributions, and Future Participation Across Sociodemographic Subgroups in a Mixed-Mode Panel Study	Journal of Official Statistics, 1–22. https://doi.org/10.1177/0282423X261429705	Introducing a web option in interviewer-administered surveys could increase response rates and reduce costs. However, this requires careful assessment of the effects of mixed-mode designs on data quality and key measures, especially across important sociodemographic subgroups. In 2018 and 2020, the Health and Retirement Study (HRS) experimentally introduced web in a sequential mixed-mode design for panelists assigned to the telephone mode. Initial analyses found a limited number of mode effects on key outcome distributions and data quality measures. This paper extends this initial analysis by assessing possible heterogeneity in these effects among sociodemographic subgroups defined by race/ethnicity, sex, and others. We interact mode with each sociodemographic indicator in statistical models for each outcome. Overall, we found limited evidence of heterogeneity in the mode effects, with 3% of the 204 interaction terms we tested emerging as significant. For example, previous work showed that more household roster changes are reported in the web-first group, and we found that this was more pronounced for females and those with some college education. Although some heterogeneity in mode effects was observed across subgroups, the effects were generally too small to cause data quality concerns. We conclude with a discussion of broader considerations for survey researchers.	Interview Mode , Different Effectiveness of Measurement
Stopczynski, A., Sekara, V., Sapiezynski, P., Cuttone, A., Madsen, M.M, Larsen, J.E., and Lehmann, S. 2014.	Measuring large scale social networks with high resolution.	PLOS One 9(4):e95978. DOI: https://doi.org/10.1371/journal.pone.0095978	Bluetooth and Wi-Fi networks can be very useful in collecting information about social networks within a specific location, and can be utilized to make connections within aging populations residing within assisted living facilities. This social network data can also be connected to relevant health data.	Social Network Measurement
Schneider, S., Junghaenel, D. U., Zelinski, E. M., Meijer, E., Stone, A. A., Langa, K. M., & Kapteyn, A. (2021)	Subtle mistakes in self-report surveys predict future transition to dementia	Alzheimer’s & Dementia (Amsterdam, Netherlands), 13(1), e12252. doi.org/10.1002/dad2.12252	This examined the relationship between errors respondents make in completing survey interviews (e.g., implausible responses, skipped questions) and subsequent dementia incidents using the Health and Retirement Study data. All response error variables showed an independent relationship with dementia, where the relationship was stronger for those who were younger and cognitively normal at baseline.	Measuring Cognition , Measurement Error , Paradata
Schneider, S., Junghaenel, D. U., Meijer, E., Stone, A. A., Orriens, B., Jin, H., Zelinski, E. M., Lee, P-J., Hernandez, R., & Kapteyn, A. (2023)	Using Item Response Times in Online Questionnaires to Detect Mild Cognitive Impairment	The Journals of Gerontology: Series B, 78(8), 1278–1283. doi.org/10.1093/geronb/gbad043	This study examined the utility of response time on online surveys for discriminating respondents’ cognitive health. Specifically, it used response time from 1053 items across 37 online surveys administered over 6.5 years and cognitive health measured at the end of this time in a multilevel location-scale model. Average response time as well as fluctuations in response time were associated with subsequent cognitive health. This suggests that response time on survey items may be a potential indicator of cognitive impairment.	Measuring Cognition , Paradata
Sanders, S., Schofield, L. S., Schumm, L. P., & Waite, L. (2025)	Measuring Cognitive Function and Cognitive Decline With Response Time Data in the National Social Life, Health, and Aging Project	The Journals of Gerontology: Series B, Psychological sciences and social sciences, 80(Supplement_1), S66–S74. doi.org/10.1093/geronb/gbae037	Using the data on response time of standard cognition questions in the National Social Life, Health, and Aging Survey, this study examined the relationship between the response time and the Montreal Cognitive Assessment (MoCA). The results show that response time predicted current as well as future MoCA. This predictive power varied by race and age but not by gender.	Measuring Cognition , Paradata
Nichols, E., Markot, M., Gross, A. L., Jones, R. N., Meijer, E., Schneider, S., & Lee, J. (2025)	The Added Value of Metadata on Test Completion Time for the Quantification of Cognitive Functioning in Survey Research	Journal of the International Neuropsychological Society, 1-10. doi.org/10.1017/S1355617724000742.	This study examined the relationship between response times on cognitive testing in computerized in-person interviews and cognitive performance. Nonlinear associations between response time and cognitive functioning was reported, after adjusting for traditional cognitive test scores. Results indicate that response times from cognitive testing may contain important information on cognition not captured in traditional scoring. Incorporation of this information has the potential to improve existing estimates of cognitive functioning.	Measuring Cognition , Paradata
Nichols, E. & Lee, J. (2024)	Considerations around the measurement of cognition in large-scale cross-national surveys: Lessons from the Health and Retirement International Network of Surveys (HRS INS) and the Harmonized Cognitive Assessment Protocol (HCAP)	CESR-Schaeffer Working Paper No. 2024-012, http://dx.doi.org/10.2139/ssrn.4986761	This paper presents key lessons on cognitive assessment from the international collaboration of aging studies. It reports challenges associated with administering cognitive tests across populations from different cultural and linguistic backgrounds and importance of maintaining consistency across time and studies, test implementation feasibility across both high-income and low- and middle-income countries, and comprehensive cognitive batteries for improved measurement precision.	Measuring Cognition , Different Effectiveness of Measurement
Nichols, E., Jones, R. N., Gross, A. L., Hayat, S., Zaninotto, P., & Lee, J. (2024)	Development and assessment of analytic methods to improve the measurement of cognition in longitudinal studies of aging through the use of substudies with comprehensive neuropsychological testing	Alzheimer’s & Dementia, 20, 7024–7036. https://doi.org/10.1002/alz.14175	Regression modeling or confirmatory factor analysis (CFA) can be used to incorporate information from substudies with comprehensive neuropsychological testing into measures of cognition in broader aging surveys. Compared to a gold standard measure based on comprehensive neuropsychological testing, both approaches had lower mean squared error than existing comparison approaches. Associations with example risks were similar across all approaches, though estimated standard errors were most accurate for CFA models. The similarity across approaches may be due to the brevity of available cognitive assessments in the example dataset.	Measuring Cognition , Different Effectiveness of Measurement
Nichols, E., Gross, A. L., Zhang, Y. S., Meijer, E., Hayat, S., Steptoe, A., Langa, K. M., & Lee, J. (2024)	Considerations for the use of the Informant Questionnaire on Cognitive Decline in the Elderly (IQCODE) in cross-country comparisons of cognitive aging and dementia	Alzheimer’s & Dementia, 20, 4635–4648. https://doi.org/10.1002/alz.13895	This study compares the performance of IQCODE in understanding objective cognition across the U.S., England and India. It reports strong associations between IQCODE and objective cognition at low cognitive functioning in the U.S. and England. In India, however, the IQCODE was less sensitive to impairments at the lowest levels of cognitive functioning, particularly among those with no education. Informant characteristics may differentially impact informant reports across countries. Researchers should consider the use of country-specific adjustments to the IQCDOE scoring based on informant characteristics to improve cross-national comparability.	Measuring Cognition , Different Effectiveness of Measurement
Kumari, M., Andrayas, A., Al Baghal, T., Burton, J., Crossley, T. F., Kerry, S. J., Parkington, D. A., Koulman, A., & Benzeval, M. (2023)	A Randomised Study of Nurse Collected Venous Blood and Self-Collected Dried Blood Spots for the Assessment of Cardiovascular Risk Factors in the Understanding Society Innovation Panel	Scientific Reports, 13, 13008. doi.org/10.1038/s41598-023-39674-6	This study examined whether participants in the U.K. Understanding Society Innovation Panel were willing to provide a blood sample in different interview settings (randomly assigned to 1. Nurse collection of dried blood spot (DBS), 2. Nurse collection of blood sample by venepuncture, or 3. Self-collection of DBS), and how resulting cardiovascular risk biomarkers compared. Although the willingness was lowest in the self-collection DBS, demographic characteristics of participants in self-collection mode were not different to those in nurse collection mode. Further, clinical biomarker information relevant to cardiovascular disease risk did not differ between the venepuncture blood sample and the DBS. This demonstrates that DBS collection offers acceptable measures of clinically relevant biomarkers, enabling the calculation of population levels of cardiovascular disease risk.	Consent Burden , Burden Exchange , Differential Effectiveness of Measurement , Measurement Error
Gatz, M., Schneider, S., Meijer, E., Darling, J.E., Orriens, B., Liu, Y., and Kapteyn, A. (2022).	Identifying Cognitive Impairment Among Older Participants in a Nationally Representative Internet Panel.	The Journals of Gerontology: Series B, 78(2), 201-209. doi:10.1093/geronb/gbac172.	Cognitive impairment is a major health issue impacting many older adults, making it imperative to have indicators of cognitive functioning. Web and phone surveys have shown to be useful in measuring cognitive functioning, allowing for the development of a cognitive impairment score.	Measuring Cognition
Domingue, B. W., McCammon, R. J., West, B. T., Langa, K. M., Weir, D. R., & Faul, J. (2023)	The Mode Effect of Web-Based Surveying on the 2018 U.S. Health and Retirement Study Measure of Cognitive Functioning	The Journals of Gerontology: Series B, 78(9), 1466–1473. doi.org/10.1093/geronb/gbad068	This study examined the mode effect on the respondent performance of cognitive tests where the mode (web vs. telephone) was randomly assigned. Those assigned to the Web mode scored higher than those to the telephone mode, particularly in the Serial 7 task and numeracy items. It recommends a mode-dependent scoring system to indicate cognitively impaired but not demented status.	Mode Effects and Cognition , Measuring Cognition
Diemer, M. A., Frisby, M. B., Marchand, A. D., & Bardelli, E. (2024)	Illustrating and enacting a Critical Quantitative approach to measurement with MIMIC models	Journal of Research on Educational Effectiveness, 1–24. doi.org/10.1080/19345747.2024.2391774	This paper provides a how-to guide for planning, implementing, and evaluating the MIMIC (Multiple Indicator and MultIple Causes) method in diverse populations – in this case, large samples of Black and white respondents from the MADICS study. The MIMIC approach can also probe for differential item functioning (DIF) in longitudinal measures (i.e., temporal invariance). MIMIC models afford powerful claims about measurement – particularly in terms of biased items – and are relatively simple to specify and test. Therefore, we argue that MIMIC models are sorely underutilized and serve important roles in ensuring sound and fair measurement. To increase their use, this tutorial carefully explains how to specify, interpret, and evaluate MIMIC models, as well as provides sample code in R (lavaan) and MPlus. MIMIC models are explained in accessible “plain English” and Greek (notation), with an OSF folder providing annotated code and output.	Measurement Error