The CTS and Data Scientists in Training

Sep 15, 2023
2 min read

For the third year in a row, the California Teachers Study (CTS) has had the pleasure of working with the University of Southern California (USC) Data Science program. Data Science is the practice of extracting meaning from data. As part of the USC Data Science program, Master of Science (MS) students are given the opportunity to apply the data science principles they have learned to a real-world health dataset.

This summer, three USC students used CTS data to utilize their data science knowledge by training machine learning models. Machine learning (ML) means using computer systems that can learn and adapt without explicit instructions; instead of human commands, these computer systems use algorithms and statistical models. Machine learning models can enable researchers to identify patterns or trends they might not otherwise recognize.

The three USC students were all given the same two tasks: use CTS data to 1) determine whether they could identify predictors for short-term death after hospitalization, and 2) assess which ML models best identified those predictors. For this project, short-term death was defined as death during hospitalization or within 30 days of hospital discharge.

To complete their work, students had to move through entire AI/ML data science lifecycle using the following steps:

examine and understand a real-world dataset,
determine which variables they would use in their analysis,
split their dataset into training and test sets (this is done so that the machine learning algorithms have a training set of data used to learn and then a separate test set of data to evaluate the effectiveness of the model),
decide which ML models to use,
run the models, and
evaluate the performance of their models using standard ML metrics.

Independently, these three students’ models all identified some common risk factors for short-term death post hospitalization: age at hospitalization, length of hospital stay, and type of diagnosis were all associated with risk of death within 30 days of hospitalization. All three students found the Random Forest model to be a top-performing ML model for identifying these risk factors.

This chart by Misha Khan, a USC Master of Science candidate, shows the different models applied to CTS data and the accuracy of each model.

In addition to highlighting the readiness of CTS data for use in machine learning projects, these student projects also identified some interesting potential findings that could merit future exploration. We look forward to seeing what these talented students do next!