Import patient CSV

Feature coverage
Molecular fingerprint
Strong low Low Reference High Strong high
Key drivers
Low Intermediate High

Model outputs

Suggested actions

Clinical handoff

Biomarker explorer

Molecular fingerprint

Modelling workflow

01 Data collection

Three stage-specific GEO cohorts were used so each model matched a distinct clinical question: diagnosis, next-visit activity, and treatment response (Barrett et al., 2013).

02 Pre-processing and EDA

Expression and metadata were cleaned before modelling. Diagnosis RPKM values were transformed as log2(RPKM + 1); treatment microarray values were filtered for annotated, non-control probes and collapsed to one probe per gene (Bedre, 2023).

  • Checked sample-level expression distributions and outcome balance.
  • Derived longitudinal t-to-t+1 labels for progression.
  • Standardised or imputed model inputs inside training workflows where required (NIST, n.d.).

03 Feature selection

Feature selection was performed before final modelling to reduce high-dimensional expression matrices to interpretable panels and to avoid leakage from external or held-out data.

  • Diagnosis and treatment pre-filtered to highly variable expressed genes, then combined RF-Gini ranking, PCA checks, Boruta, and biological curation (Breiman, 2001; Kursa & Rudnicki, 2010; Liaw & Wiener, 2002).
  • Progression selected expression probes from the training cohort only, removed direct leakage variables, then combined immune, clinical, treatment, temporal, engineered, and gene features.
  • Final progression features were ranked with random-forest importance on training data only.

04 Modelling

Candidate models were compared within each stage rather than forcing one algorithm across all tasks. The deployed app reads the selected R model objects and sends the frontend CSV payload to the backend for inference; probability displays are summarised in Predicted probability bands.

Diagnosis compared limma signature, random forest, and LASSO. Progression compared elastic net, random forest, and GBM. Treatment response compared limma signature, random forest, LASSO, elastic net, and linear SVM. See Validation metrics below.

05 Performance evaluation

Evaluation prioritised metrics that are more informative than raw accuracy for imbalanced clinical cohorts. Diagnosis and treatment used stratified 5-fold cross-validation; progression used patient-level cross-validation for tuning and one independent external test on GSE49454 (Fawcett, 2006; Brodersen et al., 2010). See Validation metrics and Metric notes below.

Validation metrics

Outcome Model Validation AUROC Macro F1 Balanced acc. Accuracy MCC
Limited < 0.70 Adequate 0.70-0.84 Strong ≥ 0.85

Metric notes

  • AUROC summarises threshold-free ranking of positive cases above negative cases (Fawcett, 2006).
  • Macro F1 averages class-wise F1 scores so minority and majority classes contribute evenly, while balanced accuracy averages class-wise recall to reduce majority-class bias (Sokolova & Lapalme, 2009; Brodersen et al., 2010).
  • Accuracy is the overall proportion of correct predictions and can look optimistic in imbalanced cohorts; MCC is a correlation-like summary of the full confusion matrix, where +1 is perfect agreement and 0 is no better than chance (Matthews, 1975; Chicco & Jurman, 2020).
  • The legend bands follow common discrimination heuristics: values below 0.70 are treated as limited, 0.70-0.84 as adequate, and 0.85 or higher as strong. These are visual interpretation bands, not clinical acceptance thresholds (Hosmer et al., 2013).

Predicted probability bands

Biomed 09 - Team Members

Manna Berry Development of Progression Model & Assistance with Backend
mber0347@uni.sydney.edu.au Faculty of Engineering J12, The University of Sydney, NSW 2006
Lezhi Lin Development of App Frontend & Presentation Slides
llin0935@uni.sydney.edu.au School of Mathematics and Statistics F07, The University of Sydney, NSW 2006 Australia
Udit Samant Development of Diagnosis Model & General App Backend
usam6049@uni.sydney.edu.au School of Computer Science J12, The University of Sydney, NSW 2006 Australia
Hadi Shafat Interdisciplinary Aspects Research& Assistance with Backend
hsha0153@uni.sydney.edu.au School of Computer Science J12, The University of Sydney, NSW 2006 Australia
Minh Hieu Tran Assistance with Initial Data Analysis
mtra0191@uni.sydney.edu.au School of Computer Science J12, The University of Sydney, NSW 2006 Australia
Jillian Zhao Development of Treatment Model & Assistance with Backend & Background Research
yzha0369@uni.sydney.edu.au School of Computer Science J12, The University of Sydney, NSW 2006 Australia

Acknowledgment

We acknowledge the Gadigal of the Eora Nation, the Traditional Custodians of the land on which the University of Sydney stands, and pay our respects to Elders past and present.

This prototype is submitted in partial fulfillment of the assessment requirements for DATA3888 Data Science Capstone at The University of Sydney. Our work also rests on the work of open-source maintainers across R, Bioconductor, and the modelling libraries used here, as well as the DATA3888 teaching team for project structure, feedback, and course support.

We're extremely grateful to our supervisors, Dr. Andy Tran and Elyna Lin, for all the guidance, thoughtful feedback, and steady support during both the workshops and consultations, throughout the project.

We acknowledge the original data contributors and study participants behind the public GEO cohorts. Their shared expression and clinical metadata made the modelling, validation, and patient-level demonstrations possible.

We acknowledge the use of AI-assisted tools to support drafting, code iteration, interface refinement, and debugging. All AI-assisted outputs were reviewed, edited, and validated by the team, who remain responsible for the final analysis, design decisions, and implementation.

References

  1. Apple Inc. (n.d.). Swift Charts. Apple Developer Documentation. Retrieved May 23, 2026, from https://developer.apple.com/documentation/charts
  2. Banchereau, R., Hong, S., Cantarel, B., Baldwin, N., Baisch, J., Edens, M., Cepika, A.-M., Acs, P., Turner, J., Anguiano, E., Vinod, P., Kahn, S., Obermoser, G., Blankenship, D., Wakeland, E., Nassi, L., Gotte, A., Punaro, M., Liu, Y.-J., ... Pascual, V. (2016). Personalized immunomonitoring uncovers molecular networks that stratify lupus patients. Cell, 165(3), 551-565. https://doi.org/10.1016/j.cell.2016.03.008
  3. Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M., Marshall, K. A., Phillippy, K. H., Sherman, P. M., Holko, M., Yefanov, A., Lee, H., Zhang, N., Robertson, C. L., Serova, N., Davis, S., & Soboleva, A. (2013). NCBI GEO: Archive for functional genomics data sets-update. Nucleic Acids Research, 41(D1), D991-D995. https://doi.org/10.1093/nar/gks1193
  4. Bedre, R. (2023). RNA-seq expression units: RPKM, FPKM, and TPM. https://www.reneshbedre.com/blog/expression_units.html
  5. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
  6. Brodersen, K. H., Ong, C. S., Stephan, K. E., & Buhmann, J. M. (2010). The balanced accuracy and its posterior distribution. In 2010 20th International Conference on Pattern Recognition (pp. 3121-3124). IEEE. https://doi.org/10.1109/ICPR.2010.764
  7. Brown, G. R., Hem, V., Katz, K. S., Ovetsky, M., Wallin, C., Ermolaeva, O., Tolstoy, I., Tatusova, T., Pruitt, K. D., Maglott, D. R., & Murphy, T. D. (2015). Gene: A gene-centered information resource at NCBI. Nucleic Acids Research, 43(D1), D36-D42. https://doi.org/10.1093/nar/gku1055
  8. Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21, Article 6. https://doi.org/10.1186/s12864-019-6413-7
  9. Chiche, L., Jourde-Chiche, N., Whalen, E., Presnell, S., Gersuk, V., Dang, K., Anguiano, E., Quinn, C., Burtey, S., Berland, Y., Kaplanski, G., Harle, J.-R., Pascual, V., & Chaussabel, D. (2014). Modular transcriptional repertoire analyses of adults with systemic lupus erythematosus reveal distinct type I and type II interferon signatures. Arthritis & Rheumatology, 66(6), 1583-1595. https://doi.org/10.1002/art.38628
  10. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. https://doi.org/10.1007/BF00994018
  11. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861-874. https://doi.org/10.1016/j.patrec.2005.10.010
  12. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189-1232. https://doi.org/10.1214/aos/1013203451
  13. Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22. https://doi.org/10.18637/jss.v033.i01
  14. Harrison, P. W., Amode, M. R., Austine-Orimoloye, O., & others. (2024). Ensembl 2024. Nucleic Acids Research, 52(D1), D891-D899. https://doi.org/10.1093/nar/gkad1049
  15. Hosmer, D. W., Jr., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (3rd ed.). Wiley. https://doi.org/10.1002/9781118548387
  16. Hung, T., Pratt, G. A., Sundararaman, B., Townsend, M. J., Chaivorapol, C., Bhangale, T., Graham, R. R., Ortmann, W., Criswell, L. A., Yeo, G. W., & Behrens, T. W. (2015). The Ro60 autoantigen binds endogenous retroelements and regulates inflammatory gene expression. Science, 350(6259), 455-459. https://doi.org/10.1126/science.aac7442
  17. Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the Boruta package. Journal of Statistical Software, 36(11), 1-13. https://doi.org/10.18637/jss.v036.i11
  18. Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3), 18-22. https://journal.r-project.org/articles/RN-2002-022/
  19. López-Domínguez, R., Villatoro-García, J. A., Marañón, C., Goldman, D., Petri, M., Carmona-Sáez, P., Alarcón-Riquelme, M., & Toro-Domínguez, D. (2024). Immune and molecular landscape behind non-response to Mycophenolate Mofetil and Azathioprine in lupus nephritis therapy [Preprint]. Research Square. https://doi.org/10.21203/rs.3.rs-3783877/v1
  20. Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2), 442-451. https://doi.org/10.1016/0005-2795(75)90109-9
  21. Microbe Notes. (2023). RNA sequencing (RNA-Seq): Principle, steps, types, uses. https://microbenotes.com/rna-sequencing-principle-steps-types-uses/
  22. NIST (National Institute of Standards and Technology). (n.d.). Standardize. Dataplot reference manual. Retrieved May 24, 2026, from https://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/standard.htm
  23. Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., & Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), e47. https://doi.org/10.1093/nar/gkv007
  24. Seal, R. L., Braschi, B., Gray, K. A., Jones, T. E. M., Tweedie, S., Haim-Vilmovsky, L., & Bruford, E. A. (2023). Genenames.org: The HGNC resources in 2023. Nucleic Acids Research, 51(D1), D1003-D1009. https://doi.org/10.1093/nar/gkac888
  25. Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3(1), Article 3. https://doi.org/10.2202/1544-6115.1027
  26. Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427-437. https://doi.org/10.1016/j.ipm.2009.03.002
  27. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
  28. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x