-
Notifications
You must be signed in to change notification settings - Fork 2
Scaling OLGA and Gliph2 for Compare workflow #69
Copy link
Copy link
Closed
Description
Running the Compare workflow, particularly OLGA for pgen calculation for tcrsharing and the clustering for Gliph2, does not scale well with increasing cohort size. Currently:
- OLGA calculates generation probability serially
- Gliph2 clustering cpu/memory/time increase exponentially with larger clusters
Possible solutions to implement:
OLGA:
- find means of vectorizing compute_pgen process
- chunk input CDR3 df and parallelize within Nextflow/python
- reformat so that pgen is calculated per sample (ie. take the OLGA outputs from sample workflow and feed into compare workflow)
Gliph2:
- group input by CDR3 across samples to reduce size of input (potentially pivot long per sample afterwards)
- subset Gliph2 inputs by metadata or specific groups of samples so entire cohort isn't clustered at once, and Gliph2 subgroups can be clustered in parallel
- would enable more direct downstream
- need to assess biological impact of running Gliph2 on whole cohort vs subsets
- add
cluster_min_sizeas parameter and/or tinker with other Gliph2 args to reduce amount of clustering
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels