Scaling OLGA and Gliph2 for Compare workflow

Running the Compare workflow, particularly OLGA for pgen calculation for tcrsharing and the clustering for Gliph2, does not scale well with increasing cohort size. Currently:
- OLGA calculates generation probability serially
- Gliph2 clustering cpu/memory/time increase exponentially with larger clusters

Possible solutions to implement:
OLGA:
- find means of vectorizing compute_pgen process
- chunk input CDR3 df and parallelize within Nextflow/python
- reformat so that pgen is calculated per sample (ie. take the OLGA outputs from sample workflow and feed into compare workflow)

Gliph2:
- group input by CDR3 across samples to reduce size of input (potentially pivot long per sample afterwards)
- subset Gliph2 inputs by metadata or specific groups of samples so entire cohort isn't clustered at once, and Gliph2 subgroups can be clustered in parallel
  - would enable more direct downstream
  - need to assess biological impact of running Gliph2 on whole cohort vs subsets
- add `cluster_min_size` as parameter and/or tinker with other Gliph2 args to reduce amount of clustering

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling OLGA and Gliph2 for Compare workflow #69

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scaling OLGA and Gliph2 for Compare workflow #69

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions