Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds multiple Quarto notebook templates that generate new interactive visualizations and tables for TCR sharing, QC, overlap/dynamics, phenotype analyses (bulk + single-cell), and GLIPH2 motif/network summaries.
Changes:
- Added new visualization-heavy Quarto templates for QC, sharing/publicity, overlap/dynamics, phenotype (bulk & sc), and GLIPH2.
- Added “Details” notebook wrappers that compose the new templates via
include. - Updated
.gitignoreto un-ignore selectednotebooks/*.qmdtemplates.
Reviewed changes
Copilot reviewed 9 out of 12 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| notebooks/template_sharing.qmd | Adds sharing/publicity visualizations (stacked bar + pgen scatter + GIANA charts + UpSet + tables). |
| notebooks/template_qc.qmd | Adds an expanded QC report with many interactive plots and rarefaction/outlier analyses. |
| notebooks/template_pheno_sc.qmd | Adds single-cell phenotype composition plots (unweighted/weighted) and phenotype overlap UpSet views. |
| notebooks/template_pheno_sc_details.qmd | Adds phenotype-level Morisita heatmaps across sample-phenotype pairs per patient. |
| notebooks/template_pheno_bulk.qmd | Adds bulk phenotype inference visualizations using TCRpheno (unweighted + weighted). |
| notebooks/template_overlap.qmd | Adds clonal dynamics testing/visualizations, longitudinal heatmaps, and trajectory plots. |
| notebooks/template_gliph.qmd | Adds GLIPH2 network visualization and motif summaries/logos. |
| notebooks/template_details_part1.qmd | Adds a “Details” wrapper (part 1) with parameters and includes. |
| notebooks/template_details_part2.qmd | Adds a “Details” wrapper (part 2) composing overlap/sharing/pheno/GLIPH2 templates. |
| .gitignore | Updates ignore rules to track additional notebook templates under notebooks/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| VDJdb_dir = f"{project_dir}vdjdb/" | ||
| convergence_dir = f"{project_dir}convergence/" |
There was a problem hiding this comment.
VDJdb_dir and convergence_dir are missing a path separator (they currently expand to <project_dir>vdjdb/ and <project_dir>convergence/). This will produce invalid paths and break downstream file discovery; these should be f"{project_dir}/vdjdb/" and f"{project_dir}/convergence/" (or use os.path.join).
| VDJdb_dir = f"{project_dir}vdjdb/" | |
| convergence_dir = f"{project_dir}convergence/" | |
| VDJdb_dir = f"{project_dir}/vdjdb/" | |
| convergence_dir = f"{project_dir}/convergence/" |
|
|
||
| ## Reading concatenated cdr3 file | ||
| concat_df = pd.read_csv(concat_csv, sep='\t') | ||
| concat_df = concat_df.merge(meta[['sample', subject_col, alias_col, 'timepoint', timepoint_order_col]], on='sample', how='left') |
There was a problem hiding this comment.
This merge hard-codes 'timepoint' instead of using timepoint_col. If the notebook is parameterized to use a different timepoint column name, this will KeyError or silently use the wrong column; use timepoint_col consistently here.
| concat_df = concat_df.merge(meta[['sample', subject_col, alias_col, 'timepoint', timepoint_order_col]], on='sample', how='left') | |
| concat_df = concat_df.merge(meta[['sample', subject_col, alias_col, timepoint_col, timepoint_order_col]], on='sample', how='left') |
| composition['subject_id'] = meta_row.iloc[0]['subject_id'] | ||
| composition['alias'] = meta_row.iloc[0]['alias'] | ||
| composition['sort_order'] = meta_row.iloc[0]['timepoint_order'] if 'timepoint_order' in meta_df.columns else 0 | ||
| else: | ||
| composition['subject_id'] = np.nan | ||
| composition['alias'] = sample_id |
There was a problem hiding this comment.
Inside create_phenotype_tabs, metadata fields are hard-coded as 'subject_id', 'alias', and 'timepoint_order' rather than using the parameter variables (subject_col, alias_col, timepoint_order_col). This breaks configurability and will fail when those parameter values differ; use the configured column names consistently.
| composition['subject_id'] = meta_row.iloc[0]['subject_id'] | |
| composition['alias'] = meta_row.iloc[0]['alias'] | |
| composition['sort_order'] = meta_row.iloc[0]['timepoint_order'] if 'timepoint_order' in meta_df.columns else 0 | |
| else: | |
| composition['subject_id'] = np.nan | |
| composition['alias'] = sample_id | |
| # Use configured metadata column names | |
| composition[subject_col] = meta_row.iloc[0][subject_col] if subject_col in meta_df.columns else np.nan | |
| composition[alias_col] = meta_row.iloc[0][alias_col] if alias_col in meta_df.columns else sample_id | |
| composition['sort_order'] = meta_row.iloc[0][timepoint_order_col] if timepoint_order_col in meta_df.columns else 0 | |
| else: | |
| composition[subject_col] = np.nan | |
| composition[alias_col] = sample_id |
| # --- Metadata Lookup --- | ||
| meta_row = meta_df[meta_df['sample'] == sample_name] | ||
| if meta_row.empty: | ||
| # Skip samples not in metadata, or define defaults | ||
| continue | ||
|
|
||
| subject_id = meta_row.iloc[0]['subject_id'] | ||
| alias = meta_row.iloc[0]['alias'] | ||
| # Default to 0 if order column is missing | ||
| tp_order = meta_row.iloc[0]['timepoint_order'] if 'timepoint_order' in meta_df.columns else 0 | ||
|
|
There was a problem hiding this comment.
Similarly, create_weighted_phenotype_plot hard-codes 'subject_id', 'alias', and 'timepoint_order' for metadata lookup/output rather than using subject_col / alias_col / timepoint_order_col. This can raise KeyError or produce mismatched outputs when users change the parameter names.
| cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order) | ||
| else: | ||
| ordered_timepoints = sorted(cdr3_df[timepoint_col].unique()) | ||
| timepoint_order_fallback = {t: i for i, t in enumerate(ordered_timepoints)} | ||
| cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_fallback) | ||
|
|
There was a problem hiding this comment.
This code checks for a variable named timepoint_order in locals(), but the notebook defines timepoint_order_col (column name) and uses timepoint_order_col elsewhere. As written, timepoint_order will almost always be undefined, so the custom ordering path is never taken. Consider deriving the ordering from timepoint_order_col (e.g., from meta) instead of relying on an undefined mapping variable.
| cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order) | |
| else: | |
| ordered_timepoints = sorted(cdr3_df[timepoint_col].unique()) | |
| timepoint_order_fallback = {t: i for i, t in enumerate(ordered_timepoints)} | |
| cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_fallback) | |
| # Use explicitly provided mapping if available | |
| timepoint_order_map = timepoint_order | |
| elif 'timepoint_order_col' in locals() and timepoint_order_col in cdr3_df.columns: | |
| # Derive ordering from a dedicated timepoint order column, if provided | |
| ordered_timepoints = ( | |
| cdr3_df[[timepoint_col, timepoint_order_col]] | |
| .drop_duplicates() | |
| .sort_values(by=timepoint_order_col)[timepoint_col] | |
| .tolist() | |
| ) | |
| timepoint_order_map = {t: i for i, t in enumerate(ordered_timepoints)} | |
| else: | |
| # Fallback: sort unique timepoints alphabetically/numerically | |
| ordered_timepoints = sorted(cdr3_df[timepoint_col].unique()) | |
| timepoint_order_map = {t: i for i, t in enumerate(ordered_timepoints)} | |
| cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_map) |
| Simulates rarefaction by expanding counts and shuffling. | ||
| """ | ||
| # Flat array where each element is a clone ID, repeated by its count | ||
| # Use integer IDs for memory efficiency instead of strings | ||
| counts = sample_df['counts'].values | ||
| n_reads = counts.sum() | ||
|
|
||
| if n_reads == 0: | ||
| return pd.DataFrame() | ||
|
|
||
| # Generate clone IDs (0 to N-1) | ||
| ids = np.arange(len(counts)) | ||
|
|
||
| # e.g., if clone 0 has 2 counts -> [0, 0] | ||
| # WARNING: High memory usage for very deep sequencing | ||
| flat_repertoire = np.repeat(ids, counts) | ||
|
|
||
| # Shuffle to simulate random sampling | ||
| np.random.shuffle(flat_repertoire) | ||
|
|
||
| # Define depths to check (from 0 to total reads) | ||
| depths = np.linspace(0, n_reads, num=steps, dtype=int) | ||
| depths = np.unique(depths)[1:] # Remove 0 and duplicates | ||
|
|
||
| results = [] | ||
| for d in depths: | ||
| subsample = flat_repertoire[:d] | ||
| n_unique = len(np.unique(subsample)) |
There was a problem hiding this comment.
Rarefaction uses np.repeat(ids, counts) to materialize a per-read array (flat_repertoire) of length n_reads. For deep libraries this is O(n_reads) memory and can OOM / severely slow rendering. Consider using a rarefaction algorithm that avoids full expansion (e.g., multinomial/without-replacement sampling on counts, or computing expected richness), and/or cap depth/steps or sample subset for plotting.
| Simulates rarefaction by expanding counts and shuffling. | |
| """ | |
| # Flat array where each element is a clone ID, repeated by its count | |
| # Use integer IDs for memory efficiency instead of strings | |
| counts = sample_df['counts'].values | |
| n_reads = counts.sum() | |
| if n_reads == 0: | |
| return pd.DataFrame() | |
| # Generate clone IDs (0 to N-1) | |
| ids = np.arange(len(counts)) | |
| # e.g., if clone 0 has 2 counts -> [0, 0] | |
| # WARNING: High memory usage for very deep sequencing | |
| flat_repertoire = np.repeat(ids, counts) | |
| # Shuffle to simulate random sampling | |
| np.random.shuffle(flat_repertoire) | |
| # Define depths to check (from 0 to total reads) | |
| depths = np.linspace(0, n_reads, num=steps, dtype=int) | |
| depths = np.unique(depths)[1:] # Remove 0 and duplicates | |
| results = [] | |
| for d in depths: | |
| subsample = flat_repertoire[:d] | |
| n_unique = len(np.unique(subsample)) | |
| Simulates rarefaction by sampling on counts without materializing | |
| a per-read array, to avoid O(n_reads) memory usage. | |
| """ | |
| # Work directly with the counts vector | |
| counts = sample_df['counts'].values | |
| n_reads = counts.sum() | |
| if n_reads == 0: | |
| return pd.DataFrame() | |
| # Clone frequency probabilities | |
| probs = counts / n_reads | |
| # Define depths to check (from 0 to total reads) | |
| depths = np.linspace(0, n_reads, num=steps, dtype=int) | |
| depths = np.unique(depths)[1:] # Remove 0 and duplicates | |
| results = [] | |
| for d in depths: | |
| # Multinomial sampling approximates random subsampling | |
| sample_counts = np.random.multinomial(d, probs) | |
| n_unique = np.count_nonzero(sample_counts) |
| def analyze_cdr3_lengths(df): | ||
| """ | ||
| Expects df with columns: ['sample', 'CDR3b'] | ||
| """ |
There was a problem hiding this comment.
analyze_cdr3_lengths mutates the input dataframe in-place by adding a cdr3_len column. Since run_qc_pipeline passes concat_df directly, this introduces a hidden side effect that can impact later cells. Prefer working on a copy or computing the series without modifying the caller’s dataframe.
| """ | |
| """ | |
| # Work on a copy to avoid mutating the caller's dataframe | |
| df = df.copy() |
| def create_phenotype_stacked_bar(meta_df): | ||
|
|
||
| # --- 1. Load Data --- | ||
| file_pattern = os.path.join(pseudobulk_dir, "*_pseudobulk_phenotype.tsv") | ||
| file_paths = glob.glob(file_pattern) |
There was a problem hiding this comment.
create_phenotype_stacked_bar is defined twice in the same notebook (second definition overwrites the first). This makes execution order-dependent and hard to maintain; consider using a single function with parameters (e.g., weight_by='clonotypes'|'cells') or distinct function names for the two plots.
|
|
||
| This plot provides a powerful, two-dimensional view of your T-cell repertoire, helping you move beyond simple clone lists to understand the functional landscape of the immune response. The central question it helps answer is: **Are the T-cell clones found across many different samples (widely shared) also the ones that are most active and expanded?** | ||
|
|
||
| This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed axplanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category. |
There was a problem hiding this comment.
Spelling: "axplanation" should be "explanation" in the narrative text.
| This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed axplanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category. | |
| This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed explanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category. |
No description provided.