Skip to content

Add notebook visualizations#76

Open
dltamayo wants to merge 1 commit intodltamayo-devfrom
quarto-notebooks
Open

Add notebook visualizations#76
dltamayo wants to merge 1 commit intodltamayo-devfrom
quarto-notebooks

Conversation

@dltamayo
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds multiple Quarto notebook templates that generate new interactive visualizations and tables for TCR sharing, QC, overlap/dynamics, phenotype analyses (bulk + single-cell), and GLIPH2 motif/network summaries.

Changes:

  • Added new visualization-heavy Quarto templates for QC, sharing/publicity, overlap/dynamics, phenotype (bulk & sc), and GLIPH2.
  • Added “Details” notebook wrappers that compose the new templates via include.
  • Updated .gitignore to un-ignore selected notebooks/*.qmd templates.

Reviewed changes

Copilot reviewed 9 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
notebooks/template_sharing.qmd Adds sharing/publicity visualizations (stacked bar + pgen scatter + GIANA charts + UpSet + tables).
notebooks/template_qc.qmd Adds an expanded QC report with many interactive plots and rarefaction/outlier analyses.
notebooks/template_pheno_sc.qmd Adds single-cell phenotype composition plots (unweighted/weighted) and phenotype overlap UpSet views.
notebooks/template_pheno_sc_details.qmd Adds phenotype-level Morisita heatmaps across sample-phenotype pairs per patient.
notebooks/template_pheno_bulk.qmd Adds bulk phenotype inference visualizations using TCRpheno (unweighted + weighted).
notebooks/template_overlap.qmd Adds clonal dynamics testing/visualizations, longitudinal heatmaps, and trajectory plots.
notebooks/template_gliph.qmd Adds GLIPH2 network visualization and motif summaries/logos.
notebooks/template_details_part1.qmd Adds a “Details” wrapper (part 1) with parameters and includes.
notebooks/template_details_part2.qmd Adds a “Details” wrapper (part 2) composing overlap/sharing/pheno/GLIPH2 templates.
.gitignore Updates ignore rules to track additional notebook templates under notebooks/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +27 to +28
VDJdb_dir = f"{project_dir}vdjdb/"
convergence_dir = f"{project_dir}convergence/"
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VDJdb_dir and convergence_dir are missing a path separator (they currently expand to <project_dir>vdjdb/ and <project_dir>convergence/). This will produce invalid paths and break downstream file discovery; these should be f"{project_dir}/vdjdb/" and f"{project_dir}/convergence/" (or use os.path.join).

Suggested change
VDJdb_dir = f"{project_dir}vdjdb/"
convergence_dir = f"{project_dir}convergence/"
VDJdb_dir = f"{project_dir}/vdjdb/"
convergence_dir = f"{project_dir}/convergence/"

Copilot uses AI. Check for mistakes.

## Reading concatenated cdr3 file
concat_df = pd.read_csv(concat_csv, sep='\t')
concat_df = concat_df.merge(meta[['sample', subject_col, alias_col, 'timepoint', timepoint_order_col]], on='sample', how='left')
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This merge hard-codes 'timepoint' instead of using timepoint_col. If the notebook is parameterized to use a different timepoint column name, this will KeyError or silently use the wrong column; use timepoint_col consistently here.

Suggested change
concat_df = concat_df.merge(meta[['sample', subject_col, alias_col, 'timepoint', timepoint_order_col]], on='sample', how='left')
concat_df = concat_df.merge(meta[['sample', subject_col, alias_col, timepoint_col, timepoint_order_col]], on='sample', how='left')

Copilot uses AI. Check for mistakes.
Comment on lines +178 to +183
composition['subject_id'] = meta_row.iloc[0]['subject_id']
composition['alias'] = meta_row.iloc[0]['alias']
composition['sort_order'] = meta_row.iloc[0]['timepoint_order'] if 'timepoint_order' in meta_df.columns else 0
else:
composition['subject_id'] = np.nan
composition['alias'] = sample_id
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside create_phenotype_tabs, metadata fields are hard-coded as 'subject_id', 'alias', and 'timepoint_order' rather than using the parameter variables (subject_col, alias_col, timepoint_order_col). This breaks configurability and will fail when those parameter values differ; use the configured column names consistently.

Suggested change
composition['subject_id'] = meta_row.iloc[0]['subject_id']
composition['alias'] = meta_row.iloc[0]['alias']
composition['sort_order'] = meta_row.iloc[0]['timepoint_order'] if 'timepoint_order' in meta_df.columns else 0
else:
composition['subject_id'] = np.nan
composition['alias'] = sample_id
# Use configured metadata column names
composition[subject_col] = meta_row.iloc[0][subject_col] if subject_col in meta_df.columns else np.nan
composition[alias_col] = meta_row.iloc[0][alias_col] if alias_col in meta_df.columns else sample_id
composition['sort_order'] = meta_row.iloc[0][timepoint_order_col] if timepoint_order_col in meta_df.columns else 0
else:
composition[subject_col] = np.nan
composition[alias_col] = sample_id

Copilot uses AI. Check for mistakes.
Comment on lines +345 to +355
# --- Metadata Lookup ---
meta_row = meta_df[meta_df['sample'] == sample_name]
if meta_row.empty:
# Skip samples not in metadata, or define defaults
continue

subject_id = meta_row.iloc[0]['subject_id']
alias = meta_row.iloc[0]['alias']
# Default to 0 if order column is missing
tp_order = meta_row.iloc[0]['timepoint_order'] if 'timepoint_order' in meta_df.columns else 0

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, create_weighted_phenotype_plot hard-codes 'subject_id', 'alias', and 'timepoint_order' for metadata lookup/output rather than using subject_col / alias_col / timepoint_order_col. This can raise KeyError or produce mismatched outputs when users change the parameter names.

Copilot uses AI. Check for mistakes.
Comment on lines +752 to +757
cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order)
else:
ordered_timepoints = sorted(cdr3_df[timepoint_col].unique())
timepoint_order_fallback = {t: i for i, t in enumerate(ordered_timepoints)}
cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_fallback)

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code checks for a variable named timepoint_order in locals(), but the notebook defines timepoint_order_col (column name) and uses timepoint_order_col elsewhere. As written, timepoint_order will almost always be undefined, so the custom ordering path is never taken. Consider deriving the ordering from timepoint_order_col (e.g., from meta) instead of relying on an undefined mapping variable.

Suggested change
cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order)
else:
ordered_timepoints = sorted(cdr3_df[timepoint_col].unique())
timepoint_order_fallback = {t: i for i, t in enumerate(ordered_timepoints)}
cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_fallback)
# Use explicitly provided mapping if available
timepoint_order_map = timepoint_order
elif 'timepoint_order_col' in locals() and timepoint_order_col in cdr3_df.columns:
# Derive ordering from a dedicated timepoint order column, if provided
ordered_timepoints = (
cdr3_df[[timepoint_col, timepoint_order_col]]
.drop_duplicates()
.sort_values(by=timepoint_order_col)[timepoint_col]
.tolist()
)
timepoint_order_map = {t: i for i, t in enumerate(ordered_timepoints)}
else:
# Fallback: sort unique timepoints alphabetically/numerically
ordered_timepoints = sorted(cdr3_df[timepoint_col].unique())
timepoint_order_map = {t: i for i, t in enumerate(ordered_timepoints)}
cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_map)

Copilot uses AI. Check for mistakes.
Comment on lines +1627 to +1654
Simulates rarefaction by expanding counts and shuffling.
"""
# Flat array where each element is a clone ID, repeated by its count
# Use integer IDs for memory efficiency instead of strings
counts = sample_df['counts'].values
n_reads = counts.sum()

if n_reads == 0:
return pd.DataFrame()

# Generate clone IDs (0 to N-1)
ids = np.arange(len(counts))

# e.g., if clone 0 has 2 counts -> [0, 0]
# WARNING: High memory usage for very deep sequencing
flat_repertoire = np.repeat(ids, counts)

# Shuffle to simulate random sampling
np.random.shuffle(flat_repertoire)

# Define depths to check (from 0 to total reads)
depths = np.linspace(0, n_reads, num=steps, dtype=int)
depths = np.unique(depths)[1:] # Remove 0 and duplicates

results = []
for d in depths:
subsample = flat_repertoire[:d]
n_unique = len(np.unique(subsample))
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rarefaction uses np.repeat(ids, counts) to materialize a per-read array (flat_repertoire) of length n_reads. For deep libraries this is O(n_reads) memory and can OOM / severely slow rendering. Consider using a rarefaction algorithm that avoids full expansion (e.g., multinomial/without-replacement sampling on counts, or computing expected richness), and/or cap depth/steps or sample subset for plotting.

Suggested change
Simulates rarefaction by expanding counts and shuffling.
"""
# Flat array where each element is a clone ID, repeated by its count
# Use integer IDs for memory efficiency instead of strings
counts = sample_df['counts'].values
n_reads = counts.sum()
if n_reads == 0:
return pd.DataFrame()
# Generate clone IDs (0 to N-1)
ids = np.arange(len(counts))
# e.g., if clone 0 has 2 counts -> [0, 0]
# WARNING: High memory usage for very deep sequencing
flat_repertoire = np.repeat(ids, counts)
# Shuffle to simulate random sampling
np.random.shuffle(flat_repertoire)
# Define depths to check (from 0 to total reads)
depths = np.linspace(0, n_reads, num=steps, dtype=int)
depths = np.unique(depths)[1:] # Remove 0 and duplicates
results = []
for d in depths:
subsample = flat_repertoire[:d]
n_unique = len(np.unique(subsample))
Simulates rarefaction by sampling on counts without materializing
a per-read array, to avoid O(n_reads) memory usage.
"""
# Work directly with the counts vector
counts = sample_df['counts'].values
n_reads = counts.sum()
if n_reads == 0:
return pd.DataFrame()
# Clone frequency probabilities
probs = counts / n_reads
# Define depths to check (from 0 to total reads)
depths = np.linspace(0, n_reads, num=steps, dtype=int)
depths = np.unique(depths)[1:] # Remove 0 and duplicates
results = []
for d in depths:
# Multinomial sampling approximates random subsampling
sample_counts = np.random.multinomial(d, probs)
n_unique = np.count_nonzero(sample_counts)

Copilot uses AI. Check for mistakes.
def analyze_cdr3_lengths(df):
"""
Expects df with columns: ['sample', 'CDR3b']
"""
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analyze_cdr3_lengths mutates the input dataframe in-place by adding a cdr3_len column. Since run_qc_pipeline passes concat_df directly, this introduces a hidden side effect that can impact later cells. Prefer working on a copy or computing the series without modifying the caller’s dataframe.

Suggested change
"""
"""
# Work on a copy to avoid mutating the caller's dataframe
df = df.copy()

Copilot uses AI. Check for mistakes.
Comment on lines +158 to +162
def create_phenotype_stacked_bar(meta_df):

# --- 1. Load Data ---
file_pattern = os.path.join(pseudobulk_dir, "*_pseudobulk_phenotype.tsv")
file_paths = glob.glob(file_pattern)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_phenotype_stacked_bar is defined twice in the same notebook (second definition overwrites the first). This makes execution order-dependent and hard to maintain; consider using a single function with parameters (e.g., weight_by='clonotypes'|'cells') or distinct function names for the two plots.

Copilot uses AI. Check for mistakes.

This plot provides a powerful, two-dimensional view of your T-cell repertoire, helping you move beyond simple clone lists to understand the functional landscape of the immune response. The central question it helps answer is: **Are the T-cell clones found across many different samples (widely shared) also the ones that are most active and expanded?**

This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed axplanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category.
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling: "axplanation" should be "explanation" in the narrative text.

Suggested change
This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed axplanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category.
This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed explanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants