Add notebook visualizations by dltamayo · Pull Request #76 · KarchinLab/TCRtoolkit

dltamayo · 2026-03-27T18:28:01Z

No description provided.

Copilot

Pull request overview

This PR adds multiple Quarto notebook templates that generate new interactive visualizations and tables for TCR sharing, QC, overlap/dynamics, phenotype analyses (bulk + single-cell), and GLIPH2 motif/network summaries.

Changes:

Added new visualization-heavy Quarto templates for QC, sharing/publicity, overlap/dynamics, phenotype (bulk & sc), and GLIPH2.
Added “Details” notebook wrappers that compose the new templates via include.
Updated .gitignore to un-ignore selected notebooks/*.qmd templates.

Reviewed changes

Copilot reviewed 9 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
notebooks/template_sharing.qmd	Adds sharing/publicity visualizations (stacked bar + pgen scatter + GIANA charts + UpSet + tables).
notebooks/template_qc.qmd	Adds an expanded QC report with many interactive plots and rarefaction/outlier analyses.
notebooks/template_pheno_sc.qmd	Adds single-cell phenotype composition plots (unweighted/weighted) and phenotype overlap UpSet views.
notebooks/template_pheno_sc_details.qmd	Adds phenotype-level Morisita heatmaps across sample-phenotype pairs per patient.
notebooks/template_pheno_bulk.qmd	Adds bulk phenotype inference visualizations using TCRpheno (unweighted + weighted).
notebooks/template_overlap.qmd	Adds clonal dynamics testing/visualizations, longitudinal heatmaps, and trajectory plots.
notebooks/template_gliph.qmd	Adds GLIPH2 network visualization and motif summaries/logos.
notebooks/template_details_part1.qmd	Adds a “Details” wrapper (part 1) with parameters and includes.
notebooks/template_details_part2.qmd	Adds a “Details” wrapper (part 2) composing overlap/sharing/pheno/GLIPH2 templates.
.gitignore	Updates ignore rules to track additional notebook templates under `notebooks/`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T14:04:19Z

notebooks/template_pheno_sc.qmd

+VDJdb_dir = f"{project_dir}vdjdb/"
+convergence_dir = f"{project_dir}convergence/"


VDJdb_dir and convergence_dir are missing a path separator (they currently expand to <project_dir>vdjdb/ and <project_dir>convergence/). This will produce invalid paths and break downstream file discovery; these should be f"{project_dir}/vdjdb/" and f"{project_dir}/convergence/" (or use os.path.join).

Suggested change

VDJdb_dir = f"{project_dir}vdjdb/"

convergence_dir = f"{project_dir}convergence/"

VDJdb_dir = f"{project_dir}/vdjdb/"

convergence_dir = f"{project_dir}/convergence/"

Copilot · 2026-04-01T14:04:19Z

notebooks/template_pheno_sc_details.qmd

+
+## Reading concatenated cdr3 file
+concat_df = pd.read_csv(concat_csv, sep='\t')
+concat_df = concat_df.merge(meta[['sample', subject_col, alias_col, 'timepoint', timepoint_order_col]], on='sample', how='left')


This merge hard-codes 'timepoint' instead of using timepoint_col. If the notebook is parameterized to use a different timepoint column name, this will KeyError or silently use the wrong column; use timepoint_col consistently here.

Suggested change

concat_df = concat_df.merge(meta[['sample', subject_col, alias_col, 'timepoint', timepoint_order_col]], on='sample', how='left')

concat_df = concat_df.merge(meta[['sample', subject_col, alias_col, timepoint_col, timepoint_order_col]], on='sample', how='left')

Copilot · 2026-04-01T14:04:20Z

notebooks/template_pheno_bulk.qmd

+                composition['subject_id'] = meta_row.iloc[0]['subject_id']
+                composition['alias'] = meta_row.iloc[0]['alias']
+                composition['sort_order'] = meta_row.iloc[0]['timepoint_order'] if 'timepoint_order' in meta_df.columns else 0
+            else:
+                composition['subject_id'] = np.nan
+                composition['alias'] = sample_id


Inside create_phenotype_tabs, metadata fields are hard-coded as 'subject_id', 'alias', and 'timepoint_order' rather than using the parameter variables (subject_col, alias_col, timepoint_order_col). This breaks configurability and will fail when those parameter values differ; use the configured column names consistently.

Suggested change

composition['subject_id'] = meta_row.iloc[0]['subject_id']

composition['alias'] = meta_row.iloc[0]['alias']

composition['sort_order'] = meta_row.iloc[0]['timepoint_order'] if 'timepoint_order' in meta_df.columns else 0

else:

composition['subject_id'] = np.nan

composition['alias'] = sample_id

# Use configured metadata column names

composition[subject_col] = meta_row.iloc[0][subject_col] if subject_col in meta_df.columns else np.nan

composition[alias_col] = meta_row.iloc[0][alias_col] if alias_col in meta_df.columns else sample_id

composition['sort_order'] = meta_row.iloc[0][timepoint_order_col] if timepoint_order_col in meta_df.columns else 0

else:

composition[subject_col] = np.nan

composition[alias_col] = sample_id

Copilot · 2026-04-01T14:04:20Z

notebooks/template_pheno_bulk.qmd

+            # --- Metadata Lookup ---
+            meta_row = meta_df[meta_df['sample'] == sample_name]
+            if meta_row.empty:
+                # Skip samples not in metadata, or define defaults
+                continue
+
+            subject_id = meta_row.iloc[0]['subject_id']
+            alias = meta_row.iloc[0]['alias']
+            # Default to 0 if order column is missing
+            tp_order = meta_row.iloc[0]['timepoint_order'] if 'timepoint_order' in meta_df.columns else 0
+


Similarly, create_weighted_phenotype_plot hard-codes 'subject_id', 'alias', and 'timepoint_order' for metadata lookup/output rather than using subject_col / alias_col / timepoint_order_col. This can raise KeyError or produce mismatched outputs when users change the parameter names.

Copilot · 2026-04-01T14:04:20Z

notebooks/template_overlap.qmd

+    cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order)
+else:
+    ordered_timepoints = sorted(cdr3_df[timepoint_col].unique())
+    timepoint_order_fallback = {t: i for i, t in enumerate(ordered_timepoints)}
+    cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_fallback)
+


This code checks for a variable named timepoint_order in locals(), but the notebook defines timepoint_order_col (column name) and uses timepoint_order_col elsewhere. As written, timepoint_order will almost always be undefined, so the custom ordering path is never taken. Consider deriving the ordering from timepoint_order_col (e.g., from meta) instead of relying on an undefined mapping variable.

Suggested change

cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order)

else:

ordered_timepoints = sorted(cdr3_df[timepoint_col].unique())

timepoint_order_fallback = {t: i for i, t in enumerate(ordered_timepoints)}

cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_fallback)

# Use explicitly provided mapping if available

timepoint_order_map = timepoint_order

elif 'timepoint_order_col' in locals() and timepoint_order_col in cdr3_df.columns:

# Derive ordering from a dedicated timepoint order column, if provided

ordered_timepoints = (

cdr3_df[[timepoint_col, timepoint_order_col]]

.drop_duplicates()

.sort_values(by=timepoint_order_col)[timepoint_col]

.tolist()

)

timepoint_order_map = {t: i for i, t in enumerate(ordered_timepoints)}

else:

# Fallback: sort unique timepoints alphabetically/numerically

ordered_timepoints = sorted(cdr3_df[timepoint_col].unique())

timepoint_order_map = {t: i for i, t in enumerate(ordered_timepoints)}

cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_map)

Copilot · 2026-04-01T14:04:21Z

notebooks/template_qc.qmd

+    Simulates rarefaction by expanding counts and shuffling.
+    """
+    # Flat array where each element is a clone ID, repeated by its count
+    # Use integer IDs for memory efficiency instead of strings
+    counts = sample_df['counts'].values
+    n_reads = counts.sum()
+
+    if n_reads == 0:
+        return pd.DataFrame()
+
+    # Generate clone IDs (0 to N-1)
+    ids = np.arange(len(counts))
+
+    # e.g., if clone 0 has 2 counts -> [0, 0]
+    # WARNING: High memory usage for very deep sequencing
+    flat_repertoire = np.repeat(ids, counts)
+
+    # Shuffle to simulate random sampling
+    np.random.shuffle(flat_repertoire)
+
+    # Define depths to check (from 0 to total reads)
+    depths = np.linspace(0, n_reads, num=steps, dtype=int)
+    depths = np.unique(depths)[1:] # Remove 0 and duplicates
+
+    results = []
+    for d in depths:
+        subsample = flat_repertoire[:d]
+        n_unique = len(np.unique(subsample))


Rarefaction uses np.repeat(ids, counts) to materialize a per-read array (flat_repertoire) of length n_reads. For deep libraries this is O(n_reads) memory and can OOM / severely slow rendering. Consider using a rarefaction algorithm that avoids full expansion (e.g., multinomial/without-replacement sampling on counts, or computing expected richness), and/or cap depth/steps or sample subset for plotting.

Suggested change

Simulates rarefaction by expanding counts and shuffling.

"""

# Flat array where each element is a clone ID, repeated by its count

# Use integer IDs for memory efficiency instead of strings

counts = sample_df['counts'].values

n_reads = counts.sum()

if n_reads == 0:

return pd.DataFrame()

# Generate clone IDs (0 to N-1)

ids = np.arange(len(counts))

# e.g., if clone 0 has 2 counts -> [0, 0]

# WARNING: High memory usage for very deep sequencing

flat_repertoire = np.repeat(ids, counts)

# Shuffle to simulate random sampling

np.random.shuffle(flat_repertoire)

# Define depths to check (from 0 to total reads)

depths = np.linspace(0, n_reads, num=steps, dtype=int)

depths = np.unique(depths)[1:] # Remove 0 and duplicates

results = []

for d in depths:

subsample = flat_repertoire[:d]

n_unique = len(np.unique(subsample))

Simulates rarefaction by sampling on counts without materializing

a per-read array, to avoid O(n_reads) memory usage.

"""

# Work directly with the counts vector

counts = sample_df['counts'].values

n_reads = counts.sum()

if n_reads == 0:

return pd.DataFrame()

# Clone frequency probabilities

probs = counts / n_reads

# Define depths to check (from 0 to total reads)

depths = np.linspace(0, n_reads, num=steps, dtype=int)

depths = np.unique(depths)[1:] # Remove 0 and duplicates

results = []

for d in depths:

# Multinomial sampling approximates random subsampling

sample_counts = np.random.multinomial(d, probs)

n_unique = np.count_nonzero(sample_counts)

Copilot · 2026-04-01T14:04:21Z

notebooks/template_qc.qmd

+def analyze_cdr3_lengths(df):
+    """
+    Expects df with columns: ['sample', 'CDR3b']
+    """


analyze_cdr3_lengths mutates the input dataframe in-place by adding a cdr3_len column. Since run_qc_pipeline passes concat_df directly, this introduces a hidden side effect that can impact later cells. Prefer working on a copy or computing the series without modifying the caller’s dataframe.

Suggested change

"""

"""

# Work on a copy to avoid mutating the caller's dataframe

df = df.copy()

Copilot · 2026-04-01T14:04:21Z

notebooks/template_pheno_sc.qmd

+def create_phenotype_stacked_bar(meta_df):
+
+    # --- 1. Load Data ---
+    file_pattern = os.path.join(pseudobulk_dir, "*_pseudobulk_phenotype.tsv")
+    file_paths = glob.glob(file_pattern)


create_phenotype_stacked_bar is defined twice in the same notebook (second definition overwrites the first). This makes execution order-dependent and hard to maintain; consider using a single function with parameters (e.g., weight_by='clonotypes'|'cells') or distinct function names for the two plots.

Copilot · 2026-04-01T14:04:22Z

notebooks/template_sharing.qmd

+
+This plot provides a powerful, two-dimensional view of your T-cell repertoire, helping you move beyond simple clone lists to understand the functional landscape of the immune response. The central question it helps answer is: **Are the T-cell clones found across many different samples (widely shared) also the ones that are most active and expanded?**
+
+This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed axplanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category.


Spelling: "axplanation" should be "explanation" in the narrative text.

Suggested change

This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed axplanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category.

This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed explanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category.

Add notebook visualizations

3db9250

dimalvovs requested a review from Copilot April 1, 2026 13:59

Copilot started reviewing on behalf of dimalvovs April 1, 2026 13:59 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add notebook visualizations#76

Add notebook visualizations#76
dltamayo wants to merge 1 commit intodltamayo-devfrom
quarto-notebooks

dltamayo commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		VDJdb_dir = f"{project_dir}vdjdb/"
		convergence_dir = f"{project_dir}convergence/"

	concat_df = concat_df.merge(meta[['sample', subject_col, alias_col, 'timepoint', timepoint_order_col]], on='sample', how='left')
	concat_df = concat_df.merge(meta[['sample', subject_col, alias_col, timepoint_col, timepoint_order_col]], on='sample', how='left')

-                composition['subject_id'] = meta_row.iloc[0]['subject_id']
-                composition['alias'] = meta_row.iloc[0]['alias']
-                composition['sort_order'] = meta_row.iloc[0]['timepoint_order'] if 'timepoint_order' in meta_df.columns else 0
-            else:
-                composition['subject_id'] = np.nan
-                composition['alias'] = sample_id
+                # Use configured metadata column names
+                composition[subject_col] = meta_row.iloc[0][subject_col] if subject_col in meta_df.columns else np.nan
+                composition[alias_col] = meta_row.iloc[0][alias_col] if alias_col in meta_df.columns else sample_id
+                composition['sort_order'] = meta_row.iloc[0][timepoint_order_col] if timepoint_order_col in meta_df.columns else 0
+            else:
+                composition[subject_col] = np.nan
+                composition[alias_col] = sample_id

-    cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order)
-else:
-    ordered_timepoints = sorted(cdr3_df[timepoint_col].unique())
-    timepoint_order_fallback = {t: i for i, t in enumerate(ordered_timepoints)}
-    cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_fallback)
+    # Use explicitly provided mapping if available
+    timepoint_order_map = timepoint_order
+elif 'timepoint_order_col' in locals() and timepoint_order_col in cdr3_df.columns:
+    # Derive ordering from a dedicated timepoint order column, if provided
+    ordered_timepoints = (
+        cdr3_df[[timepoint_col, timepoint_order_col]]
+        .drop_duplicates()
+        .sort_values(by=timepoint_order_col)[timepoint_col]
+        .tolist()
+    )
+    timepoint_order_map = {t: i for i, t in enumerate(ordered_timepoints)}
+else:
+    # Fallback: sort unique timepoints alphabetically/numerically
+    ordered_timepoints = sorted(cdr3_df[timepoint_col].unique())
+    timepoint_order_map = {t: i for i, t in enumerate(ordered_timepoints)}
+cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_map)

-    """
+    """
+    # Work on a copy to avoid mutating the caller's dataframe
+    df = df.copy()


		This plot provides a powerful, two-dimensional view of your T-cell repertoire, helping you move beyond simple clone lists to understand the functional landscape of the immune response. The central question it helps answer is: Are the T-cell clones found across many different samples (widely shared) also the ones that are most active and expanded?

		This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed axplanation). The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category.

Conversation

dltamayo commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants