NGS screen processing module

GuideCounter class is a wrapper to run the functions for a CRISPR screen experiment.

This module contains a set of python functions to process and analyze NGS files from CRISPR screens. Based on the type of CRISPR-Cas system used for the screen, the functions are divided into two classes: Cas9 and Cas12.


Scripts to work with NGS data

This module provides functions to process FASTQ files from screens with single or dual guide libraries. In general, the algorithm is fairly simple:

  1. Read the FASTQ file and extract the proper sequences

  2. Count the exact number of occurrences for each unique sequence

  3. Map the counted sequences to the reference sequence library

  4. Return the counted mapped or unmapped events as a dataframe(s)

For single-guide screens, the sequences are counted as single protospacer from a single-end read file (R1). Then, these sequences are mapped to the reference library of protospacer sequences.

For dual-guide screens, the sequences are counted as pairs of protospacer A and B from paired-end read files (R1 and R2). Then, sequences are mapped to the reference library of protospacer A and B pairs.

Theoretically, the algorithm is able to detect any observed sequence since it is counting first and then mapping. Therefore, the recombination events can be detected. In dual-guide design protospacer A and B are not the same pairs as in the reference library. These events include:

  • Protospacer A and B pairs are present in the reference library but paired differently

  • Only one of the protospacer A and B is present in the reference library

  • None of the protospacer A and B is present in the reference library

class screenpro.ngs.GuideCounter(cas_type, library_type)[source]

Bases: object

Class to count sequences from FASTQ files

build_counts_anndata(source='library', verbose=False)[source]

Build AnnData object from count matrix

get_counts_matrix(fastq_dir, samples, get_recombinant=False, cas_type='cas9', protospacer_length='auto', trim_first_g=False, write=True, verbose=False)[source]

Get count matrix for given samples

load_counts_matrix(counts_mat_path, **kwargs)[source]

Load count matrix from file

load_library(library_path, sep='\t', index_col=0, protospacer_length=19, verbose=False, **args)[source]

Load library file

Cas9 CRISPR-Cas system (single or dual sgRNA libraries)

screenpro.ngs.cas9.fastq_to_count_dual_guide(R1_fastq_file_path: str, R2_fastq_file_path: str, trim5p_pos1_start: Optional[int] = None, trim5p_pos1_length: Optional[int] = None, trim5p_pos2_start: Optional[int] = None, trim5p_pos2_length: Optional[int] = None, verbose: bool = False) DataFrame[source]

Count the occurrences of unique sequences in paired-end FASTQ files to a DataFrame containing counts of unique pairs of sequences. e.g. dual-guide design R1: protospacer_A, R2: protospacer_B

Parameters
  • R1_fastq_file_path (str) – File path of the R1 FASTQ file.

  • R2_fastq_file_path (str) – File path of the R2 FASTQ file.

  • trim5p_pos1_start (int, optional) – Start position for trimming the 5’ end of the R1 sequences. Defaults to None.

  • trim5p_pos1_length (int, optional) – Length of the trimmed R1 sequences. Defaults to None.

  • trim5p_pos2_start (int, optional) – Start position for trimming the 5’ end of the R2 sequences. Defaults to None.

  • trim5p_pos2_length (int, optional) – Length of the trimmed R2 sequences. Defaults to None.

  • verbose (bool, optional) – Whether to print verbose output. Defaults to False.

Returns

DataFrame containing counts of unique sequences with columns ‘protospacer_A’, ‘protospacer_B’, and ‘count’.

Return type

pl.DataFrame

Raises

ValueError – If trim5p_pos1_start, trim5p_pos1_length, trim5p_pos2_start, and trim5p_pos2_length are not provided concurrently.

screenpro.ngs.cas9.fastq_to_count_single_guide(fastq_file_path: str, trim5p_start: Optional[int] = None, trim5p_length: Optional[int] = None, verbose: bool = False) DataFrame[source]

Count the occurrences of unique sequences in single-end FASTQ files to a DataFrame containing counts of unique sequences. e.g. single-guide design R1: protospacer

Parameters
  • fastq_file_path (str) – The path to the FASTQ file.

  • trim5p_start (int, optional) – The starting position for trimming the 5’ end of the sequences. Defaults to None.

  • trim5p_length (int, optional) – The length of the trimmed sequences. Defaults to None.

  • verbose (bool, optional) – Whether to print verbose output. Defaults to False.

Returns

A DataFrame containing the unique sequences and their respective counts.

Return type

pl.DataFrame

screenpro.ngs.cas9.map_to_library_dual_guide(df_count, library, get_recombinant=False, return_type='all', verbose=False)[source]

Map the counts of unique sequences to a library DataFrame containing dual-guide sgRNA sequences. Optionally, the function can capture recombinant events. User can choose to return mapped reads, unmapped reads, recombinant events, or all.

Parameters
  • df_count (pandas.DataFrame) – The input DataFrame containing the counts.

  • library (pandas.DataFrame) – The library of sequences to map against.

  • get_recombinant (bool, optional) – Whether to calculate recombinant events. Defaults to False.

  • return_type (str, optional) – The type of reads to return. Can be ‘unmapped’, ‘mapped’, ‘recombinant’, or ‘all’. Defaults to ‘all’.

  • verbose (bool, optional) – Whether to print verbose output. Defaults to False.

Returns

The mapped reads based on the specified return_type.

Return type

pandas.DataFrame or dict

Raises
  • ValueError – If return_type is not one of ‘unmapped’, ‘mapped’, ‘recombinant’, or ‘all’.

  • ValueError – If get_recombinant is False and return_type is ‘recombinant’.

screenpro.ngs.cas9.map_to_library_single_guide(df_count, library, return_type='all', verbose=False)[source]

Map the counts of unique sequences to a library DataFrame containing sgRNA sequences. User can choose to return mapped reads, unmapped reads, or both.

Parameters
  • df_count (pandas.DataFrame) – The input DataFrame containing counts.

  • library (pandas.DataFrame) – The library DataFrame to map to.

  • return_type (str, optional) – The type of result to return. Defaults to ‘all’.

  • verbose (bool, optional) – Whether to print verbose information. Defaults to False.

Returns

The mapped result based on the return_type parameter.

Return type

dict or pandas.DataFrame

Raises

ValueError – If the return_type parameter is invalid.

Cas12 CRISPR-Cas system (multiplexed crRNA libraries)