NGS screen processing module
GuideCounter class is a wrapper to run the functions for a
CRISPR screen experiment.
This module contains a set of python functions to process and analyze
NGS files from CRISPR screens. Based on the type of CRISPR-Cas system
used for the screen, the functions are divided into two classes:
Cas9 and Cas12.
Scripts to work with NGS data
This module provides functions to process FASTQ files from screens with single or dual guide libraries. In general, the algorithm is fairly simple:
Read the FASTQ file and extract the proper sequences
Count the exact number of occurrences for each unique sequence
Map the counted sequences to the reference sequence library
Return the counted mapped or unmapped events as a dataframe(s)
For single-guide screens, the sequences are counted as single protospacer from a single-end read file (R1). Then, these sequences are mapped to the reference library of protospacer sequences.
For dual-guide screens, the sequences are counted as pairs of protospacer A and B from paired-end read files (R1 and R2). Then, sequences are mapped to the reference library of protospacer A and B pairs.
Theoretically, the algorithm is able to detect any observed sequence since it is counting first and then mapping. Therefore, the recombination events can be detected. In dual-guide design protospacer A and B are not the same pairs as in the reference library. These events include:
Protospacer A and B pairs are present in the reference library but paired differently
Only one of the protospacer A and B is present in the reference library
None of the protospacer A and B is present in the reference library
- class screenpro.ngs.GuideCounter(cas_type, library_type)[source]
Bases:
objectClass to count sequences from FASTQ files
- build_counts_anndata(source='library', verbose=False)[source]
Build AnnData object from count matrix
Cas9 CRISPR-Cas system (single or dual sgRNA libraries)
- screenpro.ngs.cas9.fastq_to_count_dual_guide(R1_fastq_file_path: str, R2_fastq_file_path: str, trim5p_pos1_start: Optional[int] = None, trim5p_pos1_length: Optional[int] = None, trim5p_pos2_start: Optional[int] = None, trim5p_pos2_length: Optional[int] = None, verbose: bool = False) DataFrame[source]
Count the occurrences of unique sequences in paired-end FASTQ files to a DataFrame containing counts of unique pairs of sequences. e.g. dual-guide design R1: protospacer_A, R2: protospacer_B
- Parameters
R1_fastq_file_path (str) – File path of the R1 FASTQ file.
R2_fastq_file_path (str) – File path of the R2 FASTQ file.
trim5p_pos1_start (int, optional) – Start position for trimming the 5’ end of the R1 sequences. Defaults to None.
trim5p_pos1_length (int, optional) – Length of the trimmed R1 sequences. Defaults to None.
trim5p_pos2_start (int, optional) – Start position for trimming the 5’ end of the R2 sequences. Defaults to None.
trim5p_pos2_length (int, optional) – Length of the trimmed R2 sequences. Defaults to None.
verbose (bool, optional) – Whether to print verbose output. Defaults to False.
- Returns
DataFrame containing counts of unique sequences with columns ‘protospacer_A’, ‘protospacer_B’, and ‘count’.
- Return type
pl.DataFrame
- Raises
ValueError – If trim5p_pos1_start, trim5p_pos1_length, trim5p_pos2_start, and trim5p_pos2_length are not provided concurrently.
- screenpro.ngs.cas9.fastq_to_count_single_guide(fastq_file_path: str, trim5p_start: Optional[int] = None, trim5p_length: Optional[int] = None, verbose: bool = False) DataFrame[source]
Count the occurrences of unique sequences in single-end FASTQ files to a DataFrame containing counts of unique sequences. e.g. single-guide design R1: protospacer
- Parameters
fastq_file_path (str) – The path to the FASTQ file.
trim5p_start (int, optional) – The starting position for trimming the 5’ end of the sequences. Defaults to None.
trim5p_length (int, optional) – The length of the trimmed sequences. Defaults to None.
verbose (bool, optional) – Whether to print verbose output. Defaults to False.
- Returns
A DataFrame containing the unique sequences and their respective counts.
- Return type
pl.DataFrame
- screenpro.ngs.cas9.map_to_library_dual_guide(df_count, library, get_recombinant=False, return_type='all', verbose=False)[source]
Map the counts of unique sequences to a library DataFrame containing dual-guide sgRNA sequences. Optionally, the function can capture recombinant events. User can choose to return mapped reads, unmapped reads, recombinant events, or all.
- Parameters
df_count (pandas.DataFrame) – The input DataFrame containing the counts.
library (pandas.DataFrame) – The library of sequences to map against.
get_recombinant (bool, optional) – Whether to calculate recombinant events. Defaults to False.
return_type (str, optional) – The type of reads to return. Can be ‘unmapped’, ‘mapped’, ‘recombinant’, or ‘all’. Defaults to ‘all’.
verbose (bool, optional) – Whether to print verbose output. Defaults to False.
- Returns
The mapped reads based on the specified return_type.
- Return type
pandas.DataFrame or dict
- Raises
ValueError – If return_type is not one of ‘unmapped’, ‘mapped’, ‘recombinant’, or ‘all’.
ValueError – If get_recombinant is False and return_type is ‘recombinant’.
- screenpro.ngs.cas9.map_to_library_single_guide(df_count, library, return_type='all', verbose=False)[source]
Map the counts of unique sequences to a library DataFrame containing sgRNA sequences. User can choose to return mapped reads, unmapped reads, or both.
- Parameters
df_count (pandas.DataFrame) – The input DataFrame containing counts.
library (pandas.DataFrame) – The library DataFrame to map to.
return_type (str, optional) – The type of result to return. Defaults to ‘all’.
verbose (bool, optional) – Whether to print verbose information. Defaults to False.
- Returns
The mapped result based on the return_type parameter.
- Return type
dict or pandas.DataFrame
- Raises
ValueError – If the return_type parameter is invalid.