VAPr package¶
Submodules¶
VAPr.annovar_output_parsing module¶
-
class
VAPr.annovar_output_parsing.
AnnovarAnnotatedVariant
[source]¶ Bases:
object
-
ALLELE_DEPTH_KEY
= 'AD'¶
-
FILTER_PASSING_READS_COUNT_KEY
= 'filter_passing_reads_count'¶
-
GENOTYPE_KEY
= 'genotype'¶
-
GENOTYPE_LIKELIHOODS_KEY
= 'genotype_likelihoods'¶
-
GENOTYPE_SUBCLASS_BY_CLASS_KEY
= 'genotype_subclass_by_class'¶
-
HGVS_ID_KEY
= 'hgvs_id'¶
-
SAMPLES_KEY
= 'samples'¶
-
SAMPLE_ID_KEY
= 'sample_id'¶
-
-
class
VAPr.annovar_output_parsing.
AnnovarTxtParser
[source]¶ Bases:
object
Class that processes an Annovar-created tab-delimited text file.
-
ALT_HEADER
= 'alt'¶
-
CHR_HEADER
= 'chr'¶
-
CYTOBAND_HEADER
= 'cytoband'¶
-
END_HEADER
= 'end'¶
-
ESP6500_ALL_HEADER
= 'esp6500siv2_all'¶
-
EXONICFUNC_KNOWNGENE_HEADER
= 'exonicfunc_knowngene'¶
-
FUNC_KNOWNGENE_HEADER
= 'func_knowngene'¶
-
GENEDETAIL_KNOWNGENE_HEADER
= 'genedetail_knowngene'¶
-
GENE_KNOWNGENE_HEADER
= 'gene_knowngene'¶
-
GENOMIC_SUPERDUPS_HEADER
= 'genomicsuperdups'¶
-
NCI60_HEADER
= 'nci60'¶
-
OTHERINFO_HEADER
= 'otherinfo'¶
-
RAW_CHR_MT_SUFFIX_VAL
= 'M'¶
-
RAW_CHR_MT_VAL
= 'chrM'¶
-
REF_HEADER
= 'ref'¶
-
SCORE_KEY
= 'Score'¶
-
STANDARDIZED_CHR_MT_SUFFIX_VAL
= 'MT'¶
-
STANDARDIZED_CHR_MT_VAL
= 'chrMT'¶
-
START_HEADER
= 'start'¶
-
TFBS_CONS_SITES_HEADER
= 'tfbsconssites'¶
-
THOU_G_2015_ALL_HEADER
= '1000g2015aug_all'¶
-
VAPr.annovar_running module¶
-
class
VAPr.annovar_running.
AnnovarWrapper
(annovar_install_path, genome_build_version, custom_annovar_dbs_to_use=None)[source]¶ Bases:
object
Wrapper around ANNOVAR download and annotation functions
-
hg_19_databases
= {'1000g2015aug': 'f', 'knownGene': 'g'}¶
-
hg_38_databases
= {'1000g2015aug': 'f', 'knownGene': 'g'}¶
-
VAPr.chunk_processing module¶
VAPr.filtering module¶
-
VAPr.filtering.
make_de_novo_variants_filter
(proband, ancestor1, ancestor2)[source]¶ Function for de novo variant analysis. Can be performed on multisample files or or on data coming from a collection of files. In the former case, every sample contains the same variants, although they have differences in their allele frequency and read values. A de novo variant is defined as a variant that occurs only in the specified sample (sample1) and not on the other two (sample2, sample3). Occurrence is defined as having allele frequencies greater than [0, 0] ([REF, ALT]).
-
VAPr.filtering.
make_deleterious_compound_heterozygous_variants_filter
(sample_ids_list=None)[source]¶
VAPr.validation module¶
This module exposes utility functions to validate user inputs
By convention, validation functions in this module raise an appropriate Error if validation is unsuccessful. If it is successful, they return either nothing or the appropriately converted input value.
-
VAPr.validation.
convert_to_nonneg_int
(input_val, nullable=False)[source]¶ For non-null input_val, cast to a non-negative integer and return result; for null input_val, return None.
Parameters: - input_val (Any) – The value to attempt to convert to either a non-negative integer or a None (if nullable). The recognized null values are ‘.’, None, ‘’, and ‘NULL’
- nullable (Optional[bool]) – True if the input value may be null, false otherwise. Defaults to False.
Returns: None if nullable=True and the input is a null value. The appropriately cast non-negative integer if input is not null and the cast is successful.
Raises: ValueError
– if the input cannot be successfully converted to a non-negative integer or, if allowed, None
-
VAPr.validation.
convert_to_nullable
(input_val, cast_function)[source]¶ For non-null input_val, apply cast_function and return result if successful; for null input_val, return None.
Parameters: - input_val (Any) – The value to attempt to convert to either a None or the type specified by cast_function. The recognized null values are ‘.’, None, ‘’, and ‘NULL’
- cast_function (Callable[[Any], Any]) – A function to cast the input_val to some specified type; should raise an error if this cast fails.
Returns: None if input is the null value. An appropriately cast value if input is not null and the cast is successful.
Raises: Error
– whatever error is provided by cast_function if the cast fails.
VAPr.vapr_core module¶
-
class
VAPr.vapr_core.
VaprAnnotator
(input_dir, output_dir, mongo_db_name, mongo_collection_name, annovar_install_path=None, design_file=None, build_ver=None, vcfs_gzipped=False)[source]¶ Bases:
object
Class in charge of gathering requirements, finding files, downloading databases required to run the annotation
Parameters: - input_dir (str) – Input directory to vcf files
- output_dir (str) – Output directory to annotated vcf files
- mongo_db_name (str) – Name of the database to which you’ll store the collection of variants
- mongo_collection_name (str) – Name of the collection to which you’d store the annotated variants
- annovar_install_path (str) – Path to locally installed annovar scripts
- design_file (str) – path to csv design file
- build_ver (str) – genome build version to which annotation will be done against. Either hg19 or hg38
- vcfs_gzipped (bool) – if the vcf files are gzipped, set to True
Returns:
-
DEFAULT_GENOME_VERSION
= 'hg19'¶
-
HG19_VERSION
= 'hg19'¶
-
HG38_VERSION
= 'hg38'¶
-
SAMPLE_NAMES_KEY
= 'Sample_Names'¶
-
SUPPORTED_GENOME_BUILD_VERSIONS
= ['hg19', 'hg38']¶
-
annotate
(num_processes=4, chunk_size=2000, verbose_level=1, allow_adds=False)[source]¶ This is the main function of the package. It will run Annovar beforehand, and will kick-start the full annotation functionality. Namely, it will collect all the variant data from Annovar annotations, combine it with data coming from MyVariant.info, and parse it to MongoDB, in the database and collection specified in project_data.
It will return the class
VaprDataset
, which can then be used for downstream filtering and analysis.Parameters: - num_processes (int, optional) – number of parallel processes. Defaults to 8
- chunk_size (int, optional) – int number of variants to be processed at once. Defaults to 2000
- verbose_level (int, optional) – int higher verbosity will give more feedback, raise to 2 or 3 when debugging. Defaults to 1
- allow_adds (bool, optional) – bool Allow adding new variants to a pre-existing Mongo collection, or overwrite it (Default value = False)
Returns: class:~VAPr.vapr_core.VaprDataset
Return type: class
-
annotate_lite
(num_processes=8, chunk_size=2000, verbose_level=1, allow_adds=False)[source]¶ ‘Lite’ Annotation: it will query myvariant.info only, without generating annotations from Annovar. It requires solely VAPr to be installed. The execution will grab the HGVS ids from the vcf files and query the variant data from MyVariant.info.
and inability to run native VAPr queries on the data.
It will return the class
VaprDataset
, which can then be used for downstream filtering and analysis.Parameters: - num_processes (int, optional) – number of parallel processes. Defaults to 8
- chunk_size (int, optional) – int number of variants to be processed at once. Defaults to 2000
- verbose_level (int, optional) – int higher verbosity will give more feedback, raise to 2 or 3 when debugging. Defaults to 1
- allow_adds (bool, optional) – bool Allow adding new variants to a pre-existing Mongo collection, or overwrite it (Default value = False)
Returns: ~VAPr.vapr_core.VaprDataset
Return type: class
-
class
VAPr.vapr_core.
VaprDataset
(mongo_db_name, mongo_collection_name, merged_vcf_path=None)[source]¶ Bases:
object
-
full_name
¶ Full name of database and collection
Args:
Returns: Full name of database and collection Return type: str
-
get_custom_filtered_variants
(filter_dictionary)[source]¶ See Create your own filter for more information on how to implement
Parameters: filter_dictionary(dictionary – dict): mongodb custom filter Returns: list of variants Return type: list
-
get_de_novo_variants
(proband, ancestor1, ancestor2)[source]¶ See 4. De novo Variants for more information on how this is implemented
Parameters: Returns: list of variants
Return type:
-
get_deleterious_compound_heterozygous_variants
(sample_names_list=None)[source]¶ See 3. Deleterious Compound Heterozygous Variants for more information on how this is implemented
Parameters: sample_names_list(list – list, optional): list of samples to draw variants from (Default value = None) Returns: list of variants Return type: list
-
get_distinct_sample_ids
()[source]¶ Self-explanatory
Args:
Returns: list of sample ids Return type: list
-
get_known_disease_variants
(sample_names_list=None)[source]¶ See 2. Known Disease Variants for more information on how this is implemented
Parameters: sample_names_list(list – list, optional): list of samples to draw variants from (Default value = None) Returns: list of variants Return type: list
-
get_rare_deleterious_variants
(sample_names_list=None)[source]¶ See 1. Rare Deleterious Variants for more information on how this is implemented
Parameters: sample_names_list(list – list, optional): list of samples to draw variants from (Default value = None) Returns: list of variants Return type: list
-
get_variants_as_dataframe
(filtered_variants=None)[source]¶ Utility to get a dataframe from variants, either all of them or a filtered subset
Parameters: filtered_variants – a list of variants (Default value = None) Returns: pandas.DataFrame
-
get_variants_for_sample
(sample_name)[source]¶ Return variants for a specific sample
Parameters: sample_name (str) – name of sample Returns: list of variants Return type: list
-
get_variants_for_samples
(specific_sample_names)[source]¶ Return variants from multiple samples
Parameters: specific_sample_names (list) – name of samples Returns: list of variants Return type: list
-
is_empty
¶ If there are no records in the collection, returns True
Args:
Returns: if there are no records in the collection, returns True Return type: bool
-
num_records
¶ Number of records in MongoDB collection
Args:
Returns: Number of records in MongoDB collection Return type: int
-
write_filtered_annotated_csv
(filtered_variants, output_fp)[source]¶ Filtered csv file containing annotations from a list passed to it, coming from MongoDB
Parameters: Returns: None
-
write_filtered_annotated_vcf
(filtered_variants, vcf_output_path, info_out=True)[source]¶ Parameters: Returns: None
-
write_unfiltered_annotated_csv
(output_fp)[source]¶ Full csv file containing annotations from both annovar and myvariant.info
Parameters: output_fp (str) – Output file path Returns: None
-
write_unfiltered_annotated_csvs_per_sample
(output_dir)[source]¶ Parameters: output_dir – return: None Returns: None
-
write_unfiltered_annotated_vcf
(vcf_output_path, info_out=True)[source]¶ Filtered vcf file containing annotations from a list passed to it, coming from MongoDB
Parameters: - vcf_output_path (str) – Output file path
- info_out – if True, extra annotation information will be written to the vcf file (Default value = True)
- info_out – bool (Default value = True)
Returns: None
-
VAPr.vcf_genotype_fields_parsing module¶
-
class
VAPr.vcf_genotype_fields_parsing.
Allele
(unfiltered_read_counts=None)[source]¶ Bases:
object
Store unfiltered read counts, if any, for a particular allele.
-
unfiltered_read_counts
¶ int or None – Number of unfiltered reads counts for this sample at this site, from AD field.
-
-
class
VAPr.vcf_genotype_fields_parsing.
GenotypeLikelihood
(allele1_number, allele2_number, likelihood_neg_exponent)[source]¶ Bases:
object
Store parsed info from VCF genotype likelihood field for a single sample.
-
allele1_number
¶ int – The allele identifier for the left-hand allele inferred for this genotype likelihood.
-
allele2_number
¶ int – The allele identifier for the right-hand allele inferred for this genotype likelihood.
-
likelihood_neg_exponent
¶ float – The “normalized” Phred-scaled likelihood of the genotype represented by allele1 and allele2.
-
-
class
VAPr.vcf_genotype_fields_parsing.
VCFGenotypeInfo
(raw_string)[source]¶ Bases:
object
Store parsed info from VCF genotype fields for a single sample.
-
_raw_string
¶ str – The genotype fields values string from a VCF file (e.g., ‘0/1:173,141:282:99:255,0,255’).
-
genotype
¶ Optional[str] – The type of each of the sample’s two alleles, such as 0/0, 0/1, etc.
-
alleles
¶ List[Allele] – One Allele object for each allele detected for this variant (this can be across samples, so there can be more than 2 alleles).
-
genotype_likelihoods
¶ List[GenotypeLikelihood] – The GenotypeLikelihood object for each allele.
-
unprocessed_info
¶ Dict[str, Any] – Dictionary of field tag and value(s) for any fields not stored in dedicated attributes of VCFGenotypeInfo. Values are parsed to lists and/or floats if possible.
-
genotype_subclass_by_class
¶ Dict[str, str] – Genotype subclass (reference, alt, compound) keyed by genotype class (homozygous/heterozygous).
-
filter_passing_reads_count
¶ int or None – Filtered depth of coverage of this sample at this site from the DP field.
-
genotype_confidence
¶ str – Genotype quality (confidence) of this sample at this site, from the GQ field.
-
-
class
VAPr.vcf_genotype_fields_parsing.
VCFGenotypeParser
[source]¶ Bases:
object
Mine format string and genotype fields string to create a filled VCFGenotypeInfo object.
-
FILTERED_ALLELE_DEPTH_TAG
= 'DP'¶
-
GENOTYPE_QUALITY_TAG
= 'GQ'¶
-
GENOTYPE_TAG
= 'GT'¶
-
NORMALIZED_SCALED_LIKELIHOODS_TAG
= 'PL'¶
-
UNFILTERED_ALLELE_DEPTH_TAG
= 'AD'¶
-
static
is_valid_genotype_fields_string
(genotype_fields_string)[source]¶ Return true if input has any real genotype fields content, false if is just periods, zeroes, and delimiters.
Parameters: genotype_fields_string (str) – A VCF-style genotype fields string, such as 1/1:0,2:2:6:89,6,0 or ./.:.:.:.:. - Returns
- bool: true if input has any real genotype fields content, false if is just periods, zeroes, and delimiters.
-
classmethod
parse
(format_key_string, format_value_string)[source]¶ Parse the input format string and genotype fields string into a filled VCFGenotypeInfo object.
Parameters: Returns: - A filled VCFGenotypeInfo for this sample at this site unless an error was
encountered, in which case None is returned. encountered, in which case None is returned.
Return type:
-