VAPr package¶

Submodules¶

VAPr.annovar_output_parsing module¶

class VAPr.annovar_output_parsing.AnnovarAnnotatedVariant[source]¶

Bases: object

ALLELE_DEPTH_KEY = 'AD'¶

FILTER_PASSING_READS_COUNT_KEY = 'filter_passing_reads_count'¶

GENOTYPE_KEY = 'genotype'¶

GENOTYPE_LIKELIHOODS_KEY = 'genotype_likelihoods'¶

GENOTYPE_SUBCLASS_BY_CLASS_KEY = 'genotype_subclass_by_class'¶

HGVS_ID_KEY = 'hgvs_id'¶

SAMPLES_KEY = 'samples'¶

SAMPLE_ID_KEY = 'sample_id'¶

classmethod make_per_variant_annotation_dict(fields_by_annovar_header, hgvs_id, format_string, genotype_field_strings_by_sample_name)[source]¶

class VAPr.annovar_output_parsing.AnnovarTxtParser[source]¶

Bases: object

Class that processes an Annovar-created tab-delimited text file.

ALT_HEADER = 'alt'¶

CHR_HEADER = 'chr'¶

CYTOBAND_HEADER = 'cytoband'¶

END_HEADER = 'end'¶

ESP6500_ALL_HEADER = 'esp6500siv2_all'¶

EXONICFUNC_KNOWNGENE_HEADER = 'exonicfunc_knowngene'¶

FUNC_KNOWNGENE_HEADER = 'func_knowngene'¶

GENEDETAIL_KNOWNGENE_HEADER = 'genedetail_knowngene'¶

GENE_KNOWNGENE_HEADER = 'gene_knowngene'¶

GENOMIC_SUPERDUPS_HEADER = 'genomicsuperdups'¶

NCI60_HEADER = 'nci60'¶

OTHERINFO_HEADER = 'otherinfo'¶

RAW_CHR_MT_SUFFIX_VAL = 'M'¶

RAW_CHR_MT_VAL = 'chrM'¶

REF_HEADER = 'ref'¶

SCORE_KEY = 'Score'¶

STANDARDIZED_CHR_MT_SUFFIX_VAL = 'MT'¶

STANDARDIZED_CHR_MT_VAL = 'chrMT'¶

START_HEADER = 'start'¶

TFBS_CONS_SITES_HEADER = 'tfbsconssites'¶

THOU_G_2015_ALL_HEADER = '1000g2015aug_all'¶

classmethod read_chunk_of_annotations_to_dicts_list(annovar_txt_file_like_obj, sample_names_list, chunk_index, chunk_size)[source]¶

VAPr.annovar_running module¶

class VAPr.annovar_running.AnnovarWrapper(annovar_install_path, genome_build_version, custom_annovar_dbs_to_use=None)[source]¶

Bases: object

Wrapper around ANNOVAR download and annotation functions

download_databases()[source]¶

hg_19_databases = {'1000g2015aug': 'f', 'knownGene': 'g'}¶

hg_38_databases = {'1000g2015aug': 'f', 'knownGene': 'g'}¶

run_annotation(single_vcf_path, output_basename, output_dir)[source]¶

VAPr.chunk_processing module¶

class VAPr.chunk_processing.AnnotationJobParamsIndices[source]¶

CHUNK_INDEX_INDEX = 0¶

CHUNK_SIZE_INDEX = 2¶

COLLECTION_NAME_INDEX = 4¶

DB_NAME_INDEX = 3¶

FILE_PATH_INDEX = 1¶

GENOME_BUILD_VERSION_INDEX = 5¶

SAMPLE_LIST_INDEX = 7¶

VERBOSE_LEVEL_INDEX = 6¶

classmethod get_num_possible_indices()[source]¶

VAPr.chunk_processing.collect_chunk_annotations_and_store(job_params_tuple)[source]¶

VAPr.filtering module¶

VAPr.filtering.get_any_of_sample_ids_filter(sample_names_list)[source]¶

VAPr.filtering.get_sample_id_filter(sample_name)[source]¶

VAPr.filtering.make_de_novo_variants_filter(proband, ancestor1, ancestor2)[source]¶: Function for de novo variant analysis. Can be performed on multisample files or or on data coming from a collection of files. In the former case, every sample contains the same variants, although they have differences in their allele frequency and read values. A de novo variant is defined as a variant that occurs only in the specified sample (sample1) and not on the other two (sample2, sample3). Occurrence is defined as having allele frequencies greater than [0, 0] ([REF, ALT]).

VAPr.filtering.make_deleterious_compound_heterozygous_variants_filter(sample_ids_list=None)[source]¶

VAPr.filtering.make_known_disease_variants_filter(sample_ids_list=None)[source]¶: Function for retrieving known disease variants by presence in Clinvar and Cosmic.

VAPr.filtering.make_rare_deleterious_variants_filter(sample_ids_list=None)[source]¶: Function for retrieving rare, deleterious variants

VAPr.validation module¶

This module exposes utility functions to validate user inputs

By convention, validation functions in this module raise an appropriate Error if validation is unsuccessful. If it is successful, they return either nothing or the appropriately converted input value.

VAPr.validation.convert_to_nonneg_int(input_val, nullable=False)[source]¶

For non-null input_val, cast to a non-negative integer and return result; for null input_val, return None.

Parameters:	input_val (Any) – The value to attempt to convert to either a non-negative integer or a None (if nullable). The recognized null values are ‘.’, None, ‘’, and ‘NULL’ nullable (Optional[bool]) – True if the input value may be null, false otherwise. Defaults to False.
Returns:	None if nullable=True and the input is a null value. The appropriately cast non-negative integer if input is not null and the cast is successful.
Raises:	`ValueError` – if the input cannot be successfully converted to a non-negative integer or, if allowed, None

VAPr.validation.convert_to_nullable(input_val, cast_function)[source]¶

For non-null input_val, apply cast_function and return result if successful; for null input_val, return None.

Parameters:	input_val (Any) – The value to attempt to convert to either a None or the type specified by cast_function. The recognized null values are ‘.’, None, ‘’, and ‘NULL’ cast_function (Callable[[Any], Any]) – A function to cast the input_val to some specified type; should raise an error if this cast fails.
Returns:	None if input is the null value. An appropriately cast value if input is not null and the cast is successful.
Raises:	`Error` – whatever error is provided by cast_function if the cast fails.

VAPr.vapr_core module¶

class VAPr.vapr_core.VaprAnnotator(input_dir, output_dir, mongo_db_name, mongo_collection_name, annovar_install_path=None, design_file=None, build_ver=None, vcfs_gzipped=False)[source]¶

Bases: object

Class in charge of gathering requirements, finding files, downloading databases required to run the annotation

Parameters:

input_dir (str) – Input directory to vcf files
output_dir (str) – Output directory to annotated vcf files
mongo_db_name (str) – Name of the database to which you’ll store the collection of variants
mongo_collection_name (str) – Name of the collection to which you’d store the annotated variants
annovar_install_path (str) – Path to locally installed annovar scripts
design_file (str) – path to csv design file
build_ver (str) – genome build version to which annotation will be done against. Either hg19 or hg38
vcfs_gzipped (bool) – if the vcf files are gzipped, set to True

Returns:

DEFAULT_GENOME_VERSION = 'hg19'¶

HG19_VERSION = 'hg19'¶

HG38_VERSION = 'hg38'¶

SAMPLE_NAMES_KEY = 'Sample_Names'¶

SUPPORTED_GENOME_BUILD_VERSIONS = ['hg19', 'hg38']¶

annotate(num_processes=4, chunk_size=2000, verbose_level=1, allow_adds=False)[source]¶

This is the main function of the package. It will run Annovar beforehand, and will kick-start the full annotation functionality. Namely, it will collect all the variant data from Annovar annotations, combine it with data coming from MyVariant.info, and parse it to MongoDB, in the database and collection specified in project_data.

It will return the class VaprDataset, which can then be used for downstream filtering and analysis.

Parameters:	num_processes (int, optional) – number of parallel processes. Defaults to 8 chunk_size (int, optional) – int number of variants to be processed at once. Defaults to 2000 verbose_level (int, optional) – int higher verbosity will give more feedback, raise to 2 or 3 when debugging. Defaults to 1 allow_adds (bool, optional) – bool Allow adding new variants to a pre-existing Mongo collection, or overwrite it (Default value = False)
Returns:	class:~VAPr.vapr_core.VaprDataset
Return type:	class

annotate_lite(num_processes=8, chunk_size=2000, verbose_level=1, allow_adds=False)[source]¶

‘Lite’ Annotation: it will query myvariant.info only, without generating annotations from Annovar. It requires solely VAPr to be installed. The execution will grab the HGVS ids from the vcf files and query the variant data from MyVariant.info.

and inability to run native VAPr queries on the data.

It will return the class VaprDataset, which can then be used for downstream filtering and analysis.

Parameters:	num_processes (int, optional) – number of parallel processes. Defaults to 8 chunk_size (int, optional) – int number of variants to be processed at once. Defaults to 2000 verbose_level (int, optional) – int higher verbosity will give more feedback, raise to 2 or 3 when debugging. Defaults to 1 allow_adds (bool, optional) – bool Allow adding new variants to a pre-existing Mongo collection, or overwrite it (Default value = False)
Returns:	~VAPr.vapr_core.VaprDataset
Return type:	class

download_annovar_databases()[source]¶

Needed for ANNOVAR to run, it will download the required databases

Args:

Returns:

class VAPr.vapr_core.VaprDataset(mongo_db_name, mongo_collection_name, merged_vcf_path=None)[source]¶

Bases: object

full_name¶

Full name of database and collection

Args:

Returns:	Full name of database and collection
Return type:	str

get_all_variants()[source]¶

Self-explanatory

Args:

Returns:	list of variants
Return type:	list

get_custom_filtered_variants(filter_dictionary)[source]¶

See Create your own filter for more information on how to implement

Parameters:	filter_dictionary(dictionary – dict): mongodb custom filter
Returns:	list of variants
Return type:	list

get_de_novo_variants(proband, ancestor1, ancestor2)[source]¶

See 4. De novo Variants for more information on how this is implemented

Parameters:	proband (str) – proband variant ancestor1 (str) – ancestor #1 variant ancestor2 (str) – ancestor #2 variant
Returns:	list of variants
Return type:	list

get_deleterious_compound_heterozygous_variants(sample_names_list=None)[source]¶

See 3. Deleterious Compound Heterozygous Variants for more information on how this is implemented

Parameters:	sample_names_list(list – list, optional): list of samples to draw variants from (Default value = None)
Returns:	list of variants
Return type:	list

get_distinct_sample_ids()[source]¶

Self-explanatory

Args:

Returns:	list of sample ids
Return type:	list

get_known_disease_variants(sample_names_list=None)[source]¶

See 2. Known Disease Variants for more information on how this is implemented

Parameters:	sample_names_list(list – list, optional): list of samples to draw variants from (Default value = None)
Returns:	list of variants
Return type:	list

get_rare_deleterious_variants(sample_names_list=None)[source]¶

See 1. Rare Deleterious Variants for more information on how this is implemented

Parameters:	sample_names_list(list – list, optional): list of samples to draw variants from (Default value = None)
Returns:	list of variants
Return type:	list

get_variants_as_dataframe(filtered_variants=None)[source]¶

Utility to get a dataframe from variants, either all of them or a filtered subset

Parameters:	filtered_variants – a list of variants (Default value = None)
Returns:	pandas.DataFrame

get_variants_for_sample(sample_name)[source]¶

Return variants for a specific sample

Parameters:	sample_name (str) – name of sample
Returns:	list of variants
Return type:	list

get_variants_for_samples(specific_sample_names)[source]¶

Return variants from multiple samples

Parameters:	specific_sample_names (list) – name of samples
Returns:	list of variants
Return type:	list

is_empty¶

If there are no records in the collection, returns True

Args:

Returns:	if there are no records in the collection, returns True
Return type:	bool

num_records¶

Number of records in MongoDB collection

Args:

Returns:	Number of records in MongoDB collection
Return type:	int

write_filtered_annotated_csv(filtered_variants, output_fp)[source]¶

Filtered csv file containing annotations from a list passed to it, coming from MongoDB

Parameters:	filtered_variants (list) – variants coming from MongoDB output_fp (str) – Output file path
Returns:	None

write_filtered_annotated_vcf(filtered_variants, vcf_output_path, info_out=True)[source]¶

Parameters:	filtered_variants (list) – variants coming from MongoDB vcf_output_path (str) – Output file path info_out – if True, extra annotation information will be written to the vcf file (Default value = True) info_out – bool (Default value = True)
Returns:	None

write_unfiltered_annotated_csv(output_fp)[source]¶

Full csv file containing annotations from both annovar and myvariant.info

Parameters:	output_fp (str) – Output file path
Returns:	None

write_unfiltered_annotated_csvs_per_sample(output_dir)[source]¶

Parameters:	output_dir – return: None
Returns:	None

write_unfiltered_annotated_vcf(vcf_output_path, info_out=True)[source]¶

Filtered vcf file containing annotations from a list passed to it, coming from MongoDB

Parameters:	vcf_output_path (str) – Output file path info_out – if True, extra annotation information will be written to the vcf file (Default value = True) info_out – bool (Default value = True)
Returns:	None

VAPr.vcf_genotype_fields_parsing module¶

class VAPr.vcf_genotype_fields_parsing.Allele(unfiltered_read_counts=None)[source]¶

Bases: object

Store unfiltered read counts, if any, for a particular allele.

unfiltered_read_counts¶: int or None – Number of unfiltered reads counts for this sample at this site, from AD field.

class VAPr.vcf_genotype_fields_parsing.GenotypeLikelihood(allele1_number, allele2_number, likelihood_neg_exponent)[source]¶

Bases: object

Store parsed info from VCF genotype likelihood field for a single sample.

allele1_number¶: int – The allele identifier for the left-hand allele inferred for this genotype likelihood.

allele2_number¶: int – The allele identifier for the right-hand allele inferred for this genotype likelihood.

likelihood_neg_exponent¶: float – The “normalized” Phred-scaled likelihood of the genotype represented by allele1 and allele2.

class VAPr.vcf_genotype_fields_parsing.VCFGenotypeInfo(raw_string)[source]¶

Bases: object

Store parsed info from VCF genotype fields for a single sample.

_raw_string¶: str – The genotype fields values string from a VCF file (e.g., ‘0/1:173,141:282:99:255,0,255’).

genotype¶: Optional[str] – The type of each of the sample’s two alleles, such as 0/0, 0/1, etc.

alleles¶: List[Allele] – One Allele object for each allele detected for this variant (this can be across samples, so there can be more than 2 alleles).

genotype_likelihoods¶: List[GenotypeLikelihood] – The GenotypeLikelihood object for each allele.

unprocessed_info¶: Dict[str, Any] – Dictionary of field tag and value(s) for any fields not stored in dedicated attributes of VCFGenotypeInfo. Values are parsed to lists and/or floats if possible.

genotype_subclass_by_class¶: Dict[str, str] – Genotype subclass (reference, alt, compound) keyed by genotype class (homozygous/heterozygous).

filter_passing_reads_count¶: int or None – Filtered depth of coverage of this sample at this site from the DP field.

genotype_confidence¶: str – Genotype quality (confidence) of this sample at this site, from the GQ field.

class VAPr.vcf_genotype_fields_parsing.VCFGenotypeParser[source]¶

Bases: object

Mine format string and genotype fields string to create a filled VCFGenotypeInfo object.

FILTERED_ALLELE_DEPTH_TAG = 'DP'¶

GENOTYPE_QUALITY_TAG = 'GQ'¶

GENOTYPE_TAG = 'GT'¶

NORMALIZED_SCALED_LIKELIHOODS_TAG = 'PL'¶

UNFILTERED_ALLELE_DEPTH_TAG = 'AD'¶

static is_valid_genotype_fields_string(genotype_fields_string)[source]¶

Return true if input has any real genotype fields content, false if is just periods, zeroes, and delimiters.

Parameters:	genotype_fields_string (str) – A VCF-style genotype fields string, such as 1/1:0,2:2:6:89,6,0 or ./.:.:.:.:.

Returns: bool: true if input has any real genotype fields content, false if is just periods, zeroes, and delimiters.

classmethod parse(format_key_string, format_value_string)[source]¶

Parse the input format string and genotype fields string into a filled VCFGenotypeInfo object.

Parameters:

format_key_string (str) – The VCF format string (e.g., ‘GT:AD:DP:GQ:PL’) for this sample at this site.
format_value_string (str) – The VCF genotype fields values string (e.g., ‘1/1:0,34:34:99:1187.2,101,0’) corresponding to the format_key_string for this sample at this site.

Returns:

A filled VCFGenotypeInfo for this sample at this site unless an error was: encountered, in which case None is returned. encountered, in which case None is returned.

Return type:

VCFGenotypeInfo or None

VAPr.vcf_merging module¶

VAPr.vcf_merging.bgzip_and_index_vcf(vcf_path)[source]¶: bgzip and index each vcf so it can be merged with bcftools.

VAPr.vcf_merging.merge_vcfs(input_dir, output_dir, project_name, raw_vcf_path_list=None, vcfs_gzipped=False)[source]¶: Merge vcf files into single multisample vcf, bgzip and index merged vcf file.

VAPr package¶

Submodules¶

VAPr.annovar_output_parsing module¶

VAPr.annovar_running module¶

VAPr.chunk_processing module¶

VAPr.filtering module¶

VAPr.validation module¶

VAPr.vapr_core module¶

VAPr.vcf_genotype_fields_parsing module¶

VAPr.vcf_merging module¶

Module contents¶