VAPr package

Submodules

VAPr.annovar_output_parsing module

class VAPr.annovar_output_parsing.AnnovarAnnotatedVariant[source]

Bases: object

ALLELE_DEPTH_KEY = 'AD'
FILTER_PASSING_READS_COUNT_KEY = 'filter_passing_reads_count'
GENOTYPE_KEY = 'genotype'
GENOTYPE_LIKELIHOODS_KEY = 'genotype_likelihoods'
GENOTYPE_SUBCLASS_BY_CLASS_KEY = 'genotype_subclass_by_class'
HGVS_ID_KEY = 'hgvs_id'
SAMPLES_KEY = 'samples'
SAMPLE_ID_KEY = 'sample_id'
classmethod make_per_variant_annotation_dict(fields_by_annovar_header, hgvs_id, format_string, genotype_field_strings_by_sample_name)[source]
class VAPr.annovar_output_parsing.AnnovarTxtParser[source]

Bases: object

Class that processes an Annovar-created tab-delimited text file.

ALT_HEADER = 'alt'
CHR_HEADER = 'chr'
CYTOBAND_HEADER = 'cytoband'
END_HEADER = 'end'
ESP6500_ALL_HEADER = 'esp6500siv2_all'
EXONICFUNC_KNOWNGENE_HEADER = 'exonicfunc_knowngene'
FUNC_KNOWNGENE_HEADER = 'func_knowngene'
GENEDETAIL_KNOWNGENE_HEADER = 'genedetail_knowngene'
GENE_KNOWNGENE_HEADER = 'gene_knowngene'
GENOMIC_SUPERDUPS_HEADER = 'genomicsuperdups'
NCI60_HEADER = 'nci60'
OTHERINFO_HEADER = 'otherinfo'
RAW_CHR_MT_SUFFIX_VAL = 'M'
RAW_CHR_MT_VAL = 'chrM'
REF_HEADER = 'ref'
SCORE_KEY = 'Score'
STANDARDIZED_CHR_MT_SUFFIX_VAL = 'MT'
STANDARDIZED_CHR_MT_VAL = 'chrMT'
START_HEADER = 'start'
TFBS_CONS_SITES_HEADER = 'tfbsconssites'
THOU_G_2015_ALL_HEADER = '1000g2015aug_all'
classmethod read_chunk_of_annotations_to_dicts_list(annovar_txt_file_like_obj, sample_names_list, chunk_index, chunk_size)[source]

VAPr.annovar_running module

class VAPr.annovar_running.AnnovarWrapper(annovar_install_path, genome_build_version, custom_annovar_dbs_to_use=None)[source]

Bases: object

Wrapper around ANNOVAR download and annotation functions

download_databases()[source]
hg_19_databases = {'1000g2015aug': 'f', 'knownGene': 'g'}
hg_38_databases = {'1000g2015aug': 'f', 'knownGene': 'g'}
run_annotation(single_vcf_path, output_basename, output_dir)[source]

VAPr.chunk_processing module

class VAPr.chunk_processing.AnnotationJobParamsIndices[source]
CHUNK_INDEX_INDEX = 0
CHUNK_SIZE_INDEX = 2
COLLECTION_NAME_INDEX = 4
DB_NAME_INDEX = 3
FILE_PATH_INDEX = 1
GENOME_BUILD_VERSION_INDEX = 5
SAMPLE_LIST_INDEX = 7
VERBOSE_LEVEL_INDEX = 6
classmethod get_num_possible_indices()[source]
VAPr.chunk_processing.collect_chunk_annotations_and_store(job_params_tuple)[source]

VAPr.filtering module

VAPr.filtering.get_any_of_sample_ids_filter(sample_names_list)[source]
VAPr.filtering.get_sample_id_filter(sample_name)[source]
VAPr.filtering.make_de_novo_variants_filter(proband, ancestor1, ancestor2)[source]

Function for de novo variant analysis. Can be performed on multisample files or or on data coming from a collection of files. In the former case, every sample contains the same variants, although they have differences in their allele frequency and read values. A de novo variant is defined as a variant that occurs only in the specified sample (sample1) and not on the other two (sample2, sample3). Occurrence is defined as having allele frequencies greater than [0, 0] ([REF, ALT]).

VAPr.filtering.make_deleterious_compound_heterozygous_variants_filter(sample_ids_list=None)[source]
VAPr.filtering.make_known_disease_variants_filter(sample_ids_list=None)[source]

Function for retrieving known disease variants by presence in Clinvar and Cosmic.

VAPr.filtering.make_rare_deleterious_variants_filter(sample_ids_list=None)[source]

Function for retrieving rare, deleterious variants

VAPr.validation module

This module exposes utility functions to validate user inputs

By convention, validation functions in this module raise an appropriate Error if validation is unsuccessful. If it is successful, they return either nothing or the appropriately converted input value.

VAPr.validation.convert_to_nonneg_int(input_val, nullable=False)[source]

For non-null input_val, cast to a non-negative integer and return result; for null input_val, return None.

Parameters:
  • input_val (Any) – The value to attempt to convert to either a non-negative integer or a None (if nullable). The recognized null values are ‘.’, None, ‘’, and ‘NULL’
  • nullable (Optional[bool]) – True if the input value may be null, false otherwise. Defaults to False.
Returns:

None if nullable=True and the input is a null value. The appropriately cast non-negative integer if input is not null and the cast is successful.

Raises:

ValueError – if the input cannot be successfully converted to a non-negative integer or, if allowed, None

VAPr.validation.convert_to_nullable(input_val, cast_function)[source]

For non-null input_val, apply cast_function and return result if successful; for null input_val, return None.

Parameters:
  • input_val (Any) – The value to attempt to convert to either a None or the type specified by cast_function. The recognized null values are ‘.’, None, ‘’, and ‘NULL’
  • cast_function (Callable[[Any], Any]) – A function to cast the input_val to some specified type; should raise an error if this cast fails.
Returns:

None if input is the null value. An appropriately cast value if input is not null and the cast is successful.

Raises:

Error – whatever error is provided by cast_function if the cast fails.

VAPr.vapr_core module

class VAPr.vapr_core.VaprAnnotator(input_dir, output_dir, mongo_db_name, mongo_collection_name, annovar_install_path=None, design_file=None, build_ver=None, vcfs_gzipped=False)[source]

Bases: object

Class in charge of gathering requirements, finding files, downloading databases required to run the annotation

Parameters:
  • input_dir (str) – Input directory to vcf files
  • output_dir (str) – Output directory to annotated vcf files
  • mongo_db_name (str) – Name of the database to which you’ll store the collection of variants
  • mongo_collection_name (str) – Name of the collection to which you’d store the annotated variants
  • annovar_install_path (str) – Path to locally installed annovar scripts
  • design_file (str) – path to csv design file
  • build_ver (str) – genome build version to which annotation will be done against. Either hg19 or hg38
  • vcfs_gzipped (bool) – if the vcf files are gzipped, set to True

Returns:

DEFAULT_GENOME_VERSION = 'hg19'
HG19_VERSION = 'hg19'
HG38_VERSION = 'hg38'
SAMPLE_NAMES_KEY = 'Sample_Names'
SUPPORTED_GENOME_BUILD_VERSIONS = ['hg19', 'hg38']
annotate(num_processes=4, chunk_size=2000, verbose_level=1, allow_adds=False)[source]

This is the main function of the package. It will run Annovar beforehand, and will kick-start the full annotation functionality. Namely, it will collect all the variant data from Annovar annotations, combine it with data coming from MyVariant.info, and parse it to MongoDB, in the database and collection specified in project_data.

It will return the class VaprDataset, which can then be used for downstream filtering and analysis.

Parameters:
  • num_processes (int, optional) – number of parallel processes. Defaults to 8
  • chunk_size (int, optional) – int number of variants to be processed at once. Defaults to 2000
  • verbose_level (int, optional) – int higher verbosity will give more feedback, raise to 2 or 3 when debugging. Defaults to 1
  • allow_adds (bool, optional) – bool Allow adding new variants to a pre-existing Mongo collection, or overwrite it (Default value = False)
Returns:

class:~VAPr.vapr_core.VaprDataset

Return type:

class

annotate_lite(num_processes=8, chunk_size=2000, verbose_level=1, allow_adds=False)[source]

‘Lite’ Annotation: it will query myvariant.info only, without generating annotations from Annovar. It requires solely VAPr to be installed. The execution will grab the HGVS ids from the vcf files and query the variant data from MyVariant.info.

and inability to run native VAPr queries on the data.

It will return the class VaprDataset, which can then be used for downstream filtering and analysis.

Parameters:
  • num_processes (int, optional) – number of parallel processes. Defaults to 8
  • chunk_size (int, optional) – int number of variants to be processed at once. Defaults to 2000
  • verbose_level (int, optional) – int higher verbosity will give more feedback, raise to 2 or 3 when debugging. Defaults to 1
  • allow_adds (bool, optional) – bool Allow adding new variants to a pre-existing Mongo collection, or overwrite it (Default value = False)
Returns:

~VAPr.vapr_core.VaprDataset

Return type:

class

download_annovar_databases()[source]

Needed for ANNOVAR to run, it will download the required databases

Args:

Returns:

class VAPr.vapr_core.VaprDataset(mongo_db_name, mongo_collection_name, merged_vcf_path=None)[source]

Bases: object

full_name

Full name of database and collection

Args:

Returns:Full name of database and collection
Return type:str
get_all_variants()[source]

Self-explanatory

Args:

Returns:list of variants
Return type:list
get_custom_filtered_variants(filter_dictionary)[source]

See Create your own filter for more information on how to implement

Parameters:filter_dictionary(dictionary – dict): mongodb custom filter
Returns:list of variants
Return type:list
get_de_novo_variants(proband, ancestor1, ancestor2)[source]

See 4. De novo Variants for more information on how this is implemented

Parameters:
  • proband (str) – proband variant
  • ancestor1 (str) – ancestor #1 variant
  • ancestor2 (str) – ancestor #2 variant
Returns:

list of variants

Return type:

list

get_deleterious_compound_heterozygous_variants(sample_names_list=None)[source]

See 3. Deleterious Compound Heterozygous Variants for more information on how this is implemented

Parameters:sample_names_list(list – list, optional): list of samples to draw variants from (Default value = None)
Returns:list of variants
Return type:list
get_distinct_sample_ids()[source]

Self-explanatory

Args:

Returns:list of sample ids
Return type:list
get_known_disease_variants(sample_names_list=None)[source]

See 2. Known Disease Variants for more information on how this is implemented

Parameters:sample_names_list(list – list, optional): list of samples to draw variants from (Default value = None)
Returns:list of variants
Return type:list
get_rare_deleterious_variants(sample_names_list=None)[source]

See 1. Rare Deleterious Variants for more information on how this is implemented

Parameters:sample_names_list(list – list, optional): list of samples to draw variants from (Default value = None)
Returns:list of variants
Return type:list
get_variants_as_dataframe(filtered_variants=None)[source]

Utility to get a dataframe from variants, either all of them or a filtered subset

Parameters:filtered_variants – a list of variants (Default value = None)
Returns:pandas.DataFrame
get_variants_for_sample(sample_name)[source]

Return variants for a specific sample

Parameters:sample_name (str) – name of sample
Returns:list of variants
Return type:list
get_variants_for_samples(specific_sample_names)[source]

Return variants from multiple samples

Parameters:specific_sample_names (list) – name of samples
Returns:list of variants
Return type:list
is_empty

If there are no records in the collection, returns True

Args:

Returns:if there are no records in the collection, returns True
Return type:bool
num_records

Number of records in MongoDB collection

Args:

Returns:Number of records in MongoDB collection
Return type:int
write_filtered_annotated_csv(filtered_variants, output_fp)[source]

Filtered csv file containing annotations from a list passed to it, coming from MongoDB

Parameters:
  • filtered_variants (list) – variants coming from MongoDB
  • output_fp (str) – Output file path
Returns:

None

write_filtered_annotated_vcf(filtered_variants, vcf_output_path, info_out=True)[source]
Parameters:
  • filtered_variants (list) – variants coming from MongoDB
  • vcf_output_path (str) – Output file path
  • info_out – if True, extra annotation information will be written to the vcf file (Default value = True)
  • info_out – bool (Default value = True)
Returns:

None

write_unfiltered_annotated_csv(output_fp)[source]

Full csv file containing annotations from both annovar and myvariant.info

Parameters:output_fp (str) – Output file path
Returns:None
write_unfiltered_annotated_csvs_per_sample(output_dir)[source]
Parameters:output_dir – return: None
Returns:None
write_unfiltered_annotated_vcf(vcf_output_path, info_out=True)[source]

Filtered vcf file containing annotations from a list passed to it, coming from MongoDB

Parameters:
  • vcf_output_path (str) – Output file path
  • info_out – if True, extra annotation information will be written to the vcf file (Default value = True)
  • info_out – bool (Default value = True)
Returns:

None

VAPr.vcf_genotype_fields_parsing module

class VAPr.vcf_genotype_fields_parsing.Allele(unfiltered_read_counts=None)[source]

Bases: object

Store unfiltered read counts, if any, for a particular allele.

unfiltered_read_counts

int or None – Number of unfiltered reads counts for this sample at this site, from AD field.

class VAPr.vcf_genotype_fields_parsing.GenotypeLikelihood(allele1_number, allele2_number, likelihood_neg_exponent)[source]

Bases: object

Store parsed info from VCF genotype likelihood field for a single sample.

allele1_number

int – The allele identifier for the left-hand allele inferred for this genotype likelihood.

allele2_number

int – The allele identifier for the right-hand allele inferred for this genotype likelihood.

likelihood_neg_exponent

float – The “normalized” Phred-scaled likelihood of the genotype represented by allele1 and allele2.

class VAPr.vcf_genotype_fields_parsing.VCFGenotypeInfo(raw_string)[source]

Bases: object

Store parsed info from VCF genotype fields for a single sample.

_raw_string

str – The genotype fields values string from a VCF file (e.g., ‘0/1:173,141:282:99:255,0,255’).

genotype

Optional[str] – The type of each of the sample’s two alleles, such as 0/0, 0/1, etc.

alleles

List[Allele] – One Allele object for each allele detected for this variant (this can be across samples, so there can be more than 2 alleles).

genotype_likelihoods

List[GenotypeLikelihood] – The GenotypeLikelihood object for each allele.

unprocessed_info

Dict[str, Any] – Dictionary of field tag and value(s) for any fields not stored in dedicated attributes of VCFGenotypeInfo. Values are parsed to lists and/or floats if possible.

genotype_subclass_by_class

Dict[str, str] – Genotype subclass (reference, alt, compound) keyed by genotype class (homozygous/heterozygous).

filter_passing_reads_count

int or None – Filtered depth of coverage of this sample at this site from the DP field.

genotype_confidence

str – Genotype quality (confidence) of this sample at this site, from the GQ field.

class VAPr.vcf_genotype_fields_parsing.VCFGenotypeParser[source]

Bases: object

Mine format string and genotype fields string to create a filled VCFGenotypeInfo object.

FILTERED_ALLELE_DEPTH_TAG = 'DP'
GENOTYPE_QUALITY_TAG = 'GQ'
GENOTYPE_TAG = 'GT'
NORMALIZED_SCALED_LIKELIHOODS_TAG = 'PL'
UNFILTERED_ALLELE_DEPTH_TAG = 'AD'
static is_valid_genotype_fields_string(genotype_fields_string)[source]

Return true if input has any real genotype fields content, false if is just periods, zeroes, and delimiters.

Parameters:genotype_fields_string (str) – A VCF-style genotype fields string, such as 1/1:0,2:2:6:89,6,0 or ./.:.:.:.:.
Returns
bool: true if input has any real genotype fields content, false if is just periods, zeroes, and delimiters.
classmethod parse(format_key_string, format_value_string)[source]

Parse the input format string and genotype fields string into a filled VCFGenotypeInfo object.

Parameters:
  • format_key_string (str) – The VCF format string (e.g., ‘GT:AD:DP:GQ:PL’) for this sample at this site.
  • format_value_string (str) – The VCF genotype fields values string (e.g., ‘1/1:0,34:34:99:1187.2,101,0’) corresponding to the format_key_string for this sample at this site.
Returns:

A filled VCFGenotypeInfo for this sample at this site unless an error was

encountered, in which case None is returned. encountered, in which case None is returned.

Return type:

VCFGenotypeInfo or None

VAPr.vcf_merging module

VAPr.vcf_merging.bgzip_and_index_vcf(vcf_path)[source]

bgzip and index each vcf so it can be merged with bcftools.

VAPr.vcf_merging.merge_vcfs(input_dir, output_dir, project_name, raw_vcf_path_list=None, vcfs_gzipped=False)[source]

Merge vcf files into single multisample vcf, bgzip and index merged vcf file.

Module contents