Germline SNP and Indel variant calling was performed following the Genome Analysis Toolkit (GATK, v4.1.0.0) best practice recommendations. Raw reads were mapped to the UCSC human reference genome hg38 using a Burrows-Wheeler Aligner (BWA-MEM, v0.7.17). Optical and PCR duplicate marking and sorting was done using Picard (v4.1.0.0). Base quality score recalibration was performed with the GATK BaseRecalibrator resulting in a final BAM file for each sample. The reference data used for base quality score recalibration were dbSNP138, Mills and 1000 genome gold standard indels and 1000 genome phase 1, provided in the GATK Resource Bundle (last modified 8/).

After data pre-processing, variant calling was done with the Haplotype Caller (v4.1.0.0) in the ERC GVCF mode to generate an intermediate gVCF file for each sample, which were then consolidated using GenomicsDBImport to create a single file for joint calling. Joint calling was performed on the entire cohort of 147 samples using the GenotypeGVCF GATK4 to create one multisample VCF file.

Since target exome sequencing data in this study does not support Variant Quality Score Recalibration, we selected hard filtering instead of VQSR. We applied hard filter thresholds recommended by GATK to maximize the number of true positives and minimize the number of false positive variants. The applied filtering strategies following standard GATK recommendations and metrics assessed during the quality control procedure were for SNVs: FS, SOR, ReadPosRankSum, MQRankSum, QD, DP, MQ, and for indels: FS, SOR, ReadPosRankSum, MQRankSum, QD, DP.

In addition, on a reference sample (HG001, Genome In A Bottle) validation of the GATK variant calling pipeline was performed and 96.9/99.4 recall/precision score was obtained. All steps were matched using the Cancer Genome Cloud Seven Bridges platform.

Quality-control and you will annotation

To assess the quality of the obtained set of variants, we calculated per-sample metrics with Bcftools v1.9 ( such as the total number of variants, mean transition to transversion ratio (Ti/Tv) and average coverage per site with SAMtools v1.3 65 calculated for each BAM file. We calculated the number of singletons and the ratio of heterozygous to non-reference homozygous sites (Het/Hom) in order to filter out low-quality samples. Samples with the Het/Hom ratio deviation were removed using PLINK v1.9 ( 66 . We marked the sites with depth (DP)

We used the Ensembl Variant Effect Predictor (VEP, ensembl-vep 90.5) for functional annotation of the final set of variants. Databases that were used in VEP were 1kGP Phase3, COSMIC v81, ClinVar 201706, NHLBI ESP V2-SSA137, HGMD-Public 20164, dbSNP150, GENCODE v27, gnomAD v2.1 and Regulatory Build. VEP provides scores and pathogenicity predictions with Sorting Intolerant From Tolerant v5.2.2 (SIFT) and PolyPhen-2 v2.2.2 tools. For each transcript in the final dataset we obtained the coding consequence prediction and score based on SIFT and PolyPhen-2. A canonical transcript was assigned for each gene, based on VEP.

Serbian test sex construction

We analyzed the number of mapped reads to the sex chromosomes from each sample BAM file using CNVkit to generate target and antitarget BED files.

Dysfunction out of variants

To investigate allele frequency distribution in the Serbian population sample, we categorized variants into four groups based on the minor allele frequency (MAF): MAF ≤ 1%, 1–2%, 2–5% and ≥ 5%. We separately categorized singletons (AC = 1) and private doubletons (AC = 2), where a variant occurs only in one individual and in the homozygotic state.

We classified variants into four functional impact groups based on Ensembl. High (Loss of function) which includes splice donor variants, splice acceptor variants, stop gained, frameshift variants, stop lost and start lost. Moderate which includes inframe insertion, inframe deletion, missense variants. Low which includes splice region variants, synonymous variants, start and stop retained variants. MODIFIER which includes coding sequence variants, 5'UTR and 3' UTR variants, non-coding transcript exon variants, intron variants, NMD transcript variants, non-coding transcript variants, upstream gene variants, downstream gene variants and intergenic variants.

