Calculator Input
Example Data Table
This table shows how genotype patterns affect the segregating site decision.
| CHROM | POS | REF | ALT | Sample Genotypes | Observed Alleles | Decision |
|---|---|---|---|---|---|---|
| chr1 | 1050 | A | G | 0/0, 0/1, 1/1 | A and G | Segregating |
| chr1 | 1088 | C | T | 0/0, 0/0, 0/0 | C only | Monomorphic |
| chr2 | 2240 | G | A,C | 0/1, 1/2, ./. | G, A, C | Segregating |
Small VCF Example
##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 S3 chr1 1050 . A G 99 PASS AC=3;AN=6 GT 0/0 0/1 1/1 chr1 1088 . C T 80 PASS AC=0;AN=6 GT 0/0 0/0 0/0 chr2 2240 . G A,C 70 PASS AC=2,1;AN=4 GT 0/1 1/2 ./.
Formula Used
Segregating sites, S: Count each eligible VCF position where two or more alleles are observed among called genotypes.
Observed allele rule: A site is segregating when the called allele count contains at least two alleles with count greater than zero.
Missing percent: Missing percent = missing samples / total samples × 100.
Minor allele count: MAC is the smallest nonzero allele count at a variable site.
S per kb: S per kb = S / callable base pairs × 1000.
Watterson theta: theta = S / a1, where a1 = sum of 1 / i for i from 1 to n - 1. The calculator estimates n from average called alleles across segregating sites.
How to Use This Calculator
- Paste VCF text or upload a VCF file.
- Select whether to count all sites, SNVs, or indel and complex sites.
- Choose biallelic, multiallelic, or all ALT structures.
- Set missingness, called sample, ALT count, and MAC filters.
- Add callable sequence length when normalized rates are needed.
- Press the calculate button and review the result above the form.
- Use CSV or PDF export for reports and records.
Understanding Segregating Sites in VCF Files
What the Count Means
A segregating site is a genomic position that shows variation in a sample set. The site must have at least two observed alleles. One allele can be the reference allele. Another can be an alternate allele. The key point is variation among called genotypes.
Why VCF Filtering Matters
Raw VCF files often include low quality calls. Some records fail filters. Some samples have missing genotypes. Some sites may have alternate alleles listed but no real variation in the selected samples. A useful count should therefore apply clear rules before reporting S.
Using Genotypes and INFO Fields
This calculator first reads genotype fields when sample columns are present. It checks GT values such as 0/0, 0/1, 1/1, and 1/2. It counts reference and alternate allele observations. If genotype calls are unavailable, it can use AC and AN values from the INFO column. This helps with summary VCF files.
Interpreting the Result
The main output is S, the number of eligible variable positions. The table also shows called samples, missing percentage, alternate allele count, minor allele count, and allele summaries. These values help explain why a site was counted or treated as monomorphic.
Population Genetics Use
Segregating sites are common in diversity studies. They help compare populations, regions, genes, and filtering strategies. A higher S value can suggest more variation, but it also depends on sample size, callable sequence length, and data quality. Normalized values, such as S per kb, make comparisons more balanced.
Practical Advice
Use strict filters for final analysis. Use relaxed filters during exploration. Keep the same settings when comparing multiple files. Record the filter choices with every result. This makes your counts easier to audit, reproduce, and explain later.
FAQs
1. What is a segregating site?
A segregating site is a position where at least two alleles are observed among the called samples. It represents variation in the analyzed group.
2. Does the calculator count monomorphic ALT records?
No, not when genotypes show only one observed allele. A listed ALT allele is not enough if all called samples share the same allele.
3. Can multiallelic sites be counted?
Yes. Select all ALT structures or multiallelic only. The calculator counts observed reference and alternate alleles from genotypes or INFO counts.
4. What does MAC mean?
MAC means minor allele count. It is the smallest nonzero allele count at a variable site after filters are applied.
5. Why use a missing percentage filter?
High missingness can inflate uncertainty. A maximum missing percentage filter removes sites with too many uncalled sample genotypes.
6. What happens without sample columns?
The calculator tries to use AC and AN values from the INFO field. If those are missing, it uses ALT presence as a limited fallback.
7. What is S per kb?
S per kb normalizes segregating sites by callable sequence length. It equals S divided by callable base pairs, multiplied by 1000.
8. Is Watterson theta exact here?
It is an estimate based on segregating sites and average called alleles. For formal studies, use consistent ploidy and callable-site definitions.