Segregating Sites VCF Calculator

Upload or paste VCF data for careful site counting. Adjust filters and review allele summaries. Export clean tables with clear population genetics context today.

Calculator Input

You may paste VCF text or upload a file. Uploaded content is used first.
Optional. Used for S per kb and theta per bp.

Example Data Table

This table shows how genotype patterns affect the segregating site decision.

CHROM POS REF ALT Sample Genotypes Observed Alleles Decision
chr1 1050 A G 0/0, 0/1, 1/1 A and G Segregating
chr1 1088 C T 0/0, 0/0, 0/0 C only Monomorphic
chr2 2240 G A,C 0/1, 1/2, ./. G, A, C Segregating

Small VCF Example

##fileformat=VCFv4.2
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1	S2	S3
chr1	1050	.	A	G	99	PASS	AC=3;AN=6	GT	0/0	0/1	1/1
chr1	1088	.	C	T	80	PASS	AC=0;AN=6	GT	0/0	0/0	0/0
chr2	2240	.	G	A,C	70	PASS	AC=2,1;AN=4	GT	0/1	1/2	./.

Formula Used

Segregating sites, S: Count each eligible VCF position where two or more alleles are observed among called genotypes.

Observed allele rule: A site is segregating when the called allele count contains at least two alleles with count greater than zero.

Missing percent: Missing percent = missing samples / total samples × 100.

Minor allele count: MAC is the smallest nonzero allele count at a variable site.

S per kb: S per kb = S / callable base pairs × 1000.

Watterson theta: theta = S / a1, where a1 = sum of 1 / i for i from 1 to n - 1. The calculator estimates n from average called alleles across segregating sites.

How to Use This Calculator

  1. Paste VCF text or upload a VCF file.
  2. Select whether to count all sites, SNVs, or indel and complex sites.
  3. Choose biallelic, multiallelic, or all ALT structures.
  4. Set missingness, called sample, ALT count, and MAC filters.
  5. Add callable sequence length when normalized rates are needed.
  6. Press the calculate button and review the result above the form.
  7. Use CSV or PDF export for reports and records.

Understanding Segregating Sites in VCF Files

What the Count Means

A segregating site is a genomic position that shows variation in a sample set. The site must have at least two observed alleles. One allele can be the reference allele. Another can be an alternate allele. The key point is variation among called genotypes.

Why VCF Filtering Matters

Raw VCF files often include low quality calls. Some records fail filters. Some samples have missing genotypes. Some sites may have alternate alleles listed but no real variation in the selected samples. A useful count should therefore apply clear rules before reporting S.

Using Genotypes and INFO Fields

This calculator first reads genotype fields when sample columns are present. It checks GT values such as 0/0, 0/1, 1/1, and 1/2. It counts reference and alternate allele observations. If genotype calls are unavailable, it can use AC and AN values from the INFO column. This helps with summary VCF files.

Interpreting the Result

The main output is S, the number of eligible variable positions. The table also shows called samples, missing percentage, alternate allele count, minor allele count, and allele summaries. These values help explain why a site was counted or treated as monomorphic.

Population Genetics Use

Segregating sites are common in diversity studies. They help compare populations, regions, genes, and filtering strategies. A higher S value can suggest more variation, but it also depends on sample size, callable sequence length, and data quality. Normalized values, such as S per kb, make comparisons more balanced.

Practical Advice

Use strict filters for final analysis. Use relaxed filters during exploration. Keep the same settings when comparing multiple files. Record the filter choices with every result. This makes your counts easier to audit, reproduce, and explain later.

FAQs

1. What is a segregating site?

A segregating site is a position where at least two alleles are observed among the called samples. It represents variation in the analyzed group.

2. Does the calculator count monomorphic ALT records?

No, not when genotypes show only one observed allele. A listed ALT allele is not enough if all called samples share the same allele.

3. Can multiallelic sites be counted?

Yes. Select all ALT structures or multiallelic only. The calculator counts observed reference and alternate alleles from genotypes or INFO counts.

4. What does MAC mean?

MAC means minor allele count. It is the smallest nonzero allele count at a variable site after filters are applied.

5. Why use a missing percentage filter?

High missingness can inflate uncertainty. A maximum missing percentage filter removes sites with too many uncalled sample genotypes.

6. What happens without sample columns?

The calculator tries to use AC and AN values from the INFO field. If those are missing, it uses ALT presence as a limited fallback.

7. What is S per kb?

S per kb normalizes segregating sites by callable sequence length. It equals S divided by callable base pairs, multiplied by 1000.

8. Is Watterson theta exact here?

It is an estimate based on segregating sites and average called alleles. For formal studies, use consistent ploidy and callable-site definitions.

Related Calculators

Paver Sand Bedding Calculator (depth-based)Paver Edge Restraint Length & Cost CalculatorPaver Sealer Quantity & Cost CalculatorExcavation Hauling Loads Calculator (truck loads)Soil Disposal Fee CalculatorSite Leveling Cost CalculatorCompaction Passes Time & Cost CalculatorPlate Compactor Rental Cost CalculatorGravel Volume Calculator (yards/tons)Gravel Weight Calculator (by material type)

Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.