| URL Variant | Why It Duplicates | Recommended Fix |
|---|---|---|
| /product/shoes?utm_source=ad | Tracking parameters keep content identical. | Strip params, canonicalize, and redirect where possible. |
| /blog/post and /blog/post/ | Trailing slashes can split indexation. | Pick one format, redirect the other. |
| /page#print | Print views often mirror the main page. | Noindex print, or canonical back to main. |
- Exact duplicate: SHA‑1 fingerprint of normalized page text (optionally including title, meta description, and canonical).
- Near duplicate: Similarity = 1 − (HammingDistance(SimHashA, SimHashB) ÷ 64).
- Paste two or more URLs, one per line.
- Decide whether to strip query parameters and fragments.
- Set a near‑duplicate threshold, then submit.
- Review exact and near duplicate groups above the form.
- Export CSV or PDF for audits and action lists.
Why duplicates reduce crawl efficiency
Search engines often crawl every unique URL they discover. When one template generates many variants, the crawl queue grows quickly and important pages may be visited less often. This calculator requests each page, follows up to six redirects, and records the final URL to expose consolidation opportunities. The default scan settings use a 12‑second timeout and a 1.5 MB page cap. Use batches of 20–50 URLs to keep comparisons fast and consistently clear.
Exact matching with content fingerprints
For exact duplicates, the calculator extracts visible text, removes scripts and styles, collapses whitespace, and lowercases the result. It then creates a SHA‑1 fingerprint from the normalized text. Pages sharing the same fingerprint are grouped together, and the table shows status code, word count, and a fingerprint. You can include the title, meta description, and canonical URL inside the fingerprint to detect templated pages with matching metadata.
Near matching using a SimHash score
Near duplicates are detected with a 64‑bit SimHash built from token frequencies, which is resilient to small edits. Similarity is computed as 1 minus the Hamming distance divided by 64. A threshold of 0.86 means pages can differ by about nine bits and still cluster together, while 0.90 is stricter. The calculator computes an average similarity per cluster for prioritization.
URL normalization prevents false splits
Duplicate content frequently comes from query parameters, tracking codes, and anchors. Enable “Strip query parameters” to collapse UTM variants into one normalized URL, and keep fragments only when anchors change meaningful content. Watch trailing slashes, mixed casing, and redirected paths, because these can split indexation signals across multiple locations. Compare the detected canonical tag against the final URL to confirm whether the preferred version is declared consistently.
Prioritize fixes with exportable evidence
Start with exact groups, then review near groups with high average similarity. Use word count as a confidence check; the default minimum is 150 words, which avoids clustering thin utility pages. Decide the primary URL, update internal links, and apply redirects or canonical tags for alternates. Export CSV for sorting, filtering, and task assignment. Export PDF for audits, approvals, and documentation snapshots that support SEO change requests.
1. What is an exact duplicate in this report?
Two pages are exact duplicates when their normalized text produces the same fingerprint. Normalization removes scripts and styles, collapses whitespace, and lowercases content. If you enable title, meta, or canonical inclusion, those fields also influence matching.
2. What near-duplicate threshold should I start with?
Start around 0.86 for broad clustering, then raise to 0.90 to reduce noise. Higher thresholds catch pages with tiny edits, while lower thresholds capture shared templates. Review a few pairs, then tune the threshold to match your site patterns.
3. Why are some pages excluded from duplicate groups?
Grouping skips pages below the minimum word count to avoid misleading matches on thin pages. Increase or lower the minimum depending on your content. Status errors, blocked requests, or very small pages may still appear in the full scan table.
4. Does the calculator follow redirects and canonical tags?
It follows up to six redirects and records the final destination URL. It also extracts the canonical tag when present, so you can compare declared preference against the resolved page. Redirects and canonicals should align to one preferred version.
5. How should I handle tracking parameters like UTM tags?
Enable query stripping to normalize parameterized URLs into one address. Keep parameters only when they change meaningful content, such as pagination or filters. After identifying duplicates, add canonical tags and redirects to consolidate signals and simplify crawling.
6. Is this safe for large lists of URLs?
Use reasonable batches so requests complete reliably. The page cap and timeout protect resources, but very large lists can still take time. Split lists by folder, template, or section, export results, and merge them in your workflow.