Checker inputs
Example data table
The table shows a realistic audit snapshot using the same scoring approach.
| Page or asset | Words | Highest similarity | Typical cause | Recommended action |
|---|---|---|---|---|
| City landing page A | 620 | 88% | Template sections repeated sitewide | Rewrite unique intro, add local FAQs |
| City landing page B | 605 | 86% | Near-identical service blocks | Consolidate blocks, add distinct proof points |
| Product page variant | 410 | 72% | Shared spec lists and boilerplate | Keep specs, rewrite benefits and use cases |
| Blog draft rewrite | 980 | 54% | Same outline, partial sentence reuse | Add new examples, restructure sections |
Formula used
Jaccard similarity (n-grams)
Content is converted into phrase shingles (n-grams). Similarity is:
Higher n reduces false matches from short common phrases.
Cosine similarity (term frequency)
Each item becomes a word-frequency vector. Similarity is:
Useful for broad topical overlap, even with paraphrasing.
How to use this calculator
- Paste two or more content items into the input boxes.
- Set the threshold to define what counts as “duplicate”.
- Pick Jaccard for phrase overlap, or Cosine for topic overlap.
- Use normalization options to match your publishing workflow.
- Submit to view the report above the form.
- Download CSV or PDF for sharing and tracking changes.
Why duplicate content is a measurable risk
Search systems cluster similar pages to avoid repeating results. When multiple drafts share the same phrases, signals like links and engagement can split across versions. This calculator quantifies overlap so you can decide whether to merge, rewrite, or canonicalize. Measuring duplication before publishing reduces index bloat, improves crawl efficiency, and protects topical focus. It also helps teams spot boilerplate that quietly spreads across categories and locations.
How similarity scoring mirrors real audits
The tool normalizes text by stripping tags, removing punctuation, and filtering short words or stopwords. It then compares each item to every other item and produces a matrix of pairwise scores. Use Jaccard n‑grams to detect copied phrasing, and Cosine term frequency to catch heavy topical reuse. Together, they provide a practical audit view. A configurable threshold turns raw scores into clear pass or fail decisions for workflows.
Interpreting the matrix and flagged pairs
High scores in one row indicate a page that closely resembles several others. Start with the flagged pairs list to see which combinations exceed your threshold. Review the shared phrase hints to locate repeated blocks such as intros, service lists, or templated paragraphs. If only a few sections overlap, rewrite those segments and recheck. Watch for navigation text, disclaimers, and repeated calls to action that inflate similarity without adding value.
Turning results into action plans
For near‑identical pages, consolidate content and redirect weaker versions, or keep one canonical destination. For location templates, keep consistent structural elements but make the lead paragraph, proof points, and FAQs unique. For product variants, retain specifications while rewriting benefits, usage scenarios, and comparison language. Track changes by exporting reports after each revision cycle. If duplication is intentional for compliance, isolate it in short reusable blocks and expand unique supporting copy.
Operational best practices for ongoing checks
Run the checker during content briefs, before publishing, and after major template updates. Maintain a standard threshold for your site so results stay comparable over time. Increase n‑gram size when your niche uses many common phrases. Save CSV reports for teams, and use the PDF summary for approvals or stakeholder reviews. Pair results with internal URL mapping so you can prioritize high‑traffic pages first and document decisions consistently. Schedule quarterly spot checks for new templates and campaign landing pages as well always.
FAQs
What does the duplication percentage mean?
It estimates how many unique phrase shingles in one item also appear in any other item you provided, based on the selected n‑gram setting.
Which metric should I choose for SEO reviews?
Use Jaccard for detecting copied phrasing and templated blocks. Use Cosine when you want to understand topical similarity across rewrites and outlines.
How many items can I compare at once?
This page supports up to six inputs per run. For larger audits, compare in batches and keep consistent settings for reliable tracking.
Do I need to paste full HTML pages?
No. You can paste plain text, rendered page copy, or HTML. Enable “Strip HTML tags” if you paste markup-heavy content.
Why does changing n‑gram size change the score?
Smaller n‑grams match more common phrases and raise similarity. Larger n‑grams require longer shared wording and reduce false positives.
Is this a web crawler or a local checker?
It is a local checker. It compares only the text you paste, which is ideal for drafts, templates, and controlled audits before publishing.