Example data table
These sample pairs illustrate how duplication scores may be interpreted.
| Sample A (snippet) | Sample B (snippet) | Duplication | Interpretation |
|---|---|---|---|
| Short landing copy with unique value points. | Different phrasing, same intent, minimal overlap. | 12–25% | Mostly unique; safe for indexing. |
| Product description reused with small edits. | Same structure, minor synonyms, repeated phrases. | 45–70% | Moderate risk; rewrite key sections. |
| Boilerplate paragraph copied across many pages. | Same paragraph, same order, identical phrases. | 85–100% | Near-duplicate; consolidate or canonicalize. |
Formula used
How to use this calculator
- Paste the original content in Text A and the comparison content in Text B.
- Choose cleaning options to match your real publishing pipeline.
- Select a scoring method; keep “Combined” for general audits.
- Adjust n‑gram size if you want stricter phrase matching.
- Click “Detect duplication” and review the highlighted overlap.
- Download CSV or PDF to document your rewrite decisions.
Why duplication weakens search signals
Duplicate paragraphs can trigger query cannibalization. Indexing systems may pick one version and ignore the rest, wasting crawl budget. This calculator classifies risk by score: under 25% is low, 25–60% is moderate, 60–85% is high, and 85%+ is near‑duplicate. Use these bands to decide whether you need light edits, a full rewrite, or consolidation confidently.
How this score relates to crawl decisions
Overlap often comes from faceted filters, printer views, language variants, and repeated “feature” blocks. If two category pages land around 70%, they usually share headings, bullets, and internal links in the same order. Review the “Top overlapping phrases” table, which lists up to 12 shared n‑grams with counts for A and B. Smaller n‑grams (3–4) expose template reuse; larger n‑grams (6–8) flag copied sentences.
Metric selection for different content types
Cosine TF‑IDF is strong for topical similarity even when synonyms are used. Word‑level Jaccard is stricter and surfaces shared vocabulary; n‑gram Jaccard is best for copied phrases. Sequence match and edit similarity capture near‑verbatim blocks such as policy text, product specs, or boilerplate introductions. Cleaning options matter: stripping HTML removes tags, ignoring case merges variants, and punctuation removal stabilizes tokens. Minimum word length and stopword removal reduce noise in short, generic text.
Scaling checks across a site
For a template audit, sample 20–50 URLs per layout and compare one “reference” page to the rest. Track results with CSV exports and the saved history panel. Inputs are capped at 200,000 characters per text. The tool stores the latest 10 comparisons in the session and can copy a JSON payload for dashboards or tickets.
Practical remediation targets
Keep high‑value sections unique: introductions, benefits, FAQs, and internal linking context. If duplication is unavoidable, consolidate similar pages, apply canonical URLs, or redirect thin variants to a primary page. For pagination and filters, consider noindex rules or parameter handling to reduce duplicate clusters. As a practical target, push combined scores below 40% for similar topics, and below 25% when pages should be fully distinct. Save PDF reports to document before‑and‑after gains.
FAQs
What score should I treat as duplicated content?
For most audits, treat 60%+ as a rewrite candidate and 85%+ as near‑duplicate. Always review the overlapping phrases and page intent before acting, because templates and navigation text can inflate similarity.
When should I use n‑gram Jaccard?
Use it when you suspect copying of sentences or paragraphs. Set n to 3–5 for repeated phrases and 6–8 for stricter, near‑verbatim detection. Pair it with the phrase table for context.
Why does removing punctuation change the result?
Punctuation creates token boundaries and can hide matches. Removing it normalizes variants like “SEO‑friendly” versus “SEO friendly”, improving stable comparisons. If punctuation carries meaning in your niche, test both settings and compare trends.
Can I compare HTML pages directly?
Yes. Paste the raw HTML and enable “Strip HTML” to analyze visible text only. If you keep HTML tags, similarity may reflect shared markup rather than content.
Should I remove stopwords?
Stopword removal helps when texts are short or heavily templated, because common words dominate. For long articles, keeping stopwords can preserve phrasing signals. Try both and choose the setting that aligns with your editorial goals.
How do I export results for reporting?
After running a comparison, use Download CSV for spreadsheet tracking or Download PDF for shareable reports. You can also copy JSON to paste into tickets or dashboards. Run again after edits to document improvements.