| Sample | Graphemes | Code points | Bytes (UTF-8) |
|---|---|---|---|
| Hello | 5 | 5 | 5 |
| café | 4 | 4 | 5 |
| 👩💻 | 1 | 3–5* | 11–17* |
| a b c | 5 | 5 | 5 |
- Selected length = grapheme length, code-point length, or byte length (based on your mode).
- Byte length = strlen(text).
- Code-point length = mb_strlen(text, encoding).
- Grapheme length = grapheme_strlen(text) when available; otherwise code-point length.
- Lines = number of \n plus one (after normalizing Windows newlines).
- Whitespace = count of Unicode whitespace matches (\s).
- Words = either Unicode word-like tokens or whitespace splits (your choice).
- Paste your text, code, or log output into the textarea.
- Select a Length mode: graphemes for human-visible characters, code points for Unicode processing, bytes for storage.
- Optionally change trimming and normalization to compare raw vs cleaned input.
- Click Submit to show results above the form.
- Use Download CSV or Download PDF to share a snapshot.
Why length metrics differ
In software validation, a “character” can mean bytes, Unicode code points, or user‑perceived graphemes. This calculator reports all three so you can match API rules precisely. For example, “café” is 4 code points but 5 UTF‑8 bytes, while a single emoji sequence may be 1 grapheme yet several code points and over 10 bytes. These distinctions matter when enforcing limits in APIs, databases, and user interfaces today consistently.
Using graphemes for UI limits
Front‑end limits usually target what users see. Grapheme length approximates visible symbols, including combined emojis and flags. When the grapheme extension is available, the tool counts extended grapheme clusters; otherwise it falls back to code points. Use this mode for tweet‑like limits, form inputs, and UX copy checks across languages.
Code points for Unicode processing
Many back‑end libraries operate on Unicode scalar values. Code‑point length (mb_strlen) is useful for normalization comparisons, indexing, and substring safety. With normalization enabled (NFC, NFD, NFKC, NFKD), the same text can change code‑point count because composed and decomposed forms store accents differently, affecting storage and equality checks in pipelines.
Bytes for storage and transport
Databases, message queues, and network payloads care about bytes. UTF‑8 uses 1 byte for ASCII, 2–4 bytes for many non‑Latin characters, and multiple bytes for emoji sequences. The byte length shown by strlen helps you size VARCHAR limits, estimate log volume, and enforce payload caps like 64 KB request bodies. It also reduces surprises when truncation happens mid‑character.
Operational counters beyond length
Words, lines, whitespace, and category counts support practical engineering tasks. Word totals help documentation gates, line counts help linting and commit policy, and whitespace counts highlight formatting inflation in JSON or CSV. Letter, digit, punctuation, and symbol splits are useful for password policy audits, tokenization tests, and input sanitation rules.
Exportable evidence for reviews
Teams often need repeatable evidence in QA and security reviews. The CSV export produces a compact metrics table for spreadsheets, while the PDF export creates a printable snapshot for tickets and sign‑offs. The top‑character frequency list reveals skew, such as repeated commas, tabs, or unusual symbols, speeding up debugging and regression comparison over time.
What length mode should I use for database limits?
Use byte length when limits are enforced in bytes, such as storage quotas or payload caps. If your DB limit is defined in characters, compare code points and bytes to avoid truncating multibyte UTF-8 input.
Why does an emoji count as more than one character sometimes?
Many emojis are sequences joined by invisible code points. Users see one symbol (a grapheme), but the stored text may contain multiple code points and several UTF‑8 bytes.
Does normalization change the visible text?
Usually the rendered text looks the same, but the internal representation can change. NFC may combine base letters and accents, while NFD may split them, affecting code-point and byte counts.
How are words counted in this calculator?
You can choose Unicode word-like tokens for language-aware counting, or whitespace splitting for quick estimates. For strict specs, match your application’s tokenizer and compare results with both modes.
Can I use this for API request validation?
Yes. Pick the metric your API enforces, run trimming and normalization if your pipeline does, then export CSV or PDF to document the exact counts used during testing and review.
Why do I see “grapheme unavailable” in settings?
That means the grapheme functions are not enabled on your server. The calculator will still work, but grapheme length will fall back to code-point length, which can differ for combined emojis.