JavaScript Strings and Encodings: Characters, Bytes, and Unicode
JavaScript strings look simple, but they sit on top of Unicode and character encodings. If you have ever seen unexpected string lengths, broken emoji, or encoding errors when sending text over the network, this topic explains why.
Quick answer: JavaScript strings are sequences of UTF-16 code units, not raw bytes and not always one-character-per-unit. Use Unicode-aware APIs like TextEncoder, TextDecoder, and string iteration helpers when you need correct encoding or character handling.
Difficulty: Beginner to Intermediate
You'll understand this better if you know: basic JavaScript strings, arrays, and that computers store text as numbers behind the scenes.
1. What Is JavaScript Strings and Encodings?
In JavaScript, a string is text stored as a sequence of UTF-16 code units. An encoding is the rule that turns text into bytes and bytes back into text. JavaScript handles the string value in memory, while encodings matter when text crosses boundaries such as files, URLs, HTTP requests, or browser storage.
- A string is a text value such as "hello" or "café".
- UTF-16 is JavaScript’s internal string representation.
- UTF-8 is the most common encoding for files, web pages, and network data.
- Some visible characters, especially emoji and many non-Latin symbols, may take more than one code unit.
That difference is why "🙂".length is not 1 in JavaScript, even though the emoji looks like a single character.
2. Why JavaScript Strings and Encodings Matter
Most beginner code works fine with plain English text, but real applications quickly encounter accents, emoji, multilingual names, and data that must be encoded for transport. If you ignore encodings, you can display corrupted text, count characters incorrectly, or generate invalid URLs.
This topic matters whenever you:
- send text to APIs or servers
- read files or binary data as text
- build form submissions and URLs
- count characters for validation or limits
- process user-generated content in many languages
When text only stays inside a JavaScript program and never leaves it, you usually do not need to think about byte encodings. Once text leaves JavaScript, encoding becomes important.
3. Basic Syntax or Core Idea
String literals
JavaScript lets you create strings with single quotes, double quotes, or backticks. The encoding details are not visible in the syntax, but they affect how the string behaves.
const greeting = "Hello";
const city = 'São Paulo';
const emoji = `🙂`;These values are strings, but they may contain different underlying code units depending on the characters used.
String length versus user-perceived characters
The length property counts UTF-16 code units, not visible letters or grapheme clusters.
const plain = "cat";
const accented = "café";
const smile = "🙂";
console.log(plain.length); // 3
console.log(accented.length); // 4
console.log(smile.length); // 2The emoji looks like one character, but JavaScript stores it using two UTF-16 code units.
Encoding and decoding text
Use TextEncoder to turn a string into UTF-8 bytes, and TextDecoder to convert bytes back into text.
const text = "café 🙂";
const encoder = new TextEncoder();
const bytes = encoder.encode(text);
const decoder = new TextDecoder();
const roundTrip = decoder.decode(bytes);
console.log(bytes); // Uint8Array
console.log(roundTrip); // "café 🙂"This is the standard pattern for working with bytes and text in modern JavaScript.
4. Step-by-Step Examples
Example 1: Checking length for basic ASCII text
For simple ASCII text, string length usually matches the number of visible characters.
const username = "devdocs";
console.log(username.length); // 7This works as expected because every character fits in one UTF-16 code unit.
Example 2: Seeing the difference with emoji
Emoji often use surrogate pairs in UTF-16, which makes length surprising.
const flag = "🇫🇷";
console.log(flag.length); // 4
console.log([...flag].length); // 2The spread operator iterates code points, which is closer to what developers often expect than raw length.
Example 3: Encoding a string for transport
When you need bytes, encode the string explicitly instead of assuming a string is already bytes.
const message = "Hello, 世界";
const bytes = new TextEncoder().encode(message);
console.log(bytes.length);UTF-8 uses one byte for many ASCII characters and multiple bytes for non-ASCII characters.
Example 4: Decoding bytes into readable text
If you receive bytes from an API or file, decode them with the correct encoding.
const utf8Bytes = new Uint8Array([99, 97, 102, 195, 169]);
const text = new TextDecoder("utf-8").decode(utf8Bytes);
console.log(text); // "café"Using the wrong encoding here would produce corrupted output.
5. Practical Use Cases
- Validating a username limit based on visible characters, not raw code units.
- Sending form data or JSON to an API as UTF-8 encoded text.
- Reading a file upload and converting bytes to text safely.
- Building URL query strings with percent-encoding.
- Displaying multilingual names, addresses, and search results correctly.
- Normalizing user input before comparison, such as composed versus decomposed accents.
These use cases all involve text leaving the simplest possible “plain string” world and entering a real encoding boundary.
6. Common Mistakes
Mistake 1: Treating length as visible character count
Developers often use length for validation or slicing, then get unexpected results for emoji or other non-ASCII text.
Problem: This code counts UTF-16 code units, so a single emoji can make the string look longer than expected.
const nickname = "🙂🙂";
if (nickname.length > 2) {
console.log("Too long");
}Fix: Count code points with spread syntax or, for user-facing text, use a grapheme-aware approach when needed.
const nickname = "🙂🙂";
if ([...nickname].length > 2) {
console.log("Too long");
}The corrected version better matches what users think of as characters, though some complex emoji sequences may still need deeper Unicode handling.
Mistake 2: Decoding bytes with the wrong text encoding
If the bytes were created as UTF-8 but decoded as something else, the result can look garbled.
Problem: The wrong decoder produces corrupted text because the byte sequence is interpreted using the wrong character map.
const bytes = new Uint8Array([99, 97, 102, 195, 169]);
const text = new TextDecoder("utf-16le").decode(bytes);
console.log(text);Fix: Use the same encoding that was used to create the bytes, most commonly UTF-8.
const bytes = new Uint8Array([99, 97, 102, 195, 169]);
const text = new TextDecoder("utf-8").decode(bytes);
console.log(text); // "café"The corrected version works because UTF-8 correctly maps the byte sequence back to the original text.
Mistake 3: Assuming plain strings are safe to place in URLs
Text with spaces, slashes, or special characters must be encoded before being added to a query string or path segment.
Problem: Unescaped text can break the URL or change its meaning, especially when it contains ?, &, or spaces.
const search = "café au lait & tea";
const url = "/search?q=" + search;
console.log(url);Fix: Use encodeURIComponent for query values and path segments that need percent-encoding.
const search = "café au lait & tea";
const url = "/search?q=" + encodeURIComponent(search);
console.log(url); // "/search?q=caf%C3%A9%20au%20lait%20%26%20tea"The fixed version keeps the URL valid and preserves the original text safely.
Mistake 4: Comparing visually identical strings without normalization
Some letters can be represented in more than one Unicode form, such as a single precomposed character or a base letter plus combining mark.
Problem: Two strings can look identical but still compare as different values because their underlying code units differ.
const a = "café";
const b = "café";
console.log(a === b); // falseFix: Normalize both strings before comparing them.
const a = "café";
const b = "café";
console.log(a.normalize("NFC") === b.normalize("NFC")); // trueNormalization helps when user input may come from different keyboards or systems.
7. Best Practices
Use encoding APIs at boundaries
Keep text as strings while it is inside JavaScript, and convert to bytes only when you cross a boundary such as network, file, or binary APIs.
const body = "Hello, world";
const encoded = new TextEncoder().encode(body);This avoids guessing and makes the text format explicit.
Normalize text before equality checks
When comparing user-entered text, normalize it first if your product needs accent-sensitive but representation-insensitive matching.
function sameText(left, right) {
return left.normalize("NFC") === right.normalize("NFC");
}This reduces false mismatches caused by alternate Unicode forms.
Be careful when slicing text
String slicing by index can split surrogate pairs or complex emoji sequences. Prefer code-point-aware iteration for display logic.
const name = "A🙂B";
for (const char of name) {
console.log(char);
}Iteration with for...of handles code points better than indexing by raw position.
8. Limitations and Edge Cases
- length counts UTF-16 code units, so it does not always match the number of user-visible characters.
- Some emojis are made from multiple code points joined together, so even code-point counting can still be misleading for display purposes.
- Different Unicode normalization forms can make visually identical strings compare unequal.
- TextEncoder and TextDecoder are for text encoding, not for arbitrary binary-to-binary conversion.
- encodeURIComponent is for URL components, not for full URLs or file paths in every context.
- Browsers and Node.js both support modern text APIs, but very old environments may require polyfills or alternatives.
If a string looks broken only after transport, storage, or comparison, the problem is often not the string itself but the encoding or normalization step around it.
9. Practical Mini Project
Let’s build a small text inspector that reports the string length, code-point count, normalized form, and UTF-8 byte length. This is useful for debugging why a piece of text behaves unexpectedly.
function inspectText(input) {
const codeUnits = input.length;
const codePoints = [...input].length;
const normalized = input.normalize("NFC");
const bytes = new TextEncoder().encode(input);
return {
codeUnits,
codePoints,
normalized,
utf8Bytes: bytes.length
};
}
const sample = "café 🙂";
console.log(inspectText(sample));This mini project shows the same text from several angles: storage size in code units, human-readable character count, normalized equality, and UTF-8 byte size. It is a practical way to debug encoding-related surprises.
10. Key Points
- JavaScript strings are sequences of UTF-16 code units.
- length does not always equal visible character count.
- TextEncoder and TextDecoder handle real byte conversion.
- Unicode normalization matters when comparing text from different sources.
- URL components should be percent-encoded before use in a query string or path.
11. Practice Exercise
- Create a function named reportStringInfo that accepts a string.
- Return an object with the raw length, code-point count, and UTF-8 byte length.
- Also return a normalized version of the string using NFC.
- Test it with plain ASCII, accented text, and at least one emoji.
Expected output: an object with numeric counts and a normalized string value.
Hint: Use spread syntax for code points and TextEncoder for bytes.
function reportStringInfo(text) {
return {
length: text.length,
codePoints: [...text].length,
utf8Bytes: new TextEncoder().encode(text).length,
normalized: text.normalize("NFC")
};
}
console.log(reportStringInfo("hello"));
console.log(reportStringInfo("café"));
console.log(reportStringInfo("🙂"));12. Final Summary
JavaScript strings are not raw byte arrays. They are UTF-16 code-unit sequences, which is why basic text often behaves as expected while emoji, accents, and multilingual content reveal edge cases. Once you understand that difference, string length, iteration, and comparison all make more sense.
For real-world applications, the main rule is simple: keep text as strings inside JavaScript, and use explicit encoders, decoders, and normalization when text must be counted, compared, sent over the network, or converted into bytes. That habit prevents many of the subtle bugs that show up in production.
If you want to go deeper, the next useful topics are Unicode normalization, URL encoding, and working with binary data using ArrayBuffer and Uint8Array.