JavaScript Strings and Encodings: Characters, Bytes, and Unicode

JavaScript strings look simple, but they sit on top of Unicode and character encodings. If you have ever seen unexpected string lengths, broken emoji, or encoding errors when sending text over the network, this topic explains why.

Quick answer: JavaScript strings are sequences of UTF-16 code units, not raw bytes and not always one-character-per-unit. Use Unicode-aware APIs like TextEncoder, TextDecoder, and string iteration helpers when you need correct encoding or character handling.

Difficulty: Beginner to Intermediate

You'll understand this better if you know: basic JavaScript strings, arrays, and that computers store text as numbers behind the scenes.

1. What Is JavaScript Strings and Encodings?

In JavaScript, a string is text stored as a sequence of UTF-16 code units. An encoding is the rule that turns text into bytes and bytes back into text. JavaScript handles the string value in memory, while encodings matter when text crosses boundaries such as files, URLs, HTTP requests, or browser storage.

That difference is why "🙂".length is not 1 in JavaScript, even though the emoji looks like a single character.

2. Why JavaScript Strings and Encodings Matter

Most beginner code works fine with plain English text, but real applications quickly encounter accents, emoji, multilingual names, and data that must be encoded for transport. If you ignore encodings, you can display corrupted text, count characters incorrectly, or generate invalid URLs.

This topic matters whenever you:

When text only stays inside a JavaScript program and never leaves it, you usually do not need to think about byte encodings. Once text leaves JavaScript, encoding becomes important.

3. Basic Syntax or Core Idea

String literals

JavaScript lets you create strings with single quotes, double quotes, or backticks. The encoding details are not visible in the syntax, but they affect how the string behaves.

const greeting = "Hello";
const city = 'São Paulo';
const emoji = `🙂`;

These values are strings, but they may contain different underlying code units depending on the characters used.

String length versus user-perceived characters

The length property counts UTF-16 code units, not visible letters or grapheme clusters.

const plain = "cat";
const accented = "café";
const smile = "🙂";

console.log(plain.length);      // 3
console.log(accented.length);  // 4
console.log(smile.length);     // 2

The emoji looks like one character, but JavaScript stores it using two UTF-16 code units.

Encoding and decoding text

Use TextEncoder to turn a string into UTF-8 bytes, and TextDecoder to convert bytes back into text.

const text = "café 🙂";

const encoder = new TextEncoder();
const bytes = encoder.encode(text);

const decoder = new TextDecoder();
const roundTrip = decoder.decode(bytes);

console.log(bytes);       // Uint8Array
console.log(roundTrip);  // "café 🙂"

This is the standard pattern for working with bytes and text in modern JavaScript.

4. Step-by-Step Examples

Example 1: Checking length for basic ASCII text

For simple ASCII text, string length usually matches the number of visible characters.

const username = "devdocs";

console.log(username.length); // 7

This works as expected because every character fits in one UTF-16 code unit.

Example 2: Seeing the difference with emoji

Emoji often use surrogate pairs in UTF-16, which makes length surprising.

const flag = "🇫🇷";

console.log(flag.length); // 4
console.log([...flag].length); // 2

The spread operator iterates code points, which is closer to what developers often expect than raw length.

Example 3: Encoding a string for transport

When you need bytes, encode the string explicitly instead of assuming a string is already bytes.

const message = "Hello, 世界";
const bytes = new TextEncoder().encode(message);

console.log(bytes.length);

UTF-8 uses one byte for many ASCII characters and multiple bytes for non-ASCII characters.

Example 4: Decoding bytes into readable text

If you receive bytes from an API or file, decode them with the correct encoding.

const utf8Bytes = new Uint8Array([99, 97, 102, 195, 169]);
const text = new TextDecoder("utf-8").decode(utf8Bytes);

console.log(text); // "café"

Using the wrong encoding here would produce corrupted output.

5. Practical Use Cases

These use cases all involve text leaving the simplest possible “plain string” world and entering a real encoding boundary.

6. Common Mistakes

Mistake 1: Treating length as visible character count

Developers often use length for validation or slicing, then get unexpected results for emoji or other non-ASCII text.

Problem: This code counts UTF-16 code units, so a single emoji can make the string look longer than expected.

const nickname = "🙂🙂";

if (nickname.length > 2) {
  console.log("Too long");
}

Fix: Count code points with spread syntax or, for user-facing text, use a grapheme-aware approach when needed.

const nickname = "🙂🙂";

if ([...nickname].length > 2) {
  console.log("Too long");
}

The corrected version better matches what users think of as characters, though some complex emoji sequences may still need deeper Unicode handling.

Mistake 2: Decoding bytes with the wrong text encoding

If the bytes were created as UTF-8 but decoded as something else, the result can look garbled.

Problem: The wrong decoder produces corrupted text because the byte sequence is interpreted using the wrong character map.

const bytes = new Uint8Array([99, 97, 102, 195, 169]);
const text = new TextDecoder("utf-16le").decode(bytes);

console.log(text);

Fix: Use the same encoding that was used to create the bytes, most commonly UTF-8.

const bytes = new Uint8Array([99, 97, 102, 195, 169]);
const text = new TextDecoder("utf-8").decode(bytes);

console.log(text); // "café"

The corrected version works because UTF-8 correctly maps the byte sequence back to the original text.

Mistake 3: Assuming plain strings are safe to place in URLs

Text with spaces, slashes, or special characters must be encoded before being added to a query string or path segment.

Problem: Unescaped text can break the URL or change its meaning, especially when it contains ?, &, or spaces.

const search = "café au lait & tea";
const url = "/search?q=" + search;

console.log(url);

Fix: Use encodeURIComponent for query values and path segments that need percent-encoding.

const search = "café au lait & tea";
const url = "/search?q=" + encodeURIComponent(search);

console.log(url); // "/search?q=caf%C3%A9%20au%20lait%20%26%20tea"

The fixed version keeps the URL valid and preserves the original text safely.

Mistake 4: Comparing visually identical strings without normalization

Some letters can be represented in more than one Unicode form, such as a single precomposed character or a base letter plus combining mark.

Problem: Two strings can look identical but still compare as different values because their underlying code units differ.

const a = "café";
const b = "café";

console.log(a === b); // false

Fix: Normalize both strings before comparing them.

const a = "café";
const b = "café";

console.log(a.normalize("NFC") === b.normalize("NFC")); // true

Normalization helps when user input may come from different keyboards or systems.

7. Best Practices

Use encoding APIs at boundaries

Keep text as strings while it is inside JavaScript, and convert to bytes only when you cross a boundary such as network, file, or binary APIs.

const body = "Hello, world";
const encoded = new TextEncoder().encode(body);

This avoids guessing and makes the text format explicit.

Normalize text before equality checks

When comparing user-entered text, normalize it first if your product needs accent-sensitive but representation-insensitive matching.

function sameText(left, right) {
  return left.normalize("NFC") === right.normalize("NFC");
}

This reduces false mismatches caused by alternate Unicode forms.

Be careful when slicing text

String slicing by index can split surrogate pairs or complex emoji sequences. Prefer code-point-aware iteration for display logic.

const name = "A🙂B";

for (const char of name) {
  console.log(char);
}

Iteration with for...of handles code points better than indexing by raw position.

8. Limitations and Edge Cases

If a string looks broken only after transport, storage, or comparison, the problem is often not the string itself but the encoding or normalization step around it.

9. Practical Mini Project

Let’s build a small text inspector that reports the string length, code-point count, normalized form, and UTF-8 byte length. This is useful for debugging why a piece of text behaves unexpectedly.

function inspectText(input) {
  const codeUnits = input.length;
  const codePoints = [...input].length;
  const normalized = input.normalize("NFC");
  const bytes = new TextEncoder().encode(input);

  return {
    codeUnits,
    codePoints,
    normalized,
    utf8Bytes: bytes.length
  };
}

const sample = "café 🙂";
console.log(inspectText(sample));

This mini project shows the same text from several angles: storage size in code units, human-readable character count, normalized equality, and UTF-8 byte size. It is a practical way to debug encoding-related surprises.

10. Key Points

11. Practice Exercise

Expected output: an object with numeric counts and a normalized string value.

Hint: Use spread syntax for code points and TextEncoder for bytes.

function reportStringInfo(text) {
  return {
    length: text.length,
    codePoints: [...text].length,
    utf8Bytes: new TextEncoder().encode(text).length,
    normalized: text.normalize("NFC")
  };
}

console.log(reportStringInfo("hello"));
console.log(reportStringInfo("café"));
console.log(reportStringInfo("🙂"));

12. Final Summary

JavaScript strings are not raw byte arrays. They are UTF-16 code-unit sequences, which is why basic text often behaves as expected while emoji, accents, and multilingual content reveal edge cases. Once you understand that difference, string length, iteration, and comparison all make more sense.

For real-world applications, the main rule is simple: keep text as strings inside JavaScript, and use explicit encoders, decoders, and normalization when text must be counted, compared, sent over the network, or converted into bytes. That habit prevents many of the subtle bugs that show up in production.

If you want to go deeper, the next useful topics are Unicode normalization, URL encoding, and working with binary data using ArrayBuffer and Uint8Array.