C# tutorials > Modern C# Features > C# 6.0 and Later > What are UTF-8 string literals in C# 11?

What are UTF-8 string literals in C# 11?

UTF-8 string literals, introduced in C# 11, provide a concise and efficient way to represent strings encoded in UTF-8. Instead of representing strings as UTF-16 (the standard .NET encoding), these literals store the string directly as UTF-8 bytes. This can be particularly useful when interacting with systems or APIs that expect UTF-8 encoded data.

Basic Usage

This snippet demonstrates the simplest form of a UTF-8 string literal. The `u8` suffix appended to the string literal indicates that the compiler should encode the string as UTF-8. The resulting `utf8Bytes` variable is a `byte[]` containing the UTF-8 encoded representation of "Hello, World!". The `System.Text.Encoding.UTF8.GetString()` method is then used to decode the UTF-8 byte array back into a standard C# string for display.

byte[] utf8Bytes = "Hello, World!"u8;
Console.WriteLine(System.Text.Encoding.UTF8.GetString(utf8Bytes));

Concepts Behind the Snippet

Prior to C# 11, you would typically create a UTF-8 byte array from a string using `System.Text.Encoding.UTF8.GetBytes()`. UTF-8 string literals offer a more direct and convenient syntax. The `u8` suffix tells the compiler to directly generate the UTF-8 byte representation at compile time, potentially improving performance by avoiding runtime encoding costs. It's important to understand that you're dealing with a `byte[]`, not a `string` in the traditional sense, after defining the literal.

Real-Life Use Case Section

A common scenario is when interacting with web APIs that require or return data encoded as UTF-8, such as JSON payloads or message queues like Kafka. Another use case is when writing data to files or network streams where UTF-8 is the desired encoding. For example, if you are building a microservice that needs to send messages to another service that expects UTF-8 encoding, using UTF-8 string literals can simplify the process and potentially improve performance.

// Sending a UTF-8 encoded JSON payload
using System.Text.Json;

var data = new { Message = "Hello from C#" };
byte[] jsonData = JsonSerializer.SerializeToUtf8Bytes(data);
//Alternatively
//byte[] jsonData = JsonSerializer.Serialize("{\"Message\":\"Hello from C#\"}"u8);

// Send jsonData to your API endpoint

Best Practices

  • Use UTF-8 string literals when you know you need UTF-8 encoded data and want to avoid runtime encoding.
  • Be mindful that the result is a `byte[]`, not a `string`, and use appropriate methods (e.g., `System.Text.Encoding.UTF8.GetString()`) for conversion if necessary.
  • Consider the performance implications. While often faster, measure to confirm if the optimization is significant in your specific scenario.
  • Ensure the destination system or API truly expects UTF-8. Using the wrong encoding will lead to data corruption or errors.

Interview Tip

When asked about new features in C#, mention UTF-8 string literals as a convenient way to work with UTF-8 encoded data directly. Explain how they can improve performance by encoding at compile time and reduce boilerplate code. Be prepared to discuss scenarios where they would be particularly useful, such as interacting with web APIs or handling data streams.

When to use them

Use UTF-8 string literals primarily when:

  • You are working with APIs or systems that require UTF-8 encoding.
  • You want to optimize performance by encoding strings as UTF-8 at compile time.
  • You prefer a more concise syntax for creating UTF-8 byte arrays from string literals.

Memory Footprint

UTF-8 can sometimes use less memory than UTF-16, especially for strings containing primarily ASCII characters. ASCII characters require only 1 byte in UTF-8, while they always require 2 bytes in UTF-16. For strings containing many non-ASCII characters, UTF-8 can use 2-4 bytes per character. If your application processes a large number of strings consisting mainly of ASCII characters, using UTF-8 string literals might lead to a small reduction in memory consumption.

Alternatives

The primary alternative to UTF-8 string literals is using `System.Text.Encoding.UTF8.GetBytes()` to convert a regular C# string to a UTF-8 byte array at runtime. Another alternative, although less common, involves using character arrays and manually constructing the UTF-8 byte sequence. UTF-8 string literals are generally preferred for their simplicity and potential performance benefits.

// Alternative: Using Encoding.UTF8.GetBytes()
string myString = "Hello, World!";
byte[] utf8Bytes = System.Text.Encoding.UTF8.GetBytes(myString);

Pros

  • Conciseness: Provides a more compact and readable syntax for creating UTF-8 byte arrays.
  • Potential Performance: Encodes the string at compile time, potentially avoiding runtime encoding costs.
  • Direct Representation: Represents the string directly as UTF-8 bytes, which can be more efficient in some scenarios.

Cons

  • Result is a `byte[]`: Requires conversion to a `string` for certain operations.
  • Limited Applicability: Only useful when you specifically need UTF-8 encoded data.
  • Not a universal replacement: It is not meant to replace regular strings, only to provide a more convenient way to create UTF-8 encoded byte arrays.

FAQ

  • How do I convert a UTF-8 byte array created from a UTF-8 string literal back to a regular C# string?

    Use the `System.Text.Encoding.UTF8.GetString()` method. For example: csharp byte[] utf8Bytes = "Hello, World!"u8; string regularString = System.Text.Encoding.UTF8.GetString(utf8Bytes);
  • Can I use escape sequences in UTF-8 string literals?

    Yes, you can use escape sequences like `\n` (newline), `\t` (tab), and `\uXXXX` (Unicode character) within UTF-8 string literals, just like in regular string literals. The compiler will interpret these escape sequences and encode the resulting characters in UTF-8.
  • Are UTF-8 string literals automatically null-terminated?

    No, UTF-8 string literals in C# are not automatically null-terminated. The resulting `byte[]` contains only the UTF-8 encoded bytes of the string. If you require a null-terminated UTF-8 string for compatibility with certain C APIs or other systems, you need to add the null terminator (a byte with the value 0) manually to the `byte[]`.