Leetcode: Improving String Performance in Swift
Get the most out of your Swift strings when using Leetcode.
Join the DZone community and get the full member experience.
Join For FreeProblems with strings are one of the most common problems with Leetcode. By solving a task on Leetcode, you can also compare your solution with other solutions in terms of performance and memory usage. I have experience solving issues with strings, and my task execution time is often better than 100% of other users' solutions. Let's see how we can speed up solving a difficulty with strings.
Strings in Swift
In many programming languages, a string is an array of characters with direct access to the elements of the string/array. In this case, you can access a substring or character instantly in O(1) time. However, Swift is built differently. Let's take a more in-depth look at what lies behind the facade.
The String structure implements the BidirectionalCollection protocol, making it possible to access the index in O(1) time. To find out this index, you need to calculate it separately, and as a result, it takes O(n) time to access any element. To understand why it's working in this way, we need to go just more little deeper – into encoding.
The encoding defines how each Unicode character is stored in memory. There are more than 130,000 characters; each must be somehow distinguished from the others and stored in memory. The primary storage unit in computers is a byte (8 bits). It can store no more than 256 different values, so one byte is not enough to keep text in Unicode encoding (remember, we have many more possible characters).
Unicode characters have several basic encode UTF-8, UTF-16, and UTF-32. All of them have their characteristics, but we are interested in UTF-8 because Swift uses it.
UTF-8
For each character in UTF-8, from 1 to 4 bytes of memory is allocated. The number of bytes determines by the character number in the Unicode table. The larger the number in the Unicode table, the more bytes required to encode the character, up to a maximum of 4 bytes. For example, the Latin letter 'a' is encoded in the bit value 00101101 or in decimal 56, which requires one byte to represent in memory. And, the Cyrillic letter 'б' is encoded into the bit value 0110110101010111010, for which 2 bytes are already allocated. The most common world characters are included in the first 128 Unicode symbols — primarily the Latin alphabet, Arabic numerals, and basic punctuation.
Therefore, all these characters fit into one byte. But there is one problem here. Swift internally uses UTF-8, and it doesn't know in advance which characters will be in the string. Swift considers all lines to be UTF-8, and each character in the string can be between 1 and 4 bytes.
Let's look at an example. Imagine we have the string "Hi Павел" ("Hi Pavel" in Latin). The first two characters are Latin letters, a space, and 5 Cyrillic characters.
It is easy to refer to the first character:
- at the very beginning, we know the place in advance in memory where it is located
- we read it
- based on the bitmask, we can determine that this is a character from the first 128 Unicode characters
Now we know that the first character occupies one byte, so we must be moving on one byte and reading the next. Let's go further. How to access the second symbol?
But the first character could be any other and occupy not only one byte of memory, but 2, 3, or 4. Swift didn't know in advance that the first character is a letter of the Latin alphabet. So, we need to read it first and determine how many bytes it occupies. To access the next character, we should move by the corresponding number of bytes.
Further, to refer to the third character, we must determine how much the second character occupies in memory. As soon as we get to the Cyrillic characters, each has already taken up 2 bytes in memory. So to access the following characters, Swift will shift by 2 bytes.
Again, there is no way to determine in advance how many bytes you need to move to access a character. You need to go through all the characters and choose how many bytes each of them occupies. Indexes do all this work in Swift.
As we know, Swift has a particular set of functions for calculating indexes working with strings, for example, index(after:), index(before:), distance, etc. (these functions inherit from the Collection and BidirectionalCollection protocols).
These functions do all the hard work of calculating the position in the string. It turns out that to refer to a character in the middle of a line, we must first calculate its index, sequentially calculating the index of the previous characters. We will only then find out exactly where the desired symbol is stored in memory.
Leetcode Features
As we have seen, these calculations are quite complex and expensive regarding performance. In Leetcode, the most favorite part of the task condition is input data restrictions. The standard limitation for strings is the characters used, restricted to Latin alphabet letters, numbers, and punctuation marks.
Based on this additional information, we can optimize the performance of our algorithms. We have no way to tell Swift that it's UTF-8 with single-byte characters, but we can trick Swift a bit and convert strings to an array of bytes (Array<UInt8>). As we know, arrays in Swift implement the RandomAccessCollection protocol, which provides access to an array element in O(1) time. We don't need to precalculate the index to access an array element. We can access any item, for example, str[4], and Swift won't need to calculate what memory shift to make. Swift can immediately move four elements and get the desired byte instantly.
Once you get the input, you need to convert it from a string to a Character array or even a byte array. And then treat them as an array with direct access to the elements. That is how to speed up most Leetcode solutions for string-related tasks. It will speed up your decision significantly.
Array conversion options:
let str = "Hello world"
Array<UInt8>(str.utf8) // array of bytes
Array(str) // array of Characters
If during work with strings, to access any element, was required something like this:
let startSubstringIndex = greeting.index(greeting.startIndex, offsetBy: 2)
greeting[startSubstringIndex..<greeting.index(startSubstringIndex, offsetBy: 3)]
... this construction, besides the complexity of the structure itself, is also slow in contrast to this construction, which is both more concise and faster:
// let charArray = Array(greeting)
charArray[2..<5]
You can use this little hack to solve problems on Leetcode and buy some time from the checking system.
Opinions expressed by DZone contributors are their own.
Comments