After reading a fantastic article by Ian Whitney, it came to my attention that there is some confusion regarding the “length” of a string in Rust. According to the documentation, std::string::String.len()
returns the number of bytes that are in the given string. On a technical level, there is nothing confusing about this definition. However, it is widely accepted by other languages (like Java and Ruby) that the “length” of a string is the number of characters within the string.
The problem with this difference in definition is brought to light in a playpen by respeccing which shows that Rust’s std::String::String.len()
function produces counter-intuitive results. Two strings with the same character count return different “lengths” because they contain a different number of bytes.
The solution to this is instead to use a String’s character iterator and count the number of elements, as std::string::String.chars().count()
does.
This sounds like a footgun… I assume it’s too late to change len() to produce a length in characters, and have some other function to produce a length in bytes? Given a decent high-level String representation and a desire for code to transparently work with multi-byte characters, the need to get the length in bytes will be a lot less used than the need to get the length in characters…
Rust is a systems programming language; the programmer is expected to understand that bytes and characters are not analogous and that encoded characters are variable length since forever. Anyone confused by this needs to go back to ruby or java or whatever.
Yes, indeed. The question is not whether you should have a call for one, or a call for the other – you should obviously have a call for both. The question is what should len() do, as the “default” and “obvious” function to call when getting a string length. My assertion is that it should count the length in what I said was “characters”, but another commentator has pointed out should probably actually be “grapheme clusters”, because that’s more common. Rust has memory management and you don’t manually allocate memory or terminate strings, so it’s not so common to want to know the length in bytes of a string.
The answer for what len() should do is whatever len() is documented to do. One could argue that a better name for a function that returns the number of bytes of storage is size(). This would at least be in keeping with the conventions of traditional systems programming languages. In any case, since there is an inherent ambiguity involved here I sincerely do not believe there is a “right” answer, so the hunt for the “default” and “obvious” is futile. We’re left to expect the competent programmer to consult the documentation until the programmer has inculcated the correct answer.
If a string is not a byte array (and it’s not; it’s a sequence of UTF characters), then yes, giving back the number of bytes for its length is “wrong”. If that’s what it gives back on purpose, then it’s broken by design.
If I have an array of 5 int’s and ask for the length, should I expect the result to be 5 or 20? Because by the way that str behaves, I should expect that array to return 20. But it won’t, because that’s stupid.
Of course the number of bytes a structure takes up is useful to know. And as it happens, string has an as_bytes() function explicitly for getting the string as a pure byte array. If you need to know the number of bytes, you can get .as_bytes().len(). If you need the number of chars, you can use .chars().count(). And if you do just .len(), you ought to be getting the number of grapheme clusters.
(Also, difference in usage between len() and count() seems like another design problem.)
Example:
word: Amélie (using the combining diacritical)
bytes: 8
chars: 7
characters: 6
If you do a [0..4] slice on that string, you’ll cause an exception because it landed partway through the byte for the accent char. Yet according to the documentation, “Strings slices are always valid UTF-8.” But you can crash the program because the slice indexes are based off of bytes instead of characters.
Defaulting len() to be the number of bytes (and having that carry into other areas of the program) seems a sure way to introduce bugs and frustration, and in a way that knowing the documentation does not fix. And the functions that are defined on str hint at the fact that whoever wrote them up kind of realized that, and introduced a ton of functions to bandaid their way around the problem, rather than fix the class behavior in a reasonable manner.
Just an extra note: After digging into github issues, it looks like there might be better handling for this issue than expected (eg: a graphemes() function; a width() function for a full character count; etc), but it isn’t shown in the general documentation.
This is still misleading. See http://is.gd/5HJDvF .
As the String documentation (https://doc.rust-lang.org/std/string/struct.String.html#utf-8) notes, a “char” is a Unicode codepoint, not a “character”.
Usually when you want the “length” of a string you need to know how much storage it takes, so you should count bytes. Other times you want to know how many visual “characters” it has, so you should count grapheme clusters. Unfortunately, String has no method for the latter.
I’m not sure why you would want to count codepoints.
Also see http://developers.linecorp.com/blog/?p=3473 .
Hi. I’m @respeccing
Sorry for the 404 on github but they purged my account 1 day after I changed my full name to something like: abandoned account, similarly to what you see here: https://notabug.org/respeccing
That was unexpected! I wanted the comments context to remain!
Normally you’d see ‘ghost’ as username if I were to delete the account myself, but nope, it must’ve gotten purged instead: https://github.com/IanWhitney/designisrefactoring/pull/19
The PR links for 192,193 and 195 are also 404-ing from here: https://github.com/rust-lang/rust-playpen/commits/master
“Issue title is private” when hovering on them must be some bug.
Anyway, just wanted to clear things up as to why 404-ing!
Cheers!
My bad guys! I contacted github support and their bot flagged my account for spam and that is why it got hidden!
I should’ve contacted them earlier, but my negative mentality prevented me from doing that and instead I chose to believe the worst!
Account looks reinstated now! And I should work on my mentality, will do;-)
The way Rust does it is the sanest way to deal with unicode. Basically following options exist:
Strings are byte arrays encoded in utf-8, indexed by byte offsets (len() must be, primarily, consistent with indexing). That’s what Rust does and what is usually used in C++.
Strings are encoded in utf-8, indexed by character offsets. Down that road lies madness. I’ve been there. Indexing and len() with O(n) complexity is not a footgun, it is a foot missile launcher. It also defeats the purpose of indexing.
Strings are encoded in utf-32, indexed by characters. That is what Python (and I think Ruby too) does. It is very slightly more convenient, but rather wasteful both in storage and the fact that it always has to be recoded on the way in/out, because utf-32 is not used as storage or transfer format. That makes it poor match for Rust.
Strings are encoded in utf-16, indexed by words, blind eye is turned to existence of surrogate pairs. I believe this is still what Java and C# do. It appears to be more convenient and less wasteful, but in fact it has all the downsides of the byte-indexed utf-8 solution except more insidious, because while with utf-8 you are pretty much guaranteed to hit the difference between code words and characters quickly, in the utf-16 case you can remain oblivious to it very long, but it’s still going to bite you one day.
And then it needs to be said, that char is a rather totally useless unit of information anyway. There can even easily be strings, that should be considered equal for all reasonable purposes, yet they have different number of chars, because one uses the composed form and the other the decomposed form. So while the utf-32 case does allow indexing strings by random integers, it still rarely makes any sense anyway.
So given the option and Rust’s stated goal of efficiency, it made the only reasonable choice.
And given the next-to-previous paragraph, I disagree with the article. You almost always do want
len()
, because the most common uses of it are to check whether string is empty (for which you absolutely don’t want anything with O(n) complexity) and to calculate indices from a substring you just parsed out. You don’t want to think about it as number of “characters”, but you don’t want it because the concept does not even have a clear meaning for many scripts in the first place.