1 unstable release
new 0.2.0 | Nov 21, 2024 |
---|
#665 in Data structures
58 downloads per month
34KB
201 lines
rust-string-manipulation-utf8
A Rust library with string manipulation functions using character indexing (UTF-8)
Library name: string_manipulation_utf8
An implementation of string manipulation functions using character indexing instead of bytes. It uses UTF-8 encoded strings as implemented in Rust.
This library also has common string functions like indexof, substr and substring that exist in other programming languages.
It can be used as functions, or methods from 'str' type (string slice) and 'String' type.
Library functions:
- indexof : get the position from one string into another
- substr : get a substring of a string using start index and length (signed values)
- substru : get a substring of a string using start index and length (unsigned values)
- substr_end : get a substring from start index till the end of the string
- substring : get a substring of a string using start and end index (not included)
- str_remove : Remove a substring from a string
- str_concat! : macro to concatenate multiple strings
Standard Rust functions:
Functions independent of character and byte indexing in Rust.
-
replace : replaces all matches of a pattern with another string
-
replacen : replaces first N matches of a pattern with another string
-
strip_prefix : returns a string slice with the prefix removed
-
contains : check if a string contains another string
-
starts_with : check if a string starts with another string
-
ends_with : check if a string ends with another string
-
is_empty : check if a String has a length of zero
The Rust standard library doesn't support Unicode grapheme clusters (with combining diacritical marks) where multiple code points are required to form one character.
Example:
e + combining acute = e + ´ = \u{0065}\u{0301} = é (two code points with 3 bytes, hex. 65 CC 81)
Versus the character é = \u{00E9} with one code point for 2 bytes, hex. C3 A9
This library uses the Rust standard library and hence will count such combined characters as multiple characters.
See section 'Using byte positioning' for examples with native byte indexing.
Simple benchmarking code was used to find the faster algorithms. GitHub rust-string-manip-benchmark
To compile and run the example code in examples/main.rs:
cargo run --example main
To compile and run the tests in tests/tests.rs:
carto test
Using character positioning
indexof
Get the character position from one string into another. Start searching from character 'start_index'. Returns None if not found. Index of the first character is 0.
Syntax:
str.indexof(searchstring: &str, start_index: usize) -> Option<usize>
string.indexof(searchstring: &str, start_index: usize) -> Option<usize>
indexof(s: &str, searchstring: &str, start_index: usize) -> Option<usize>
Example:
Return the character index of "test" in the given string. Start searching at the beginning of the string. Result position is 0 because "test" starts at the beginning of the string.
use string_manipulation_utf8::CharString; // String and str methods
use string_manipulation_utf8::indexof; // str function
fn main() {
let s1: &str = "test éèçà 123 test";
let s2: String = s1.to_owned();
match s1.indexof("test", 0) { // Result: Some(0)
Some(pos) => println!("Found at position: {}", pos),
None => println!("Not found"),
}
match s2.indexof("test", 0) { // Result: Some(0)
Some(pos) => println!("Found at position: {}", pos),
None => println!("Not found"),
}
match indexof(s1, "test", 0) { // Result: Some(0)
Some(pos) => println!("Found at position: {}", pos),
None => println!("Not found"),
}
match indexof(&s2, "test", 0) { // Result: Some(0)
Some(pos) => println!("Found at position: {}", pos),
None => println!("Not found"),
}
}
Return the character index of "test" in the given string. Start searching from character index 6. The result is position 14.
use string_manipulation_utf8::indexof;
use string_manipulation_utf8::CharString; // String and str methods.
fn main() {
let s1: &str = "test éèçà 123 test";
let s2: String = s1.to_owned();
match s1.indexof("test", 6) { // Result: Some(14)
Some(pos) => println!("Found at position: {}", pos),
None => println!("Not found"),
}
match s2.indexof("test", 6) { // Result: Some(14)
Some(pos) => println!("Found at position: {}", pos),
None => println!("Not found"),
}
match indexof(s1, "test", 6) { // Result: Some(14)
Some(pos) => println!("Found at position: {}", pos),
None => println!("Not found"),
}
match indexof(&s2, "test", 6) { // Result: Some(14)
Some(pos) => println!("Found at position: {}", pos),
None => println!("Not found"),
}
}
substr
Get a substring of a string, beginning at character index 'start_index' and take 'length' characters.
Negative numbers count backwards:
'start_index' from the end of the string.
'length' from 'start_index'.
If start_index exceeds the string boundary limits, return an empty string. (Similar to C++ std::substr() and c# String.Substring.)
'length' can be isize::MAX or isize::MIN to get the substring until the positive or negative string boundary without the need to calculate the length. (Alternatively, see substr_end in this library.)
Index of the first character is 0.
If 'start_index' and 'length' are positive, substru is a little faster like string.chars().skip(start_index).take(length).collect() that it interpolates. See substru and section 'Standard Rust methods' for examples.
Syntax:
str.substr(start_index: isize, length: isize) -> String
string.substr(start_index: isize, length: isize) -> String
substr(s: &str, start_index: isize, length: isize) -> String
Example:
use string_manipulation_utf8::CharString; // String and str methods
fn main() {
assert_eq!("0123456789".substr(2, 3), "234");
assert_eq!("0123456789".substr(-5, 3), "567");
assert_eq!("0123456789".substr(-5, -3), "345"); // Negative length counts backwards
assert_eq!("0123456789".substr(5, -3), "345"); // Negative length counts backwards
assert_eq!("0123456789".substr(2, 0), ""); // Take nothing
assert_eq!("0123456789".substr(0, 0), ""); // Take nothing
assert_eq!("0123456789".substr(-4, 0), ""); // Take nothing
assert_eq!("0123456789".substr(isize::MAX, 1), "");
assert_eq!("0123456789".substr(isize::MAX, isize::MIN), "");
assert_eq!("0123456789".substr(isize::MIN, isize::MAX), "");
}
Example:
use string_manipulation_utf8::substr;
use string_manipulation_utf8::CharString; // String and str methods
fn main() {
let s1: &str = "test éèçà 123 test";
let s2: String = s1.to_owned();
println!("substr str: {}", s1.substr(10, 3)); // Result: "123"
println!("substr String: {}", s2.substr(10, 3)); // Result: "123"
println!("substr function: {}", substr(s1, 10, 3)); // Result: "123"
}
Remark:
To get a substring from 'start_index' until the end of the string:
substr(string, start_index, isize::MAX)
substr_end(string, start_index)
substr(string, start_index, string.chars().count() is isize - start_index)
substru
Same as substr, but only accepts unsiged values for 'start_index' and 'length'.
For positive numbers this is faster than using substr.
It interpolates the code: s.chars().skip(start_index).take(length).collect::()
Syntax:
str.substru(start_index: usize, length: usize) -> String
string.substru(start_index: usize, length: usize) -> String
substru(s: &str, start_index: usize, length: usize) -> String
substring
Get a substring of a string beginning at character index 'start_index' up to and excluding the character index 'end_index'.
Equivalent of JavaScript substring with 2 parameters.
If 'start_index' is equal to 'end_index', substring() returns an empty string.
If 'start_index' is greater than 'end_index', swap 'start_index' and 'end_index'.
Any argument value that is less than 0 is treated as if it were 0.
Any argument value that is greater than string length is treated as if it were string length.
Index of the first character is 0.
Syntax:
str.substring(start_index: isize, end_index: isize) -> String
string.substring(start_index: isize, end_index: isize) -> String
substring(s: &str, start_index: isize, end_index: isize) -> String
Example:
use string_manipulation_utf8::CharString; // String and str methods
use string_manipulation_utf8::substring; // str function
fn main() {
println!("{}", substring("0123456789", 2, 3)); // Result: 2
println!("{}", substring("0123456789", 2, 9)); // Result: 2345678
println!("{}", substring("0123456789", 2, 10)); // Result: 23456789
println!("{}", substring("0123456789", 2, 11)); // Result: 23456789
println!("{}", substring("0123456789", -2, 3)); // Result: 012
println!("{}", substring("0123456789", -2, 50)); // Result: 0123456789
println!("{}", substring("0123456789", 9, 2)); // Result: 2345678
let str: &str = "test éèçà 123 test";
let string: String = str.to_owned();
println!("{}", str.substring(10, 14)); // 123
println!("{}", string.substring(10, 14)); // 123
println!("{}", substring(str, 10, 14)); // 123
}
substr_end
Get a substring from character index 'start_index' till end of the string.
'start_index' can be negative to count backwards from the end of the string.
If start_index exceeds the string boundary limits, return an empty string.
(Similar to C++ std::substr() and c# String.Substring.)
Index of the first character is 0.
Because Rust doesn't have a practical default value for function parameters, substr_end()
replaces substr(string, start_index), string.substr(start_index).
Same result with: substr(string, start_index, isize::MAX)
Syntax:
substr_end(s: &str, start_index: isize) -> String
string.substr_end(start_index: isize) -> String
str.substr_end(start_index: isize) -> String
assert_eq!("0123456789".substr_end(2), "23456789");
assert_eq!("0123456789".substr_end(0), "0123456789");
assert_eq!("0123456789".substr_end(9), "9");
assert_eq!("0123456789".substr_end(10), "");
assert_eq!("0123456789".substr_end(-3), "789");
Example:
use string_manipulation_utf8::substr_end;
use string_manipulation_utf8::CharString; // String and str methods // str function
fn main() {
let s1: &str = "test éèçà 123 test";
let s2: String = s1.to_owned();
println!("substr_end str: {}", s1.substr_end(10)); // Result: "123 test"
println!("substr_end String: {}", s2.substr_end(10)); // Result: "123 test"
println!("substr_end function: {}", substr_end(s1, 10)); // Result: "123 test"
println!("substr_end function: {}", substr_end(&s2, 10)); // Result: "123 test"
}
str_remove
Remove a substring from a string. Beginning at character index 'start_index' and take 'length' characters.
Index of the first character is 0.
Syntax:
str.str_remove(start_index: usize, length: usize) -> String
string.str_remove(start_index: usize, length: usize) -> String
str_remove(s: &str, start_index: usize, length: usize) -> String
Examples:
use string_manipulation_utf8::str_remove;
use string_manipulation_utf8::CharString; // String and str methods // str function
fn main() {
let s1: &str = "test éèçà 123 test";
let s2: String = s1.to_owned();
println!("str_remove str: {}", s1.str_remove(10, 4)); // Result: "test éèçà test"
println!("str_remove String: {}", s2.str_remove(10, 4)); // Result: "test éèçà test"
println!("str_remove function: {}", str_remove(s1, 10, 4)); // Result: "test éèçà test"
}
str_concat
Macro to concatenate multiple strings.
All strings are borrowed.
First allocates the needed capacity, then adds the stings.
Syntax:
str_concat!(&str1, &str2, ...)
Examples:
use string_manipulation_utf8::str_concat;
fn main() {
println!(
"{}",
str_concat!("test", " ", "123 ", "éèçà ", "123 ", "test home")
); // Result: "test 123 éèçà 123 test home"
let s1: String = "string1".to_owned();
let s2: String = "string2".to_owned();
let s3: String = "string3".to_owned();
let result: String = str_concat!(&s1, &s2, &s3);
println!("{result}"); // Result: "string1string2string3"
let s2: &str = "string2"; // Adding a string slice
let result: String = str_concat!(&s1, s2, &s3);
println!("{result}"); // Result: "string1string2string3"
}
Alternatives with Rust statements.
The Rust 'std::concat!' macro only works with literals. Ex. concat!("test", 10, 'b', true)
Using the std::format macro.
format!("{}{}{}", s1, s2, s3)
When adding strings with the + operator, the first string is moved (move of ownership), from the second string it's borrowed.
s1.clone() + &s2 + &s3
s1.to_owned() + &s2 + &s3
Standard Rust methods
Standard Rust methods independent of character or byte indexing.
-
replace : Replaces all matches of a pattern with another string.
-
replacen : Replaces first N matches of a pattern with another string.
-
strip_prefix : Returns a string slice with the prefix removed if the search string is found at the beginning of the string.
-
strip_suffix : Return a string slice with suffix removed if the search string is found at the end of the string.
-
contains : Check if the given pattern matches a sub-slice of this string slice.
-
starts_with : Check if the given pattern matches a prefix of this string slice.
-
ends_with : Check if the given pattern matches a suffix of this string slice.
-
is_empty : Check if this String has a length of zero.
-
chars() : Getting a substring with the chars iterator.
Examples:
fn main() {
let s1: &str = "test éèçà 123 test";
let s2: String = s1.to_owned();
println!("{}", s1.replace("test", "new")); // Result: "new éèçà 123 new"
println!("{}", s2.replacen("test", "new", 1)); // Result: "new éèçà 123 test"
match s1.strip_prefix("test ") {
// Result: Some("éèçà 123 test")
Some(s) => println!("Found: {}", s),
None => println!("Not found"),
};
let result = match s2.strip_prefix("test ") {
// Result: Some("éèçà 123 test")
Some(s) => s,
None => &s2,
};
println!("{result}");
assert_eq!(s1.contains("123"), true);
assert_eq!(s2.contains("123"), true);
assert_eq!(s1.starts_with("test"), true);
assert_eq!(s2.ends_with("test"), true);
assert_eq!(s1.is_empty(), false);
assert_eq!(s2.is_empty(), false);
}
Getting a substring with the Rust chars() module that returns an iterator over the string characters. Skip(), take() and count() consume the chars iterator.
fn main() {
let str: &str = "test éèçà 123 test";
let string: String = str.to_owned();
let start_index: usize = 5;
let length: usize = 4;
// All 4 results return éèçà
// With type annotation
let _s1: String = string.chars().skip(start_index).take(length).collect();
let _s2: String = str.chars().skip(start_index).take(length).collect();
// Without type annotation
let _s3 = string.chars().skip(start_index).take(length).collect::<String>();
let _s4 = str.chars().skip(start_index).take(length).collect::<String>();
let _total1 = str.chars().count(); // Length in characters (=18). Consumes the chars iterator.
let _total2 = string.chars().count(); // Length in characters (=18). Consumes the chars iterator.
}
Using byte positioning
Get a substring using byte positions with standard Rust methods.
Using a string slice:
use string_manipulation_utf8::str_concat;
fn main() {
let s1: &str = "test éèçà 123 test";
let s2: &str = "éèçà ";
let s2_pos_o: Option<usize> = s1.find(s2);
if s2_pos_o.is_some() {
let s2_len: usize = s2.len(); // Length in bytes
let s2_pos: usize = s2_pos_o.unwrap(); // Position of s2 in s1 in bytes
let s2_pos_end: usize = s2_pos + s2_len; // Position of the last character of s2 in s1 in bytes
// Remove s2 from s1. Result: test 123 test
println!("{}", s1[..s2_pos].to_owned() + &s1[s2_pos_end..]);
// Same using the macro str_concat! from this library. Result: test 123 test
println!("{}", str_concat!(&s1[..s2_pos], &s1[s2_pos_end..]));
// Get characters from s1 after s2
println!("{}", &s1[s2_pos_end..]); // Result: 123 test
}
}
Using a string:
use string_manipulation_utf8::str_concat;
fn main() {
let s1: String = "test éèçà 123 test".to_owned();
let s2: &str = "éèçà ";
let s2_pos_o: Option<usize> = s1.find(s2);
if s2_pos_o.is_some() {
let s2_len: usize = s2.len(); // Length in bytes
let s2_pos: usize = s2_pos_o.unwrap(); // Position in bytes
let s2_pos_end: usize = s2_pos + s2_len;
// Remove s2 from s1. Result: test 123 test
println!("{}", s1[..s2_pos].to_owned() + &s1[s2_pos_end..]);
// Same using the macro str_concat! from this library. Result: test 123 test
println!("{}", str_concat!(&s1[..s2_pos], &s1[s2_pos_end..]));
// Get characters from s1 after s2 inside s1
println!("{}", &s1[s2_pos_end..]); // Result: 123 test
}
}
Shorter version:
use string_manipulation_utf8::str_concat;
fn main() {
// let s1: String = "test éèçà 123 test".to_owned(); // Also works with a string
let s1: &str = "test éèçà 123 test";
let s2: &str = "éèçà ";
let s2_pos_o: Option<usize> = s1.find(s2);
if s2_pos_o.is_some() {
println!("{}", str_concat!(&s1[..s2_pos_o.unwrap()], &s1[s2_pos_o.unwrap() + s2.len()..]));
// Result: test 123 test
}
}