Python based string processing

What is a string

1. In Python, strings are enclosed in single or double quotation marks.
2. Strings can also be enclosed in six single quotes or six double quotes.
3. In Python 3, strings are divided into two types. The first is str, which is a unicode string, and the second is bytes, which is an encoded string. It is a bytecode, which is suitable for saving on disk or transmitting on the network.

Two string splicing

1. Use the plus sign

 a = 'hello'
 b = 'world'
 c = a + b
 print(c)

2. Use formatted form

 a = '___'
 b = 'abc'
 c = "%s%s%s" % (a,b,a)

Three string formatting

String formatting is to extract some places in a string that may often change, occupy them with some placeholders, and then fill them in by formatting. There are two ways to format a string,% s.

1. Use% form

course = 'python'
school = 'zhiliao'
intro = "I love %s,I study in %s" % (course,school)

The above is supplemented by% form. Of course, different formats should be used for different data types:

  • String: use% s.
  • Shaping: use% d.
  • Floating point type: use% f. If you want to specify the number of digits after the decimal point. %. Can be used nf, n is 1, n is 2, n is 2, and so on.

2. Use the form of format function

  • Use the position parameter placeholder. The example code is as follows
greet = "I love {},I study in {}".format('python','zhiliao')
  • Use keyword parameter placeholders. The example code is as follows
greet = "I love {arg1},I study in {arg2}".format(arg1='python',arg2='zhiliao')

Four string subscript

Subscript operation: a string is actually the same as a container, and can also be subscript operated like a list and tuple.
Example code:

   username = 'zhiliao'
    print(username[0])
    print(username[1])
    print(username[2])
    print(username[-1]) # Start from the back

Five slice operation

  • Start position: the slice operation includes the start position. Negative numbers start from the back, and the last element is - 1.
  • End position: the slice operation includes an element in front of the end position. Negative numbers start from the back, and the last element is - 1.
  • Step size: represents the span of each value. If it is not set, it defaults to 1. Positive numbers are left to right and negative numbers are right to left.
  • Reverse order: start from the back. Therefore, the starting position should be - 1, and then to move forward, the step should be specified as - 1, and then to get all the values, the end position should be left blank.

Six common string operations

1. find method

Returns the subscript position of the lookup string. If - 1 is returned, it means that the string is not found. rfind is from right to left.

text = 'hello zhiliao'
position = text.find("zhiliao")
if position > 0:
	print('zhiliao stay text in')
else:
	print('zhiliao be not in text in')

2. index method

Very similar to find. But when the string cannot be found, instead of returning - 1, an exception is thrown. rindex starts from the right.

text = 'hello zhiliao'
position = text.index("python")
print(position)

3. len function

Gets the length of the string character.

text = 'hello zhiliao'
length = len(text)
print(length)

4. count method

Used to get the number of times the substring appears in the original string.

text = 'hello python python'
count = text.count('zhiliao')
print(count)

5. The replace method will not change the value of the original string

Create a new string and replace a string in the original string with the one you want.

text = 'hello python python'
new_text = text.replace("python",'zhiliao',1)
print(text)
print(new_text)

6. split method

Split according to the given string. A list is returned.

text = 'hello python zhiliao'
words = text.split(" ")
for word in words:
	print(word)

7. Startswitch method

Determines whether a string starts with a string.

text = 'hello python'
if text.startswith("h"):
	print("So hello initial ")
else:
	print('Not with hello initial ')

8. Endswitch method

Determines whether a string ends with a string.

text = 'hello python'
if text.endswith("python"):
	print("True")
else:
	print('False')

9. The lower method will not change the original string

Change all strings to lowercase.

text ='I am zhiliao'
new_text = text.lower()
print(text)
print(new_text)

10. The upper method will not change the value of the original string

Change all strings to uppercase.

text ='I am zhiliao'
new_text = text.upper()
print(text)
print(new_text)

11. strip method

Remove all spaces around the string.

text = '   python    '
new_text = text.strip()
print(text)
print(new_text)

12. lstrip method

Delete the space to the left of the string.

text = '   python    '
new_text = text.strip()
print(text)
print(new_text)

13. partition method

Delete the space to the right of the string.

text = 'hello python zhiliao'
result = text.partition("python")
print(result)

14. isalnum method

From the first position where STR appears, the string is divided into a three element tuple (string_pre_str,str,string_post_str). If STR is not included in the string, the string_pre_str == string.

text = 'zhiliao123.'
result = text.isalnum()
print(result)

15. isalpha method

Returns True if the string has at least one character and all characters are letters or numbers, otherwise returns False.

text = 'hello12'
result = text.isalpha()
print(result)

16. isdigit method

Returns True if the string has at least one character and all characters are letters, otherwise returns False.

text = '123sbc'
result = text.isdigit()
print(result)

17. isspace method

Returns True if the string contains only numbers; otherwise, returns False.

result = text.isspace()
print(result)

18 isspace

Returns True if the string contains only spaces, otherwise False.

test="  "
print(test.isspace())

Seven escape characters

Escape character describe
\At the end of the line Continuation character
\n Line feed
\' Single quotation mark
\" Double quotation mark
\t Tab
\ Backslash

Eight native strings

The original string will not escape any characters in the string. What you write is what the string is, so as to achieve a WYSIWYG effect.
Syntax: r 'xxx'.

text = 'hello \
world'

print(text)


text = 'hello \nworld'
print(text)

text = "apple\"s\tprice is $9"
print(text)

text = '\\'
print(text)

# raw: Native
text = r'abc\ncde'
print(text)

Nine string encoding and decoding

In Python 3, all strings written by default are of unicode type. unicode is a universal character set that can store any character. However, unicode strings can only exist in memory and cannot transmit data between disk and network. If you want to transmit data between files or networks, you must convert unicode to a string of bytes, Therefore, when writing code, we sometimes need to convert unicode and bytes strings. The conversion functions are as follows:

  • encode('utf-8 '): encode unicode into bytes, and the encoding method is UTF-8.
  • decode('utf-8 '): decode bytes into unicode, and the decoding method is UTF-8.
  • utf-8 is the encoding method, and there are other encoding methods, such as gbk, ascii and so on.
text = 'hello world'
# str
# unicode
# unicode -> bytes: encode
text_bytes = text.encode("utf-8")
print(text_bytes)
print(type(text_bytes))

# bypes->unicode: decode
text_bytes = b'hello world'
text = text_bytes.decode("utf-8")
print(text)
print(type(text))


from hashlib import md5
text = 'hello world'
result = md5(text.encode("utf-8")).hexdigest()
print(result)

with open("abc.txt",'w') as fp:
	fp.write("hello world")

Ten Unicode strings

1. What is a unicode string

For historical reasons, in the python 2 version, the default string encoding is ascii encoding (Python was released earlier than unicode). ascii code is stored in one byte, that is, 8 bits. It can only represent 28, that is, 256 characters at most, which is far from enough in the world. Taking Chinese characters alone, there are more than 6000 commonly used Chinese characters. Therefore, in order to meet the coding needs of different languages in various countries around the world, the global unified code alliance proposed unicode coding. By default, unicode encoding uses 2 bytes to store characters (UCS-2), which can store 216 characters, that is, 65536 characters, but it still can not meet all languages in the world. Therefore, 4 bytes storage (UCS-4) is added later, which can contain all words in the world.

2. Differences between Unicode and other codes

  • unicode is a character set, equivalent to a dictionary. All characters or punctuation marks in the world correspond to a number. When you want to display this character in the computer in the future, you can use the corresponding number in the unicode character set.
  • utf-8, gbk, latin-1 and ascii are all concrete coding implementations. Because in unicode, most characters are stored in two bytes, but for English letters, such as a, it only needs one byte. If they are stored in two bytes, it will waste hard disk space or traffic. Therefore, unicode is not suitable for storage. utf-8 is an implementation of unicode. It will use 8 bits by default, that is, one byte storage. If it cannot be stored, it will dynamically change the size to store characters. Therefore, utf-8 saves more space and can also contain all the characters in the world.

3. How to define unicode strings in Python 2

Add a u before the string, such as u 'China'.

4. What problems can Unicode solve

In case of garbled code or coding error, the problem can be solved.

5. How to decode other forms of encoded strings into unicode strings

greet = 'Hello'
greet_unicode = greet.decode('utf-8')

6. How to encode unicode into other encoded strings

greet = u'Hello'
greet_utf8 = greet.encode('utf-8')

7. What is the function of sys

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Set the default encoding of string when converting encoding.
For example, if you use the unicode function to convert a str string into a unicode string, ascii encoding is used by default. If you set the above code, you will use utf-8 coding.

8.#coding

What utf-8 is used for: it is used to set the encoding used by the Python interpreter when reading this source code file. ascii is used by default in python2, so you need to change the default encoding of the file. In Python 3, utf-8 is used by default, so you can support Chinese without changing the file code.

Tags: Python

Posted by splat78423 on Sat, 14 May 2022 17:40:57 +0300