Data processing of the use of Python's pandas module

1. Data import and export

(1) Data import and export of csv file

import pandas
# Import 1.csv data into the data variable
data = pandas.read_csv(
	# file path
	'D:/1.csv',
	# Set the engine parameter so that the Chinese meaning in the path will not report an error
	engine='python',
	# set encoding format
	encoding='utf8'
)

# Data output
# define data frame
data1 = pandas.DataFrame(
	data={
		'name':['a', 'b']
		'age':[12, 23]
	}
)
# Call the to_csv function to save the data to the data_csv file
data1.to_csv(
	'D:/data.csv',
	# This setting does not output the index
	index=False
)

(2) Data import and export of txt and excel files

The read_table function is used here, and the specific parameters are mentioned here. I am interested in Baidu.

import pandas
# txt file data import
data = pandas.read_table(
	'D:/1.txt',
	engine='python',
	# set list of column names
	names=['name', 'age']
	# set comma as delimiter
	sep=',',
	encoding='utf8'
)

#excel file data import
data1 = pandas_excel(
	'D:/1.xlsx',
	# The sheet name to read, defaults to the first one
	sheet_name='data'
)

#Data export, similar to csv file, not explained here

2. Data cleaning

(1), data sorting

Where data is a data frame, how to define the above, use the sort_value function to sort.

sortData = data.sort_value(
	# column to sort
	by=['age','gender'],
	# Set ascending (True) or descending order
	ascending=[True, False]
)

(2), remove duplicate data

Where data is the data frame, how to define the above, use the duplicated function to find duplicate data, and then use the drop_duplicates() function to delete duplicate data

# find duplicate data,
dIndex = data.duplicated()
# The return value of the data where all columns are repeated by default is True.
return dIndex
# Find duplicate values ​​for a specific column
d1Index = data.duplicated(['gender'])
# output duplicate data
data[data.duplicated()]

# Remove duplicate data
# Delete duplicate values ​​directly, delete according to all columns by default
cData = data.drop_duplicates()

(3), missing and blank data processing

Where data is a data frame, how to define the above, use the fillna() function to fill in the data, or use the dropna function to delete missing values ​​directly.
The str.strip() function can be used to remove the spaces at the beginning and end of the string, of course, you can also set the value parameter value to delete other characters

# Use the mean of consumption (mean() function) to fill in missing values
data['Consumption'] = data.Consumption.fillna(data.Consumption.mean())

# Delete missing values ​​directly
cData = data.dropna()

# Remove spaces before and after a string
data['name'].str.strip()

3. Data conversion

(1) Conversion of numeric values ​​to characters

Where data is a data frame, how to define the above has

# Convert phone number column to string
data['phone number']=data.phone number.astype(str)
# String converted value
data['Numeric phone number'] =data.phone number.astype(float)

(2), character transfer time

Where data is a data frame, how to define the above has

# Add the column of registration time to data of time type
# Then add the time type data to the 'time' list in the data frame
data['time'] = pandas.to_datetime(
	data.Registration time,
	format='%Y/%m/%d'
)

# Time formatting, the above will appear hours, and we only need the year and month
data['years'] = data.time.dt.strftime('%Y-%m')

Fourth, data extraction

Where data is a data frame, how to define the above has

(1), field split

# 1. Split by location
# Convert to string format first, then take out the first 3 digits
data['tel']=data['tel'].astype(str)
data['tel'].str.slice(0,3)

# 2. Split by delimiter
# The first parameter is the separator, the second parameter is to split into several columns (1 means 2 columns), and the third parameter is whether to expand the data frame
data['name'].str.split(' ',1,True)

# 3. Extract by time attribute
# Convert to time data type first
data['time'] = pandas.to_datetime(
	data.Registration time,
	format='%Y%m%d %H:%M:%S'
)
# Extract various attributes of time and add them to the corresponding column
data['year'] = data['time'].dt.year
data['moon'] = data['time'].dt.month
data['week'] = data['time'].dt.weekday
data['day'] = data['time'].dt.day
data['Time'] = data['time'].dt.hour
data['Minute'] = data['time'].dt.minute
data['Second'] = data['time'].dt.second

(2), record extraction

# 1. Extract records based on keywords
# The first parameter is the field to be referenced, and the na parameter is the filling method. In the case of filtering, it should be set to False
fData = data[data.title.str.contains('radio', na=False)]

# 2. Null value extraction
# Extract data records with item title null
fData = data[data.title.isnull()]

# 3. Value range extraction
# One-condition comparison operation
data[data.comments>10000]
# Range extraction, True means including boundaries
data[data.comments.between(1000,10000,True)]

# 4. Time range extraction
data['ptime'] = pandas.to_datetime(
	data.ptime,
	format='%Y-%m-%d'
)
from datetime import datetime
# define time point
dt1=datetime(
	year=2019,
	month=1,
	day=1
)
dt1=datetime(
	year=2020,
	month=8,
	day=11
)
# Get data between dt1 and dt2
fData = data[(data.ptime>=dt1)&(data.ptime<=dt2)]

# Combined condition extraction
# ~ for negation
data[~data.title.str.contains('radio',na=False)]
data[
	(data.comments >=1000)&(data.comments <=10000)
	& data.title.str.contains('radio',na=False)
]

(3) Random sampling

# sample by number
data.sample(n=3)
# sampling by percentage
data.samle(frac=0.2)
# whether to replace sampling,
data.sample(n=3,True)

5. Data merging

(1), record merge and field merge

Where data1, data2, data3 are the data frames to be merged

# record merge
data= pandas.concat([data1, data2, data3])

# You can also specify columns to merge
data = pandas.concat(
	[
		data1[['id', 'comments']],
		data2[['comments', 'title']],
		data3[['id','title']]
	]
)

# Field merge
# convert to string
data.data.astype(str)
# Merge fields
tel = data['band']+data['area']+data['num']

(2) Field matching

Use pandas.merge(left, right, left_on, right_on, how='inner') to operate, the specific parameters are as follows

parameter illustrate
left data frame on the left
right data frame on the right
left_on Use the column names of the left database in the connection
right_on Use the column name of the database on the right in the connection
how='inner' Connection mode, the default is inner, and left,right,outer
# inner join
itemPrices = pandas.merge(
	data1,
	data2,
	left_on='id',
	right_on='id',
	how='inner'
)

6. Data calculation

Simple numerical calculations, including four arithmetic operations

# time calculation

data['time'] = pandas.to_datetime(
	data.Registration time,
	format'%Y%m%d'
)
from datetime import datetime
data['Registration time'] = datetime.now() - data['time']
# Only keep the year, month and day
data['Registration time'] =data['Registration days'].dt.days

data normalization
The 0-1 standard is x*=(x-min)/(max-min)

# round() The first parameter is the data to be rounded, and the second parameter is the number of digits to be retained
data['standardization of consumption']= round(
	(data.Consumption-data.Consumption.min())/
	(data.Consumption.max() - data.Consumption.min())
	,2
)

grouping of data. Use pandas.cut function to complete, its parameters are as follows

parameter illustrate
x Columns of the data frame, Series
bins list for specifying groups
right Whether the right boundary of the grouping interval is closed, the default is closed
labels Group custom labels. can not be customized
bins = [0, 20, 40, 60, 80, 100, 120]
cLabels = ['0 to 20','20 to 40','40 to 60','60 to 80','80 to 100','100 to 120']
data['cut'] = pandas.cut(
	data.cost, bins,
	right=False,
	labels =cLabels
)

Tags: Python Data Analysis pandas

Posted by EvilWalrus on Mon, 23 May 2022 07:22:19 +0300