1. Data import and export
(1) Data import and export of csv file
import pandas # Import 1.csv data into the data variable data = pandas.read_csv( # file path 'D:/1.csv', # Set the engine parameter so that the Chinese meaning in the path will not report an error engine='python', # set encoding format encoding='utf8' ) # Data output # define data frame data1 = pandas.DataFrame( data={ 'name':['a', 'b'] 'age':[12, 23] } ) # Call the to_csv function to save the data to the data_csv file data1.to_csv( 'D:/data.csv', # This setting does not output the index index=False )
(2) Data import and export of txt and excel files
The read_table function is used here, and the specific parameters are mentioned here. I am interested in Baidu.
import pandas # txt file data import data = pandas.read_table( 'D:/1.txt', engine='python', # set list of column names names=['name', 'age'] # set comma as delimiter sep=',', encoding='utf8' ) #excel file data import data1 = pandas_excel( 'D:/1.xlsx', # The sheet name to read, defaults to the first one sheet_name='data' ) #Data export, similar to csv file, not explained here
2. Data cleaning
(1), data sorting
Where data is a data frame, how to define the above, use the sort_value function to sort.
sortData = data.sort_value( # column to sort by=['age','gender'], # Set ascending (True) or descending order ascending=[True, False] )
(2), remove duplicate data
Where data is the data frame, how to define the above, use the duplicated function to find duplicate data, and then use the drop_duplicates() function to delete duplicate data
# find duplicate data, dIndex = data.duplicated() # The return value of the data where all columns are repeated by default is True. return dIndex # Find duplicate values for a specific column d1Index = data.duplicated(['gender']) # output duplicate data data[data.duplicated()] # Remove duplicate data # Delete duplicate values directly, delete according to all columns by default cData = data.drop_duplicates()
(3), missing and blank data processing
Where data is a data frame, how to define the above, use the fillna() function to fill in the data, or use the dropna function to delete missing values directly.
The str.strip() function can be used to remove the spaces at the beginning and end of the string, of course, you can also set the value parameter value to delete other characters
# Use the mean of consumption (mean() function) to fill in missing values data['Consumption'] = data.Consumption.fillna(data.Consumption.mean()) # Delete missing values directly cData = data.dropna() # Remove spaces before and after a string data['name'].str.strip()
3. Data conversion
(1) Conversion of numeric values to characters
Where data is a data frame, how to define the above has
# Convert phone number column to string data['phone number']=data.phone number.astype(str) # String converted value data['Numeric phone number'] =data.phone number.astype(float)
(2), character transfer time
Where data is a data frame, how to define the above has
# Add the column of registration time to data of time type # Then add the time type data to the 'time' list in the data frame data['time'] = pandas.to_datetime( data.Registration time, format='%Y/%m/%d' ) # Time formatting, the above will appear hours, and we only need the year and month data['years'] = data.time.dt.strftime('%Y-%m')
Fourth, data extraction
Where data is a data frame, how to define the above has
(1), field split
# 1. Split by location # Convert to string format first, then take out the first 3 digits data['tel']=data['tel'].astype(str) data['tel'].str.slice(0,3) # 2. Split by delimiter # The first parameter is the separator, the second parameter is to split into several columns (1 means 2 columns), and the third parameter is whether to expand the data frame data['name'].str.split(' ',1,True) # 3. Extract by time attribute # Convert to time data type first data['time'] = pandas.to_datetime( data.Registration time, format='%Y%m%d %H:%M:%S' ) # Extract various attributes of time and add them to the corresponding column data['year'] = data['time'].dt.year data['moon'] = data['time'].dt.month data['week'] = data['time'].dt.weekday data['day'] = data['time'].dt.day data['Time'] = data['time'].dt.hour data['Minute'] = data['time'].dt.minute data['Second'] = data['time'].dt.second
(2), record extraction
# 1. Extract records based on keywords # The first parameter is the field to be referenced, and the na parameter is the filling method. In the case of filtering, it should be set to False fData = data[data.title.str.contains('radio', na=False)] # 2. Null value extraction # Extract data records with item title null fData = data[data.title.isnull()] # 3. Value range extraction # One-condition comparison operation data[data.comments>10000] # Range extraction, True means including boundaries data[data.comments.between(1000,10000,True)] # 4. Time range extraction data['ptime'] = pandas.to_datetime( data.ptime, format='%Y-%m-%d' ) from datetime import datetime # define time point dt1=datetime( year=2019, month=1, day=1 ) dt1=datetime( year=2020, month=8, day=11 ) # Get data between dt1 and dt2 fData = data[(data.ptime>=dt1)&(data.ptime<=dt2)] # Combined condition extraction # ~ for negation data[~data.title.str.contains('radio',na=False)] data[ (data.comments >=1000)&(data.comments <=10000) & data.title.str.contains('radio',na=False) ]
(3) Random sampling
# sample by number data.sample(n=3) # sampling by percentage data.samle(frac=0.2) # whether to replace sampling, data.sample(n=3,True)
5. Data merging
(1), record merge and field merge
Where data1, data2, data3 are the data frames to be merged
# record merge data= pandas.concat([data1, data2, data3]) # You can also specify columns to merge data = pandas.concat( [ data1[['id', 'comments']], data2[['comments', 'title']], data3[['id','title']] ] ) # Field merge # convert to string data.data.astype(str) # Merge fields tel = data['band']+data['area']+data['num']
(2) Field matching
Use pandas.merge(left, right, left_on, right_on, how='inner') to operate, the specific parameters are as follows
parameter | illustrate |
---|---|
left | data frame on the left |
right | data frame on the right |
left_on | Use the column names of the left database in the connection |
right_on | Use the column name of the database on the right in the connection |
how='inner' | Connection mode, the default is inner, and left,right,outer |
# inner join itemPrices = pandas.merge( data1, data2, left_on='id', right_on='id', how='inner' )
6. Data calculation
Simple numerical calculations, including four arithmetic operations
# time calculation data['time'] = pandas.to_datetime( data.Registration time, format'%Y%m%d' ) from datetime import datetime data['Registration time'] = datetime.now() - data['time'] # Only keep the year, month and day data['Registration time'] =data['Registration days'].dt.days
data normalization
The 0-1 standard is x*=(x-min)/(max-min)
# round() The first parameter is the data to be rounded, and the second parameter is the number of digits to be retained data['standardization of consumption']= round( (data.Consumption-data.Consumption.min())/ (data.Consumption.max() - data.Consumption.min()) ,2 )
grouping of data. Use pandas.cut function to complete, its parameters are as follows
parameter | illustrate |
---|---|
x | Columns of the data frame, Series |
bins | list for specifying groups |
right | Whether the right boundary of the grouping interval is closed, the default is closed |
labels | Group custom labels. can not be customized |
bins = [0, 20, 40, 60, 80, 100, 120] cLabels = ['0 to 20','20 to 40','40 to 60','60 to 80','80 to 100','100 to 120'] data['cut'] = pandas.cut( data.cost, bins, right=False, labels =cLabels )