Article directory
- Python massive data generation and processing
Python massive data generation and processing
refer to: https://blog.csdn.net/quicktest/article/details/7453189
Overview
Generate 100 million pieces of data
code show as below:
copy# Generate 100 million IP s def generateRandom(rangeFrom, rangeTo): import random return random.randint(rangeFrom,rangeTo) def generageMassiveIPAddr(fileLocation,numberOfLines): IP = [] file_handler = open(fileLocation, 'a+') for i in range(numberOfLines): IP.append('10.197.' + str(generateRandom(0,255))+'.'+ str(generateRandom(0,255)) + '\n') file_handler.writelines(IP) file_handler.close() if __name__ == '__main__': from time import ctime print(ctime()) for i in range(10): print( ' ' + str(i) + ": " + ctime()) generageMassiveIPAddr('d:\\massiveIP.txt', 10000000) print(ctime())
The program output is as follows:
copyThu Dec 30 13:01:34 2021 0: Thu Dec 30 13:01:34 2021 1: Thu Dec 30 13:02:12 2021 2: Thu Dec 30 13:02:50 2021 3: Thu Dec 30 13:03:28 2021 4: Thu Dec 30 13:04:07 2021 5: Thu Dec 30 13:04:45 2021 6: Thu Dec 30 13:05:25 2021 7: Thu Dec 30 13:06:07 2021 8: Thu Dec 30 13:06:46 2021 9: Thu Dec 30 13:07:25 2021 Thu Dec 30 13:08:04 2021
It can be seen that every 10 million pieces of data takes about 40s, and 100 million pieces of data take 6min30s in total, a total of 330s. The resulting file size is: 1.4GB

direct read test
Download Data
code show as below:
copyimport pandas as pd from time import ctime print(ctime()) df = pd.read_csv("d:\\massiveIP.txt",header=None,names=["IP"]) print(ctime())
It took 29s and the output is as follows:
copyThu Dec 30 13:20:24 2021 Thu Dec 30 13:20:53 2021
View the occupied memory size:
copydf.info()
The output is as follows:
copy<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000000 entries, 0 to 99999999 Data columns (total 1 columns): # Column Dtype --- ------ ----- 0 IP object dtypes: object(1) memory usage: 762.9+ MB
Determine the maximum number of repetitions
Determine the number of duplicate values You can use value_counts():
value_counts() is a quick way to see how many distinct values are in a column of a table, and count how many duplicates each distinct value has in that column. value_counts() is a method owned by Series. Generally, when used in DataFrame, you need to specify which column or row to use.
copy%%time df1 = df["IP"].value_counts() df1
output:
copyWall time: 31.6 s 10.197.87.47 1678 10.197.38.53 1677 10.197.42.238 1676 10.197.28.183 1676 10.197.63.208 1674 ... 10.197.30.195 1381 10.197.91.33 1379 10.197.7.231 1376 10.197.11.136 1366 10.197.241.199 1358 Name: IP, Length: 65536, dtype: int64
Takes 31.6s
Generate 1 billion pieces of data
Since there is no pressure to generate 100 million pieces of data, 500 million pieces of data are now generated Will:
copyif __name__ == '__main__': from time import ctime print(ctime()) for i in range(50): # Originally 10, now revised to 50 print( ' ' + str(i) + ": " + ctime()) generageMassiveIPAddr('d:\\massiveIP.txt', 10000000) print(ctime())
It took 27min35.8s, The resulting file size is: 7.04GB, a total of 7559142440.96 bytes A total of 500 million pieces of data, each 15.12 bytes
The output is:
copyThu Dec 30 15:04:51 2021 0: Thu Dec 30 15:04:51 2021 1: Thu Dec 30 15:05:32 2021 2: Thu Dec 30 15:06:12 2021 3: Thu Dec 30 15:06:51 2021 4: Thu Dec 30 15:07:29 2021 5: Thu Dec 30 15:08:08 2021 6: Thu Dec 30 15:08:48 2021 7: Thu Dec 30 15:09:30 2021 8: Thu Dec 30 15:10:11 2021 9: Thu Dec 30 15:10:51 2021 10: Thu Dec 30 15:11:30 2021 11: Thu Dec 30 15:12:10 2021 12: Thu Dec 30 15:12:54 2021 13: Thu Dec 30 15:13:42 2021 14: Thu Dec 30 15:14:23 2021 15: Thu Dec 30 15:15:05 2021 16: Thu Dec 30 15:15:44 2021 17: Thu Dec 30 15:16:25 2021 18: Thu Dec 30 15:17:05 2021 19: Thu Dec 30 15:17:45 2021 20: Thu Dec 30 15:18:23 2021 21: Thu Dec 30 15:19:03 2021 22: Thu Dec 30 15:19:47 2021 23: Thu Dec 30 15:20:34 2021 36: Thu Dec 30 15:29:28 2021 37: Thu Dec 30 15:30:12 2021 38: Thu Dec 30 15:30:58 2021 39: Thu Dec 30 15:31:46 2021 Thu Dec 30 15:32:27 2021
direct read test
Download Data
code show as below:
copyimport pandas as pd from time import ctime print(ctime()) df = pd.read_csv("d:\\massiveIP.txt",header=None,names=["IP"]) print(ctime())
Open the resource monitor and you will see the memory usage:

The memory occupied by vscode increases rapidly. I keep turning off qq, DingTalk, and browsers that I don't use, the result. . .
After 2min49.5s, the output is as follows:
copyMemoryError: Unable to allocate 3.73 GiB for an array with shape (500000000,) and data type object
At this time, the program will not release the memory even in the terminal, you can release the memory manually
copya = [] for x in locals().keys(): print(x) a.append(x) import gc # for i in a: # del locals()[x] gc.collect()
If you want to adjust the available memory size, you can refer to: https://blog.csdn.net/com_fang_bean/article/details/106862052
But here I no longer adjust the memory size, but load data by other means.
Load data in chunks
Code:
copyimport pandas as pd from tqdm import tqdm f = open('d:\\massiveIP.txt') reader = pd.read_csv(f, sep=',',header=None,names=["IP"], iterator=True) loop = True chunkSize = 100000000 chunks = [] for i in tqdm(range(10)) : try: chunk = reader.get_chunk(chunkSize) df1 = chunk["IP"].value_counts() chunks.append(df1) del chunk except StopIteration: # loop = False print("Iteration is stopped.")
Time consuming 6m3.6s
The output is:
copy100%|██████████| 10/10 [06:03<00:00, 36.33s/it]
Load statistics for each block
Code:
copyre_chunks = [] for i in chunks: df1 = i.reset_index() re_chunks.append(df1) df22 = pd.concat(re_chunks, ignore_index=True) df22
The output is:
copyindex IP 0 10.197.87.47 1678 1 10.197.38.53 1677 2 10.197.42.238 1676 3 10.197.28.183 1676 4 10.197.63.208 1674 ... ... ... 327675 10.197.215.196 1380 327676 10.197.18.130 1379 327677 10.197.251.175 1371 327678 10.197.57.85 1368 327679 10.197.115.87 1358
Get the value of the number of IP s by resetting the sorting by grouping aggregation
copydf22.groupby(by=["index"]).agg({"IP":"sum"}).reset_index().sort_values(by=["IP"],ascending=False)
The output is as follows:
copyindex IP 32949 10.197.213.31 7982 48006 10.197.37.219 7972 63967 10.197.93.7 7961 40524 10.197.240.167 7946 45524 10.197.28.6 7945 ... ... ... 54610 10.197.60.172 7302 8240 10.197.127.141 7293 59005 10.197.76.210 7292 38627 10.197.233.73 7286 11341 10.197.138.168 7282
Check if the result is correct
copydf22["IP"].sum()
The output is as follows:
copy500000000
It is consistent with the original number, indicating that there is no problem in the process. At this point, the massive data processing based on pandas has been successfully completed.