Python massive data generation and processing

Article directory

Python massive data generation and processing

refer to: https://blog.csdn.net/quicktest/article/details/7453189

Overview

Generate 100 million pieces of data

code show as below:

# Generate 100 million IP s
 
def generateRandom(rangeFrom, rangeTo):
    import random
    return random.randint(rangeFrom,rangeTo)
 
def generageMassiveIPAddr(fileLocation,numberOfLines):
    IP = []
    file_handler = open(fileLocation, 'a+')
    for i in range(numberOfLines):
        IP.append('10.197.' + str(generateRandom(0,255))+'.'+ str(generateRandom(0,255)) + '\n')
 
    file_handler.writelines(IP)
    file_handler.close()
 
if __name__ == '__main__':
    from time import ctime
    print(ctime())
    for i in range(10):
        print( '  ' + str(i) + ": " + ctime())
        generageMassiveIPAddr('d:\\massiveIP.txt', 10000000)
    print(ctime())
copy

The program output is as follows:

Thu Dec 30 13:01:34 2021
  0: Thu Dec 30 13:01:34 2021
  1: Thu Dec 30 13:02:12 2021
  2: Thu Dec 30 13:02:50 2021
  3: Thu Dec 30 13:03:28 2021
  4: Thu Dec 30 13:04:07 2021
  5: Thu Dec 30 13:04:45 2021
  6: Thu Dec 30 13:05:25 2021
  7: Thu Dec 30 13:06:07 2021
  8: Thu Dec 30 13:06:46 2021
  9: Thu Dec 30 13:07:25 2021
Thu Dec 30 13:08:04 2021
copy

It can be seen that every 10 million pieces of data takes about 40s, and 100 million pieces of data take 6min30s in total, a total of 330s. The resulting file size is: 1.4GB

direct read test

Download Data

code show as below:

import pandas as pd
from time import ctime
print(ctime())
df = pd.read_csv("d:\\massiveIP.txt",header=None,names=["IP"])
print(ctime())
copy

It took 29s and the output is as follows:

Thu Dec 30 13:20:24 2021
Thu Dec 30 13:20:53 2021
copy

View the occupied memory size:

df.info()
copy

The output is as follows:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 1 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   IP      object
dtypes: object(1)
memory usage: 762.9+ MB
copy

Determine the maximum number of repetitions

Determine the number of duplicate values You can use value_counts():

value_counts() is a quick way to see how many distinct values ​​are in a column of a table, and count how many duplicates each distinct value has in that column. value_counts() is a method owned by Series. Generally, when used in DataFrame, you need to specify which column or row to use.

%%time
df1 = df["IP"].value_counts()
df1
copy

output:

Wall time: 31.6 s
10.197.87.47      1678
10.197.38.53      1677
10.197.42.238     1676
10.197.28.183     1676
10.197.63.208     1674
                  ... 
10.197.30.195     1381
10.197.91.33      1379
10.197.7.231      1376
10.197.11.136     1366
10.197.241.199    1358
Name: IP, Length: 65536, dtype: int64
copy

Takes 31.6s

Generate 1 billion pieces of data

Since there is no pressure to generate 100 million pieces of data, 500 million pieces of data are now generated Will:

if __name__ == '__main__':
    from time import ctime
    print(ctime())
    for i in range(50): # Originally 10, now revised to 50
        print( '  ' + str(i) + ": " + ctime())
        generageMassiveIPAddr('d:\\massiveIP.txt', 10000000)
    print(ctime())
copy

It took 27min35.8s, The resulting file size is: 7.04GB, a total of 7559142440.96 bytes A total of 500 million pieces of data, each 15.12 bytes

The output is:

Thu Dec 30 15:04:51 2021
  0: Thu Dec 30 15:04:51 2021
  1: Thu Dec 30 15:05:32 2021
  2: Thu Dec 30 15:06:12 2021
  3: Thu Dec 30 15:06:51 2021
  4: Thu Dec 30 15:07:29 2021
  5: Thu Dec 30 15:08:08 2021
  6: Thu Dec 30 15:08:48 2021
  7: Thu Dec 30 15:09:30 2021
  8: Thu Dec 30 15:10:11 2021
  9: Thu Dec 30 15:10:51 2021
  10: Thu Dec 30 15:11:30 2021
  11: Thu Dec 30 15:12:10 2021
  12: Thu Dec 30 15:12:54 2021
  13: Thu Dec 30 15:13:42 2021
  14: Thu Dec 30 15:14:23 2021
  15: Thu Dec 30 15:15:05 2021
  16: Thu Dec 30 15:15:44 2021
  17: Thu Dec 30 15:16:25 2021
  18: Thu Dec 30 15:17:05 2021
  19: Thu Dec 30 15:17:45 2021
  20: Thu Dec 30 15:18:23 2021
  21: Thu Dec 30 15:19:03 2021
  22: Thu Dec 30 15:19:47 2021
  23: Thu Dec 30 15:20:34 2021
  36: Thu Dec 30 15:29:28 2021
  37: Thu Dec 30 15:30:12 2021
  38: Thu Dec 30 15:30:58 2021
  39: Thu Dec 30 15:31:46 2021
Thu Dec 30 15:32:27 2021
copy

direct read test

Download Data

code show as below:

import pandas as pd
from time import ctime
print(ctime())
df = pd.read_csv("d:\\massiveIP.txt",header=None,names=["IP"])
print(ctime())
copy

Open the resource monitor and you will see the memory usage:

The memory occupied by vscode increases rapidly. I keep turning off qq, DingTalk, and browsers that I don't use, the result. . .

After 2min49.5s, the output is as follows:

MemoryError: Unable to allocate 3.73 GiB for an array with shape (500000000,) and data type object
copy

At this time, the program will not release the memory even in the terminal, you can release the memory manually

a = []
for x in locals().keys():
    print(x)
    a.append(x)
import gc
# for i in a:
        # del locals()[x]
gc.collect()
copy

If you want to adjust the available memory size, you can refer to: https://blog.csdn.net/com_fang_bean/article/details/106862052

But here I no longer adjust the memory size, but load data by other means.

Load data in chunks

Code:

import pandas as pd
from tqdm import tqdm

f = open('d:\\massiveIP.txt')
reader = pd.read_csv(f, sep=',',header=None,names=["IP"], iterator=True)
loop = True
chunkSize = 100000000
chunks = []


for i in tqdm(range(10)) :
    try:
        chunk = reader.get_chunk(chunkSize)
        df1 = chunk["IP"].value_counts()
        chunks.append(df1)
        del chunk
        
    except StopIteration:
        # loop = False
        print("Iteration is stopped.")
copy

Time consuming 6m3.6s

The output is:

100%|██████████| 10/10 [06:03<00:00, 36.33s/it]
copy

Load statistics for each block

Code:

re_chunks = []
for i in chunks:
    df1 = i.reset_index()
    re_chunks.append(df1)

df22 = pd.concat(re_chunks, ignore_index=True)
df22
copy

The output is:

	index	IP
0	10.197.87.47	1678
1	10.197.38.53	1677
2	10.197.42.238	1676
3	10.197.28.183	1676
4	10.197.63.208	1674
...	...	...
327675	10.197.215.196	1380
327676	10.197.18.130	1379
327677	10.197.251.175	1371
327678	10.197.57.85	1368
327679	10.197.115.87	1358
copy

Get the value of the number of IP s by resetting the sorting by grouping aggregation

df22.groupby(by=["index"]).agg({"IP":"sum"}).reset_index().sort_values(by=["IP"],ascending=False)
copy

The output is as follows:

index	IP
32949	10.197.213.31	7982
48006	10.197.37.219	7972
63967	10.197.93.7	7961
40524	10.197.240.167	7946
45524	10.197.28.6	7945
...	...	...
54610	10.197.60.172	7302
8240	10.197.127.141	7293
59005	10.197.76.210	7292
38627	10.197.233.73	7286
11341	10.197.138.168	7282
copy

Check if the result is correct

df22["IP"].sum()
copy

The output is as follows:

500000000
copy

It is consistent with the original number, indicating that there is no problem in the process. At this point, the massive data processing based on pandas has been successfully completed.

Tags: Python Cyber Security https

Posted by fleabay on Sat, 12 Nov 2022 14:57:48 +0300