Hi everyone, hello! I am a little panda ❤
About why IP proxy is needed:
When collecting data, collecting data in batches, the request speed is too fast,
The website may block your IP
<Your network cannot access this site>
The IP proxy changes to another IP, and then collects the request data
Source code and information click here
1. Packet capture analysis data source
1. Clear requirements:
- Identify the collection site and what the collection data is
- Get IP proxy, check if IP proxy is available
dit = { 'http': 'http://' + IP:port }
2. Analyze the two data of IP and port number, which website can be requested?
- F12 or right-click to check and select network to refresh the web page
- Where is the analysis data --> search for data sources by keyword <we want data>
Request https://www.kuaidaili.com/free/ to get the response to get the data we want IP and port
2. Code implementation steps
- Send a request, simulate the browser to send a request for the url address
- Parse the data and extract the data content we want
- Save data, save available IP proxy locally, IP proxy detection
- After the detection is available, save the IP proxy
code implementation
send request
Simulation camouflage --> headers request header dictionary data type
# request link url = f'https://****/free/inha/1/' headers = { # User-Agent The user agent represents the basic identity information of the browser 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36' } response = requests.get(url=url, headers=headers)
Analytical data
Three Analysis Methods
re: directly for string data extraction
- re.findall('what data to match', 'where to match') Find the content of the data we want From where to match what data From
response.text to match (.?) where (.?) is the data we want - () represents the data you want, .*? Matching rules can match any character (except newline \n)
IP_list = re.findall('<td data-title="IP">(.*?)</td>', response.text) PORT_list = re.findall('<td data-title="PORT">(.*?)</td>', response.text)
css: extract data based on tag attribute
#list tbody tr td:nth-child(1) positioning label element
td:nth-child(1) means the first td tag td:nth-child(1)::text means extract the text data in the first td tag
getall() means to get all
IP_list = selector.css('#list tbody tr td:nth-child(1)::text').getall() PORT_list = selector.css('#list tbody tr td:nth-child(2)::text').getall()
xpath: Extract data based on label nodes
IP_list = selector.xpath('//*[@id="list"]//tbody/tr/td[1]/text()').getall() PORT_list = selector.xpath('//*[@id="list"]//tbody/tr/td[2]/text()').getall()
The for loop traverses the extracted data and extracts the elements in the list one by one
for IP, PORT in zip(IP_list, PORT_list): dit = { 'http': 'http://' + IP + ':' + PORT, 'https': 'https://' + IP + ':' + PORT, } print(dit)
Detect whether the IP proxy is available, request a website to carry the IP proxy
try: # Send request with IP agent response_1 = requests.get(url='https://www.baidu.com/', headers=headers, proxies=dit, timeout=1) # response_1.status_code get status code python Learning skirt: 660193417### if response_1.status_code == 200: print(dit, 'This agent is really good') use_list.append(dit) except: print(dit, 'He tui~ useless')
save
with open('acting https.txt', mode='w', encoding='utf-8') as f: f.write('\n'.join([str(i) for i in use_list]))
Because it is just a demonstration, only 45 were obtained in total, and none of them can be used