Python obtains free proxy IP, but few can use it?

Hi everyone, hello! I am a little panda ❤

About why IP proxy is needed:

When collecting data, collecting data in batches, the request speed is too fast,
The website may block your IP
<Your network cannot access this site>
The IP proxy changes to another IP, and then collects the request data

Source code and information click here

1. Packet capture analysis data source

1. Clear requirements:

  • Identify the collection site and what the collection data is
  • Get IP proxy, check if IP proxy is available
dit = {
    'http': 'http://' + IP:port
}

2. Analyze the two data of IP and port number, which website can be requested?

  • F12 or right-click to check and select network to refresh the web page
  • Where is the analysis data --> search for data sources by keyword <we want data>

Request https://www.kuaidaili.com/free/ to get the response to get the data we want IP and port

2. Code implementation steps

  1. Send a request, simulate the browser to send a request for the url address
  2. Parse the data and extract the data content we want
  3. Save data, save available IP proxy locally, IP proxy detection
  4. After the detection is available, save the IP proxy

code implementation

send request

Simulation camouflage --> headers request header dictionary data type

# request link
url = f'https://****/free/inha/1/'
headers = {
    # User-Agent The user agent represents the basic identity information of the browser
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'
}
response = requests.get(url=url, headers=headers)

Analytical data

Three Analysis Methods

re: directly for string data extraction

  • re.findall('what data to match', 'where to match') Find the content of the data we want From where to match what data From
    response.text to match (.?) where (.?) is the data we want
  • () represents the data you want, .*? Matching rules can match any character (except newline \n)
IP_list = re.findall('<td data-title="IP">(.*?)</td>', response.text)
PORT_list = re.findall('<td data-title="PORT">(.*?)</td>', response.text)

css: extract data based on tag attribute
#list tbody tr td:nth-child(1) positioning label element
td:nth-child(1) means the first td tag td:nth-child(1)::text means extract the text data in the first td tag
getall() means to get all

IP_list = selector.css('#list tbody tr td:nth-child(1)::text').getall()
PORT_list = selector.css('#list tbody tr td:nth-child(2)::text').getall()

xpath: Extract data based on label nodes

IP_list = selector.xpath('//*[@id="list"]//tbody/tr/td[1]/text()').getall()
PORT_list = selector.xpath('//*[@id="list"]//tbody/tr/td[2]/text()').getall()

The for loop traverses the extracted data and extracts the elements in the list one by one

for IP, PORT in zip(IP_list, PORT_list):
    dit = {
        'http': 'http://' + IP + ':' + PORT,
        'https': 'https://' + IP + ':' + PORT,
    }

print(dit)

Detect whether the IP proxy is available, request a website to carry the IP proxy

try:
    # Send request with IP agent
    response_1 = requests.get(url='https://www.baidu.com/', headers=headers, proxies=dit, timeout=1)
    # response_1.status_code get status code python Learning skirt: 660193417###
    if response_1.status_code == 200:
        print(dit, 'This agent is really good')
        use_list.append(dit)
except:
    print(dit, 'He tui~ useless')

save

with open('acting https.txt', mode='w', encoding='utf-8') as f:
    f.write('\n'.join([str(i) for i in use_list]))

Because it is just a demonstration, only 45 were obtained in total, and none of them can be used


👇Questions & Answers · Source Code Acquisition · Technical Exchange · Group learning please contact 👇

Tags: Python TCP/IP programming language

Posted by cavolks on Fri, 18 Nov 2022 21:22:57 +0300