欢迎来到尧图网

客户服务 关于我们

您的位置:首页 > 财经 > 金融 > 爬虫日常练习

爬虫日常练习

2025/2/25 14:04:04 来源:https://blog.csdn.net/2301_77869606/article/details/143771872  浏览:    关键词:爬虫日常练习

1.反webdriver自动化检测爬取

有的网站会有检测webdriver的反爬手段,这时候就需要做一些操作防止被网页检测到,下面直接给出固定代码段方法:

from selenium.webdriver.support.ui import WebDriverWait# 配置 ChromeOptions 防止被检测
options = webdriver.ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--incognito")  # 使用隐身模式
# options.add_argument('--headless')  # 无头浏览,开发和调试时可以注释掉这行# 创建 WebDriver 实例
driver = webdriver.Chrome(options=options)
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": """Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"""
})
wait = WebDriverWait(driver, 10)

网站实战反反爬:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.support.ui import WebDriverWait# 配置 ChromeOptions 防止被检测
options = webdriver.ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--incognito")  # 使用隐身模式
# options.add_argument('--headless')  # 无头浏览,开发和调试时可以注释掉这行# 创建 WebDriver 实例
driver = webdriver.Chrome(options=options)
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": """Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"""
})
wait = WebDriverWait(driver, 10)
driver.get('https://antispider1.scrape.center/')
time.sleep(5)
all_data_dict = {}
for i in range(10):titles = driver.find_elements(By.XPATH, '//*[@id="index"]/div[1]/div[1]/div/div/div/div[2]/a/h2')grades = driver.find_elements(By.XPATH, '//*[@id="index"]/div[1]/div[1]/div/div/div/div[3]/p[1]')data_dict = {}for title, grade in zip(titles, grades):data_dict[title.text] = grade.textall_data_dict.update(data_dict)print(f"页数 {i + 1}: {data_dict}")try:search_next = driver.find_element(By.XPATH, '//*[@id="index"]/div[2]/div/div/div/button[2]/i').click()time.sleep(3)except Exception as e:print(f"单击下一页出错:{e}")# 打印所有页面的数据
print('-' * 80)
print("所有数据:", all_data_dict)
time.sleep(3)
driver.quit()

2.boss直聘多页爬取

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import timedriver = webdriver.Chrome()
driver.get('https://www.zhipin.com/?ka=header-home-logo')
time.sleep(3)searching = driver.find_element(By.XPATH, '//*[@id="wrap"]/div[3]/div/div[1]/div[1]/form/div[2]/p/input')
searching.send_keys('大数据爬虫')
time.sleep(3)
searching.send_keys(Keys.ENTER)
time.sleep(10)
for i in range(3):titles = driver.find_elements(By.XPATH, '//span[@class="job-name"]')prices = driver.find_elements(By.XPATH, '//span[@class="salary"]')for title, price in zip(titles, prices):print(f'正在爬取第{i+1}页......')print(f'职位名称:{title.text}')print(f'薪资:{price.text}')search_next = driver.find_element(By.XPATH, '//i[@class="ui-icon-arrow-right"]').click()time.sleep(3)time.sleep(10)
driver.quit()

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com

热搜词