京东商品爬虫技术解析：基于Selenium的自动化数据采集实战

一、代码概述

本代码实现了一个京东商品数据自动化爬虫系统，核心功能包括 Cookie免密登录、页面动态加载处理、多页数据采集 和 Excel数据存储。代码基于Python生态，主要依赖以下技术栈：

技术组件	作用
Selenium	浏览器自动化操作
lxml	HTML页面解析
pandas	数据存储与Excel导出
Edge WebDriver	浏览器驱动

二、核心功能模块解析

1. Cookie管理机制

def is_exists_cookies():cookie_file = './data/jd_cookies.txt'if os.path.exists(cookie_file):# 加载本地Cookieweb.get(jd_domain)with open(cookie_file, 'r') as file:cookies = json.load(file)for cookie in cookies:web.add_cookie(cookie)else:# 首次登录保存Cookieweb.get(jd_login_url)time.sleep(30)  # 手动登录时间窗口dictcookies = web.get_cookies()jsoncookies = json.dumps(dictcookies)with open(cookie_file, 'w') as f:f.write(jsoncookies)

技术亮点：

通过os.path.exists检测本地Cookie文件，实现免重复登录
add_cookie()方法将Cookie注入浏览器会话
JSON格式持久化存储登录凭证

2. 页面动态加载控制

def slide(web):height = 0new_height = web.execute_script("return document.body.scrollHeight")while height < new_height:for i in range(height, new_height, 400):web.execute_script(f'window.scrollTo(0, {i})')time.sleep(0.5)height = new_heightnew_height = web.execute_script(...)

实现原理：

通过JavaScript脚本获取页面总高度
分步滚动（每次400像素）模拟人工浏览
循环检测直至滚动到底部

3. 商品数据解析

def get_product(web):et = etree.HTML(web.page_source)obj_list = et.xpath('//div[@class="gl-i-wrap"]')for item in obj_list:title = ''.join(item.xpath('./div[@class="p-name"]//text()')).strip()price = item.xpath('./div[@class="p-price"]//i/text()')[0]shop = item.xpath('./div[@class="p-shop"]//a/text()')[0]sales = item.xpath('./div[@class="p-commit"]//text()')[0]img = item.xpath('./div[@class="p-img"]//img/@src')[0]

XPath定位策略：

商品列表容器：//div[@class="gl-i-wrap"]
价格字段：./div[@class="p-price"]//i/text()
销量数据：./div[@class="p-commit"]//text()

4. 多页爬取逻辑

def get_more(web, page):for i in range(page):button = web.find_element(By.XPATH, '//*[@id="J_bottomPage"]//a[9]')web.execute_script("arguments[0].click();", button)time.sleep(5)get_product(web)

翻页机制：

定位页码按钮（第9个a标签为下一页）
通过execute_script执行点击操作
固定等待5秒确保页面加载

5. 数据存储模块

data = {"标题": titless, "价格": prices,"店铺": shop_names,"销量": saleses,"图片": urls
}
pd.DataFrame(data).to_excel('./data/手机销售.xlsx', index=False)

Pandas技巧：

字典直接转换为DataFrame
index=False取消默认索引列
支持中文字段存储

三、技术优化建议

1. 增强反爬对抗能力

# 建议新增以下配置
op.add_argument("--disable-blink-features=AutomationControlled")
op.add_argument(f"user-agent={random.choice(USER_AGENTS)}")  # 随机UA
op.add_argument("--proxy-server=http://127.0.0.1:10809")     # 代理IP

2. 改进页面等待机制

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC# 替换time.sleep为显式等待
wait = WebDriverWait(web, 10)
element = wait.until(EC.presence_of_element_located((By.ID, "J_goodsList")))

3. 增加异常处理

try:price = item.xpath('./div[@class="p-price"]//i/text()')[0]
except IndexError:price = "暂无报价"

4. 提升代码可配置性

# 添加配置文件config.py
KEYWORDS = "vivo x100s" 
MAX_PAGE = 10
SAVE_PATH = "./data/"

四、潜在问题与解决方案

问题现象	原因分析	解决方案
商品列表加载不全	滚动速度过快	调整`slide()`步长至200像素
翻页按钮定位失败	页面DOM结构变更	改用CLASS_NAME定位器
数据包含空值	商品信息字段缺失	增加try-except捕获异常
Excel乱码	中文编码问题	导出时指定`encoding='utf-8-sig'`