爬虫框架 XPath 语法与接口监听用法对比笔记
各框架完整调用示例: ⚠️ 关键区别: ⚠️ 关键区别: 以下 XPath 表达式在所有框架中完全通用,只需替换对应的调用入口: DrissionPage 通过 监听器(Listener) 机制抓取接口响应,基于 CDP 协议。 持续监听(循环场景): 常用属性: Playwright 提供两种方式:事件监听 和 route 拦截。 feapder 基于 Selenium,原生不支持直接网络拦截,有以下两种常用方案: feapder 的 feapder 本身是请求级框架,如果接口可直接访问,推荐直接爬接口而非监听:🕷️ 爬虫框架 XPath 语法与接口监听用法对比笔记
覆盖框架:Selenium · DrissionPage · Playwright · Scrapy · feapder
一、XPath 定位 text 与属性语法对比
1.1 各框架 XPath 调用方式
框架 调用方法 返回类型 Selenium driver.find_element(By.XPATH, "...") / find_elements(...)WebElement / List DrissionPage page.ele("xpath:...") / page.eles("xpath:...")ChromiumElement / List Playwright page.locator("xpath=...") / page.query_selector("xpath=...")Locator / ElementHandle Scrapy response.xpath("...")SelectorList feapder selector.xpath("...") (基于 parsel,同 Scrapy)SelectorList # ── Selenium ──────────────────────────────────────────
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com")
# 单个元素
el = driver.find_element(By.XPATH, '//div[@class="title"]')
# 多个元素
els = driver.find_elements(By.XPATH, '//ul/li')
# ── DrissionPage ──────────────────────────────────────
from DrissionPage import ChromiumPage
page = ChromiumPage()
page.get("https://example.com")
# 单个元素
el = page.ele('xpath://div[@class="title"]')
# 多个元素
els = page.eles('xpath://ul/li')
# ── Playwright(同步)─────────────────────────────────
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com")
# locator(推荐,懒加载)
el = page.locator('xpath=//div[@class="title"]')
# query_selector(立即查找,返回 ElementHandle)
el = page.query_selector('xpath=//div[@class="title"]')
# 多个元素
els = page.query_selector_all('xpath=//ul/li')
# ── Scrapy(在 parse 回调中)──────────────────────────
def parse(self, response):
# 单个
el = response.xpath('//div[@class="title"]')
# 多个(本身就是列表,直接遍历)
for li in response.xpath('//ul/li'):
print(li.xpath('text()').get())
# ── feapder(在 parse 回调中)────────────────────────
def parse(self, request, response):
# response.xpath 用法与 Scrapy 完全相同
el = response.xpath('//div[@class="title"]')
for li in response.xpath('//ul/li'):
print(li.xpath('text()').extract_first())1.2 定位 text 文本
① 精确匹配文本
# XPath 表达式(所有框架通用)
# //div[text()="目标文本"]
# ── Selenium ──
el = driver.find_element(By.XPATH, '//button[text()="登录"]')
el.click()
# ── DrissionPage ──
el = page.ele('xpath://button[text()="登录"]')
el.click()
# ── Playwright ──
el = page.locator('xpath=//button[text()="登录"]')
el.click()
# ── Scrapy / feapder ──
# 精确匹配文本定位(一般用于过滤,不常单独使用)
node = response.xpath('//button[text()="登录"]')② 模糊匹配文本(包含)
# XPath 表达式(所有框架通用)
# //div[contains(text(), "关键词")]
# ── Selenium ──
els = driver.find_elements(By.XPATH, '//p[contains(text(), "公告")]')
for el in els:
print(el.text)
# ── DrissionPage ──
els = page.eles('xpath://p[contains(text(), "公告")]')
for el in els:
print(el.text)
# ── Playwright ──
els = page.query_selector_all('xpath=//p[contains(text(), "公告")]')
for el in els:
print(el.inner_text())
# ── Scrapy / feapder ──
nodes = response.xpath('//p[contains(text(), "公告")]')
for node in nodes:
print(node.xpath('text()').get())③ 提取文本内容
框架 语法 说明 Selenium element.text属性,直接获取 DrissionPage element.text属性,直接获取 Playwright element.inner_text()方法,需异步 awaitScrapy selector.xpath('//div/text()').get()XPath 末尾加 /text()feapder selector.xpath('//div/text()').extract_first()同 Scrapy, extract_first() 等价 .get()# 假设页面有 <div class="price">¥99.00</div>
# ── Selenium ──
el = driver.find_element(By.XPATH, '//div[@class="price"]')
print(el.text) # → ¥99.00
# ── DrissionPage ──
el = page.ele('xpath://div[@class="price"]')
print(el.text) # → ¥99.00
# ── Playwright(同步)──
el = page.query_selector('xpath=//div[@class="price"]')
print(el.inner_text()) # → ¥99.00
# ── Playwright(异步)──
el = await page.query_selector('xpath=//div[@class="price"]')
print(await el.inner_text()) # → ¥99.00
# ── Scrapy / feapder ──
# .get() 取第一个,.getall() 取全部列表
price = response.xpath('//div[@class="price"]/text()').get()
print(price) # → ¥99.00
# 提取多个文本(如列表页所有标题)
titles = response.xpath('//h2[@class="title"]/text()').getall()
print(titles) # → ['标题1', '标题2', ...]
# feapder 的旧写法(等价)
price = response.xpath('//div[@class="price"]/text()').extract_first()
titles = response.xpath('//h2/text()').extract()/text() 节点提取文本,需要 .get() 或 .getall() 转为 Python 字符串.text 属性.inner_text() 方法(异步场景需 await)1.3 定位标签属性
① 按属性值精确匹配
# XPath 表达式(所有框架通用)
# //input[@name="username"]
# //a[@href="https://example.com"]
# ── Selenium ──
input_el = driver.find_element(By.XPATH, '//input[@name="username"]')
input_el.send_keys("admin")
link = driver.find_element(By.XPATH, '//a[@target="_blank"]')
# ── DrissionPage ──
input_el = page.ele('xpath://input[@name="username"]')
input_el.input("admin")
# ── Playwright ──
input_el = page.locator('xpath=//input[@name="username"]')
input_el.fill("admin")
# ── Scrapy / feapder ──
link = response.xpath('//a[@target="_blank"]')② 按属性值模糊匹配
# 以下 XPath 表达式所有框架通用,仅调用方式不同
# contains —— class 包含某个值(最常用)
# //div[contains(@class, "btn")]
# starts-with —— href 以 https 开头
# //a[starts-with(@href, "https")]
# not —— 不含某属性
# //div[not(@disabled)]
# ── Selenium 示例 ──
# 找所有含 "active" 类的 li
els = driver.find_elements(By.XPATH, '//li[contains(@class, "active")]')
# ── DrissionPage 示例 ──
els = page.eles('xpath://li[contains(@class, "active")]')
# ── Playwright 示例 ──
els = page.query_selector_all('xpath=//li[contains(@class, "active")]')
# ── Scrapy / feapder 示例 ──
links = response.xpath('//a[starts-with(@href, "https")]')
for link in links:
print(link.xpath('@href').get()) # 提取属性值③ 提取属性值
框架 语法 示例 Selenium element.get_attribute("href")el.get_attribute("class")DrissionPage element.attr("href")el.attr("data-id")Playwright element.get_attribute("href")await el.get_attribute("href")Scrapy selector.xpath('//@href').get()XPath 加 @属性名feapder selector.xpath('//@href').extract_first()同 Scrapy # 假设页面有 <a class="item" href="/detail/123" data-id="123">商品名</a>
# ── Selenium ──
el = driver.find_element(By.XPATH, '//a[@class="item"]')
print(el.get_attribute("href")) # → /detail/123
print(el.get_attribute("data-id")) # → 123
print(el.get_attribute("class")) # → item
# ── DrissionPage ──
el = page.ele('xpath://a[@class="item"]')
print(el.attr("href")) # → /detail/123
print(el.attr("data-id")) # → 123
# ── Playwright(同步)──
el = page.query_selector('xpath=//a[@class="item"]')
print(el.get_attribute("href")) # → /detail/123
print(el.get_attribute("data-id")) # → 123
# ── Scrapy / feapder ──
# 方式一:在 XPath 末尾直接提取
href = response.xpath('//a[@class="item"]/@href').get()
print(href) # → /detail/123
# 方式二:先选元素再提取
el = response.xpath('//a[@class="item"]')
print(el.xpath('@href').get()) # → /detail/123
print(el.xpath('@data-id').get()) # → 123
# 批量提取所有链接的 href
all_hrefs = response.xpath('//a/@href').getall()
print(all_hrefs) # → ['/detail/1', '/detail/2', ...]/@属性名 直接提取属性值.get_attribute() 获取.attr() 方法1.4 组合条件与轴(Axis)
# ── 多条件 AND ──
# 找 type="text" 且 name="user" 的 input
# //input[@type="text" and @name="user"]
# Selenium
el = driver.find_element(By.XPATH, '//input[@type="text" and @name="user"]')
# DrissionPage
el = page.ele('xpath://input[@type="text" and @name="user"]')
# Scrapy/feapder
el = response.xpath('//input[@type="text" and @name="user"]')
# ── 多条件 OR ──
# 找 class 为 "btn-primary" 或 "btn-success" 的按钮
# //button[@class="btn-primary" or @class="btn-success"]
els = driver.find_elements(By.XPATH, '//button[@class="btn-primary" or @class="btn-success"]')
# ── 父轴:通过子元素找父元素 ──
# 找含 class="price" 的 span 的父级 div
# //span[@class="price"]/parent::div
el = page.ele('xpath://span[@class="price"]/parent::div')
# ── 兄弟轴:找同级相邻节点 ──
# 找 class="active" 的 li 之后的所有兄弟 li
# //li[@class="active"]/following-sibling::li
els = response.xpath('//li[@class="active"]/following-sibling::li')
# 找前面的兄弟
# //li[@class="active"]/preceding-sibling::li
# ── 按位置索引 ──
# XPath 下标从 1 开始(注意!不是 0)
# Scrapy/feapder —— XPath 内写索引
second_li = response.xpath('//ul/li[2]/text()').get()
last_li = response.xpath('//ul/li[last()]/text()').get()
# 位置大于 2 的所有 li
rest = response.xpath('//ul/li[position() > 2]/text()').getall()
# Selenium —— XPath 内写索引
el = driver.find_element(By.XPATH, '//ul/li[1]') # 第一个
# DrissionPage —— 也支持 XPath 内索引,或用切片
el = page.ele('xpath://ul/li[1]')
els = page.eles('xpath://ul/li')
print(els[0].text) # Python 切片(从 0 开始)
# ── 综合示例:爬取商品列表 ──
# <ul class="goods-list">
# <li class="item">
# <span class="name">商品A</span>
# <span class="price">¥99</span>
# <a href="/detail/1">详情</a>
# </li>
# </ul>
# Scrapy/feapder
for item in response.xpath('//ul[@class="goods-list"]/li[@class="item"]'):
name = item.xpath('span[@class="name"]/text()').get()
price = item.xpath('span[@class="price"]/text()').get()
href = item.xpath('a/@href').get()
print(name, price, href)
# DrissionPage
for item in page.eles('xpath://ul[@class="goods-list"]/li[@class="item"]'):
name = item.ele('xpath:span[@class="name"]').text
price = item.ele('xpath:span[@class="price"]').text
href = item.ele('xpath:a').attr('href')
print(name, price, href)
# Selenium
for item in driver.find_elements(By.XPATH, '//ul[@class="goods-list"]/li[@class="item"]'):
name = item.find_element(By.XPATH, './/span[@class="name"]').text
price = item.find_element(By.XPATH, './/span[@class="price"]').text
href = item.find_element(By.XPATH, './/a').get_attribute('href')
print(name, price, href)1.5 语法差异速查表
功能 Scrapy/feapder Selenium/DrissionPage/Playwright 获取文本 xpath('//p/text()').get()element.text / inner_text()获取属性 xpath('//@href').get()get_attribute("href") / attr("href")多结果 .getall() / .extract()find_elements(...)结果类型 SelectorList → 需序列化直接 Python 字符串/列表 二、接口监听(网络请求拦截)用法对比
适用框架:DrissionPage · Playwright · feapder
2.1 DrissionPage 接口监听
from DrissionPage import ChromiumPage
page = ChromiumPage()
# ① 开启监听,指定目标 URL 关键词
page.listen.start("api/data") # 匹配包含该关键词的请求
# ② 访问目标页面
page.get("https://example.com")
# ③ 等待并获取监听到的数据包(阻塞,默认超时 10s)
res = page.listen.wait()
# ④ 提取响应内容
print(res.response.body) # 响应体(自动 JSON 解析)
print(res.request.headers) # 请求头
print(res.url) # 完整 URL
# ⑤ 获取多个数据包
res = page.listen.wait(count=3) # 等待 3 个匹配包
# ⑥ 停止监听
page.listen.stop()page.listen.start("api/list")
page.get("https://example.com/list")
for _ in range(5): # 翻 5 页
packet = page.listen.wait()
data = packet.response.body # 直接拿 JSON
print(data)
page.ele("@class=next-btn").click()属性 说明 res.url请求 URL res.method请求方法 GET/POST res.request.body请求体 res.request.headers请求头 res.response.body响应体(JSON 自动解析) res.response.headers响应头 res.response.status状态码 2.2 Playwright 接口监听
① 响应事件监听(只读)
from playwright.sync_api import sync_playwright
def handle_response(response):
if "api/data" in response.url:
print(response.url)
print(response.json()) # 自动解析 JSON
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# 绑定响应监听事件
page.on("response", handle_response)
page.goto("https://example.com")
page.wait_for_timeout(3000)
browser.close()② request 事件监听(获取请求信息)
def handle_request(request):
if "api" in request.url:
print(request.method)
print(request.headers)
print(request.post_data) # POST 请求体
page.on("request", handle_request)③ route 拦截(可修改/拦截请求)
# 拦截并修改响应
def handle_route(route):
if "api/data" in route.request.url:
# 继续原请求并获取响应
response = route.fetch()
body = response.json()
print(body)
route.fulfill(response=response) # 放行
else:
route.continue_() # 其他请求放行
page.route("**/*", handle_route)
# 也可以直接 mock 响应
page.route("**/api/data", lambda route: route.fulfill(
status=200,
content_type="application/json",
body='{"mock": true}'
))④ 等待特定请求(同步等待)
# 等待某个接口响应后再操作
with page.expect_response("**/api/data") as resp_info:
page.click("#load-btn")
response = resp_info.value
print(response.json())⑤ 异步版本
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
async def handle_response(response):
if "api" in response.url:
data = await response.json()
print(data)
page.on("response", handle_response)
await page.goto("https://example.com")
await asyncio.sleep(3)
await browser.close()
asyncio.run(main())2.3 feapder 接口监听
① 使用 feapder 内置 BrowserSpider + Chrome DevTools(推荐)
BrowserSpider 支持通过 CDP 获取网络请求(需配合 seleniumwire 或手动 CDP):import feapder
class MySpider(feapder.AirSpider):
def start_requests(self):
yield feapder.Request("https://example.com")
def parse(self, request, response):
# response 是基于 requests 的响应对象
# 直接解析接口响应(适合直接请求接口)
data = response.json
print(data)② feapder + seleniumwire 拦截浏览器请求
from seleniumwire import webdriver
import feapder
# 在 feapder 的 BrowserSpider 中自定义 driver
class MySpider(feapder.BrowserSpider):
def start_requests(self):
yield feapder.Request("https://example.com/page")
def parse(self, request, response):
# 通过 self.driver 访问 seleniumwire 的请求记录
for req in self.driver.requests:
if "api/data" in req.url:
print(req.url)
print(req.response.body) # 响应体(bytes)③ feapder 直接爬接口(最常用)
import feapder
class ApiSpider(feapder.AirSpider):
def start_requests(self):
# 直接请求接口
yield feapder.Request(
"https://example.com/api/data?page=1",
headers={"Authorization": "Bearer xxx"},
callback=self.parse_data
)
def parse_data(self, request, response):
data = response.json # 自动解析 JSON
for item in data["list"]:
print(item)
# 翻页
if data["hasNext"]:
next_page = request.url.replace("page=1", f"page={data['page']+1}")
yield feapder.Request(next_page, callback=self.parse_data)2.4 三框架接口监听对比总结
对比维度 DrissionPage Playwright feapder 监听机制 CDP Listener 事件回调 / route seleniumwire / 直接请求 使用难度 ⭐⭐(简单) ⭐⭐⭐(灵活) ⭐⭐⭐⭐(需额外配置) 同步/异步 同步(wait 阻塞) 同步 & 异步均支持 同步 能否修改请求 ❌ 只读 ✅ route 可拦截修改 ✅(seleniumwire) 能否 Mock 响应 ❌ ✅ route.fulfill() ❌ JSON 自动解析 ✅ .body 自动解析✅ .json()✅ response.json翻页监听 ✅ 天然适合循环监听 ✅ expect_response ⚠️ 需手动循环 推荐场景 动态渲染页面接口抓取 需要拦截/Mock 的测试场景 直接 API 爬取 2.5 使用场景推荐
动态加载数据(Ajax/XHR)需要监听?
├── 只需读取响应数据 → DrissionPage listen(最省心)
├── 需要修改请求或 Mock 数据 → Playwright route
├── 接口可以直接请求 → feapder AirSpider(最高效)
└── 已用 feapder + 需要浏览器 → feapder + seleniumwire三、速查备忘
XPath 常用表达式模板
# 文本精确匹配
//tag[text()="xxx"]
# 文本模糊匹配
//tag[contains(text(), "xxx")]
# 属性精确匹配
//tag[@attr="val"]
# 属性模糊匹配
//tag[contains(@attr, "val")]
# 提取文本(Scrapy/feapder)
//tag/text()
# 提取属性(Scrapy/feapder)
//@href
# 多条件
//tag[@a="1" and @b="2"]
# 第 n 个
(//tag)[n]
# 父节点
//tag/parent::*
# 下一个兄弟
//tag/following-sibling::*[1]DrissionPage 监听模板
page.listen.start("关键词")
page.get("url")
packet = page.listen.wait()
data = packet.response.bodyPlaywright 监听模板
# 简单监听
page.on("response", lambda r: print(r.json()) if "api" in r.url else None)
# 等待特定接口
with page.expect_response("**/api/**") as r:
page.click("#btn")
print(r.value.json())📝 整理时间:2026年3月
📦 框架版本参考:Selenium 4.x · DrissionPage 4.x · Playwright 1.4x · Scrapy 2.x · feapder 1.x