🕷️ 爬虫框架 XPath 语法与接口监听用法对比笔记

覆盖框架:Selenium · DrissionPage · Playwright · Scrapy · feapder

一、XPath 定位 text 与属性语法对比

1.1 各框架 XPath 调用方式

框架调用方法返回类型
Seleniumdriver.find_element(By.XPATH, "...") / find_elements(...)WebElement / List
DrissionPagepage.ele("xpath:...") / page.eles("xpath:...")ChromiumElement / List
Playwrightpage.locator("xpath=...") / page.query_selector("xpath=...")Locator / ElementHandle
Scrapyresponse.xpath("...")SelectorList
feapderselector.xpath("...") (基于 parsel,同 Scrapy)SelectorList

各框架完整调用示例:

# ── Selenium ──────────────────────────────────────────
from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://example.com")

# 单个元素
el = driver.find_element(By.XPATH, '//div[@class="title"]')
# 多个元素
els = driver.find_elements(By.XPATH, '//ul/li')


# ── DrissionPage ──────────────────────────────────────
from DrissionPage import ChromiumPage

page = ChromiumPage()
page.get("https://example.com")

# 单个元素
el = page.ele('xpath://div[@class="title"]')
# 多个元素
els = page.eles('xpath://ul/li')


# ── Playwright(同步)─────────────────────────────────
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com")

    # locator(推荐,懒加载)
    el = page.locator('xpath=//div[@class="title"]')
    # query_selector(立即查找,返回 ElementHandle)
    el = page.query_selector('xpath=//div[@class="title"]')
    # 多个元素
    els = page.query_selector_all('xpath=//ul/li')


# ── Scrapy(在 parse 回调中)──────────────────────────
def parse(self, response):
    # 单个
    el = response.xpath('//div[@class="title"]')
    # 多个(本身就是列表,直接遍历)
    for li in response.xpath('//ul/li'):
        print(li.xpath('text()').get())


# ── feapder(在 parse 回调中)────────────────────────
def parse(self, request, response):
    # response.xpath 用法与 Scrapy 完全相同
    el = response.xpath('//div[@class="title"]')
    for li in response.xpath('//ul/li'):
        print(li.xpath('text()').extract_first())

1.2 定位 text 文本

① 精确匹配文本
# XPath 表达式(所有框架通用)
# //div[text()="目标文本"]

# ── Selenium ──
el = driver.find_element(By.XPATH, '//button[text()="登录"]')
el.click()

# ── DrissionPage ──
el = page.ele('xpath://button[text()="登录"]')
el.click()

# ── Playwright ──
el = page.locator('xpath=//button[text()="登录"]')
el.click()

# ── Scrapy / feapder ──
# 精确匹配文本定位(一般用于过滤,不常单独使用)
node = response.xpath('//button[text()="登录"]')
② 模糊匹配文本(包含)
# XPath 表达式(所有框架通用)
# //div[contains(text(), "关键词")]

# ── Selenium ──
els = driver.find_elements(By.XPATH, '//p[contains(text(), "公告")]')
for el in els:
    print(el.text)

# ── DrissionPage ──
els = page.eles('xpath://p[contains(text(), "公告")]')
for el in els:
    print(el.text)

# ── Playwright ──
els = page.query_selector_all('xpath=//p[contains(text(), "公告")]')
for el in els:
    print(el.inner_text())

# ── Scrapy / feapder ──
nodes = response.xpath('//p[contains(text(), "公告")]')
for node in nodes:
    print(node.xpath('text()').get())
③ 提取文本内容
框架语法说明
Seleniumelement.text属性,直接获取
DrissionPageelement.text属性,直接获取
Playwrightelement.inner_text()方法,需异步 await
Scrapyselector.xpath('//div/text()').get()XPath 末尾加 /text()
feapderselector.xpath('//div/text()').extract_first()同 Scrapy,extract_first() 等价 .get()
# 假设页面有 <div class="price">¥99.00</div>

# ── Selenium ──
el = driver.find_element(By.XPATH, '//div[@class="price"]')
print(el.text)                        # → ¥99.00

# ── DrissionPage ──
el = page.ele('xpath://div[@class="price"]')
print(el.text)                        # → ¥99.00

# ── Playwright(同步)──
el = page.query_selector('xpath=//div[@class="price"]')
print(el.inner_text())                # → ¥99.00

# ── Playwright(异步)──
el = await page.query_selector('xpath=//div[@class="price"]')
print(await el.inner_text())          # → ¥99.00

# ── Scrapy / feapder ──
# .get() 取第一个,.getall() 取全部列表
price = response.xpath('//div[@class="price"]/text()').get()
print(price)                          # → ¥99.00

# 提取多个文本(如列表页所有标题)
titles = response.xpath('//h2[@class="title"]/text()').getall()
print(titles)                         # → ['标题1', '标题2', ...]

# feapder 的旧写法(等价)
price = response.xpath('//div[@class="price"]/text()').extract_first()
titles = response.xpath('//h2/text()').extract()

⚠️ 关键区别

  • Scrapy / feapder 用 /text() 节点提取文本,需要 .get().getall() 转为 Python 字符串
  • Selenium / DrissionPage 直接用 .text 属性
  • Playwright 用 .inner_text() 方法(异步场景需 await

1.3 定位标签属性

① 按属性值精确匹配
# XPath 表达式(所有框架通用)
# //input[@name="username"]
# //a[@href="https://example.com"]

# ── Selenium ──
input_el = driver.find_element(By.XPATH, '//input[@name="username"]')
input_el.send_keys("admin")

link = driver.find_element(By.XPATH, '//a[@target="_blank"]')

# ── DrissionPage ──
input_el = page.ele('xpath://input[@name="username"]')
input_el.input("admin")

# ── Playwright ──
input_el = page.locator('xpath=//input[@name="username"]')
input_el.fill("admin")

# ── Scrapy / feapder ──
link = response.xpath('//a[@target="_blank"]')
② 按属性值模糊匹配
# 以下 XPath 表达式所有框架通用,仅调用方式不同

# contains —— class 包含某个值(最常用)
# //div[contains(@class, "btn")]

# starts-with —— href 以 https 开头
# //a[starts-with(@href, "https")]

# not —— 不含某属性
# //div[not(@disabled)]

# ── Selenium 示例 ──
# 找所有含 "active" 类的 li
els = driver.find_elements(By.XPATH, '//li[contains(@class, "active")]')

# ── DrissionPage 示例 ──
els = page.eles('xpath://li[contains(@class, "active")]')

# ── Playwright 示例 ──
els = page.query_selector_all('xpath=//li[contains(@class, "active")]')

# ── Scrapy / feapder 示例 ──
links = response.xpath('//a[starts-with(@href, "https")]')
for link in links:
    print(link.xpath('@href').get())   # 提取属性值
③ 提取属性值
框架语法示例
Seleniumelement.get_attribute("href")el.get_attribute("class")
DrissionPageelement.attr("href")el.attr("data-id")
Playwrightelement.get_attribute("href")await el.get_attribute("href")
Scrapyselector.xpath('//@href').get()XPath 加 @属性名
feapderselector.xpath('//@href').extract_first()同 Scrapy
# 假设页面有 <a class="item" href="/detail/123" data-id="123">商品名</a>

# ── Selenium ──
el = driver.find_element(By.XPATH, '//a[@class="item"]')
print(el.get_attribute("href"))       # → /detail/123
print(el.get_attribute("data-id"))    # → 123
print(el.get_attribute("class"))      # → item

# ── DrissionPage ──
el = page.ele('xpath://a[@class="item"]')
print(el.attr("href"))                # → /detail/123
print(el.attr("data-id"))             # → 123

# ── Playwright(同步)──
el = page.query_selector('xpath=//a[@class="item"]')
print(el.get_attribute("href"))       # → /detail/123
print(el.get_attribute("data-id"))    # → 123

# ── Scrapy / feapder ──
# 方式一:在 XPath 末尾直接提取
href = response.xpath('//a[@class="item"]/@href').get()
print(href)                           # → /detail/123

# 方式二:先选元素再提取
el = response.xpath('//a[@class="item"]')
print(el.xpath('@href').get())        # → /detail/123
print(el.xpath('@data-id').get())     # → 123

# 批量提取所有链接的 href
all_hrefs = response.xpath('//a/@href').getall()
print(all_hrefs)                      # → ['/detail/1', '/detail/2', ...]

⚠️ 关键区别

  • Scrapy / feapder 在 XPath 末尾用 /@属性名 直接提取属性值
  • Selenium / Playwright 通过元素方法 .get_attribute() 获取
  • DrissionPage 用 .attr() 方法

1.4 组合条件与轴(Axis)

以下 XPath 表达式在所有框架中完全通用,只需替换对应的调用入口:

# ── 多条件 AND ──
# 找 type="text" 且 name="user" 的 input
# //input[@type="text" and @name="user"]

# Selenium
el = driver.find_element(By.XPATH, '//input[@type="text" and @name="user"]')
# DrissionPage
el = page.ele('xpath://input[@type="text" and @name="user"]')
# Scrapy/feapder
el = response.xpath('//input[@type="text" and @name="user"]')


# ── 多条件 OR ──
# 找 class 为 "btn-primary" 或 "btn-success" 的按钮
# //button[@class="btn-primary" or @class="btn-success"]

els = driver.find_elements(By.XPATH, '//button[@class="btn-primary" or @class="btn-success"]')


# ── 父轴:通过子元素找父元素 ──
# 找含 class="price" 的 span 的父级 div
# //span[@class="price"]/parent::div

el = page.ele('xpath://span[@class="price"]/parent::div')


# ── 兄弟轴:找同级相邻节点 ──
# 找 class="active" 的 li 之后的所有兄弟 li
# //li[@class="active"]/following-sibling::li

els = response.xpath('//li[@class="active"]/following-sibling::li')

# 找前面的兄弟
# //li[@class="active"]/preceding-sibling::li


# ── 按位置索引 ──
# XPath 下标从 1 开始(注意!不是 0)

# Scrapy/feapder —— XPath 内写索引
second_li = response.xpath('//ul/li[2]/text()').get()
last_li   = response.xpath('//ul/li[last()]/text()').get()
# 位置大于 2 的所有 li
rest      = response.xpath('//ul/li[position() > 2]/text()').getall()

# Selenium —— XPath 内写索引
el = driver.find_element(By.XPATH, '//ul/li[1]')   # 第一个

# DrissionPage —— 也支持 XPath 内索引,或用切片
el  = page.ele('xpath://ul/li[1]')
els = page.eles('xpath://ul/li')
print(els[0].text)   # Python 切片(从 0 开始)


# ── 综合示例:爬取商品列表 ──
# <ul class="goods-list">
#   <li class="item">
#     <span class="name">商品A</span>
#     <span class="price">¥99</span>
#     <a href="/detail/1">详情</a>
#   </li>
# </ul>

# Scrapy/feapder
for item in response.xpath('//ul[@class="goods-list"]/li[@class="item"]'):
    name  = item.xpath('span[@class="name"]/text()').get()
    price = item.xpath('span[@class="price"]/text()').get()
    href  = item.xpath('a/@href').get()
    print(name, price, href)

# DrissionPage
for item in page.eles('xpath://ul[@class="goods-list"]/li[@class="item"]'):
    name  = item.ele('xpath:span[@class="name"]').text
    price = item.ele('xpath:span[@class="price"]').text
    href  = item.ele('xpath:a').attr('href')
    print(name, price, href)

# Selenium
for item in driver.find_elements(By.XPATH, '//ul[@class="goods-list"]/li[@class="item"]'):
    name  = item.find_element(By.XPATH, './/span[@class="name"]').text
    price = item.find_element(By.XPATH, './/span[@class="price"]').text
    href  = item.find_element(By.XPATH, './/a').get_attribute('href')
    print(name, price, href)

1.5 语法差异速查表

功能Scrapy/feapderSelenium/DrissionPage/Playwright
获取文本xpath('//p/text()').get()element.text / inner_text()
获取属性xpath('//@href').get()get_attribute("href") / attr("href")
多结果.getall() / .extract()find_elements(...)
结果类型SelectorList → 需序列化直接 Python 字符串/列表

二、接口监听(网络请求拦截)用法对比

适用框架:DrissionPage · Playwright · feapder

2.1 DrissionPage 接口监听

DrissionPage 通过 监听器(Listener) 机制抓取接口响应,基于 CDP 协议。

from DrissionPage import ChromiumPage

page = ChromiumPage()

# ① 开启监听,指定目标 URL 关键词
page.listen.start("api/data")  # 匹配包含该关键词的请求

# ② 访问目标页面
page.get("https://example.com")

# ③ 等待并获取监听到的数据包(阻塞,默认超时 10s)
res = page.listen.wait()

# ④ 提取响应内容
print(res.response.body)   # 响应体(自动 JSON 解析)
print(res.request.headers) # 请求头
print(res.url)             # 完整 URL

# ⑤ 获取多个数据包
res = page.listen.wait(count=3)  # 等待 3 个匹配包

# ⑥ 停止监听
page.listen.stop()

持续监听(循环场景):

page.listen.start("api/list")
page.get("https://example.com/list")

for _ in range(5):               # 翻 5 页
    packet = page.listen.wait()
    data = packet.response.body  # 直接拿 JSON
    print(data)
    page.ele("@class=next-btn").click()

常用属性:

属性说明
res.url请求 URL
res.method请求方法 GET/POST
res.request.body请求体
res.request.headers请求头
res.response.body响应体(JSON 自动解析)
res.response.headers响应头
res.response.status状态码

2.2 Playwright 接口监听

Playwright 提供两种方式:事件监听route 拦截

① 响应事件监听(只读)
from playwright.sync_api import sync_playwright

def handle_response(response):
    if "api/data" in response.url:
        print(response.url)
        print(response.json())  # 自动解析 JSON

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    
    # 绑定响应监听事件
    page.on("response", handle_response)
    
    page.goto("https://example.com")
    page.wait_for_timeout(3000)
    browser.close()
② request 事件监听(获取请求信息)
def handle_request(request):
    if "api" in request.url:
        print(request.method)
        print(request.headers)
        print(request.post_data)  # POST 请求体

page.on("request", handle_request)
③ route 拦截(可修改/拦截请求)
# 拦截并修改响应
def handle_route(route):
    if "api/data" in route.request.url:
        # 继续原请求并获取响应
        response = route.fetch()
        body = response.json()
        print(body)
        route.fulfill(response=response)  # 放行
    else:
        route.continue_()  # 其他请求放行

page.route("**/*", handle_route)

# 也可以直接 mock 响应
page.route("**/api/data", lambda route: route.fulfill(
    status=200,
    content_type="application/json",
    body='{"mock": true}'
))
④ 等待特定请求(同步等待)
# 等待某个接口响应后再操作
with page.expect_response("**/api/data") as resp_info:
    page.click("#load-btn")
response = resp_info.value
print(response.json())
⑤ 异步版本
import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        
        async def handle_response(response):
            if "api" in response.url:
                data = await response.json()
                print(data)
        
        page.on("response", handle_response)
        await page.goto("https://example.com")
        await asyncio.sleep(3)
        await browser.close()

asyncio.run(main())

2.3 feapder 接口监听

feapder 基于 Selenium,原生不支持直接网络拦截,有以下两种常用方案:

① 使用 feapder 内置 BrowserSpider + Chrome DevTools(推荐)

feapder 的 BrowserSpider 支持通过 CDP 获取网络请求(需配合 seleniumwire 或手动 CDP):

import feapder

class MySpider(feapder.AirSpider):
    def start_requests(self):
        yield feapder.Request("https://example.com")

    def parse(self, request, response):
        # response 是基于 requests 的响应对象
        # 直接解析接口响应(适合直接请求接口)
        data = response.json
        print(data)
② feapder + seleniumwire 拦截浏览器请求
from seleniumwire import webdriver
import feapder

# 在 feapder 的 BrowserSpider 中自定义 driver
class MySpider(feapder.BrowserSpider):
    
    def start_requests(self):
        yield feapder.Request("https://example.com/page")
    
    def parse(self, request, response):
        # 通过 self.driver 访问 seleniumwire 的请求记录
        for req in self.driver.requests:
            if "api/data" in req.url:
                print(req.url)
                print(req.response.body)  # 响应体(bytes)
③ feapder 直接爬接口(最常用)

feapder 本身是请求级框架,如果接口可直接访问,推荐直接爬接口而非监听:

import feapder

class ApiSpider(feapder.AirSpider):
    def start_requests(self):
        # 直接请求接口
        yield feapder.Request(
            "https://example.com/api/data?page=1",
            headers={"Authorization": "Bearer xxx"},
            callback=self.parse_data
        )

    def parse_data(self, request, response):
        data = response.json  # 自动解析 JSON
        for item in data["list"]:
            print(item)
        
        # 翻页
        if data["hasNext"]:
            next_page = request.url.replace("page=1", f"page={data['page']+1}")
            yield feapder.Request(next_page, callback=self.parse_data)

2.4 三框架接口监听对比总结

对比维度DrissionPagePlaywrightfeapder
监听机制CDP Listener事件回调 / routeseleniumwire / 直接请求
使用难度⭐⭐(简单)⭐⭐⭐(灵活)⭐⭐⭐⭐(需额外配置)
同步/异步同步(wait 阻塞)同步 & 异步均支持同步
能否修改请求❌ 只读✅ route 可拦截修改✅(seleniumwire)
能否 Mock 响应✅ route.fulfill()
JSON 自动解析.body 自动解析.json()response.json
翻页监听✅ 天然适合循环监听✅ expect_response⚠️ 需手动循环
推荐场景动态渲染页面接口抓取需要拦截/Mock 的测试场景直接 API 爬取

2.5 使用场景推荐

动态加载数据(Ajax/XHR)需要监听?
    ├── 只需读取响应数据          → DrissionPage listen(最省心)
    ├── 需要修改请求或 Mock 数据   → Playwright route
    ├── 接口可以直接请求           → feapder AirSpider(最高效)
    └── 已用 feapder + 需要浏览器  → feapder + seleniumwire

三、速查备忘

XPath 常用表达式模板

# 文本精确匹配
//tag[text()="xxx"]

# 文本模糊匹配
//tag[contains(text(), "xxx")]

# 属性精确匹配
//tag[@attr="val"]

# 属性模糊匹配
//tag[contains(@attr, "val")]

# 提取文本(Scrapy/feapder)
//tag/text()

# 提取属性(Scrapy/feapder)
//@href

# 多条件
//tag[@a="1" and @b="2"]

# 第 n 个
(//tag)[n]

# 父节点
//tag/parent::*

# 下一个兄弟
//tag/following-sibling::*[1]

DrissionPage 监听模板

page.listen.start("关键词")
page.get("url")
packet = page.listen.wait()
data = packet.response.body

Playwright 监听模板

# 简单监听
page.on("response", lambda r: print(r.json()) if "api" in r.url else None)

# 等待特定接口
with page.expect_response("**/api/**") as r:
    page.click("#btn")
print(r.value.json())

📝 整理时间:2026年3月
📦 框架版本参考:Selenium 4.x · DrissionPage 4.x · Playwright 1.4x · Scrapy 2.x · feapder 1.x

标签: none

添加新评论