博智编程

Scrapy 通常用于爬取静态网页，而无法直接处理使用 JavaScript 动态加载的 AJAX 数据。然而，你可以使用 Splash 或 Selenium 等工具，将其集成到 Scrapy 中，以处理动态加载的 AJAX 数据。

方法一：使用 Splash 加载：

首先，确保你已经安装了 Splash。你可以通过以下方式安装：

docker run -p 8050:8050 scrapinghub/splash

接下来，在 Scrapy 项目中，安装 scrapy-splash 扩展：

pip install scrapy-splash

在你的 Scrapy 项目的 settings.py 文件中添加以下配置：

SPLASH_URL = 'http://localhost:8050'
DUPEFILTER_CLASS='scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE='scrapy_splash.SplashAwareFSCacheStorage'

然后，在你的 Scrapy Spider 中使用 SplashRequest

from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['https://douban.com']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, args={'wait': 0.5})

    def parse(self, response):
        # 在这里处理页面内容，包括 AJAX 数据
        # response.text 包含页面的完整 HTML，包括通过 JavaScript 加载的内容
        pass

方法二使用selenum加载chrome浏览器

首先，确保你已经安装了 Selenium：

pip install selenium

然后，下载 ChromeDriver，确保与你的 Chrome 浏览器版本匹配。下载地址：https://chromedriver.storage.googleapis.com/index.html

如果找不到对应版本，可以下载最新版测试版：https://googlechromelabs.github.io/chrome-for-testing/

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# 设置 Chrome 选项
chrome_options = Options()
chrome_options.add_argument('--headless')  # 无头模式，不显示浏览器界面，加速爬取
chrome_options.add_argument('--disable-gpu')  # 禁用 GPU 加速

# 设置 ChromeDriver 路径
chrome_driver_path = '/path/to/chromedriver'

# 初始化 Chrome 浏览器
driver = webdriver.Chrome(executable_path=chrome_driver_path, options=chrome_options)

# 访问网页
url = 'https://douban.com'
driver.get(url)

# 获取页面内容
page_source = driver.page_source

# 在这里可以使用 page_source 处理网页内容，如解析 HTML 或提取信息

# 关闭浏览器
driver.quit()

未经允许不得转载：学编程 » 解决scrapy无法爬取网页动态数据的方法

解决scrapy无法爬取网页动态数据的方法

方法一：使用 Splash 加载：

方法二使用selenum加载chrome浏览器

相关推荐

Docker搭建elasticsearch集群（3节点）【二】

Docker搭建elasticsearch集群（3节点）【一】

ElasticSearch之RestClient查询文档

方法一：使用 Splash 加载：

方法二 使用selenum加载chrome浏览器

相关推荐

方法二使用selenum加载chrome浏览器