我爱免费 发表于 2025-10-2 21:15

AI爬虫框架Crawl4AI解密系列(二):如何爬取动态网页

作者:微信文章
一、基本简介


在React和Vue等前端框架日益流行的今天,现代网站严重依赖JavaScript来加载内容。而动态网站的爬取对于传统爬虫框架来说是一个挑战,Crawl4AI可以通过简单的参数设置来轻松搞定这个问题。

二、案例实操


Crawl4AI可以通过两种方式解决动态网页内容爬取,一种是通过设置爬取时等待指定网页的选择器渲染后在执行爬取,选择器可以支持以XPATH方式书写和以CSS方式书写,核心是设置CrawlerRunConfig对象的wait_for_selector属性,但是这里有个坑,就是wait_for_selector属性必须和wait_for_timeout属性同时设置才可以,通常情况在不确定目标网站多久能渲染成功的情况下,建议将wait_for_timeout属性设置一个大一点的值,具体代码如下:
import asyncioimport jsonfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMExtractionStrategy, LLMConfigfrom crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategyfrom dotenv import load_dotenv
load_dotenv()

async def scraper(url: str, instruction: str) -> str:    # Browser configuration    browser_config = BrowserConfig(      headless=True,      browser_mode='dedicated',      user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36',    )    llm_config = LLMConfig(provider="deepseek/deepseek-chat",                            api_token="请输入你的API KEY")    strategy = LLMExtractionStrategy(      llm_config=llm_config,      instruction=instruction,      verbose=True    )    # Crawler configuration    crawler_config = CrawlerRunConfig(      cache_mode=CacheMode.DISABLED,      wait_until="domcontentloaded",# wait until the DOM of the page has been loaded      page_timeout=1800000000,      wait_for=".contentDiv .tableList .title_txt",      wait_for_timeout=150000000,      css_selector="#newsList",      extraction_strategy=strategy
    )    cs = AsyncPlaywrightCrawlerStrategy(browser_config=browser_config)
    # Run the AI-powered crawler    async with AsyncWebCrawler(crawler_strategy=cs) as crawler:      result = await crawler.arun(            config=crawler_config,            url=url,
      )    # print(f"Parsed Markdown data:n{result.markdown}")    # print(f"Parsed Markdown data:n{result.extracted_content}")    re=[]    for x in json.loads(result.extracted_content):      # h=json.loads(x)      print(x["content"])      re.append(x["content"])
    return re

if __name__ == "__main__":    # asyncio.run(craw())    re=asyncio.run(scraper("http://credit.jt.jiangxi.gov.cn/list_sgs.shtml?id=4bb7bd0997804feeb2d4d1e7065bd026",            "请根据采集的网页内容,帮我提取行政相对人名称,不要带其他信息,尤其是不要URL"))    print(re)

另一种方式是通过让Crawl4AI执行JavaScript代码方式来实现动态网页加载,实现方式是设置CrawlerRunConfig对象的js_code属性,具体代码如下:
import asyncioimport jsonfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheModefrom crawl4ai.extraction_strategy import JsonCssExtractionStrategy # For example
async def crawl_dynamic_page():    print("\n--- Crawling Dynamic Page with JS Interaction ---")
    # Example schema for CSS extraction (adapt to the target site)    schema = { "items": { "selector": "div.product-item", "type": "list", "fields": { "title": "h2", "price": ".price" } } }    css_extractor = JsonCssExtractionStrategy(schema)
    # JavaScript to execute on the page (e.g., click a 'Load More' button)    # Note: Selector needs to match the target website    js_to_run = """    (async () => {      const loadMoreButton = document.querySelector('button#load-more');      if (loadMoreButton) {            console.log('Clicking load more button...');            loadMoreButton.click();            // Wait a bit for content to potentially load after click            await new Promise(resolve => setTimeout(resolve, 2000));            console.log('Waited after click.');      } else {            console.log('Load more button not found.');      }    })();    """
    run_conf = CrawlerRunConfig(      cache_mode=CacheMode.BYPASS,      js_code=, # List of JS snippets to execute      wait_for_timeout=3000, # Wait 3 seconds after initial load AND after JS execution      # wait_for_selector="div.newly-loaded-content", # Or wait for a specific element      extraction_strategy=css_extractor, # Extract data after JS runs      output_formats=['markdown', 'extracted_content']    )
    # Ensure JS is enabled in BrowserConfig (it is by default)    browser_conf = BrowserConfig(headless=True, java_script_enabled=True)
    async with AsyncWebCrawler(config=browser_conf) as crawler:      result = await crawler.arun(            url="URL_OF_DYNAMIC_PAGE_HERE", # Replace with actual URL            config=run_conf      )
      if result and result.success:            print("Dynamic page crawl successful!")            print(f"Fit Markdown Length: {len(result.markdown.fit_markdown)}")            if result.extracted_content:                try:                  extracted_data = json.loads(result.extracted_content)                  print(f"Extracted Content Preview: {json.dumps(extracted_data, indent=2)[:500]}...")                except json.JSONDecodeError:                  print(f"Extracted Content (non-JSON): {result.extracted_content[:500]}...")      else:            print(f"Crawl Failed: {result.error_message}")

if __name__ == "__main__":    # Replace with an actual URL that loads content dynamically for testing    # asyncio.run(crawl_dynamic_page())    print("Please replace 'URL_OF_DYNAMIC_PAGE_HERE' and uncomment the line above to run the dynamic example.")
至此,Crawl4AI爬取动态网页的内容介绍完毕,感兴趣的读者可以持续关注本系列,会不定期分享一些干货~
页: [1]
查看完整版本: AI爬虫框架Crawl4AI解密系列(二):如何爬取动态网页