【AI爬虫框架Crawl4AI解密系列(二):如何爬取动态网页】萍聚社区-德国热线-德国实用信息网人工智能

我爱免费 发表于 2025-10-2 21:15

AI爬虫框架Crawl4AI解密系列(二):如何爬取动态网页

作者：微信文章
一、基本简介

在React和Vue等前端框架日益流行的今天，现代网站严重依赖JavaScript来加载内容。而动态网站的爬取对于传统爬虫框架来说是一个挑战，Crawl4AI可以通过简单的参数设置来轻松搞定这个问题。

二、案例实操

Crawl4AI可以通过两种方式解决动态网页内容爬取，一种是通过设置爬取时等待指定网页的选择器渲染后在执行爬取，选择器可以支持以XPATH方式书写和以CSS方式书写，核心是设置CrawlerRunConfig对象的wait_for_selector属性，但是这里有个坑，就是wait_for_selector属性必须和wait_for_timeout属性同时设置才可以，通常情况在不确定目标网站多久能渲染成功的情况下，建议将wait_for_timeout属性设置一个大一点的值，具体代码如下：
import asyncioimport jsonfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMExtractionStrategy, LLMConfigfrom crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategyfrom dotenv import load_dotenv
load_dotenv()

async def scraper(url: str, instruction: str) -> str: # Browser configuration browser_config = BrowserConfig(    headless=True,    browser_mode='dedicated',    user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36', ) llm_config = LLMConfig(provider="deepseek/deepseek-chat",                         api_token="请输入你的API KEY") strategy = LLMExtractionStrategy(    llm_config=llm_config,    instruction=instruction,    verbose=True ) # Crawler configuration crawler_config = CrawlerRunConfig(    cache_mode=CacheMode.DISABLED,    wait_until="domcontentloaded",# wait until the DOM of the page has been loaded    page_timeout=1800000000,    wait_for=".contentDiv .tableList .title_txt",    wait_for_timeout=150000000,    css_selector="#newsList",    extraction_strategy=strategy
) cs = AsyncPlaywrightCrawlerStrategy(browser_config=browser_config)
# Run the AI-powered crawler async with AsyncWebCrawler(crawler_strategy=cs) as crawler:    result = await crawler.arun(          config=crawler_config,          url=url,
   ) # print(f"Parsed Markdown data:n{result.markdown}") # print(f"Parsed Markdown data:n{result.extracted_content}") re=[] for x in json.loads(result.extracted_content):    # h=json.loads(x)    print(x["content"])    re.append(x["content"])
return re

if __name__ == "__main__": # asyncio.run(craw()) re=asyncio.run(scraper("http://credit.jt.jiangxi.gov.cn/list_sgs.shtml?id=4bb7bd0997804feeb2d4d1e7065bd026",          "请根据采集的网页内容,帮我提取行政相对人名称，不要带其他信息，尤其是不要URL")) print(re)

另一种方式是通过让Crawl4AI执行JavaScript代码方式来实现动态网页加载，实现方式是设置CrawlerRunConfig对象的js_code属性，具体代码如下：
import asyncioimport jsonfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheModefrom crawl4ai.extraction_strategy import JsonCssExtractionStrategy # For example
async def crawl_dynamic_page(): print("\n--- Crawling Dynamic Page with JS Interaction ---")
# Example schema for CSS extraction (adapt to the target site) schema = { "items": { "selector": "div.product-item", "type": "list", "fields": { "title": "h2", "price": ".price" } } } css_extractor = JsonCssExtractionStrategy(schema)
# JavaScript to execute on the page (e.g., click a 'Load More' button) # Note: Selector needs to match the target website js_to_run = """ (async () => {    const loadMoreButton = document.querySelector('button#load-more');    if (loadMoreButton) {          console.log('Clicking load more button...');          loadMoreButton.click();          // Wait a bit for content to potentially load after click          await new Promise(resolve => setTimeout(resolve, 2000));          console.log('Waited after click.');    } else {          console.log('Load more button not found.');    } })(); """
run_conf = CrawlerRunConfig(    cache_mode=CacheMode.BYPASS,    js_code=, # List of JS snippets to execute    wait_for_timeout=3000, # Wait 3 seconds after initial load AND after JS execution    # wait_for_selector="div.newly-loaded-content", # Or wait for a specific element    extraction_strategy=css_extractor, # Extract data after JS runs    output_formats=['markdown', 'extracted_content'] )
# Ensure JS is enabled in BrowserConfig (it is by default) browser_conf = BrowserConfig(headless=True, java_script_enabled=True)
async with AsyncWebCrawler(config=browser_conf) as crawler:    result = await crawler.arun(          url="URL_OF_DYNAMIC_PAGE_HERE", # Replace with actual URL          config=run_conf    )
   if result and result.success:          print("Dynamic page crawl successful!")          print(f"Fit Markdown Length: {len(result.markdown.fit_markdown)}")          if result.extracted_content:             try:                extracted_data = json.loads(result.extracted_content)                print(f"Extracted Content Preview: {json.dumps(extracted_data, indent=2)[:500]}...")             except json.JSONDecodeError:                print(f"Extracted Content (non-JSON): {result.extracted_content[:500]}...")    else:          print(f"Crawl Failed: {result.error_message}")

if __name__ == "__main__": # Replace with an actual URL that loads content dynamically for testing # asyncio.run(crawl_dynamic_page()) print("Please replace 'URL_OF_DYNAMIC_PAGE_HERE' and uncomment the line above to run the dynamic example.")
至此，Crawl4AI爬取动态网页的内容介绍完毕，感兴趣的读者可以持续关注本系列，会不定期分享一些干货～

页: [1]

萍聚社区-德国热线-德国实用信息网's Archiver

AI爬虫框架Crawl4AI解密系列(二):如何爬取动态网页