AI爬虫框架Crawl4AI解密系列(二):如何爬取动态网页
作者:微信文章一、基本简介
在React和Vue等前端框架日益流行的今天,现代网站严重依赖JavaScript来加载内容。而动态网站的爬取对于传统爬虫框架来说是一个挑战,Crawl4AI可以通过简单的参数设置来轻松搞定这个问题。
二、案例实操
Crawl4AI可以通过两种方式解决动态网页内容爬取,一种是通过设置爬取时等待指定网页的选择器渲染后在执行爬取,选择器可以支持以XPATH方式书写和以CSS方式书写,核心是设置CrawlerRunConfig对象的wait_for_selector属性,但是这里有个坑,就是wait_for_selector属性必须和wait_for_timeout属性同时设置才可以,通常情况在不确定目标网站多久能渲染成功的情况下,建议将wait_for_timeout属性设置一个大一点的值,具体代码如下:
import asyncioimport jsonfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMExtractionStrategy, LLMConfigfrom crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategyfrom dotenv import load_dotenv
load_dotenv()
async def scraper(url: str, instruction: str) -> str: # Browser configuration browser_config = BrowserConfig( headless=True, browser_mode='dedicated', user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36', ) llm_config = LLMConfig(provider="deepseek/deepseek-chat", api_token="请输入你的API KEY") strategy = LLMExtractionStrategy( llm_config=llm_config, instruction=instruction, verbose=True ) # Crawler configuration crawler_config = CrawlerRunConfig( cache_mode=CacheMode.DISABLED, wait_until="domcontentloaded",# wait until the DOM of the page has been loaded page_timeout=1800000000, wait_for=".contentDiv .tableList .title_txt", wait_for_timeout=150000000, css_selector="#newsList", extraction_strategy=strategy
) cs = AsyncPlaywrightCrawlerStrategy(browser_config=browser_config)
# Run the AI-powered crawler async with AsyncWebCrawler(crawler_strategy=cs) as crawler: result = await crawler.arun( config=crawler_config, url=url,
) # print(f"Parsed Markdown data:n{result.markdown}") # print(f"Parsed Markdown data:n{result.extracted_content}") re=[] for x in json.loads(result.extracted_content): # h=json.loads(x) print(x["content"]) re.append(x["content"])
return re
if __name__ == "__main__": # asyncio.run(craw()) re=asyncio.run(scraper("http://credit.jt.jiangxi.gov.cn/list_sgs.shtml?id=4bb7bd0997804feeb2d4d1e7065bd026", "请根据采集的网页内容,帮我提取行政相对人名称,不要带其他信息,尤其是不要URL")) print(re)
另一种方式是通过让Crawl4AI执行JavaScript代码方式来实现动态网页加载,实现方式是设置CrawlerRunConfig对象的js_code属性,具体代码如下:
import asyncioimport jsonfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheModefrom crawl4ai.extraction_strategy import JsonCssExtractionStrategy # For example
async def crawl_dynamic_page(): print("\n--- Crawling Dynamic Page with JS Interaction ---")
# Example schema for CSS extraction (adapt to the target site) schema = { "items": { "selector": "div.product-item", "type": "list", "fields": { "title": "h2", "price": ".price" } } } css_extractor = JsonCssExtractionStrategy(schema)
# JavaScript to execute on the page (e.g., click a 'Load More' button) # Note: Selector needs to match the target website js_to_run = """ (async () => { const loadMoreButton = document.querySelector('button#load-more'); if (loadMoreButton) { console.log('Clicking load more button...'); loadMoreButton.click(); // Wait a bit for content to potentially load after click await new Promise(resolve => setTimeout(resolve, 2000)); console.log('Waited after click.'); } else { console.log('Load more button not found.'); } })(); """
run_conf = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, js_code=, # List of JS snippets to execute wait_for_timeout=3000, # Wait 3 seconds after initial load AND after JS execution # wait_for_selector="div.newly-loaded-content", # Or wait for a specific element extraction_strategy=css_extractor, # Extract data after JS runs output_formats=['markdown', 'extracted_content'] )
# Ensure JS is enabled in BrowserConfig (it is by default) browser_conf = BrowserConfig(headless=True, java_script_enabled=True)
async with AsyncWebCrawler(config=browser_conf) as crawler: result = await crawler.arun( url="URL_OF_DYNAMIC_PAGE_HERE", # Replace with actual URL config=run_conf )
if result and result.success: print("Dynamic page crawl successful!") print(f"Fit Markdown Length: {len(result.markdown.fit_markdown)}") if result.extracted_content: try: extracted_data = json.loads(result.extracted_content) print(f"Extracted Content Preview: {json.dumps(extracted_data, indent=2)[:500]}...") except json.JSONDecodeError: print(f"Extracted Content (non-JSON): {result.extracted_content[:500]}...") else: print(f"Crawl Failed: {result.error_message}")
if __name__ == "__main__": # Replace with an actual URL that loads content dynamically for testing # asyncio.run(crawl_dynamic_page()) print("Please replace 'URL_OF_DYNAMIC_PAGE_HERE' and uncomment the line above to run the dynamic example.")
至此,Crawl4AI爬取动态网页的内容介绍完毕,感兴趣的读者可以持续关注本系列,会不定期分享一些干货~
页:
[1]