x-crawl Introduction
x-crawl is a flexible Node.js AI-assisted crawler library designed to make crawler work more efficient, intelligent, and convenient. It combines the power of traditional crawling with advanced AI assistance to tackle the challenges of extracting data from dynamic and complex websites.
x-crawl Features
🤖 AI Assistance
The AI assistance feature is one of the standout aspects of x-crawl. It leverages powerful AI models, currently based on OpenAI, to simplify many tedious operations. This feature allows the crawler to understand and parse the semantic information of web pages, ensuring more accurate data extraction even when class names or structures change.
🖋️ Flexible Writing
x-crawl offers a single crawling API that is suitable for multiple configurations, each with its own advantages. This flexibility allows developers to tailor the crawler to their specific needs, whether it's crawling dynamic pages, static pages, interface data, or file data.
⚙️ Multiple Uses
With support for crawling various types of data, x-crawl is versatile enough to handle a wide range of tasks. It can crawl dynamic pages with automated operations, keyboard input, and event operations, making it suitable for complex scenarios.
⚒️ Control Page
The ability to control dynamic pages is a key feature of x-crawl. It supports automated operations, keyboard input, and event operations, which are essential for interacting with modern web pages that rely heavily on JavaScript.
👀 Device Fingerprinting
To avoid fingerprint recognition and tracking, x-crawl offers zero configuration or custom configuration options. This ensures that the crawler can operate from different locations without being identified.
🔥 Asynchronous Sync
x-crawl supports both asynchronous and synchronous crawling modes without the need to switch crawling APIs. This feature allows for more efficient use of resources and better handling of concurrent tasks.
⏱️ Interval Crawling
With options for no interval, fixed interval, and random interval, x-crawl can be fine-tuned to handle high concurrency scenarios effectively.
🔄 Failed Retry
To avoid crawling failures due to temporary problems, x-crawl allows customization of the number of retries. This ensures that the crawler can overcome transient issues and complete its tasks successfully.
➡️ Rotation Proxy
x-crawl includes automatic proxy rotation with failed retries, custom error times, and HTTP status codes. This feature is crucial for maintaining a low chance of being blocked by target websites.
🚀 Priority Queue
Based on the priority of a single crawl target, x-crawl can prioritize certain tasks over others, ensuring that critical data is crawled first.
🧾 Crawl Information
x-crawl provides controllable crawl information, which outputs colored string information in the terminal. This makes it easier for developers to monitor and debug the crawling process.
🦾 TypeScript
x-crawl is implemented in TypeScript, offering own types and complete type implementation through generics. This ensures better type safety and developer experience.
x-crawl AI-assisted Crawler
The combination of crawler and AI technology addresses the challenge of website updates that often change class names or structures. Traditional crawlers may fail in such scenarios because they rely on fixed elements to locate and extract data. x-crawl, however, uses AI to understand and parse the semantic information of web pages, allowing it to extract the required data more efficiently and intelligently.
Example Usage
Here's an example of how x-crawl can be used to obtain images of high-rated vacation rentals:
import { createCrawl, createCrawlOpenAI } from 'x-crawl';
// Create a crawler application
const crawlApp = createCrawl({
maxRetry: 3,
intervalTime: { max: 2000, min: 1000 }
});
// Create AI application
const crawlOpenAIApp = createCrawlOpenAI({
clientOptions: { apiKey: process.env['OPENAI_API_KEY'] },
defaultModel: { chatModel: 'gpt-4-turbo-preview' }
});
// crawlPage is used to crawl pages
crawlApp.crawlPage('https://www.airbnb.cn/s/select_homes').then(async (res) => {
const { page, browser } = res.data;
// Wait for the element to appear on the page and get the HTML
const targetSelector = '[data-tracking-id="TOP_REVIEWED_LISTINGS"]';
await page.waitForSelector(targetSelector);
const highlyHTML = await page.$eval(targetSelector, (el) => el.innerHTML);
// Let AI obtain image links and remove duplicates
const srcResult = await crawlOpenAIApp.parseElements(
highlyHTML,
`Get the image link, don't source it inside, and de-duplicate it`
);
browser.close();
// crawlFile is used to crawl file resources
crawlApp.crawlFile({
targets: srcResult.elements.map((item) => item.src),
storeDirs: './upload'
});
});
Tips
x-crawl can even send the entire HTML to the AI for assistance, but this will consume more Tokens. The AI can help operate on the HTML content, and even if the website updates and changes the class names or structures, x-crawl can still crawl the data normally. This is because it no longer relies on fixed class names or structures but instead lets the AI understand and parse the semantic information of the web page.
HTML Example
Below is an example of the HTML that the AI needs to process:
<div data-pageslot="true" class="c1yo0219 dir dir-ltr">
<div class="c121p4jg dir dir-ltr" data-reactroot="">
<div aria-describedby="carousel-description" aria-labelledby="carousel-label" class="s7q4c1d rd7fm2t dir dir-ltr" role="group">
<div class="hztl681 dir dir-ltr" id="carousel-label">
<div class="htgr43m dir dir-ltr">
<div class="dir dir-ltr">
<h2 class="h1436ahv dir dir-ltr">
威奇托的高评分度假屋
</h2>
<p class="bqwnmiz swd4c9o dir dir-ltr">
这些房源在位置、干净卫生等方面获得房客的一致好评。
</p>
</div>
</div>
</div>
<div class="dbldy2s dir dir-ltr" id="carousel-description">
显示 12 项中的 4 项
</div>
<!-- More HTML content -->
</div>
</div>
</div>
This example demonstrates the kind of HTML structure that x-crawl can handle, with the AI assisting in extracting the necessary data.