4 min read
Getting started collecting data with Spider
Contents
Data Curation
Collecting data with Spider can be fast and rewarding if done with some simple preliminary steps. Use the dashboard to collect data seamlessly across the internet with scheduled updates. You have two main ways of collecting data using Spider. The first and simplest is to use the UI available for scraping. The alternative is to use the API to programmatically access the system and perform actions.
Crawling (Website)
- Register or login to your account using email or Github.
- Purchase credits to kickstart crawls with
pay-as-you-go
go after credits deplete. - Configure crawl settings to fit workflows that you need.
- Navigate to the dashboard and enter a website url or ask a question to get a url that should be crawled.
- Crawl the website and export/download the data as needed.
Crawling (API)
- Register or login to your account using email or Github.
- Purchase credits to kickstart crawls with
pay-as-you-go
after credits deplete. - Configure crawl settings to fit workflows that you need.
- Navigate to API keys and create a new secret key.
- Go to the API docs page to see how the API works and perform crawls with code examples.
Crawl Configuration
Configuration your account for how you would like to crawl can help save costs or effectiveness of the content. Some of the configurations include setting Premium Proxies, Headless Browser Rendering, Webhooks, and Budgeting.
Proxies
Using proxies with our system is straight forward. Simple check the toggle on if you want all request to use a proxy to increase the success of not being blocked.
Headless Browser
If you want pages that require JavaScript to be executed the headless browser config is for you. Enabling will run all request through a real Chrome Browser for JavaScript required rendering pages.
Crawl Budget Limits
One of the key things you may need to do before getting into the crawl is setting up crawl-budgets. Crawl budgets allows you to determine how many pages you are going to crawl for a website. Determining the budget will save you costs when dealing with large websites that you only want certain data points from. The example below shows adding a asterisk (*) to determine all routes with a limit of 50 pages maximum. The settings can be overwritten by the website configuration or parameters if using the API.
Crawling and Scraping Websites
Collecting data can be done in many ways and for many reasons. Leveraging our state-of-the-art technology allows you to create fast workloads that can process content from multiple locations. At the time of writing, we have started to focus on our data processing API instead of the dashboard. The API has much more flexibility than the UI for performing advanced workloads like batching, formatting, and so on.
Transforming Data
The API has more features for gathering the content in different formats and transforming the HTML as needed. You can transform the content from HTML to Markdown and feed it to a LLM for better handling the learning aspect. The API is the first class citizen for the application. The UI will have the features provided by the API eventually as the need arises.
Leveraging Open Source
One of the reasons Spider is the ultimate data-curation service for scraping is from the power of Open-Source. The core of the engine is completly available on Github under MIT to show what is in store. We are constantly working on the crawler features including performance with plans to maintain the project for the long run.
Subscription and Spider Credits
The platform allows purchasing credits that gives you the ability to crawl at any time. When you purchase credits a crawl subscription is created that allows you to continue to use the platform when your credits deplete. The limits provided coralate with the amount of credits purchased, an example would be if you bought $5 in credits you would have about $40 in spending limit - $10 in credit gives $80 and so on.
For pay-as-you-go crawling, you need to be approved first or maintain a credit balance.
The highest purchase of credits directly determines how much is allowed on the platform. You can view your usage and credits on the usage limits page.