Add production-ready scraper with README
This commit is contained in:
26
README.md
Normal file
26
README.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# GISP Registry Scraper (API-Direct)
|
||||
|
||||
This project provides a robust, high-performance API for searching the Russian Industry portal (gisp.gov.ru). It bypasses brittle UI automation (Selenium) by interacting directly with the portal's internal REST API.
|
||||
|
||||
## Features
|
||||
- **Fast & Reliable**: Bypasses browser rendering and DOM-scraping.
|
||||
- **Filtering**: Allows querying by registry number without default UI filter constraints (like date range).
|
||||
- **Lightweight**: No need for Selenium Grid or heavy headless browsers.
|
||||
|
||||
## API Usage
|
||||
- **Endpoint**: `/scrape/{registry_number}`
|
||||
- **Example**: `GET /scrape/10084557`
|
||||
|
||||
## Setup
|
||||
1. **Environment**: Ensure you have Python 3.12+.
|
||||
2. **Install Dependencies**:
|
||||
```bash
|
||||
pip install fastapi uvicorn httpx
|
||||
```
|
||||
3. **Run the App**:
|
||||
```bash
|
||||
uvicorn app.main:app --host 0.0.0.0 --port 8000
|
||||
```
|
||||
|
||||
## Why API-Direct?
|
||||
The GISP portal uses a complex DevExtreme grid that is prone to race conditions and default date filters. By targeting the `/pub/prod/b/` endpoint directly, we eliminate the need for containerized browser nodes and significantly reduce scraping latency.
|
||||
Reference in New Issue
Block a user