188 lines
6.2 KiB
Markdown
188 lines
6.2 KiB
Markdown
# Casino Affiliate Crawler
|
||
|
||
Headless browser crawler that scrapes casino affiliate ranking pages, stores extracted data in PostgreSQL, and provides a React backoffice dashboard for viewing results.
|
||
|
||
## Architecture
|
||
|
||
```
|
||
crawler/ # Backend (Node.js / Express)
|
||
├── src/
|
||
│ ├── app.js # Express server entry point
|
||
│ ├── setup-db.js # Database initialisation script
|
||
│ ├── db.js # PostgreSQL pool config
|
||
│ ├── middleware/auth.js # JWT authentication middleware
|
||
│ ├── routes/
|
||
│ │ ├── auth.js # Login, register, profile endpoints
|
||
│ │ └── crawler.js # Crawl data & trigger endpoints
|
||
│ └── services/
|
||
│ ├── crawler.js # Puppeteer crawl + DOM extraction
|
||
│ └── scheduler.js # Periodic crawl job (every hour)
|
||
├── screenshots/ # Full-page screenshots per crawl
|
||
└── package.json
|
||
|
||
casino-dashboard/ # Frontend (React / Vite)
|
||
├── src/
|
||
│ ├── api.js # Axios client + auth helpers
|
||
│ ├── App.jsx # Router + AuthProvider wrapper
|
||
│ └── components/
|
||
│ ├── Login.jsx # Sign-in form with JWT
|
||
│ ├── Dashboard.jsx # Crawl history list + run button
|
||
│ ├── CrawlDetail.jsx # Casino table, screenshot viewer
|
||
│ └── Sidebar.jsx # Navigation shell
|
||
└── package.json
|
||
```
|
||
|
||
## Prerequisites
|
||
|
||
- **Node.js** 18+
|
||
- **Google Chrome** installed on the system
|
||
- **PostgreSQL** reachable at `192.168.21.197:5432` with user `postgres`
|
||
|
||
## Quick Start
|
||
|
||
### 1. Install dependencies
|
||
|
||
```bash
|
||
# Backend
|
||
cd crawler
|
||
npm install
|
||
|
||
# Frontend
|
||
cd casino-dashboard
|
||
npm install
|
||
```
|
||
|
||
### 2. Initialise the database
|
||
|
||
```bash
|
||
cd ../
|
||
node src/setup-db.js
|
||
```
|
||
|
||
This creates the `casino_crawler` database and tables (`crawls`, `casinos`, `users`). A default admin user is seeded:
|
||
|
||
| Username | Password |
|
||
|----------|----------|
|
||
| `admin` | `admin123` |
|
||
|
||
### 3. Start both servers
|
||
|
||
```bash
|
||
# Terminal 1 – Backend
|
||
cd crawler
|
||
npm start
|
||
|
||
# Terminal 2 – Frontend
|
||
cd casino-dashboard
|
||
npm run dev
|
||
```
|
||
|
||
- **Backend API**: http://localhost:3001
|
||
- **Frontend Dashboard**: http://localhost:5173
|
||
- First crawl runs automatically ~5 s after backend starts, then every hour.
|
||
|
||
## How It Works
|
||
|
||
### Crawler (`src/services/crawler.js`)
|
||
|
||
Uses Puppeteer + `puppeteer-extra-plugin-stealth` to bypass CloudFront bot detection. Each run:
|
||
|
||
1. Navigates to the target affiliate ranking page
|
||
2. Waits for network idle + 5 s buffer for lazy-loaded content
|
||
3. Takes a full-page screenshot stored in `screenshots/`
|
||
4. Extracts casino name, position, bonus offer, and affiliate link via site-specific DOM strategies
|
||
5. Inserts records into PostgreSQL
|
||
|
||
Two targeted extractors are implemented:
|
||
|
||
| Site | Selector Strategy |
|
||
|------|------------------|
|
||
| **top10onlineslots.co.uk** | Finds divs containing "Get Bonus" text + logo `<img>`, pulls bonus from child spans |
|
||
| **ubet.co.uk** | Targets `.mainProduct.row-index-N` cards, reads `wss-vendorName-*` for name and `coupon-container` for the offer |
|
||
|
||
A generic fallback covers any future affiliate site.
|
||
|
||
### Scheduled Runs
|
||
|
||
Every hour the scheduler triggers crawls for all configured sites (see `src/services/scheduler.js`). A crawl can also be triggered manually via button in the dashboard or a POST to `/api/crawler/run-all`.
|
||
|
||
## Database Schema
|
||
|
||
### `crawls`
|
||
|
||
| Column | Type | Description |
|
||
|--------|------|-------------|
|
||
| id | SERIAL PK | Auto-increment |
|
||
| url | TEXT | Crawled page URL |
|
||
| site_name | VARCHAR(255) | Human-readable site label |
|
||
| crawled_at | TIMESTAMP | When the crawl ran |
|
||
| status | VARCHAR(50) | `completed` or `failed: ...` |
|
||
| screenshot_path | TEXT | Filename in `screenshots/` |
|
||
|
||
### `casinos`
|
||
|
||
| Column | Type | Description |
|
||
|--------|------|-------------|
|
||
| id | SERIAL PK | Auto-increment |
|
||
| crawl_id | INT FK → crawls.id | Which crawl this casino belongs to |
|
||
| position | INT | Rank on the page |
|
||
| casino_name | VARCHAR(255) | Casino brand name |
|
||
| url | TEXT | Affiliate redirect URL |
|
||
| bonus_offer | TEXT | Welcome bonus / free spins text |
|
||
|
||
### `users`
|
||
|
||
| Column | Type | Description |
|
||
|--------|------|-------------|
|
||
| id | SERIAL PK | Auto-increment |
|
||
| username | VARCHAR(100) UNIQUE | Login name |
|
||
| password_hash | VARCHAR(255) | bcrypt hash |
|
||
| role | VARCHAR(50) | Currently always `admin` |
|
||
| created_at | TIMESTAMP | Account creation time |
|
||
|
||
## API Endpoints
|
||
|
||
All authenticated endpoints require `Authorization: Bearer <token>` header.
|
||
|
||
### Auth
|
||
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| POST | `/api/auth/login` | Login, returns JWT + user object |
|
||
| POST | `/api/auth/register` | Create new admin user |
|
||
| GET | `/api/auth/me` | Current user profile |
|
||
|
||
### Crawler
|
||
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| GET | `/api/crawler/all` | All crawls with nested casino arrays |
|
||
| GET | `/api/crawler/:id` | Single crawl detail + screenshot path |
|
||
| POST | `/api/crawler/run-all` | Trigger immediate crawl of all sites |
|
||
| POST | `/api/crawler/run` | Crawl a single custom URL (body: `{url, siteName}`) |
|
||
|
||
### Health
|
||
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| GET | `/api/health` | DB connectivity check |
|
||
|
||
## Adding New Sites
|
||
|
||
1. Add the site config object to `src/services/scheduler.js` under `sites[]`.
|
||
2. Write a new extractor method in `src/services/crawler.js` and add a URL-based dispatch in `extractCasinoData()`.
|
||
3. Restart the backend.
|
||
|
||
## Screenshots
|
||
|
||
Full-page screenshots are saved as PNGs in `screenshots/` and served statically at `/screenshots/<filename>`. Each crawl writes one file named `<siteName>_<timestamp>.png`. The dashboard viewer loads them through the Vite proxy → Express static route.
|
||
|
||
## Production Build
|
||
|
||
```bash
|
||
cd casino-dashboard
|
||
npm run build # outputs to dist/
|
||
```
|
||
|
||
The `dist/` folder can be served by any static server or reverse-proxied behind Nginx alongside the Express API on port 3001. Set `VITE_API_URL=https://yourdomain.com/api` as an environment variable so the frontend talks to the correct backend.
|