6.2 KiB
Casino Affiliate Crawler
Headless browser crawler that scrapes casino affiliate ranking pages, stores extracted data in PostgreSQL, and provides a React backoffice dashboard for viewing results.
Architecture
crawler/ # Backend (Node.js / Express)
├── src/
│ ├── app.js # Express server entry point
│ ├── setup-db.js # Database initialisation script
│ ├── db.js # PostgreSQL pool config
│ ├── middleware/auth.js # JWT authentication middleware
│ ├── routes/
│ │ ├── auth.js # Login, register, profile endpoints
│ │ └── crawler.js # Crawl data & trigger endpoints
│ └── services/
│ ├── crawler.js # Puppeteer crawl + DOM extraction
│ └── scheduler.js # Periodic crawl job (every hour)
├── screenshots/ # Full-page screenshots per crawl
└── package.json
casino-dashboard/ # Frontend (React / Vite)
├── src/
│ ├── api.js # Axios client + auth helpers
│ ├── App.jsx # Router + AuthProvider wrapper
│ └── components/
│ ├── Login.jsx # Sign-in form with JWT
│ ├── Dashboard.jsx # Crawl history list + run button
│ ├── CrawlDetail.jsx # Casino table, screenshot viewer
│ └── Sidebar.jsx # Navigation shell
└── package.json
Prerequisites
- Node.js 18+
- Google Chrome installed on the system
- PostgreSQL reachable at
192.168.21.197:5432with userpostgres
Quick Start
1. Install dependencies
# Backend
cd crawler
npm install
# Frontend
cd casino-dashboard
npm install
2. Initialise the database
cd ../
node src/setup-db.js
This creates the casino_crawler database and tables (crawls, casinos, users). A default admin user is seeded:
| Username | Password |
|---|---|
admin |
admin123 |
3. Start both servers
# Terminal 1 – Backend
cd crawler
npm start
# Terminal 2 – Frontend
cd casino-dashboard
npm run dev
- Backend API: http://localhost:3001
- Frontend Dashboard: http://localhost:5173
- First crawl runs automatically ~5 s after backend starts, then every hour.
How It Works
Crawler (src/services/crawler.js)
Uses Puppeteer + puppeteer-extra-plugin-stealth to bypass CloudFront bot detection. Each run:
- Navigates to the target affiliate ranking page
- Waits for network idle + 5 s buffer for lazy-loaded content
- Takes a full-page screenshot stored in
screenshots/ - Extracts casino name, position, bonus offer, and affiliate link via site-specific DOM strategies
- Inserts records into PostgreSQL
Two targeted extractors are implemented:
| Site | Selector Strategy |
|---|---|
| top10onlineslots.co.uk | Finds divs containing "Get Bonus" text + logo <img>, pulls bonus from child spans |
| ubet.co.uk | Targets .mainProduct.row-index-N cards, reads wss-vendorName-* for name and coupon-container for the offer |
A generic fallback covers any future affiliate site.
Scheduled Runs
Every hour the scheduler triggers crawls for all configured sites (see src/services/scheduler.js). A crawl can also be triggered manually via button in the dashboard or a POST to /api/crawler/run-all.
Database Schema
crawls
| Column | Type | Description |
|---|---|---|
| id | SERIAL PK | Auto-increment |
| url | TEXT | Crawled page URL |
| site_name | VARCHAR(255) | Human-readable site label |
| crawled_at | TIMESTAMP | When the crawl ran |
| status | VARCHAR(50) | completed or failed: ... |
| screenshot_path | TEXT | Filename in screenshots/ |
casinos
| Column | Type | Description |
|---|---|---|
| id | SERIAL PK | Auto-increment |
| crawl_id | INT FK → crawls.id | Which crawl this casino belongs to |
| position | INT | Rank on the page |
| casino_name | VARCHAR(255) | Casino brand name |
| url | TEXT | Affiliate redirect URL |
| bonus_offer | TEXT | Welcome bonus / free spins text |
users
| Column | Type | Description |
|---|---|---|
| id | SERIAL PK | Auto-increment |
| username | VARCHAR(100) UNIQUE | Login name |
| password_hash | VARCHAR(255) | bcrypt hash |
| role | VARCHAR(50) | Currently always admin |
| created_at | TIMESTAMP | Account creation time |
API Endpoints
All authenticated endpoints require Authorization: Bearer <token> header.
Auth
| Method | Path | Description |
|---|---|---|
| POST | /api/auth/login |
Login, returns JWT + user object |
| POST | /api/auth/register |
Create new admin user |
| GET | /api/auth/me |
Current user profile |
Crawler
| Method | Path | Description |
|---|---|---|
| GET | /api/crawler/all |
All crawls with nested casino arrays |
| GET | /api/crawler/:id |
Single crawl detail + screenshot path |
| POST | /api/crawler/run-all |
Trigger immediate crawl of all sites |
| POST | /api/crawler/run |
Crawl a single custom URL (body: {url, siteName}) |
Health
| Method | Path | Description |
|---|---|---|
| GET | /api/health |
DB connectivity check |
Adding New Sites
- Add the site config object to
src/services/scheduler.jsundersites[]. - Write a new extractor method in
src/services/crawler.jsand add a URL-based dispatch inextractCasinoData(). - Restart the backend.
Screenshots
Full-page screenshots are saved as PNGs in screenshots/ and served statically at /screenshots/<filename>. Each crawl writes one file named <siteName>_<timestamp>.png. The dashboard viewer loads them through the Vite proxy → Express static route.
Production Build
cd casino-dashboard
npm run build # outputs to dist/
The dist/ folder can be served by any static server or reverse-proxied behind Nginx alongside the Express API on port 3001. Set VITE_API_URL=https://yourdomain.com/api as an environment variable so the frontend talks to the correct backend.