# Casino Affiliate Crawler
Headless browser crawler that scrapes casino affiliate ranking pages, stores extracted data in PostgreSQL, and provides a React backoffice dashboard for viewing results.
## Architecture
```
crawler/ # Backend (Node.js / Express)
├── src/
│ ├── app.js # Express server entry point
│ ├── setup-db.js # Database initialisation script
│ ├── db.js # PostgreSQL pool config
│ ├── middleware/auth.js # JWT authentication middleware
│ ├── routes/
│ │ ├── auth.js # Login, register, profile endpoints
│ │ └── crawler.js # Crawl data & trigger endpoints
│ └── services/
│ ├── crawler.js # Puppeteer crawl + DOM extraction
│ └── scheduler.js # Periodic crawl job (every hour)
├── screenshots/ # Full-page screenshots per crawl
└── package.json
casino-dashboard/ # Frontend (React / Vite)
├── src/
│ ├── api.js # Axios client + auth helpers
│ ├── App.jsx # Router + AuthProvider wrapper
│ └── components/
│ ├── Login.jsx # Sign-in form with JWT
│ ├── Dashboard.jsx # Crawl history list + run button
│ ├── CrawlDetail.jsx # Casino table, screenshot viewer
│ └── Sidebar.jsx # Navigation shell
└── package.json
```
## Prerequisites
- **Node.js** 18+
- **Google Chrome** installed on the system
- **PostgreSQL** reachable at `192.168.21.197:5432` with user `postgres`
## Quick Start
### 1. Install dependencies
```bash
# Backend
cd crawler
npm install
# Frontend
cd casino-dashboard
npm install
```
### 2. Initialise the database
```bash
cd ../
node src/setup-db.js
```
This creates the `casino_crawler` database and tables (`crawls`, `casinos`, `users`). A default admin user is seeded:
| Username | Password |
|----------|----------|
| `admin` | `admin123` |
### 3. Start both servers
```bash
# Terminal 1 – Backend
cd crawler
npm start
# Terminal 2 – Frontend
cd casino-dashboard
npm run dev
```
- **Backend API**: http://localhost:3001
- **Frontend Dashboard**: http://localhost:5173
- First crawl runs automatically ~5 s after backend starts, then every hour.
## How It Works
### Crawler (`src/services/crawler.js`)
Uses Puppeteer + `puppeteer-extra-plugin-stealth` to bypass CloudFront bot detection. Each run:
1. Navigates to the target affiliate ranking page
2. Waits for network idle + 5 s buffer for lazy-loaded content
3. Takes a full-page screenshot stored in `screenshots/`
4. Extracts casino name, position, bonus offer, and affiliate link via site-specific DOM strategies
5. Inserts records into PostgreSQL
Two targeted extractors are implemented:
| Site | Selector Strategy |
|------|------------------|
| **top10onlineslots.co.uk** | Finds divs containing "Get Bonus" text + logo `
`, pulls bonus from child spans |
| **ubet.co.uk** | Targets `.mainProduct.row-index-N` cards, reads `wss-vendorName-*` for name and `coupon-container` for the offer |
A generic fallback covers any future affiliate site.
### Scheduled Runs
Every hour the scheduler triggers crawls for all configured sites (see `src/services/scheduler.js`). A crawl can also be triggered manually via button in the dashboard or a POST to `/api/crawler/run-all`.
## Database Schema
### `crawls`
| Column | Type | Description |
|--------|------|-------------|
| id | SERIAL PK | Auto-increment |
| url | TEXT | Crawled page URL |
| site_name | VARCHAR(255) | Human-readable site label |
| crawled_at | TIMESTAMP | When the crawl ran |
| status | VARCHAR(50) | `completed` or `failed: ...` |
| screenshot_path | TEXT | Filename in `screenshots/` |
### `casinos`
| Column | Type | Description |
|--------|------|-------------|
| id | SERIAL PK | Auto-increment |
| crawl_id | INT FK → crawls.id | Which crawl this casino belongs to |
| position | INT | Rank on the page |
| casino_name | VARCHAR(255) | Casino brand name |
| url | TEXT | Affiliate redirect URL |
| bonus_offer | TEXT | Welcome bonus / free spins text |
### `users`
| Column | Type | Description |
|--------|------|-------------|
| id | SERIAL PK | Auto-increment |
| username | VARCHAR(100) UNIQUE | Login name |
| password_hash | VARCHAR(255) | bcrypt hash |
| role | VARCHAR(50) | Currently always `admin` |
| created_at | TIMESTAMP | Account creation time |
## API Endpoints
All authenticated endpoints require `Authorization: Bearer ` header.
### Auth
| Method | Path | Description |
|--------|------|-------------|
| POST | `/api/auth/login` | Login, returns JWT + user object |
| POST | `/api/auth/register` | Create new admin user |
| GET | `/api/auth/me` | Current user profile |
### Crawler
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/crawler/all` | All crawls with nested casino arrays |
| GET | `/api/crawler/:id` | Single crawl detail + screenshot path |
| POST | `/api/crawler/run-all` | Trigger immediate crawl of all sites |
| POST | `/api/crawler/run` | Crawl a single custom URL (body: `{url, siteName}`) |
### Health
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/health` | DB connectivity check |
## Adding New Sites
1. Add the site config object to `src/services/scheduler.js` under `sites[]`.
2. Write a new extractor method in `src/services/crawler.js` and add a URL-based dispatch in `extractCasinoData()`.
3. Restart the backend.
## Screenshots
Full-page screenshots are saved as PNGs in `screenshots/` and served statically at `/screenshots/`. Each crawl writes one file named `_.png`. The dashboard viewer loads them through the Vite proxy → Express static route.
## Production Build
```bash
cd casino-dashboard
npm run build # outputs to dist/
```
The `dist/` folder can be served by any static server or reverse-proxied behind Nginx alongside the Express API on port 3001. Set `VITE_API_URL=https://yourdomain.com/api` as an environment variable so the frontend talks to the correct backend.