Files
2026-06-26 14:30:45 +02:00

188 lines
6.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Casino Affiliate Crawler
Headless browser crawler that scrapes casino affiliate ranking pages, stores extracted data in PostgreSQL, and provides a React backoffice dashboard for viewing results.
## Architecture
```
crawler/ # Backend (Node.js / Express)
├── src/
│ ├── app.js # Express server entry point
│ ├── setup-db.js # Database initialisation script
│ ├── db.js # PostgreSQL pool config
│ ├── middleware/auth.js # JWT authentication middleware
│ ├── routes/
│ │ ├── auth.js # Login, register, profile endpoints
│ │ └── crawler.js # Crawl data & trigger endpoints
│ └── services/
│ ├── crawler.js # Puppeteer crawl + DOM extraction
│ └── scheduler.js # Periodic crawl job (every hour)
├── screenshots/ # Full-page screenshots per crawl
└── package.json
casino-dashboard/ # Frontend (React / Vite)
├── src/
│ ├── api.js # Axios client + auth helpers
│ ├── App.jsx # Router + AuthProvider wrapper
│ └── components/
│ ├── Login.jsx # Sign-in form with JWT
│ ├── Dashboard.jsx # Crawl history list + run button
│ ├── CrawlDetail.jsx # Casino table, screenshot viewer
│ └── Sidebar.jsx # Navigation shell
└── package.json
```
## Prerequisites
- **Node.js** 18+
- **Google Chrome** installed on the system
- **PostgreSQL** reachable at `192.168.21.197:5432` with user `postgres`
## Quick Start
### 1. Install dependencies
```bash
# Backend
cd crawler
npm install
# Frontend
cd casino-dashboard
npm install
```
### 2. Initialise the database
```bash
cd ../
node src/setup-db.js
```
This creates the `casino_crawler` database and tables (`crawls`, `casinos`, `users`). A default admin user is seeded:
| Username | Password |
|----------|----------|
| `admin` | `admin123` |
### 3. Start both servers
```bash
# Terminal 1 Backend
cd crawler
npm start
# Terminal 2 Frontend
cd casino-dashboard
npm run dev
```
- **Backend API**: http://localhost:3001
- **Frontend Dashboard**: http://localhost:5173
- First crawl runs automatically ~5 s after backend starts, then every hour.
## How It Works
### Crawler (`src/services/crawler.js`)
Uses Puppeteer + `puppeteer-extra-plugin-stealth` to bypass CloudFront bot detection. Each run:
1. Navigates to the target affiliate ranking page
2. Waits for network idle + 5 s buffer for lazy-loaded content
3. Takes a full-page screenshot stored in `screenshots/`
4. Extracts casino name, position, bonus offer, and affiliate link via site-specific DOM strategies
5. Inserts records into PostgreSQL
Two targeted extractors are implemented:
| Site | Selector Strategy |
|------|------------------|
| **top10onlineslots.co.uk** | Finds divs containing "Get Bonus" text + logo `<img>`, pulls bonus from child spans |
| **ubet.co.uk** | Targets `.mainProduct.row-index-N` cards, reads `wss-vendorName-*` for name and `coupon-container` for the offer |
A generic fallback covers any future affiliate site.
### Scheduled Runs
Every hour the scheduler triggers crawls for all configured sites (see `src/services/scheduler.js`). A crawl can also be triggered manually via button in the dashboard or a POST to `/api/crawler/run-all`.
## Database Schema
### `crawls`
| Column | Type | Description |
|--------|------|-------------|
| id | SERIAL PK | Auto-increment |
| url | TEXT | Crawled page URL |
| site_name | VARCHAR(255) | Human-readable site label |
| crawled_at | TIMESTAMP | When the crawl ran |
| status | VARCHAR(50) | `completed` or `failed: ...` |
| screenshot_path | TEXT | Filename in `screenshots/` |
### `casinos`
| Column | Type | Description |
|--------|------|-------------|
| id | SERIAL PK | Auto-increment |
| crawl_id | INT FK → crawls.id | Which crawl this casino belongs to |
| position | INT | Rank on the page |
| casino_name | VARCHAR(255) | Casino brand name |
| url | TEXT | Affiliate redirect URL |
| bonus_offer | TEXT | Welcome bonus / free spins text |
### `users`
| Column | Type | Description |
|--------|------|-------------|
| id | SERIAL PK | Auto-increment |
| username | VARCHAR(100) UNIQUE | Login name |
| password_hash | VARCHAR(255) | bcrypt hash |
| role | VARCHAR(50) | Currently always `admin` |
| created_at | TIMESTAMP | Account creation time |
## API Endpoints
All authenticated endpoints require `Authorization: Bearer <token>` header.
### Auth
| Method | Path | Description |
|--------|------|-------------|
| POST | `/api/auth/login` | Login, returns JWT + user object |
| POST | `/api/auth/register` | Create new admin user |
| GET | `/api/auth/me` | Current user profile |
### Crawler
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/crawler/all` | All crawls with nested casino arrays |
| GET | `/api/crawler/:id` | Single crawl detail + screenshot path |
| POST | `/api/crawler/run-all` | Trigger immediate crawl of all sites |
| POST | `/api/crawler/run` | Crawl a single custom URL (body: `{url, siteName}`) |
### Health
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/health` | DB connectivity check |
## Adding New Sites
1. Add the site config object to `src/services/scheduler.js` under `sites[]`.
2. Write a new extractor method in `src/services/crawler.js` and add a URL-based dispatch in `extractCasinoData()`.
3. Restart the backend.
## Screenshots
Full-page screenshots are saved as PNGs in `screenshots/` and served statically at `/screenshots/<filename>`. Each crawl writes one file named `<siteName>_<timestamp>.png`. The dashboard viewer loads them through the Vite proxy → Express static route.
## Production Build
```bash
cd casino-dashboard
npm run build # outputs to dist/
```
The `dist/` folder can be served by any static server or reverse-proxied behind Nginx alongside the Express API on port 3001. Set `VITE_API_URL=https://yourdomain.com/api` as an environment variable so the frontend talks to the correct backend.