Initial commit
This commit is contained in:
@@ -0,0 +1,187 @@
|
||||
# Casino Affiliate Crawler
|
||||
|
||||
Headless browser crawler that scrapes casino affiliate ranking pages, stores extracted data in PostgreSQL, and provides a React backoffice dashboard for viewing results.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
crawler/ # Backend (Node.js / Express)
|
||||
├── src/
|
||||
│ ├── app.js # Express server entry point
|
||||
│ ├── setup-db.js # Database initialisation script
|
||||
│ ├── db.js # PostgreSQL pool config
|
||||
│ ├── middleware/auth.js # JWT authentication middleware
|
||||
│ ├── routes/
|
||||
│ │ ├── auth.js # Login, register, profile endpoints
|
||||
│ │ └── crawler.js # Crawl data & trigger endpoints
|
||||
│ └── services/
|
||||
│ ├── crawler.js # Puppeteer crawl + DOM extraction
|
||||
│ └── scheduler.js # Periodic crawl job (every hour)
|
||||
├── screenshots/ # Full-page screenshots per crawl
|
||||
└── package.json
|
||||
|
||||
casino-dashboard/ # Frontend (React / Vite)
|
||||
├── src/
|
||||
│ ├── api.js # Axios client + auth helpers
|
||||
│ ├── App.jsx # Router + AuthProvider wrapper
|
||||
│ └── components/
|
||||
│ ├── Login.jsx # Sign-in form with JWT
|
||||
│ ├── Dashboard.jsx # Crawl history list + run button
|
||||
│ ├── CrawlDetail.jsx # Casino table, screenshot viewer
|
||||
│ └── Sidebar.jsx # Navigation shell
|
||||
└── package.json
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- **Node.js** 18+
|
||||
- **Google Chrome** installed on the system
|
||||
- **PostgreSQL** reachable at `192.168.21.197:5432` with user `postgres`
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Install dependencies
|
||||
|
||||
```bash
|
||||
# Backend
|
||||
cd crawler
|
||||
npm install
|
||||
|
||||
# Frontend
|
||||
cd casino-dashboard
|
||||
npm install
|
||||
```
|
||||
|
||||
### 2. Initialise the database
|
||||
|
||||
```bash
|
||||
cd ../
|
||||
node src/setup-db.js
|
||||
```
|
||||
|
||||
This creates the `casino_crawler` database and tables (`crawls`, `casinos`, `users`). A default admin user is seeded:
|
||||
|
||||
| Username | Password |
|
||||
|----------|----------|
|
||||
| `admin` | `admin123` |
|
||||
|
||||
### 3. Start both servers
|
||||
|
||||
```bash
|
||||
# Terminal 1 – Backend
|
||||
cd crawler
|
||||
npm start
|
||||
|
||||
# Terminal 2 – Frontend
|
||||
cd casino-dashboard
|
||||
npm run dev
|
||||
```
|
||||
|
||||
- **Backend API**: http://localhost:3001
|
||||
- **Frontend Dashboard**: http://localhost:5173
|
||||
- First crawl runs automatically ~5 s after backend starts, then every hour.
|
||||
|
||||
## How It Works
|
||||
|
||||
### Crawler (`src/services/crawler.js`)
|
||||
|
||||
Uses Puppeteer + `puppeteer-extra-plugin-stealth` to bypass CloudFront bot detection. Each run:
|
||||
|
||||
1. Navigates to the target affiliate ranking page
|
||||
2. Waits for network idle + 5 s buffer for lazy-loaded content
|
||||
3. Takes a full-page screenshot stored in `screenshots/`
|
||||
4. Extracts casino name, position, bonus offer, and affiliate link via site-specific DOM strategies
|
||||
5. Inserts records into PostgreSQL
|
||||
|
||||
Two targeted extractors are implemented:
|
||||
|
||||
| Site | Selector Strategy |
|
||||
|------|------------------|
|
||||
| **top10onlineslots.co.uk** | Finds divs containing "Get Bonus" text + logo `<img>`, pulls bonus from child spans |
|
||||
| **ubet.co.uk** | Targets `.mainProduct.row-index-N` cards, reads `wss-vendorName-*` for name and `coupon-container` for the offer |
|
||||
|
||||
A generic fallback covers any future affiliate site.
|
||||
|
||||
### Scheduled Runs
|
||||
|
||||
Every hour the scheduler triggers crawls for all configured sites (see `src/services/scheduler.js`). A crawl can also be triggered manually via button in the dashboard or a POST to `/api/crawler/run-all`.
|
||||
|
||||
## Database Schema
|
||||
|
||||
### `crawls`
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| id | SERIAL PK | Auto-increment |
|
||||
| url | TEXT | Crawled page URL |
|
||||
| site_name | VARCHAR(255) | Human-readable site label |
|
||||
| crawled_at | TIMESTAMP | When the crawl ran |
|
||||
| status | VARCHAR(50) | `completed` or `failed: ...` |
|
||||
| screenshot_path | TEXT | Filename in `screenshots/` |
|
||||
|
||||
### `casinos`
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| id | SERIAL PK | Auto-increment |
|
||||
| crawl_id | INT FK → crawls.id | Which crawl this casino belongs to |
|
||||
| position | INT | Rank on the page |
|
||||
| casino_name | VARCHAR(255) | Casino brand name |
|
||||
| url | TEXT | Affiliate redirect URL |
|
||||
| bonus_offer | TEXT | Welcome bonus / free spins text |
|
||||
|
||||
### `users`
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| id | SERIAL PK | Auto-increment |
|
||||
| username | VARCHAR(100) UNIQUE | Login name |
|
||||
| password_hash | VARCHAR(255) | bcrypt hash |
|
||||
| role | VARCHAR(50) | Currently always `admin` |
|
||||
| created_at | TIMESTAMP | Account creation time |
|
||||
|
||||
## API Endpoints
|
||||
|
||||
All authenticated endpoints require `Authorization: Bearer <token>` header.
|
||||
|
||||
### Auth
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------|-------------|
|
||||
| POST | `/api/auth/login` | Login, returns JWT + user object |
|
||||
| POST | `/api/auth/register` | Create new admin user |
|
||||
| GET | `/api/auth/me` | Current user profile |
|
||||
|
||||
### Crawler
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------|-------------|
|
||||
| GET | `/api/crawler/all` | All crawls with nested casino arrays |
|
||||
| GET | `/api/crawler/:id` | Single crawl detail + screenshot path |
|
||||
| POST | `/api/crawler/run-all` | Trigger immediate crawl of all sites |
|
||||
| POST | `/api/crawler/run` | Crawl a single custom URL (body: `{url, siteName}`) |
|
||||
|
||||
### Health
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------|-------------|
|
||||
| GET | `/api/health` | DB connectivity check |
|
||||
|
||||
## Adding New Sites
|
||||
|
||||
1. Add the site config object to `src/services/scheduler.js` under `sites[]`.
|
||||
2. Write a new extractor method in `src/services/crawler.js` and add a URL-based dispatch in `extractCasinoData()`.
|
||||
3. Restart the backend.
|
||||
|
||||
## Screenshots
|
||||
|
||||
Full-page screenshots are saved as PNGs in `screenshots/` and served statically at `/screenshots/<filename>`. Each crawl writes one file named `<siteName>_<timestamp>.png`. The dashboard viewer loads them through the Vite proxy → Express static route.
|
||||
|
||||
## Production Build
|
||||
|
||||
```bash
|
||||
cd casino-dashboard
|
||||
npm run build # outputs to dist/
|
||||
```
|
||||
|
||||
The `dist/` folder can be served by any static server or reverse-proxied behind Nginx alongside the Express API on port 3001. Set `VITE_API_URL=https://yourdomain.com/api` as an environment variable so the frontend talks to the correct backend.
|
||||
Reference in New Issue
Block a user