Files

T

2026-06-26 14:30:45 +02:00

6.2 KiB

Raw Permalink Blame History

Casino Affiliate Crawler

Headless browser crawler that scrapes casino affiliate ranking pages, stores extracted data in PostgreSQL, and provides a React backoffice dashboard for viewing results.

Architecture

crawler/                          # Backend (Node.js / Express)
├── src/
│   ├── app.js                     # Express server entry point
│   ├── setup-db.js                # Database initialisation script
│   ├── db.js                      # PostgreSQL pool config
│   ├── middleware/auth.js         # JWT authentication middleware
│   ├── routes/
│   │   ├── auth.js               # Login, register, profile endpoints
│   │   └── crawler.js            # Crawl data & trigger endpoints
│   └── services/
│       ├── crawler.js             # Puppeteer crawl + DOM extraction
│       └── scheduler.js           # Periodic crawl job (every hour)
├── screenshots/                   # Full-page screenshots per crawl
└── package.json

casino-dashboard/                 # Frontend (React / Vite)
├── src/
│   ├── api.js                     # Axios client + auth helpers
│   ├── App.jsx                    # Router + AuthProvider wrapper
│   └── components/
│       ├── Login.jsx              # Sign-in form with JWT
│       ├── Dashboard.jsx          # Crawl history list + run button
│       ├── CrawlDetail.jsx        # Casino table, screenshot viewer
│       └── Sidebar.jsx            # Navigation shell
└── package.json

Prerequisites

Node.js 18+
Google Chrome installed on the system
PostgreSQL reachable at 192.168.21.197:5432 with user postgres

Quick Start

1. Install dependencies

# Backend
cd crawler
npm install

# Frontend
cd casino-dashboard
npm install

2. Initialise the database

cd ../
node src/setup-db.js

This creates the casino_crawler database and tables (crawls, casinos, users). A default admin user is seeded:

Username	Password
`admin`	`admin123`

3. Start both servers

# Terminal 1 – Backend
cd crawler
npm start

# Terminal 2 – Frontend
cd casino-dashboard
npm run dev

Backend API: http://localhost:3001
Frontend Dashboard: http://localhost:5173
First crawl runs automatically ~5 s after backend starts, then every hour.

How It Works

Crawler (`src/services/crawler.js`)

Uses Puppeteer + puppeteer-extra-plugin-stealth to bypass CloudFront bot detection. Each run:

Navigates to the target affiliate ranking page
Waits for network idle + 5 s buffer for lazy-loaded content
Takes a full-page screenshot stored in screenshots/
Extracts casino name, position, bonus offer, and affiliate link via site-specific DOM strategies
Inserts records into PostgreSQL

Two targeted extractors are implemented:

Site	Selector Strategy
top10onlineslots.co.uk	Finds divs containing "Get Bonus" text + logo `<img>`, pulls bonus from child spans
ubet.co.uk	Targets `.mainProduct.row-index-N` cards, reads `wss-vendorName-*` for name and `coupon-container` for the offer

A generic fallback covers any future affiliate site.

Scheduled Runs

Every hour the scheduler triggers crawls for all configured sites (see src/services/scheduler.js). A crawl can also be triggered manually via button in the dashboard or a POST to /api/crawler/run-all.

Database Schema

`crawls`

Column	Type	Description
id	SERIAL PK	Auto-increment
url	TEXT	Crawled page URL
site_name	VARCHAR(255)	Human-readable site label
crawled_at	TIMESTAMP	When the crawl ran
status	VARCHAR(50)	`completed` or `failed: ...`
screenshot_path	TEXT	Filename in `screenshots/`

`casinos`

Column	Type	Description
id	SERIAL PK	Auto-increment
crawl_id	INT FK → crawls.id	Which crawl this casino belongs to
position	INT	Rank on the page
casino_name	VARCHAR(255)	Casino brand name
url	TEXT	Affiliate redirect URL
bonus_offer	TEXT	Welcome bonus / free spins text

`users`

Column	Type	Description
id	SERIAL PK	Auto-increment
username	VARCHAR(100) UNIQUE	Login name
password_hash	VARCHAR(255)	bcrypt hash
role	VARCHAR(50)	Currently always `admin`
created_at	TIMESTAMP	Account creation time

API Endpoints

All authenticated endpoints require Authorization: Bearer <token> header.

Auth

Method	Path	Description
POST	`/api/auth/login`	Login, returns JWT + user object
POST	`/api/auth/register`	Create new admin user
GET	`/api/auth/me`	Current user profile

Crawler

Method	Path	Description
GET	`/api/crawler/all`	All crawls with nested casino arrays
GET	`/api/crawler/:id`	Single crawl detail + screenshot path
POST	`/api/crawler/run-all`	Trigger immediate crawl of all sites
POST	`/api/crawler/run`	Crawl a single custom URL (body: `{url, siteName}`)

Health

Method	Path	Description
GET	`/api/health`	DB connectivity check

Adding New Sites

Add the site config object to src/services/scheduler.js under sites[].
Write a new extractor method in src/services/crawler.js and add a URL-based dispatch in extractCasinoData().
Restart the backend.

Screenshots

Full-page screenshots are saved as PNGs in screenshots/ and served statically at /screenshots/<filename>. Each crawl writes one file named <siteName>_<timestamp>.png. The dashboard viewer loads them through the Vite proxy → Express static route.

Production Build

cd casino-dashboard
npm run build   # outputs to dist/

The dist/ folder can be served by any static server or reverse-proxied behind Nginx alongside the Express API on port 3001. Set VITE_API_URL=https://yourdomain.com/api as an environment variable so the frontend talks to the correct backend.

6.2 KiB Raw Permalink Blame History Unescape Escape