Files
2026-06-26 14:30:45 +02:00

6.2 KiB
Raw Permalink Blame History

Casino Affiliate Crawler

Headless browser crawler that scrapes casino affiliate ranking pages, stores extracted data in PostgreSQL, and provides a React backoffice dashboard for viewing results.

Architecture

crawler/                          # Backend (Node.js / Express)
├── src/
│   ├── app.js                     # Express server entry point
│   ├── setup-db.js                # Database initialisation script
│   ├── db.js                      # PostgreSQL pool config
│   ├── middleware/auth.js         # JWT authentication middleware
│   ├── routes/
│   │   ├── auth.js               # Login, register, profile endpoints
│   │   └── crawler.js            # Crawl data & trigger endpoints
│   └── services/
│       ├── crawler.js             # Puppeteer crawl + DOM extraction
│       └── scheduler.js           # Periodic crawl job (every hour)
├── screenshots/                   # Full-page screenshots per crawl
└── package.json

casino-dashboard/                 # Frontend (React / Vite)
├── src/
│   ├── api.js                     # Axios client + auth helpers
│   ├── App.jsx                    # Router + AuthProvider wrapper
│   └── components/
│       ├── Login.jsx              # Sign-in form with JWT
│       ├── Dashboard.jsx          # Crawl history list + run button
│       ├── CrawlDetail.jsx        # Casino table, screenshot viewer
│       └── Sidebar.jsx            # Navigation shell
└── package.json

Prerequisites

  • Node.js 18+
  • Google Chrome installed on the system
  • PostgreSQL reachable at 192.168.21.197:5432 with user postgres

Quick Start

1. Install dependencies

# Backend
cd crawler
npm install

# Frontend
cd casino-dashboard
npm install

2. Initialise the database

cd ../
node src/setup-db.js

This creates the casino_crawler database and tables (crawls, casinos, users). A default admin user is seeded:

Username Password
admin admin123

3. Start both servers

# Terminal 1  Backend
cd crawler
npm start

# Terminal 2  Frontend
cd casino-dashboard
npm run dev

How It Works

Crawler (src/services/crawler.js)

Uses Puppeteer + puppeteer-extra-plugin-stealth to bypass CloudFront bot detection. Each run:

  1. Navigates to the target affiliate ranking page
  2. Waits for network idle + 5 s buffer for lazy-loaded content
  3. Takes a full-page screenshot stored in screenshots/
  4. Extracts casino name, position, bonus offer, and affiliate link via site-specific DOM strategies
  5. Inserts records into PostgreSQL

Two targeted extractors are implemented:

Site Selector Strategy
top10onlineslots.co.uk Finds divs containing "Get Bonus" text + logo <img>, pulls bonus from child spans
ubet.co.uk Targets .mainProduct.row-index-N cards, reads wss-vendorName-* for name and coupon-container for the offer

A generic fallback covers any future affiliate site.

Scheduled Runs

Every hour the scheduler triggers crawls for all configured sites (see src/services/scheduler.js). A crawl can also be triggered manually via button in the dashboard or a POST to /api/crawler/run-all.

Database Schema

crawls

Column Type Description
id SERIAL PK Auto-increment
url TEXT Crawled page URL
site_name VARCHAR(255) Human-readable site label
crawled_at TIMESTAMP When the crawl ran
status VARCHAR(50) completed or failed: ...
screenshot_path TEXT Filename in screenshots/

casinos

Column Type Description
id SERIAL PK Auto-increment
crawl_id INT FK → crawls.id Which crawl this casino belongs to
position INT Rank on the page
casino_name VARCHAR(255) Casino brand name
url TEXT Affiliate redirect URL
bonus_offer TEXT Welcome bonus / free spins text

users

Column Type Description
id SERIAL PK Auto-increment
username VARCHAR(100) UNIQUE Login name
password_hash VARCHAR(255) bcrypt hash
role VARCHAR(50) Currently always admin
created_at TIMESTAMP Account creation time

API Endpoints

All authenticated endpoints require Authorization: Bearer <token> header.

Auth

Method Path Description
POST /api/auth/login Login, returns JWT + user object
POST /api/auth/register Create new admin user
GET /api/auth/me Current user profile

Crawler

Method Path Description
GET /api/crawler/all All crawls with nested casino arrays
GET /api/crawler/:id Single crawl detail + screenshot path
POST /api/crawler/run-all Trigger immediate crawl of all sites
POST /api/crawler/run Crawl a single custom URL (body: {url, siteName})

Health

Method Path Description
GET /api/health DB connectivity check

Adding New Sites

  1. Add the site config object to src/services/scheduler.js under sites[].
  2. Write a new extractor method in src/services/crawler.js and add a URL-based dispatch in extractCasinoData().
  3. Restart the backend.

Screenshots

Full-page screenshots are saved as PNGs in screenshots/ and served statically at /screenshots/<filename>. Each crawl writes one file named <siteName>_<timestamp>.png. The dashboard viewer loads them through the Vite proxy → Express static route.

Production Build

cd casino-dashboard
npm run build   # outputs to dist/

The dist/ folder can be served by any static server or reverse-proxied behind Nginx alongside the Express API on port 3001. Set VITE_API_URL=https://yourdomain.com/api as an environment variable so the frontend talks to the correct backend.