# Casino Affiliate Crawler Headless browser crawler that scrapes casino affiliate ranking pages, stores extracted data in PostgreSQL, and provides a React backoffice dashboard for viewing results. ## Architecture ``` crawler/ # Backend (Node.js / Express) ├── src/ │ ├── app.js # Express server entry point │ ├── setup-db.js # Database initialisation script │ ├── db.js # PostgreSQL pool config │ ├── middleware/auth.js # JWT authentication middleware │ ├── routes/ │ │ ├── auth.js # Login, register, profile endpoints │ │ └── crawler.js # Crawl data & trigger endpoints │ └── services/ │ ├── crawler.js # Puppeteer crawl + DOM extraction │ └── scheduler.js # Periodic crawl job (every hour) ├── screenshots/ # Full-page screenshots per crawl └── package.json casino-dashboard/ # Frontend (React / Vite) ├── src/ │ ├── api.js # Axios client + auth helpers │ ├── App.jsx # Router + AuthProvider wrapper │ └── components/ │ ├── Login.jsx # Sign-in form with JWT │ ├── Dashboard.jsx # Crawl history list + run button │ ├── CrawlDetail.jsx # Casino table, screenshot viewer │ └── Sidebar.jsx # Navigation shell └── package.json ``` ## Prerequisites - **Node.js** 18+ - **Google Chrome** installed on the system - **PostgreSQL** reachable at `192.168.21.197:5432` with user `postgres` ## Quick Start ### 1. Install dependencies ```bash # Backend cd crawler npm install # Frontend cd casino-dashboard npm install ``` ### 2. Initialise the database ```bash cd ../ node src/setup-db.js ``` This creates the `casino_crawler` database and tables (`crawls`, `casinos`, `users`). A default admin user is seeded: | Username | Password | |----------|----------| | `admin` | `admin123` | ### 3. Start both servers ```bash # Terminal 1 – Backend cd crawler npm start # Terminal 2 – Frontend cd casino-dashboard npm run dev ``` - **Backend API**: http://localhost:3001 - **Frontend Dashboard**: http://localhost:5173 - First crawl runs automatically ~5 s after backend starts, then every hour. ## How It Works ### Crawler (`src/services/crawler.js`) Uses Puppeteer + `puppeteer-extra-plugin-stealth` to bypass CloudFront bot detection. Each run: 1. Navigates to the target affiliate ranking page 2. Waits for network idle + 5 s buffer for lazy-loaded content 3. Takes a full-page screenshot stored in `screenshots/` 4. Extracts casino name, position, bonus offer, and affiliate link via site-specific DOM strategies 5. Inserts records into PostgreSQL Two targeted extractors are implemented: | Site | Selector Strategy | |------|------------------| | **top10onlineslots.co.uk** | Finds divs containing "Get Bonus" text + logo ``, pulls bonus from child spans | | **ubet.co.uk** | Targets `.mainProduct.row-index-N` cards, reads `wss-vendorName-*` for name and `coupon-container` for the offer | A generic fallback covers any future affiliate site. ### Scheduled Runs Every hour the scheduler triggers crawls for all configured sites (see `src/services/scheduler.js`). A crawl can also be triggered manually via button in the dashboard or a POST to `/api/crawler/run-all`. ## Database Schema ### `crawls` | Column | Type | Description | |--------|------|-------------| | id | SERIAL PK | Auto-increment | | url | TEXT | Crawled page URL | | site_name | VARCHAR(255) | Human-readable site label | | crawled_at | TIMESTAMP | When the crawl ran | | status | VARCHAR(50) | `completed` or `failed: ...` | | screenshot_path | TEXT | Filename in `screenshots/` | ### `casinos` | Column | Type | Description | |--------|------|-------------| | id | SERIAL PK | Auto-increment | | crawl_id | INT FK → crawls.id | Which crawl this casino belongs to | | position | INT | Rank on the page | | casino_name | VARCHAR(255) | Casino brand name | | url | TEXT | Affiliate redirect URL | | bonus_offer | TEXT | Welcome bonus / free spins text | ### `users` | Column | Type | Description | |--------|------|-------------| | id | SERIAL PK | Auto-increment | | username | VARCHAR(100) UNIQUE | Login name | | password_hash | VARCHAR(255) | bcrypt hash | | role | VARCHAR(50) | Currently always `admin` | | created_at | TIMESTAMP | Account creation time | ## API Endpoints All authenticated endpoints require `Authorization: Bearer ` header. ### Auth | Method | Path | Description | |--------|------|-------------| | POST | `/api/auth/login` | Login, returns JWT + user object | | POST | `/api/auth/register` | Create new admin user | | GET | `/api/auth/me` | Current user profile | ### Crawler | Method | Path | Description | |--------|------|-------------| | GET | `/api/crawler/all` | All crawls with nested casino arrays | | GET | `/api/crawler/:id` | Single crawl detail + screenshot path | | POST | `/api/crawler/run-all` | Trigger immediate crawl of all sites | | POST | `/api/crawler/run` | Crawl a single custom URL (body: `{url, siteName}`) | ### Health | Method | Path | Description | |--------|------|-------------| | GET | `/api/health` | DB connectivity check | ## Adding New Sites 1. Add the site config object to `src/services/scheduler.js` under `sites[]`. 2. Write a new extractor method in `src/services/crawler.js` and add a URL-based dispatch in `extractCasinoData()`. 3. Restart the backend. ## Screenshots Full-page screenshots are saved as PNGs in `screenshots/` and served statically at `/screenshots/`. Each crawl writes one file named `_.png`. The dashboard viewer loads them through the Vite proxy → Express static route. ## Production Build ```bash cd casino-dashboard npm run build # outputs to dist/ ``` The `dist/` folder can be served by any static server or reverse-proxied behind Nginx alongside the Express API on port 3001. Set `VITE_API_URL=https://yourdomain.com/api` as an environment variable so the frontend talks to the correct backend.