Home PythonBuilding a Headless BI Usage Scraper with Playwright

Building a Headless BI Usage Scraper with Playwright

by Marc

Our BI tool does not expose a usage API. So I built a headless Playwright scraper on Cloud Run that logs into the BI platform, scrapes dashboard usage metrics, loads them into BigQuery, and sends weekly Slack summaries. Here is how I handled the hardest part: session management.

Why Scrape a BI Tool?

We use a BI platform for all our dashboards. It has a usage monitoring page that shows who viewed which dashboards and when — exactly the data I needed to track dashboard adoption and identify unused dashboards for cleanup. But the platform does not expose this data via API. No endpoint, no webhook, no export button.

I needed this data in BigQuery so I could join it with our other metadata (dashboard ownership, data freshness, cost per query) and build an internal “data product health” dashboard. The only way to get it was to scrape it.

Architecture Overview

The system has three components:

  1. Cloud Run service — runs Playwright headless Chromium, scrapes the usage page, loads to BigQuery, optionally sends a Slack summary
  2. Cloud Scheduler — triggers the scraper daily (data collection) and weekly (Slack report)
  3. Session refresh script — a local script that opens a real browser for manual login, then uploads the cookies to GCS for the Cloud Run service to use

The Session Management Problem

The BI platform uses SSO with our identity provider. The login flow involves redirects, CSRF tokens, and MFA — there is no way to automate this with stored credentials. I needed a way to:

  • Log in manually once via a real browser
  • Capture the authenticated session (cookies)
  • Reuse those cookies in headless Playwright on Cloud Run
  • Detect when the session expires and alert me to refresh it

Capturing Cookies Locally

Playwright has a built-in concept of browser contexts with persistent storage. When you create a context with storage_state, it saves and restores cookies and localStorage. I wrote a local refresh script that:

  1. Opens a visible Chromium browser with Playwright
  2. Navigates to the BI platform login page
  3. Waits for me to complete the SSO/MFA flow manually
  4. Saves the browser context’s storage state (cookies + localStorage) to a JSON file
  5. Uploads that JSON to a GCS bucket
#!/usr/bin/env python3
"""refresh-session.py — Open a browser, log in manually, save cookies to GCS."""

import json
from playwright.sync_api import sync_playwright
from google.cloud import storage

BI_TOOL_URL = "https://bi-tool.example.com"
GCS_BUCKET = "bi-usage-scraper-sessions"
GCS_BLOB = "browser_state.json"
LOCAL_STATE = "/tmp/browser_state.json"


def refresh_session():
    with sync_playwright() as p:
        # Launch a VISIBLE browser — user needs to interact with it
        browser = p.chromium.launch(headless=False)
        context = browser.new_context()

        page = context.new_page()
        page.goto(BI_TOOL_URL)

        print("\n=== Log in to the BI tool in the browser window ===")
        print("Press ENTER here when you are logged in and see the dashboard...")
        input()

        # Verify we are actually logged in
        if "login" in page.url.lower():
            print("ERROR: Still on login page. Try again.")
            browser.close()
            return

        # Save the full browser state (cookies + localStorage)
        state = context.storage_state()
        with open(LOCAL_STATE, "w") as f:
            json.dump(state, f)

        print(f"Saved browser state: {len(state.get('cookies', []))} cookies")

        browser.close()

    # Upload to GCS
    client = storage.Client()
    bucket = client.bucket(GCS_BUCKET)
    blob = bucket.blob(GCS_BLOB)
    blob.upload_from_filename(LOCAL_STATE)
    print(f"Uploaded to gs://{GCS_BUCKET}/{GCS_BLOB}")


if __name__ == "__main__":
    refresh_session()

Loading Cookies on Cloud Run

On the Cloud Run side, the scraper downloads the cookie state from GCS at startup and uses it to create an authenticated Playwright context:

import json
import tempfile
from playwright.sync_api import sync_playwright
from google.cloud import storage

GCS_BUCKET = "bi-usage-scraper-sessions"
GCS_BLOB = "browser_state.json"


def load_browser_state() -> dict:
    """Download browser state from GCS."""
    client = storage.Client()
    bucket = client.bucket(GCS_BUCKET)
    blob = bucket.blob(GCS_BLOB)

    with tempfile.NamedTemporaryFile(suffix=".json", delete=False) as f:
        blob.download_to_filename(f.name)
        with open(f.name) as fh:
            return json.load(fh)


def create_authenticated_context(playwright):
    """Create a Playwright browser context with stored cookies."""
    state = load_browser_state()

    browser = playwright.chromium.launch(
        headless=True,
        args=[
            "--no-sandbox",
            "--disable-dev-shm-usage",  # Required for Cloud Run
            "--disable-gpu",
        ],
    )

    # Create context WITH the stored state — cookies are loaded automatically
    context = browser.new_context(
        storage_state=state,
        viewport={"width": 1920, "height": 1080},
        user_agent=(
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
            "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
        ),
    )

    return browser, context

The Scraping Logic

The BI tool’s usage monitoring page renders a table with columns for dashboard name, viewer email, view count, and last viewed date. The scraper navigates to this page and extracts the table data:

def scrape_usage(context) -> list[dict]:
    """Scrape the usage monitoring page and return structured data."""
    page = context.new_page()

    # Navigate to the usage monitoring page
    page.goto("https://bi-tool.example.com/admin/usage-monitoring")

    # Wait for the usage table to render (it loads asynchronously)
    page.wait_for_selector("table.usage-table", timeout=30000)

    # Check if we got redirected to login (session expired)
    if "login" in page.url.lower():
        raise SessionExpiredError("Browser session has expired — run refresh-session.sh")

    # Some BI tools paginate their usage tables — click "load more" until exhausted
    while True:
        load_more = page.query_selector("button.load-more")
        if not load_more or not load_more.is_visible():
            break
        load_more.click()
        page.wait_for_timeout(2000)  # Wait for new rows to render

    # Extract table rows
    rows = page.query_selector_all("table.usage-table tbody tr")
    results = []

    for row in rows:
        cells = row.query_selector_all("td")
        if len(cells) >= 4:
            results.append({
                "dashboard_name": cells[0].inner_text().strip(),
                "viewer_email": cells[1].inner_text().strip(),
                "view_count": int(cells[2].inner_text().strip()),
                "last_viewed": cells[3].inner_text().strip(),
                "scraped_at": datetime.utcnow().isoformat(),
            })

    page.close()
    return results

Loading to BigQuery

The scraped data goes into a BigQuery table with a simple append pattern. Each scrape run adds rows with a scraped_at timestamp, so we can track usage trends over time:

from google.cloud import bigquery

BQ_TABLE = "my-project.bi_usage.dashboard_views"


def load_to_bigquery(rows: list[dict]):
    """Load scraped usage data to BigQuery."""
    client = bigquery.Client()

    job_config = bigquery.LoadJobConfig(
        schema=[
            bigquery.SchemaField("dashboard_name", "STRING"),
            bigquery.SchemaField("viewer_email", "STRING"),
            bigquery.SchemaField("view_count", "INTEGER"),
            bigquery.SchemaField("last_viewed", "STRING"),
            bigquery.SchemaField("scraped_at", "TIMESTAMP"),
        ],
        write_disposition="WRITE_APPEND",
    )

    job = client.load_table_from_json(rows, BQ_TABLE, job_config=job_config)
    job.result()  # Wait for completion
    print(f"Loaded {len(rows)} rows to {BQ_TABLE}")

Session Expiry Detection and Alerting

Session cookies typically expire after 7-30 days depending on the identity provider. When the session expires, the scraper gets redirected to the login page. I detect this and send a Slack alert:

import requests

SLACK_WEBHOOK_URL = "https://hooks.slack.com/services/..."


class SessionExpiredError(Exception):
    pass


def send_slack_alert(message: str):
    """Send an alert to Slack."""
    requests.post(SLACK_WEBHOOK_URL, json={
        "text": f":warning: *BI Usage Scraper*: {message}\n"
                f"Run `./refresh-session.sh` to fix."
    })


def run_scrape():
    """Main entry point — scrape, load, alert on failure."""
    try:
        with sync_playwright() as p:
            browser, context = create_authenticated_context(p)
            rows = scrape_usage(context)
            load_to_bigquery(rows)
            browser.close()
            return {"success": True, "rows": len(rows)}
    except SessionExpiredError as e:
        send_slack_alert(str(e))
        return {"success": False, "error": "session_expired"}
    except Exception as e:
        send_slack_alert(f"Unexpected error: {e}")
        raise

Cloud Run Configuration

Running Playwright on Cloud Run requires some specific configuration. Chromium is memory-hungry, especially in headless mode:

# Dockerfile
FROM python:3.12-slim

# Install Chromium dependencies
RUN apt-get update && apt-get install -y \
    libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 \
    libcups2 libdrm2 libxkbcommon0 libxcomposite1 \
    libxdamage1 libxfixes3 libxrandr2 libgbm1 libpango-1.0-0 \
    libcairo2 libasound2 libatspi2.0-0 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Install Playwright browsers
RUN playwright install chromium

COPY . .

CMD ["gunicorn", "main:app", "-k", "uvicorn.workers.UvicornWorker", \
     "--bind", "0.0.0.0:8080", "--timeout", "120"]
# Deploy with enough memory for Chromium
gcloud run deploy bi-usage-scraper \
  --project=my-gcp-project \
  --region=europe-west1 \
  --image=europe-west1-docker.pkg.dev/my-gcp-project/containers/bi-usage-scraper \
  --memory=1Gi \
  --cpu=1 \
  --timeout=120 \
  --min-instances=0 \
  --max-instances=1 \
  --no-allow-unauthenticated

Key settings:

  • 1 GiB memory: Chromium needs at least 512MB; 1 GiB gives comfortable headroom
  • 120-second timeout: the page can take 20-30 seconds to fully render with all usage data
  • max-instances=1: we never need more than one concurrent scrape, and this prevents accidental parallel runs
  • no-allow-unauthenticated: the scraper is only triggered by Cloud Scheduler with IAM auth

Cloud Scheduler Setup

Two scheduler jobs: a daily scrape for data collection (no Slack notification), and a weekly one that also sends a Slack summary:

# Daily scrape (weekdays 11am CET — after the BI tool's own usage data refreshes)
gcloud scheduler jobs create http bi-usage-daily \
  --project=my-gcp-project \
  --location=europe-west1 \
  --schedule="0 11 * * 1-5" \
  --time-zone="Europe/Amsterdam" \
  --uri="https://bi-usage-scraper-xxxxx.run.app/trigger/scrape" \
  --http-method=POST \
  --oidc-service-account-email="[email protected]"

# Weekly Slack summary (Monday 11:05am CET)
gcloud scheduler jobs create http bi-usage-weekly \
  --project=my-gcp-project \
  --location=europe-west1 \
  --schedule="5 11 * * 1" \
  --time-zone="Europe/Amsterdam" \
  --uri="https://bi-usage-scraper-xxxxx.run.app/trigger/scrape?slack=true" \
  --http-method=POST \
  --oidc-service-account-email="[email protected]"

What I Learned

  • Playwright’s storage_state is the right abstraction for session persistence. It captures cookies AND localStorage in a single JSON blob. This is much more reliable than manually extracting and replaying individual cookies.
  • Store browser state in GCS, not baked into the Docker image. Session cookies expire. If you bake them into the image, you need to rebuild and redeploy every time they expire. With GCS, you just run the refresh script and the next scrape picks up the new cookies automatically.
  • Cloud Run + Playwright works, but needs memory. Chromium is hungry. Do not try to run it on a 256MB container. 1 GiB is the sweet spot for single-page scraping.
  • Always detect session expiry explicitly. Do not let the scraper silently fail. Check the URL after navigation — if you ended up on a login page, the session is dead. Alert immediately so you can refresh within hours, not days.
  • The manual login step is a feature, not a bug. Yes, it would be nice to fully automate the login. But SSO with MFA exists for a reason, and a manual login once every few weeks is a small price for not storing long-lived credentials or trying to automate MFA bypass.

The scraper has been running for about two months now. Session cookies last approximately 3 weeks before expiring, so I run the refresh script twice a month. The data it collects has already helped us identify 4 dashboards that nobody had viewed in over 60 days — all of which we subsequently deprecated and removed.

Categories: Python, Tools & Automation

You may also like

Leave a Comment