Our BI tool does not expose a usage API. So I built a headless Playwright scraper on Cloud Run that logs into the BI platform, scrapes dashboard usage metrics, loads them into BigQuery, and sends weekly Slack summaries. Here is how I handled the hardest part: session management.
Why Scrape a BI Tool?
We use a BI platform for all our dashboards. It has a usage monitoring page that shows who viewed which dashboards and when — exactly the data I needed to track dashboard adoption and identify unused dashboards for cleanup. But the platform does not expose this data via API. No endpoint, no webhook, no export button.
I needed this data in BigQuery so I could join it with our other metadata (dashboard ownership, data freshness, cost per query) and build an internal “data product health” dashboard. The only way to get it was to scrape it.
Architecture Overview
The system has three components:
- Cloud Run service — runs Playwright headless Chromium, scrapes the usage page, loads to BigQuery, optionally sends a Slack summary
- Cloud Scheduler — triggers the scraper daily (data collection) and weekly (Slack report)
- Session refresh script — a local script that opens a real browser for manual login, then uploads the cookies to GCS for the Cloud Run service to use
The Session Management Problem
The BI platform uses SSO with our identity provider. The login flow involves redirects, CSRF tokens, and MFA — there is no way to automate this with stored credentials. I needed a way to:
- Log in manually once via a real browser
- Capture the authenticated session (cookies)
- Reuse those cookies in headless Playwright on Cloud Run
- Detect when the session expires and alert me to refresh it
Capturing Cookies Locally
Playwright has a built-in concept of browser contexts with persistent storage. When you create a context with storage_state, it saves and restores cookies and localStorage. I wrote a local refresh script that:
- Opens a visible Chromium browser with Playwright
- Navigates to the BI platform login page
- Waits for me to complete the SSO/MFA flow manually
- Saves the browser context’s storage state (cookies + localStorage) to a JSON file
- Uploads that JSON to a GCS bucket
#!/usr/bin/env python3
"""refresh-session.py — Open a browser, log in manually, save cookies to GCS."""
import json
from playwright.sync_api import sync_playwright
from google.cloud import storage
BI_TOOL_URL = "https://bi-tool.example.com"
GCS_BUCKET = "bi-usage-scraper-sessions"
GCS_BLOB = "browser_state.json"
LOCAL_STATE = "/tmp/browser_state.json"
def refresh_session():
with sync_playwright() as p:
# Launch a VISIBLE browser — user needs to interact with it
browser = p.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
page.goto(BI_TOOL_URL)
print("\n=== Log in to the BI tool in the browser window ===")
print("Press ENTER here when you are logged in and see the dashboard...")
input()
# Verify we are actually logged in
if "login" in page.url.lower():
print("ERROR: Still on login page. Try again.")
browser.close()
return
# Save the full browser state (cookies + localStorage)
state = context.storage_state()
with open(LOCAL_STATE, "w") as f:
json.dump(state, f)
print(f"Saved browser state: {len(state.get('cookies', []))} cookies")
browser.close()
# Upload to GCS
client = storage.Client()
bucket = client.bucket(GCS_BUCKET)
blob = bucket.blob(GCS_BLOB)
blob.upload_from_filename(LOCAL_STATE)
print(f"Uploaded to gs://{GCS_BUCKET}/{GCS_BLOB}")
if __name__ == "__main__":
refresh_session()
Loading Cookies on Cloud Run
On the Cloud Run side, the scraper downloads the cookie state from GCS at startup and uses it to create an authenticated Playwright context:
import json
import tempfile
from playwright.sync_api import sync_playwright
from google.cloud import storage
GCS_BUCKET = "bi-usage-scraper-sessions"
GCS_BLOB = "browser_state.json"
def load_browser_state() -> dict:
"""Download browser state from GCS."""
client = storage.Client()
bucket = client.bucket(GCS_BUCKET)
blob = bucket.blob(GCS_BLOB)
with tempfile.NamedTemporaryFile(suffix=".json", delete=False) as f:
blob.download_to_filename(f.name)
with open(f.name) as fh:
return json.load(fh)
def create_authenticated_context(playwright):
"""Create a Playwright browser context with stored cookies."""
state = load_browser_state()
browser = playwright.chromium.launch(
headless=True,
args=[
"--no-sandbox",
"--disable-dev-shm-usage", # Required for Cloud Run
"--disable-gpu",
],
)
# Create context WITH the stored state — cookies are loaded automatically
context = browser.new_context(
storage_state=state,
viewport={"width": 1920, "height": 1080},
user_agent=(
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
),
)
return browser, context
The Scraping Logic
The BI tool’s usage monitoring page renders a table with columns for dashboard name, viewer email, view count, and last viewed date. The scraper navigates to this page and extracts the table data:
def scrape_usage(context) -> list[dict]:
"""Scrape the usage monitoring page and return structured data."""
page = context.new_page()
# Navigate to the usage monitoring page
page.goto("https://bi-tool.example.com/admin/usage-monitoring")
# Wait for the usage table to render (it loads asynchronously)
page.wait_for_selector("table.usage-table", timeout=30000)
# Check if we got redirected to login (session expired)
if "login" in page.url.lower():
raise SessionExpiredError("Browser session has expired — run refresh-session.sh")
# Some BI tools paginate their usage tables — click "load more" until exhausted
while True:
load_more = page.query_selector("button.load-more")
if not load_more or not load_more.is_visible():
break
load_more.click()
page.wait_for_timeout(2000) # Wait for new rows to render
# Extract table rows
rows = page.query_selector_all("table.usage-table tbody tr")
results = []
for row in rows:
cells = row.query_selector_all("td")
if len(cells) >= 4:
results.append({
"dashboard_name": cells[0].inner_text().strip(),
"viewer_email": cells[1].inner_text().strip(),
"view_count": int(cells[2].inner_text().strip()),
"last_viewed": cells[3].inner_text().strip(),
"scraped_at": datetime.utcnow().isoformat(),
})
page.close()
return results
Loading to BigQuery
The scraped data goes into a BigQuery table with a simple append pattern. Each scrape run adds rows with a scraped_at timestamp, so we can track usage trends over time:
from google.cloud import bigquery
BQ_TABLE = "my-project.bi_usage.dashboard_views"
def load_to_bigquery(rows: list[dict]):
"""Load scraped usage data to BigQuery."""
client = bigquery.Client()
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("dashboard_name", "STRING"),
bigquery.SchemaField("viewer_email", "STRING"),
bigquery.SchemaField("view_count", "INTEGER"),
bigquery.SchemaField("last_viewed", "STRING"),
bigquery.SchemaField("scraped_at", "TIMESTAMP"),
],
write_disposition="WRITE_APPEND",
)
job = client.load_table_from_json(rows, BQ_TABLE, job_config=job_config)
job.result() # Wait for completion
print(f"Loaded {len(rows)} rows to {BQ_TABLE}")
Session Expiry Detection and Alerting
Session cookies typically expire after 7-30 days depending on the identity provider. When the session expires, the scraper gets redirected to the login page. I detect this and send a Slack alert:
import requests
SLACK_WEBHOOK_URL = "https://hooks.slack.com/services/..."
class SessionExpiredError(Exception):
pass
def send_slack_alert(message: str):
"""Send an alert to Slack."""
requests.post(SLACK_WEBHOOK_URL, json={
"text": f":warning: *BI Usage Scraper*: {message}\n"
f"Run `./refresh-session.sh` to fix."
})
def run_scrape():
"""Main entry point — scrape, load, alert on failure."""
try:
with sync_playwright() as p:
browser, context = create_authenticated_context(p)
rows = scrape_usage(context)
load_to_bigquery(rows)
browser.close()
return {"success": True, "rows": len(rows)}
except SessionExpiredError as e:
send_slack_alert(str(e))
return {"success": False, "error": "session_expired"}
except Exception as e:
send_slack_alert(f"Unexpected error: {e}")
raise
Cloud Run Configuration
Running Playwright on Cloud Run requires some specific configuration. Chromium is memory-hungry, especially in headless mode:
# Dockerfile
FROM python:3.12-slim
# Install Chromium dependencies
RUN apt-get update && apt-get install -y \
libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 \
libcups2 libdrm2 libxkbcommon0 libxcomposite1 \
libxdamage1 libxfixes3 libxrandr2 libgbm1 libpango-1.0-0 \
libcairo2 libasound2 libatspi2.0-0 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Install Playwright browsers
RUN playwright install chromium
COPY . .
CMD ["gunicorn", "main:app", "-k", "uvicorn.workers.UvicornWorker", \
"--bind", "0.0.0.0:8080", "--timeout", "120"]
# Deploy with enough memory for Chromium
gcloud run deploy bi-usage-scraper \
--project=my-gcp-project \
--region=europe-west1 \
--image=europe-west1-docker.pkg.dev/my-gcp-project/containers/bi-usage-scraper \
--memory=1Gi \
--cpu=1 \
--timeout=120 \
--min-instances=0 \
--max-instances=1 \
--no-allow-unauthenticated
Key settings:
- 1 GiB memory: Chromium needs at least 512MB; 1 GiB gives comfortable headroom
- 120-second timeout: the page can take 20-30 seconds to fully render with all usage data
- max-instances=1: we never need more than one concurrent scrape, and this prevents accidental parallel runs
- no-allow-unauthenticated: the scraper is only triggered by Cloud Scheduler with IAM auth
Cloud Scheduler Setup
Two scheduler jobs: a daily scrape for data collection (no Slack notification), and a weekly one that also sends a Slack summary:
# Daily scrape (weekdays 11am CET — after the BI tool's own usage data refreshes)
gcloud scheduler jobs create http bi-usage-daily \
--project=my-gcp-project \
--location=europe-west1 \
--schedule="0 11 * * 1-5" \
--time-zone="Europe/Amsterdam" \
--uri="https://bi-usage-scraper-xxxxx.run.app/trigger/scrape" \
--http-method=POST \
--oidc-service-account-email="[email protected]"
# Weekly Slack summary (Monday 11:05am CET)
gcloud scheduler jobs create http bi-usage-weekly \
--project=my-gcp-project \
--location=europe-west1 \
--schedule="5 11 * * 1" \
--time-zone="Europe/Amsterdam" \
--uri="https://bi-usage-scraper-xxxxx.run.app/trigger/scrape?slack=true" \
--http-method=POST \
--oidc-service-account-email="[email protected]"
What I Learned
- Playwright’s
storage_stateis the right abstraction for session persistence. It captures cookies AND localStorage in a single JSON blob. This is much more reliable than manually extracting and replaying individual cookies. - Store browser state in GCS, not baked into the Docker image. Session cookies expire. If you bake them into the image, you need to rebuild and redeploy every time they expire. With GCS, you just run the refresh script and the next scrape picks up the new cookies automatically.
- Cloud Run + Playwright works, but needs memory. Chromium is hungry. Do not try to run it on a 256MB container. 1 GiB is the sweet spot for single-page scraping.
- Always detect session expiry explicitly. Do not let the scraper silently fail. Check the URL after navigation — if you ended up on a login page, the session is dead. Alert immediately so you can refresh within hours, not days.
- The manual login step is a feature, not a bug. Yes, it would be nice to fully automate the login. But SSO with MFA exists for a reason, and a manual login once every few weeks is a small price for not storing long-lived credentials or trying to automate MFA bypass.
The scraper has been running for about two months now. Session cookies last approximately 3 weeks before expiring, so I run the refresh script twice a month. The data it collects has already helped us identify 4 dashboards that nobody had viewed in over 60 days — all of which we subsequently deprecated and removed.
Categories: Python, Tools & Automation
