Build a Website SEO Auditor in Python with BeautifulSoup

What We’re Building

You have a website. You want to know if search engines can actually read it. Instead of paying for yet another SaaS tool, let’s build our own SEO auditor in Python. Under 200 lines, it’ll check a URL for the most impactful SEO issues and spit out a score.

We’ll use httpx for fetching pages (with HTTP/2 support), BeautifulSoup for parsing HTML, and plain Python for everything else. No frameworks, no databases, just a script you can run from your terminal.

Setup

Install the three packages we need:

pip install httpx beautifulsoup4 lxml

httpx is a modern HTTP client that supports HTTP/2 and async. lxml is the fast HTML parser that BeautifulSoup uses under the hood. You could use the built-in html.parser, but lxml handles malformed HTML better, which is most of the web.

Fetching a Page

First, let’s grab a page and handle the common failure modes:

import httpx
from bs4 import BeautifulSoup

def fetch_page(url: str) -> tuple[str, str]:
    """Fetch a URL and return (html, final_url) after redirects."""
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (compatible; SEOAuditor/1.0; "
            "+https://github.com/yourname/seo-auditor)"
        ),
        "Accept": "text/html,application/xhtml+xml",
        "Accept-Language": "en-US,en;q=0.9",
    }
    with httpx.Client(http2=True, follow_redirects=True, timeout=15) as client:
        resp = client.get(url, headers=headers)
        resp.raise_for_status()
        return resp.text, str(resp.url)

A few things worth noting. We set a descriptive User-Agent so site owners know what’s hitting them. We enable HTTP/2 because some servers only serve content over it. And follow_redirects=True handles the typical HTTP-to-HTTPS and www-to-non-www chains.

Checking HTML Foundations

The basics matter most. A missing <title> tag or absent meta description is like leaving your shop sign blank. Here’s how to check them:

import json
from dataclasses import dataclass

@dataclass
class Check:
    name: str
    passed: bool
    score: float  # 0.0 to 1.0
    weight: int
    details: str

def check_html_foundations(soup: BeautifulSoup, url: str) -> list[Check]:
    checks = []

    # Title tag
    title_tag = soup.find("title")
    title_text = title_tag.get_text(strip=True) if title_tag else ""
    checks.append(Check(
        name="title_present",
        passed=bool(title_text),
        score=1.0 if title_text else 0.0,
        weight=3,
        details=f"Title: '{title_text}'" if title_text else "No <title> tag found",
    ))

    # Title length (under 60 chars is ideal)
    if title_text:
        length = len(title_text)
        score = 1.0 if length <= 60 else max(0.0, 1.0 - (length - 60) / 30)
        checks.append(Check(
            name="title_length",
            passed=length <= 60,
            score=round(score, 2),
            weight=2,
            details=f"{length} characters (aim for under 60)",
        ))

    # Meta description
    meta_desc = soup.find("meta", attrs={"name": "description"})
    desc_text = meta_desc["content"].strip() if meta_desc and meta_desc.get("content") else ""
    checks.append(Check(
        name="meta_description",
        passed=bool(desc_text),
        score=1.0 if desc_text else 0.0,
        weight=3,
        details=f"Found ({len(desc_text)} chars)" if desc_text else "Missing",
    ))

    # Canonical URL
    canonical = soup.find("link", rel="canonical")
    checks.append(Check(
        name="canonical",
        passed=canonical is not None,
        score=1.0 if canonical else 0.0,
        weight=3,
        details=canonical["href"] if canonical else "No canonical link",
    ))

    # Viewport meta
    viewport = soup.find("meta", attrs={"name": "viewport"})
    checks.append(Check(
        name="viewport",
        passed=viewport is not None,
        score=1.0 if viewport else 0.0,
        weight=2,
        details="Present" if viewport else "Missing (bad for mobile)",
    ))

    return checks

Each check gets a weight reflecting how much it matters. A missing title (weight 3) hurts more than a missing favicon (weight 1). The score field allows partial credit: a title that's 65 characters isn't great, but it's not as bad as having no title at all.

Checking Open Graph Tags

When someone shares your link on Twitter, LinkedIn, or WhatsApp, Open Graph tags control what shows up in the preview card. Missing them means your link shows up as a bare URL with no image. That kills click-through rates.

def check_open_graph(soup: BeautifulSoup) -> list[Check]:
    checks = []

    og_tags = {
        "og:title": 2,
        "og:description": 2,
        "og:image": 3,
        "og:url": 1,
        "og:type": 1,
    }

    for tag_name, weight in og_tags.items():
        meta = soup.find("meta", property=tag_name)
        content = meta["content"].strip() if meta and meta.get("content") else ""
        checks.append(Check(
            name=tag_name.replace(":", "_"),
            passed=bool(content),
            score=1.0 if content else 0.0,
            weight=weight,
            details=content[:80] if content else f"Missing {tag_name}",
        ))

    # Check og:image is not SVG (social platforms don't render SVGs)
    og_img = soup.find("meta", property="og:image")
    if og_img and og_img.get("content", "").strip():
        img_url = og_img["content"].strip().lower()
        is_raster = not img_url.endswith(".svg")
        checks.append(Check(
            name="og_image_format",
            passed=is_raster,
            score=1.0 if is_raster else 0.0,
            weight=2,
            details="Raster image (good)" if is_raster else "SVG detected (won't render on social)",
        ))

    return checks

The og:image format check is one people miss. SVG images look great on your site but Twitter and Facebook won't render them. You need a raster format like PNG or JPEG, ideally 1200x630 pixels.

Parsing JSON-LD Structured Data

Search engines use JSON-LD to understand what your page is about. It's how you get those rich results in Google: star ratings, FAQ dropdowns, product prices. Here's how to extract and validate it:

def check_structured_data(soup: BeautifulSoup) -> list[Check]:
    checks = []

    # Extract all JSON-LD blocks
    scripts = soup.find_all("script", type="application/ld+json")
    schemas = []
    for script in scripts:
        try:
            data = json.loads(script.string)
            if isinstance(data, list):
                schemas.extend(data)
            else:
                schemas.append(data)
        except (json.JSONDecodeError, TypeError):
            continue

    checks.append(Check(
        name="has_json_ld",
        passed=len(schemas) > 0,
        score=1.0 if schemas else 0.0,
        weight=3,
        details=f"{len(schemas)} JSON-LD block(s) found",
    ))

    # Check for WebSite schema
    types = set()
    for s in schemas:
        t = s.get("@type", "")
        if isinstance(t, list):
            types.update(t)
        else:
            types.add(t)

    for schema_type, weight in [("WebSite", 2), ("Organization", 2)]:
        found = schema_type in types
        checks.append(Check(
            name=f"{schema_type.lower()}_schema",
            passed=found,
            score=1.0 if found else 0.0,
            weight=weight,
            details=f"{schema_type} schema present" if found else f"No {schema_type} schema",
        ))

    return checks

We parse every <script type="application/ld+json"> block and look for the types that matter most: WebSite tells Google this is a website (obvious, but important for sitelinks), and Organization feeds the knowledge panel.

Checking robots.txt

A missing robots.txt isn't the end of the world, but it's a signal that nobody's thought about crawling. And if you have one, it should point to your sitemap:

import re

def check_robots_txt(base_url: str) -> list[Check]:
    checks = []

    # Derive the base (scheme + host)
    from urllib.parse import urlparse
    parsed = urlparse(base_url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    try:
        with httpx.Client(timeout=10) as client:
            resp = client.get(robots_url)
            exists = resp.status_code == 200
            content = resp.text if exists else ""
    except httpx.RequestError:
        exists = False
        content = ""

    checks.append(Check(
        name="robots_txt",
        passed=exists,
        score=1.0 if exists else 0.0,
        weight=3,
        details="Accessible" if exists else f"Not found at {robots_url}",
    ))

    if exists:
        has_sitemap = bool(re.search(r"^Sitemap:s*https?://", content, re.MULTILINE | re.IGNORECASE))
        checks.append(Check(
            name="sitemap_in_robots",
            passed=has_sitemap,
            score=1.0 if has_sitemap else 0.0,
            weight=2,
            details="Sitemap directive found" if has_sitemap else "No Sitemap directive",
        ))

    return checks

Building the Scoring System

Each check has a weight and a score (0.0 to 1.0). The overall score is a weighted average, mapped to a letter grade:

def calculate_score(all_checks: list[Check]) -> tuple[float, str]:
    total_weight = sum(c.weight for c in all_checks)
    earned = sum(c.score * c.weight for c in all_checks)

    if total_weight == 0:
        return 0.0, "F"

    score = round((earned / total_weight) * 100, 1)

    if score >= 95:
        grade = "A+"
    elif score >= 90:
        grade = "A"
    elif score >= 80:
        grade = "B"
    elif score >= 70:
        grade = "C"
    elif score >= 60:
        grade = "D"
    else:
        grade = "F"

    return score, grade

Weighted scoring is better than a simple pass/fail count. A site that nails all the high-weight checks (title, meta description, OG image) but misses a favicon still gets a solid B. A site missing the basics gets an F regardless of how many minor boxes it ticks.

Putting It All Together

Now wire everything into a CLI you can run against any URL:

import sys

def audit(url: str) -> dict:
    html, final_url = fetch_page(url)
    soup = BeautifulSoup(html, "lxml")

    all_checks = []
    all_checks.extend(check_html_foundations(soup, final_url))
    all_checks.extend(check_open_graph(soup))
    all_checks.extend(check_structured_data(soup))
    all_checks.extend(check_robots_txt(final_url))

    score, grade = calculate_score(all_checks)

    passed = [c for c in all_checks if c.passed]
    failed = [c for c in all_checks if not c.passed]

    return {
        "url": url,
        "final_url": final_url,
        "score": score,
        "grade": grade,
        "passed": len(passed),
        "failed": len(failed),
        "checks": all_checks,
    }

def print_report(result: dict):
    print(f"n  SEO Audit: {result['url']}")
    if result["url"] != result["final_url"]:
        print(f"  Redirected to: {result['final_url']}")
    print(f"  Score: {result['score']}/100 ({result['grade']})")
    print(f"  Passed: {result['passed']} | Failed: {result['failed']}")
    print()

    for check in result["checks"]:
        icon = "PASS" if check.passed else "FAIL"
        print(f"  [{icon}] {check.name} (w={check.weight}): {check.details}")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python seo_auditor.py ")
        sys.exit(1)

    result = audit(sys.argv[1])
    print_report(result)

Run it:

python seo_auditor.py https://example.com

You'll get output like:

  SEO Audit: https://example.com
  Score: 42.3/100 (F)
  Passed: 6 | Failed: 12

  [PASS] title_present (w=3): Title: 'Example Domain'
  [PASS] title_length (w=2): 14 characters (aim for under 60)
  [FAIL] meta_description (w=3): Missing
  [FAIL] canonical (w=3): No canonical link
  [PASS] viewport (w=2): Present
  [FAIL] og_title (w=2): Missing og:title
  ...

Where to Go From Here

This is roughly 150 lines and covers the checks that matter most. There's plenty you can add:

Image alt text audit: loop through all <img> tags and flag missing alt attributes
Heading hierarchy: check that H1 through H6 don't skip levels (H1 followed by H3 is bad)
Lazy loading: verify below-fold images have loading="lazy"
llms.txt: the new /llms.txt file that helps AI models understand your site
PDF report generation: use fpdf2 to turn the results into a branded report
Subpage checks: scan /login, /signup for noindex tags

If you don't want to build all of that yourself, SEOFlash does exactly this. It runs 50+ checks across seven categories, gives you a score with a letter grade, and generates a downloadable PDF report. It's free to use, no account needed.

But the point of this tutorial is that SEO auditing isn't magic. It's HTML parsing with opinions about what good HTML looks like. And Python makes it surprisingly simple.

2 thoughts on “Build a Website SEO Auditor in Python with BeautifulSoup”

Marcus T.

Mar 3, 2026 at 14:22

Great walkthrough! I extended this to also check for Core Web Vitals hints — missing width/height on images, render-blocking scripts in the head, etc. Added maybe 30 lines on top of your base. The weighted scoring approach is really clean and easy to customize for different priorities.
Sarah Chen

Mar 4, 2026 at 09:45

Love this. One thing I would add: checking for JSON-LD structured data. Google heavily rewards proper schema markup these days, and BeautifulSoup can parse those script tags with type=application/ld+json easily. Also curious if anyone has benchmarked httpx vs requests for this kind of crawling at scale?