What We’re Building
You have a website. You want to know if search engines can actually read it. Instead of paying for yet another SaaS tool, let’s build our own SEO auditor in Python. Under 200 lines, it’ll check a URL for the most impactful SEO issues and spit out a score.
We’ll use httpx for fetching pages (with HTTP/2 support), BeautifulSoup for parsing HTML, and plain Python for everything else. No frameworks, no databases, just a script you can run from your terminal.
Setup
Install the three packages we need:
pip install httpx beautifulsoup4 lxml
httpx is a modern HTTP client that supports HTTP/2 and async. lxml is the fast HTML parser that BeautifulSoup uses under the hood. You could use the built-in html.parser, but lxml handles malformed HTML better, which is most of the web.
Fetching a Page
First, let’s grab a page and handle the common failure modes:
import httpx
from bs4 import BeautifulSoup
def fetch_page(url: str) -> tuple[str, str]:
"""Fetch a URL and return (html, final_url) after redirects."""
headers = {
"User-Agent": (
"Mozilla/5.0 (compatible; SEOAuditor/1.0; "
"+https://github.com/yourname/seo-auditor)"
),
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
}
with httpx.Client(http2=True, follow_redirects=True, timeout=15) as client:
resp = client.get(url, headers=headers)
resp.raise_for_status()
return resp.text, str(resp.url)
A few things worth noting. We set a descriptive User-Agent so site owners know what’s hitting them. We enable HTTP/2 because some servers only serve content over it. And follow_redirects=True handles the typical HTTP-to-HTTPS and www-to-non-www chains.
Checking HTML Foundations
The basics matter most. A missing <title> tag or absent meta description is like leaving your shop sign blank. Here’s how to check them:
import json
from dataclasses import dataclass
@dataclass
class Check:
name: str
passed: bool
score: float # 0.0 to 1.0
weight: int
details: str
def check_html_foundations(soup: BeautifulSoup, url: str) -> list[Check]:
checks = []
# Title tag
title_tag = soup.find("title")
title_text = title_tag.get_text(strip=True) if title_tag else ""
checks.append(Check(
name="title_present",
passed=bool(title_text),
score=1.0 if title_text else 0.0,
weight=3,
details=f"Title: '{title_text}'" if title_text else "No <title> tag found",
))
# Title length (under 60 chars is ideal)
if title_text:
length = len(title_text)
score = 1.0 if length <= 60 else max(0.0, 1.0 - (length - 60) / 30)
checks.append(Check(
name="title_length",
passed=length <= 60,
score=round(score, 2),
weight=2,
details=f"{length} characters (aim for under 60)",
))
# Meta description
meta_desc = soup.find("meta", attrs={"name": "description"})
desc_text = meta_desc["content"].strip() if meta_desc and meta_desc.get("content") else ""
checks.append(Check(
name="meta_description",
passed=bool(desc_text),
score=1.0 if desc_text else 0.0,
weight=3,
details=f"Found ({len(desc_text)} chars)" if desc_text else "Missing",
))
# Canonical URL
canonical = soup.find("link", rel="canonical")
checks.append(Check(
name="canonical",
passed=canonical is not None,
score=1.0 if canonical else 0.0,
weight=3,
details=canonical["href"] if canonical else "No canonical link",
))
# Viewport meta
viewport = soup.find("meta", attrs={"name": "viewport"})
checks.append(Check(
name="viewport",
passed=viewport is not None,
score=1.0 if viewport else 0.0,
weight=2,
details="Present" if viewport else "Missing (bad for mobile)",
))
return checks
Each check gets a weight reflecting how much it matters. A missing title (weight 3) hurts more than a missing favicon (weight 1). The score field allows partial credit: a title that's 65 characters isn't great, but it's not as bad as having no title at all.
Checking Open Graph Tags
When someone shares your link on Twitter, LinkedIn, or WhatsApp, Open Graph tags control what shows up in the preview card. Missing them means your link shows up as a bare URL with no image. That kills click-through rates.
def check_open_graph(soup: BeautifulSoup) -> list[Check]:
checks = []
og_tags = {
"og:title": 2,
"og:description": 2,
"og:image": 3,
"og:url": 1,
"og:type": 1,
}
for tag_name, weight in og_tags.items():
meta = soup.find("meta", property=tag_name)
content = meta["content"].strip() if meta and meta.get("content") else ""
checks.append(Check(
name=tag_name.replace(":", "_"),
passed=bool(content),
score=1.0 if content else 0.0,
weight=weight,
details=content[:80] if content else f"Missing {tag_name}",
))
# Check og:image is not SVG (social platforms don't render SVGs)
og_img = soup.find("meta", property="og:image")
if og_img and og_img.get("content", "").strip():
img_url = og_img["content"].strip().lower()
is_raster = not img_url.endswith(".svg")
checks.append(Check(
name="og_image_format",
passed=is_raster,
score=1.0 if is_raster else 0.0,
weight=2,
details="Raster image (good)" if is_raster else "SVG detected (won't render on social)",
))
return checks
The og:image format check is one people miss. SVG images look great on your site but Twitter and Facebook won't render them. You need a raster format like PNG or JPEG, ideally 1200x630 pixels.
Parsing JSON-LD Structured Data
Search engines use JSON-LD to understand what your page is about. It's how you get those rich results in Google: star ratings, FAQ dropdowns, product prices. Here's how to extract and validate it:
def check_structured_data(soup: BeautifulSoup) -> list[Check]:
checks = []
# Extract all JSON-LD blocks
scripts = soup.find_all("script", type="application/ld+json")
schemas = []
for script in scripts:
try:
data = json.loads(script.string)
if isinstance(data, list):
schemas.extend(data)
else:
schemas.append(data)
except (json.JSONDecodeError, TypeError):
continue
checks.append(Check(
name="has_json_ld",
passed=len(schemas) > 0,
score=1.0 if schemas else 0.0,
weight=3,
details=f"{len(schemas)} JSON-LD block(s) found",
))
# Check for WebSite schema
types = set()
for s in schemas:
t = s.get("@type", "")
if isinstance(t, list):
types.update(t)
else:
types.add(t)
for schema_type, weight in [("WebSite", 2), ("Organization", 2)]:
found = schema_type in types
checks.append(Check(
name=f"{schema_type.lower()}_schema",
passed=found,
score=1.0 if found else 0.0,
weight=weight,
details=f"{schema_type} schema present" if found else f"No {schema_type} schema",
))
return checks
We parse every <script type="application/ld+json"> block and look for the types that matter most: WebSite tells Google this is a website (obvious, but important for sitelinks), and Organization feeds the knowledge panel.
Checking robots.txt
A missing robots.txt isn't the end of the world, but it's a signal that nobody's thought about crawling. And if you have one, it should point to your sitemap:
import re
def check_robots_txt(base_url: str) -> list[Check]:
checks = []
# Derive the base (scheme + host)
from urllib.parse import urlparse
parsed = urlparse(base_url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
try:
with httpx.Client(timeout=10) as client:
resp = client.get(robots_url)
exists = resp.status_code == 200
content = resp.text if exists else ""
except httpx.RequestError:
exists = False
content = ""
checks.append(Check(
name="robots_txt",
passed=exists,
score=1.0 if exists else 0.0,
weight=3,
details="Accessible" if exists else f"Not found at {robots_url}",
))
if exists:
has_sitemap = bool(re.search(r"^Sitemap:s*https?://", content, re.MULTILINE | re.IGNORECASE))
checks.append(Check(
name="sitemap_in_robots",
passed=has_sitemap,
score=1.0 if has_sitemap else 0.0,
weight=2,
details="Sitemap directive found" if has_sitemap else "No Sitemap directive",
))
return checks
Building the Scoring System
Each check has a weight and a score (0.0 to 1.0). The overall score is a weighted average, mapped to a letter grade:
def calculate_score(all_checks: list[Check]) -> tuple[float, str]:
total_weight = sum(c.weight for c in all_checks)
earned = sum(c.score * c.weight for c in all_checks)
if total_weight == 0:
return 0.0, "F"
score = round((earned / total_weight) * 100, 1)
if score >= 95:
grade = "A+"
elif score >= 90:
grade = "A"
elif score >= 80:
grade = "B"
elif score >= 70:
grade = "C"
elif score >= 60:
grade = "D"
else:
grade = "F"
return score, grade
Weighted scoring is better than a simple pass/fail count. A site that nails all the high-weight checks (title, meta description, OG image) but misses a favicon still gets a solid B. A site missing the basics gets an F regardless of how many minor boxes it ticks.
Putting It All Together
Now wire everything into a CLI you can run against any URL:
import sys
def audit(url: str) -> dict:
html, final_url = fetch_page(url)
soup = BeautifulSoup(html, "lxml")
all_checks = []
all_checks.extend(check_html_foundations(soup, final_url))
all_checks.extend(check_open_graph(soup))
all_checks.extend(check_structured_data(soup))
all_checks.extend(check_robots_txt(final_url))
score, grade = calculate_score(all_checks)
passed = [c for c in all_checks if c.passed]
failed = [c for c in all_checks if not c.passed]
return {
"url": url,
"final_url": final_url,
"score": score,
"grade": grade,
"passed": len(passed),
"failed": len(failed),
"checks": all_checks,
}
def print_report(result: dict):
print(f"n SEO Audit: {result['url']}")
if result["url"] != result["final_url"]:
print(f" Redirected to: {result['final_url']}")
print(f" Score: {result['score']}/100 ({result['grade']})")
print(f" Passed: {result['passed']} | Failed: {result['failed']}")
print()
for check in result["checks"]:
icon = "PASS" if check.passed else "FAIL"
print(f" [{icon}] {check.name} (w={check.weight}): {check.details}")
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python seo_auditor.py ")
sys.exit(1)
result = audit(sys.argv[1])
print_report(result)
Run it:
python seo_auditor.py https://example.com
You'll get output like:
SEO Audit: https://example.com
Score: 42.3/100 (F)
Passed: 6 | Failed: 12
[PASS] title_present (w=3): Title: 'Example Domain'
[PASS] title_length (w=2): 14 characters (aim for under 60)
[FAIL] meta_description (w=3): Missing
[FAIL] canonical (w=3): No canonical link
[PASS] viewport (w=2): Present
[FAIL] og_title (w=2): Missing og:title
...
Where to Go From Here
This is roughly 150 lines and covers the checks that matter most. There's plenty you can add:
- Image alt text audit: loop through all
<img>tags and flag missingaltattributes - Heading hierarchy: check that H1 through H6 don't skip levels (H1 followed by H3 is bad)
- Lazy loading: verify below-fold images have
loading="lazy" - llms.txt: the new
/llms.txtfile that helps AI models understand your site - PDF report generation: use
fpdf2to turn the results into a branded report - Subpage checks: scan
/login,/signupfornoindextags
If you don't want to build all of that yourself, SEOFlash does exactly this. It runs 50+ checks across seven categories, gives you a score with a letter grade, and generates a downloadable PDF report. It's free to use, no account needed.
But the point of this tutorial is that SEO auditing isn't magic. It's HTML parsing with opinions about what good HTML looks like. And Python makes it surprisingly simple.
Great walkthrough! I extended this to also check for Core Web Vitals hints — missing width/height on images, render-blocking scripts in the head, etc. Added maybe 30 lines on top of your base. The weighted scoring approach is really clean and easy to customize for different priorities.
Love this. One thing I would add: checking for JSON-LD structured data. Google heavily rewards proper schema markup these days, and BeautifulSoup can parse those script tags with type=application/ld+json easily. Also curious if anyone has benchmarked httpx vs requests for this kind of crawling at scale?