The most reliable way to understand a website's SEO health is to crawl it like a bot does. Commercial tools handle this well — but their price tags can be steep for small teams. What if you built your own?
In this guide, you will learn:
- How to send HTTP requests and interpret status codes with Python
- How to extract SEO data from HTML using BeautifulSoup
- How to analyze meta tags (robots, canonical, Open Graph, hreflang)
- How to detect broken links and measure response times
- How to generate a complete site map and export it as sitemap.xml
- How to save results in CSV and JSON formats
- How to build an ethical crawler that respects robots.txt
Ready? Open your terminal and follow along.
Requirements and Setup
Three libraries are all you need. Run this in your terminal:
pip install requests beautifulsoup4 lxml
For a proper requirements.txt:
requests==2.31.0
beautifulsoup4==4.12.3
lxml==5.1.0
Why lxml? Python's built-in HTML parser (
html.parser) works for most pages. But lxml handles malformed HTML more gracefully and parses large pages 2-5x faster. In production, the difference is noticeable.
Step 1: Fetching Pages with HTTP Requests
Every crawler starts with HTTP requests. You send a GET request to a URL and inspect the response.
import requests
from typing import Optional
def fetch_page(url: str, timeout: int = 10) -> Optional[requests.Response]:
"""Send a GET request to the given URL.
Args:
url: The page address to fetch.
timeout: Connection timeout in seconds.
Returns:
Response object on success, None on failure.
"""
headers = {
"User-Agent": "MiniSEOCrawler/1.0 (+https://yoursite.com/bot)"
}
try:
response = requests.get(url, headers=headers, timeout=timeout, allow_redirects=True)
return response
except requests.RequestException as e:
print(f"[ERROR] {url}: {e}")
return None
This function does three things: sends a custom User-Agent header, enforces a timeout, and catches errors gracefully.
Why does User-Agent matter? Most web servers inspect the User-Agent header. If you leave it blank or send the default
python-requests/2.31.0, servers may reject your request (403) or show a CAPTCHA. Specify your bot name and a contact page — this is ethical crawler behavior.
HTTP Status Codes Reference
The status code returned for each page reveals its SEO health:
| Code | Meaning | SEO Impact |
|---|---|---|
| 200 | Success | Page is accessible and indexable |
| 301 | Permanent redirect | Passes link equity, target URL gets indexed |
| 302 | Temporary redirect | Does NOT pass link equity, source URL stays indexed |
| 404 | Not found | Wastes crawl budget, hurts user experience |
| 410 | Permanently removed | Google removes from index — clearer signal than 404 |
| 500 | Server error | Repeated 500s halt indexing |
| 503 | Temporarily unavailable | Maintenance page, acceptable if short-lived |
Step 2: Extracting SEO Data from HTML
Once you have the page content, parse the HTML with BeautifulSoup. The critical SEO elements: title, meta description, heading tags, and links.
from bs4 import BeautifulSoup
from dataclasses import dataclass, field
from typing import Dict, List
@dataclass
class SEOData:
"""Holds SEO data for a single page."""
url: str
status_code: int
title: str = ""
meta_description: str = ""
h1_tags: List[str] = field(default_factory=list)
h2_tags: List[str] = field(default_factory=list)
internal_links: List[str] = field(default_factory=list)
external_links: List[str] = field(default_factory=list)
images_without_alt: int = 0
word_count: int = 0
def extract_seo_data(url: str, response: requests.Response, base_domain: str) -> SEOData:
"""Extract SEO data from an HTTP response.
Args:
url: The page URL.
response: requests.Response object.
base_domain: The site's domain (e.g., example.com).
Returns:
SEOData object containing the page's SEO information.
"""
soup = BeautifulSoup(response.text, "lxml")
data = SEOData(url=url, status_code=response.status_code)
# Title tag
title_tag = soup.find("title")
if title_tag:
data.title = title_tag.get_text(strip=True)
# Meta description
meta_desc = soup.find("meta", attrs={"name": "description"})
if meta_desc:
data.meta_description = meta_desc.get("content", "")
# Heading tags
data.h1_tags = [h.get_text(strip=True) for h in soup.find_all("h1")]
data.h2_tags = [h.get_text(strip=True) for h in soup.find_all("h2")]
# Classify links
for link in soup.find_all("a", href=True):
href = link["href"]
if href.startswith(("mailto:", "tel:", "javascript:", "#")):
continue
if base_domain in href or href.startswith("/"):
data.internal_links.append(href)
elif href.startswith("http"):
data.external_links.append(href)
# Images missing alt text
images = soup.find_all("img")
data.images_without_alt = sum(
1 for img in images if not img.get("alt", "").strip()
)
# Word count
body = soup.find("body")
if body:
data.word_count = len(body.get_text(separator=" ", strip=True).split())
return data
This function extracts 8 different SEO signals from a single page. The SEOData dataclass structures the data for easy export to CSV or JSON later.
Step 3: Analyzing Meta Tags
Beyond title and description, other meta tags are critical for SEO. The robots tag controls indexing, canonical resolves duplicate content, and Open Graph determines how pages appear on social media.
from typing import Any
def extract_meta_tags(soup: BeautifulSoup) -> Dict[str, Any]:
"""Extract SEO-related meta tags from a page.
Args:
soup: BeautifulSoup object.
Returns:
Dictionary mapping meta tag names to their values.
"""
meta = {}
# robots tag
robots_tag = soup.find("meta", attrs={"name": "robots"})
if robots_tag:
meta["robots"] = robots_tag.get("content", "")
# canonical URL
canonical = soup.find("link", attrs={"rel": "canonical"})
if canonical:
meta["canonical"] = canonical.get("href", "")
# Open Graph tags
og_tags = soup.find_all("meta", attrs={"property": lambda x: x and x.startswith("og:")})
for tag in og_tags:
meta[tag["property"]] = tag.get("content", "")
# hreflang tags
hreflangs = soup.find_all("link", attrs={"rel": "alternate", "hreflang": True})
meta["hreflang"] = [
{"lang": tag["hreflang"], "href": tag.get("href", "")}
for tag in hreflangs
]
return meta
Meta Tags Reference Table
| Tag | Example | Purpose |
|---|---|---|
robots | noindex, nofollow | Controls whether search engines index the page |
canonical | <link rel="canonical" href="..."> | Resolves duplicate content by specifying the preferred URL |
og:title | <meta property="og:title" content="..."> | Controls the title shown in social media shares |
og:description | <meta property="og:description" content="..."> | Sets the description for social media shares |
og:image | <meta property="og:image" content="..."> | Determines the image displayed in shares |
hreflang | <link rel="alternate" hreflang="en" href="..."> | Enables language and region targeting for multilingual sites |
Step 4: Detecting Broken Links
Broken links damage user experience and waste crawl budget. This function checks all links on a page and measures response times:
import time
from dataclasses import dataclass
@dataclass
class LinkCheckResult:
"""Holds the result of a link check."""
url: str
status_code: int
response_time: float
redirect_chain: List[str]
is_broken: bool
def check_links(urls: List[str], timeout: int = 10) -> List[LinkCheckResult]:
"""Check the accessibility of a list of URLs.
Args:
urls: List of URLs to check.
timeout: Timeout per request in seconds.
Returns:
List of LinkCheckResult for each URL.
"""
results = []
headers = {
"User-Agent": "MiniSEOCrawler/1.0 (+https://yoursite.com/bot)"
}
for url in urls:
start = time.time()
try:
resp = requests.get(
url, headers=headers, timeout=timeout, allow_redirects=True
)
elapsed = time.time() - start
# Extract redirect chain
chain = [r.url for r in resp.history] if resp.history else []
results.append(LinkCheckResult(
url=url,
status_code=resp.status_code,
response_time=round(elapsed, 2),
redirect_chain=chain,
is_broken=resp.status_code >= 400,
))
except requests.RequestException:
elapsed = time.time() - start
results.append(LinkCheckResult(
url=url,
status_code=0,
response_time=round(elapsed, 2),
redirect_chain=[],
is_broken=True,
))
# Pause between requests to avoid overwhelming the server
time.sleep(0.5)
return results
Why add rate limiting? Sending hundreds of requests per second overloads the server. The result: your IP gets banned, the server slows down, or it crashes entirely. Adding
time.sleep(0.5)between requests is both ethical and practical. For larger sites, increase this to 1-2 seconds.
Step 5: Building the Site Map
Time to put all the pieces together. This crawler starts from a seed URL, follows all internal links, and generates a sitemap.xml:
from urllib.parse import urljoin, urlparse
from collections import deque
import xml.etree.ElementTree as ET
from datetime import datetime
def crawl_site(
start_url: str,
max_pages: int = 100,
delay: float = 0.5,
) -> List[SEOData]:
"""Crawl a site starting from the given URL.
Args:
start_url: The seed URL to begin crawling.
max_pages: Maximum number of pages to crawl.
delay: Delay between requests in seconds.
Returns:
List of SEOData for each crawled page.
"""
parsed_start = urlparse(start_url)
base_domain = parsed_start.netloc
visited: set = set()
queue: deque = deque([start_url])
results: List[SEOData] = []
print(f"[START] Crawling {start_url} (max {max_pages} pages)")
while queue and len(visited) < max_pages:
current_url = queue.popleft()
# Normalize
current_url = current_url.split("#")[0] # Strip fragment
if current_url in visited:
continue
response = fetch_page(current_url)
if response is None:
continue
visited.add(current_url)
content_type = response.headers.get("Content-Type", "")
# Only process HTML pages
if "text/html" not in content_type:
continue
seo_data = extract_seo_data(current_url, response, base_domain)
results.append(seo_data)
print(f" [{len(visited)}/{max_pages}] {response.status_code} — {current_url}")
# Add internal links to queue
for link in seo_data.internal_links:
full_url = urljoin(current_url, link)
full_url = full_url.split("#")[0]
parsed = urlparse(full_url)
if parsed.netloc == base_domain and full_url not in visited:
queue.append(full_url)
time.sleep(delay)
print(f"[DONE] Crawled {len(results)} pages.")
return results
def generate_sitemap_xml(pages: List[SEOData], output_path: str = "sitemap.xml") -> None:
"""Generate a sitemap.xml from crawled pages.
Args:
pages: List of SEOData objects.
output_path: Output file path.
"""
urlset = ET.Element("urlset")
urlset.set("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9")
for page in pages:
if page.status_code != 200:
continue
url_el = ET.SubElement(urlset, "url")
loc = ET.SubElement(url_el, "loc")
loc.text = page.url
lastmod = ET.SubElement(url_el, "lastmod")
lastmod.text = datetime.now().strftime("%Y-%m-%d")
tree = ET.ElementTree(urlset)
ET.indent(tree, space=" ")
tree.write(output_path, encoding="unicode", xml_declaration=True)
print(f"[SITEMAP] Generated {output_path} ({len(pages)} URLs)")
Why should you check robots.txt? Most websites have a
robots.txtfile at their root. This file specifies which crawlers can access which pages. Ignoring aDisallow: /admin/directive and crawling/admin/violates the site owner's rules. In production, use Python'surllib.robotparsermodule to programmatically respect robots.txt.
Bonus: Exporting Results to CSV and JSON
To analyze your crawl data, export it:
import csv
import json
def export_to_csv(pages: List[SEOData], output_path: str = "seo_report.csv") -> None:
"""Save SEO data in CSV format.
Args:
pages: List of SEOData objects.
output_path: Output file path.
"""
fieldnames = [
"url", "status_code", "title", "meta_description",
"h1_count", "h2_count", "internal_links", "external_links",
"images_without_alt", "word_count",
]
with open(output_path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for page in pages:
writer.writerow({
"url": page.url,
"status_code": page.status_code,
"title": page.title,
"meta_description": page.meta_description,
"h1_count": len(page.h1_tags),
"h2_count": len(page.h2_tags),
"internal_links": len(page.internal_links),
"external_links": len(page.external_links),
"images_without_alt": page.images_without_alt,
"word_count": page.word_count,
})
print(f"[CSV] Saved {output_path} ({len(pages)} rows)")
def export_to_json(pages: List[SEOData], output_path: str = "seo_report.json") -> None:
"""Save SEO data in JSON format.
Args:
pages: List of SEOData objects.
output_path: Output file path.
"""
from dataclasses import asdict
data = [asdict(page) for page in pages]
with open(output_path, "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
print(f"[JSON] Saved {output_path} ({len(pages)} records)")
Complete Working Script
The main function that ties everything together:
"""
MiniSEOCrawler — A simple SEO site crawler.
Usage:
python crawler.py https://example.com --max-pages 50
"""
import argparse
def main() -> None:
"""Launch the crawler with CLI arguments."""
parser = argparse.ArgumentParser(description="MiniSEOCrawler — SEO site crawler")
parser.add_argument("url", help="Seed URL to start crawling")
parser.add_argument("--max-pages", type=int, default=50, help="Max pages to crawl (default: 50)")
parser.add_argument("--delay", type=float, default=0.5, help="Delay between requests in seconds")
parser.add_argument("--output", default="seo_report", help="Output file name (without extension)")
args = parser.parse_args()
# Crawl the site
pages = crawl_site(args.url, max_pages=args.max_pages, delay=args.delay)
if not pages:
print("No pages were crawled.")
return
# Check for broken links
all_links = set()
for page in pages:
for link in page.internal_links:
full = urljoin(page.url, link)
all_links.add(full)
print(f"\n[LINK CHECK] Checking {len(all_links)} unique internal links...")
broken = [r for r in check_links(list(all_links)[:200]) if r.is_broken]
if broken:
print(f"\n Found {len(broken)} broken links:")
for b in broken:
print(f" {b.status_code} — {b.url}")
else:
print(" No broken links found.")
# Export results
export_to_csv(pages, f"{args.output}.csv")
export_to_json(pages, f"{args.output}.json")
generate_sitemap_xml(pages, f"{args.output}_sitemap.xml")
# Summary report
print(f"\n{'='*50}")
print(f" CRAWL SUMMARY")
print(f"{'='*50}")
print(f" Pages crawled: {len(pages)}")
print(f" 200 OK: {sum(1 for p in pages if p.status_code == 200)}")
print(f" Redirects (3xx): {sum(1 for p in pages if 300 <= p.status_code < 400)}")
print(f" Errors (4xx/5xx): {sum(1 for p in pages if p.status_code >= 400)}")
print(f" Images missing alt: {sum(p.images_without_alt for p in pages)}")
print(f" Broken links: {len(broken)}")
print(f"{'='*50}")
if __name__ == "__main__":
main()
Run it:
python crawler.py https://yoursite.com --max-pages 100 --delay 1
Next Steps
This crawler covers the fundamentals. To take it to production, consider these improvements:
-
Multi-threading: Use
concurrent.futures.ThreadPoolExecutorfor parallel requests. A 50-page crawl drops from 25 seconds to 5. -
JavaScript rendering: Sites built with React or Next.js load content via JavaScript.
requestscannot see it. Integrate Playwright or Selenium for browser-based crawling. -
Scheduled crawling (cron): Set up weekly automated crawls. Compare results against previous runs to catch new broken links or indexing issues early.
-
robots.txt compliance: Use Python's
urllib.robotparsermodule to programmatically respect crawling rules. -
Database storage: For large sites, swap CSV for SQLite or PostgreSQL. Time-series analysis lets you track SEO trends over weeks and months.
For a backend project, check out Build Your Own REST API.
For professional SEO auditing and technical analysis, explore our services.
Frequently Asked Questions
Which Python version is required?
Python 3.8 or later. The dataclass, typing, and f-string features used in this guide are all supported from 3.8 onward. If you want to use match-case, you need Python 3.10+, but all code in this guide runs on 3.8.
Can this crawler be used in production?
For prototypes and small sites (up to 500 pages), yes. For large-scale production use, you need to address: error tolerance (retry mechanisms), database support, robots.txt compliance, JavaScript rendering, and distributed crawling. Frameworks like Scrapy or Crawlee provide these features out of the box.
Is it legal to crawl a website?
Crawling your own site is completely legal. When crawling other sites, follow three rules: (1) respect robots.txt, (2) do not overload the server (implement rate limiting), and (3) do not collect personal data. Under GDPR in the EU and similar regulations elsewhere, scraping pages containing personal data carries additional legal obligations.
