Build Your Own SEO Crawler: Step-by-Step Python Guide — Can Davarcı

The most reliable way to understand a website's SEO health is to crawl it like a bot does. Commercial tools handle this well — but their price tags can be steep for small teams. What if you built your own?

In this guide, you will learn:

How to send HTTP requests and interpret status codes with Python
How to extract SEO data from HTML using BeautifulSoup
How to analyze meta tags (robots, canonical, Open Graph, hreflang)
How to detect broken links and measure response times
How to generate a complete site map and export it as sitemap.xml
How to save results in CSV and JSON formats
How to build an ethical crawler that respects robots.txt

Ready? Open your terminal and follow along.

Requirements and Setup

Three libraries are all you need. Run this in your terminal:

pip install requests beautifulsoup4 lxml

For a proper requirements.txt:

requests==2.31.0
beautifulsoup4==4.12.3
lxml==5.1.0

Why lxml? Python's built-in HTML parser (html.parser) works for most pages. But lxml handles malformed HTML more gracefully and parses large pages 2-5x faster. In production, the difference is noticeable.

Step 1: Fetching Pages with HTTP Requests

Every crawler starts with HTTP requests. You send a GET request to a URL and inspect the response.

import requests
from typing import Optional


def fetch_page(url: str, timeout: int = 10) -> Optional[requests.Response]:
    """Send a GET request to the given URL.

    Args:
        url: The page address to fetch.
        timeout: Connection timeout in seconds.

    Returns:
        Response object on success, None on failure.
    """
    headers = {
        "User-Agent": "MiniSEOCrawler/1.0 (+https://yoursite.com/bot)"
    }
    try:
        response = requests.get(url, headers=headers, timeout=timeout, allow_redirects=True)
        return response
    except requests.RequestException as e:
        print(f"[ERROR] {url}: {e}")
        return None

This function does three things: sends a custom User-Agent header, enforces a timeout, and catches errors gracefully.

Why does User-Agent matter? Most web servers inspect the User-Agent header. If you leave it blank or send the default python-requests/2.31.0, servers may reject your request (403) or show a CAPTCHA. Specify your bot name and a contact page — this is ethical crawler behavior.

HTTP Status Codes Reference

The status code returned for each page reveals its SEO health:

Code	Meaning	SEO Impact
200	Success	Page is accessible and indexable
301	Permanent redirect	Passes link equity, target URL gets indexed
302	Temporary redirect	Does NOT pass link equity, source URL stays indexed
404	Not found	Wastes crawl budget, hurts user experience
410	Permanently removed	Google removes from index — clearer signal than 404
500	Server error	Repeated 500s halt indexing
503	Temporarily unavailable	Maintenance page, acceptable if short-lived

Step 2: Extracting SEO Data from HTML

Once you have the page content, parse the HTML with BeautifulSoup. The critical SEO elements: title, meta description, heading tags, and links.

from bs4 import BeautifulSoup
from dataclasses import dataclass, field
from typing import Dict, List


@dataclass
class SEOData:
    """Holds SEO data for a single page."""
    url: str
    status_code: int
    title: str = ""
    meta_description: str = ""
    h1_tags: List[str] = field(default_factory=list)
    h2_tags: List[str] = field(default_factory=list)
    internal_links: List[str] = field(default_factory=list)
    external_links: List[str] = field(default_factory=list)
    images_without_alt: int = 0
    word_count: int = 0


def extract_seo_data(url: str, response: requests.Response, base_domain: str) -> SEOData:
    """Extract SEO data from an HTTP response.

    Args:
        url: The page URL.
        response: requests.Response object.
        base_domain: The site's domain (e.g., example.com).

    Returns:
        SEOData object containing the page's SEO information.
    """
    soup = BeautifulSoup(response.text, "lxml")
    data = SEOData(url=url, status_code=response.status_code)

    # Title tag
    title_tag = soup.find("title")
    if title_tag:
        data.title = title_tag.get_text(strip=True)

    # Meta description
    meta_desc = soup.find("meta", attrs={"name": "description"})
    if meta_desc:
        data.meta_description = meta_desc.get("content", "")

    # Heading tags
    data.h1_tags = [h.get_text(strip=True) for h in soup.find_all("h1")]
    data.h2_tags = [h.get_text(strip=True) for h in soup.find_all("h2")]

    # Classify links
    for link in soup.find_all("a", href=True):
        href = link["href"]
        if href.startswith(("mailto:", "tel:", "javascript:", "#")):
            continue
        if base_domain in href or href.startswith("/"):
            data.internal_links.append(href)
        elif href.startswith("http"):
            data.external_links.append(href)

    # Images missing alt text
    images = soup.find_all("img")
    data.images_without_alt = sum(
        1 for img in images if not img.get("alt", "").strip()
    )

    # Word count
    body = soup.find("body")
    if body:
        data.word_count = len(body.get_text(separator=" ", strip=True).split())

    return data

This function extracts 8 different SEO signals from a single page. The SEOData dataclass structures the data for easy export to CSV or JSON later.

Step 3: Analyzing Meta Tags

Beyond title and description, other meta tags are critical for SEO. The robots tag controls indexing, canonical resolves duplicate content, and Open Graph determines how pages appear on social media.

from typing import Any


def extract_meta_tags(soup: BeautifulSoup) -> Dict[str, Any]:
    """Extract SEO-related meta tags from a page.

    Args:
        soup: BeautifulSoup object.

    Returns:
        Dictionary mapping meta tag names to their values.
    """
    meta = {}

    # robots tag
    robots_tag = soup.find("meta", attrs={"name": "robots"})
    if robots_tag:
        meta["robots"] = robots_tag.get("content", "")

    # canonical URL
    canonical = soup.find("link", attrs={"rel": "canonical"})
    if canonical:
        meta["canonical"] = canonical.get("href", "")

    # Open Graph tags
    og_tags = soup.find_all("meta", attrs={"property": lambda x: x and x.startswith("og:")})
    for tag in og_tags:
        meta[tag["property"]] = tag.get("content", "")

    # hreflang tags
    hreflangs = soup.find_all("link", attrs={"rel": "alternate", "hreflang": True})
    meta["hreflang"] = [
        {"lang": tag["hreflang"], "href": tag.get("href", "")}
        for tag in hreflangs
    ]

    return meta

Meta Tags Reference Table

Tag	Example	Purpose
`robots`	`noindex, nofollow`	Controls whether search engines index the page
`canonical`	`<link rel="canonical" href="...">`	Resolves duplicate content by specifying the preferred URL
`og:title`	`<meta property="og:title" content="...">`	Controls the title shown in social media shares
`og:description`	`<meta property="og:description" content="...">`	Sets the description for social media shares
`og:image`	`<meta property="og:image" content="...">`	Determines the image displayed in shares
`hreflang`	`<link rel="alternate" hreflang="en" href="...">`	Enables language and region targeting for multilingual sites

Step 4: Detecting Broken Links

Broken links damage user experience and waste crawl budget. This function checks all links on a page and measures response times:

import time
from dataclasses import dataclass


@dataclass
class LinkCheckResult:
    """Holds the result of a link check."""
    url: str
    status_code: int
    response_time: float
    redirect_chain: List[str]
    is_broken: bool


def check_links(urls: List[str], timeout: int = 10) -> List[LinkCheckResult]:
    """Check the accessibility of a list of URLs.

    Args:
        urls: List of URLs to check.
        timeout: Timeout per request in seconds.

    Returns:
        List of LinkCheckResult for each URL.
    """
    results = []
    headers = {
        "User-Agent": "MiniSEOCrawler/1.0 (+https://yoursite.com/bot)"
    }

    for url in urls:
        start = time.time()
        try:
            resp = requests.get(
                url, headers=headers, timeout=timeout, allow_redirects=True
            )
            elapsed = time.time() - start

            # Extract redirect chain
            chain = [r.url for r in resp.history] if resp.history else []

            results.append(LinkCheckResult(
                url=url,
                status_code=resp.status_code,
                response_time=round(elapsed, 2),
                redirect_chain=chain,
                is_broken=resp.status_code >= 400,
            ))
        except requests.RequestException:
            elapsed = time.time() - start
            results.append(LinkCheckResult(
                url=url,
                status_code=0,
                response_time=round(elapsed, 2),
                redirect_chain=[],
                is_broken=True,
            ))

        # Pause between requests to avoid overwhelming the server
        time.sleep(0.5)

    return results

Why add rate limiting? Sending hundreds of requests per second overloads the server. The result: your IP gets banned, the server slows down, or it crashes entirely. Adding time.sleep(0.5) between requests is both ethical and practical. For larger sites, increase this to 1-2 seconds.

Step 5: Building the Site Map

Time to put all the pieces together. This crawler starts from a seed URL, follows all internal links, and generates a sitemap.xml:

from urllib.parse import urljoin, urlparse
from collections import deque
import xml.etree.ElementTree as ET
from datetime import datetime


def crawl_site(
    start_url: str,
    max_pages: int = 100,
    delay: float = 0.5,
) -> List[SEOData]:
    """Crawl a site starting from the given URL.

    Args:
        start_url: The seed URL to begin crawling.
        max_pages: Maximum number of pages to crawl.
        delay: Delay between requests in seconds.

    Returns:
        List of SEOData for each crawled page.
    """
    parsed_start = urlparse(start_url)
    base_domain = parsed_start.netloc

    visited: set = set()
    queue: deque = deque([start_url])
    results: List[SEOData] = []

    print(f"[START] Crawling {start_url} (max {max_pages} pages)")

    while queue and len(visited) < max_pages:
        current_url = queue.popleft()

        # Normalize
        current_url = current_url.split("#")[0]  # Strip fragment
        if current_url in visited:
            continue

        response = fetch_page(current_url)
        if response is None:
            continue

        visited.add(current_url)
        content_type = response.headers.get("Content-Type", "")

        # Only process HTML pages
        if "text/html" not in content_type:
            continue

        seo_data = extract_seo_data(current_url, response, base_domain)
        results.append(seo_data)

        print(f"  [{len(visited)}/{max_pages}] {response.status_code} — {current_url}")

        # Add internal links to queue
        for link in seo_data.internal_links:
            full_url = urljoin(current_url, link)
            full_url = full_url.split("#")[0]
            parsed = urlparse(full_url)

            if parsed.netloc == base_domain and full_url not in visited:
                queue.append(full_url)

        time.sleep(delay)

    print(f"[DONE] Crawled {len(results)} pages.")
    return results


def generate_sitemap_xml(pages: List[SEOData], output_path: str = "sitemap.xml") -> None:
    """Generate a sitemap.xml from crawled pages.

    Args:
        pages: List of SEOData objects.
        output_path: Output file path.
    """
    urlset = ET.Element("urlset")
    urlset.set("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9")

    for page in pages:
        if page.status_code != 200:
            continue
        url_el = ET.SubElement(urlset, "url")
        loc = ET.SubElement(url_el, "loc")
        loc.text = page.url
        lastmod = ET.SubElement(url_el, "lastmod")
        lastmod.text = datetime.now().strftime("%Y-%m-%d")

    tree = ET.ElementTree(urlset)
    ET.indent(tree, space="  ")
    tree.write(output_path, encoding="unicode", xml_declaration=True)
    print(f"[SITEMAP] Generated {output_path} ({len(pages)} URLs)")

Why should you check robots.txt? Most websites have a robots.txt file at their root. This file specifies which crawlers can access which pages. Ignoring a Disallow: /admin/ directive and crawling /admin/ violates the site owner's rules. In production, use Python's urllib.robotparser module to programmatically respect robots.txt.

Bonus: Exporting Results to CSV and JSON

To analyze your crawl data, export it:

import csv
import json


def export_to_csv(pages: List[SEOData], output_path: str = "seo_report.csv") -> None:
    """Save SEO data in CSV format.

    Args:
        pages: List of SEOData objects.
        output_path: Output file path.
    """
    fieldnames = [
        "url", "status_code", "title", "meta_description",
        "h1_count", "h2_count", "internal_links", "external_links",
        "images_without_alt", "word_count",
    ]

    with open(output_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        for page in pages:
            writer.writerow({
                "url": page.url,
                "status_code": page.status_code,
                "title": page.title,
                "meta_description": page.meta_description,
                "h1_count": len(page.h1_tags),
                "h2_count": len(page.h2_tags),
                "internal_links": len(page.internal_links),
                "external_links": len(page.external_links),
                "images_without_alt": page.images_without_alt,
                "word_count": page.word_count,
            })

    print(f"[CSV] Saved {output_path} ({len(pages)} rows)")


def export_to_json(pages: List[SEOData], output_path: str = "seo_report.json") -> None:
    """Save SEO data in JSON format.

    Args:
        pages: List of SEOData objects.
        output_path: Output file path.
    """
    from dataclasses import asdict

    data = [asdict(page) for page in pages]
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

    print(f"[JSON] Saved {output_path} ({len(pages)} records)")

Complete Working Script

The main function that ties everything together:

"""
MiniSEOCrawler — A simple SEO site crawler.

Usage:
    python crawler.py https://example.com --max-pages 50
"""

import argparse


def main() -> None:
    """Launch the crawler with CLI arguments."""
    parser = argparse.ArgumentParser(description="MiniSEOCrawler — SEO site crawler")
    parser.add_argument("url", help="Seed URL to start crawling")
    parser.add_argument("--max-pages", type=int, default=50, help="Max pages to crawl (default: 50)")
    parser.add_argument("--delay", type=float, default=0.5, help="Delay between requests in seconds")
    parser.add_argument("--output", default="seo_report", help="Output file name (without extension)")
    args = parser.parse_args()

    # Crawl the site
    pages = crawl_site(args.url, max_pages=args.max_pages, delay=args.delay)

    if not pages:
        print("No pages were crawled.")
        return

    # Check for broken links
    all_links = set()
    for page in pages:
        for link in page.internal_links:
            full = urljoin(page.url, link)
            all_links.add(full)

    print(f"\n[LINK CHECK] Checking {len(all_links)} unique internal links...")
    broken = [r for r in check_links(list(all_links)[:200]) if r.is_broken]

    if broken:
        print(f"\n  Found {len(broken)} broken links:")
        for b in broken:
            print(f"    {b.status_code} — {b.url}")
    else:
        print("  No broken links found.")

    # Export results
    export_to_csv(pages, f"{args.output}.csv")
    export_to_json(pages, f"{args.output}.json")
    generate_sitemap_xml(pages, f"{args.output}_sitemap.xml")

    # Summary report
    print(f"\n{'='*50}")
    print(f"  CRAWL SUMMARY")
    print(f"{'='*50}")
    print(f"  Pages crawled:           {len(pages)}")
    print(f"  200 OK:                  {sum(1 for p in pages if p.status_code == 200)}")
    print(f"  Redirects (3xx):         {sum(1 for p in pages if 300 <= p.status_code < 400)}")
    print(f"  Errors (4xx/5xx):        {sum(1 for p in pages if p.status_code >= 400)}")
    print(f"  Images missing alt:      {sum(p.images_without_alt for p in pages)}")
    print(f"  Broken links:            {len(broken)}")
    print(f"{'='*50}")


if __name__ == "__main__":
    main()

Run it:

python crawler.py https://yoursite.com --max-pages 100 --delay 1

Next Steps

This crawler covers the fundamentals. To take it to production, consider these improvements:

Multi-threading: Use concurrent.futures.ThreadPoolExecutor for parallel requests. A 50-page crawl drops from 25 seconds to 5.
JavaScript rendering: Sites built with React or Next.js load content via JavaScript. requests cannot see it. Integrate Playwright or Selenium for browser-based crawling.
Scheduled crawling (cron): Set up weekly automated crawls. Compare results against previous runs to catch new broken links or indexing issues early.
robots.txt compliance: Use Python's urllib.robotparser module to programmatically respect crawling rules.
Database storage: For large sites, swap CSV for SQLite or PostgreSQL. Time-series analysis lets you track SEO trends over weeks and months.

For a backend project, check out Build Your Own REST API.

For professional SEO auditing and technical analysis, explore our services.

Frequently Asked Questions

Which Python version is required?

Python 3.8 or later. The dataclass, typing, and f-string features used in this guide are all supported from 3.8 onward. If you want to use match-case, you need Python 3.10+, but all code in this guide runs on 3.8.

Can this crawler be used in production?

For prototypes and small sites (up to 500 pages), yes. For large-scale production use, you need to address: error tolerance (retry mechanisms), database support, robots.txt compliance, JavaScript rendering, and distributed crawling. Frameworks like Scrapy or Crawlee provide these features out of the box.

Is it legal to crawl a website?

Crawling your own site is completely legal. When crawling other sites, follow three rules: (1) respect robots.txt, (2) do not overload the server (implement rate limiting), and (3) do not collect personal data. Under GDPR in the EU and similar regulations elsewhere, scraping pages containing personal data carries additional legal obligations.

Build Your Own SEO Crawler: A Step-by-Step Python Guide

Requirements and Setup

Step 1: Fetching Pages with HTTP Requests

HTTP Status Codes Reference

Step 2: Extracting SEO Data from HTML

Step 3: Analyzing Meta Tags

Meta Tags Reference Table

Step 4: Detecting Broken Links

Step 5: Building the Site Map

Bonus: Exporting Results to CSV and JSON

Complete Working Script

Next Steps

Frequently Asked Questions

Related Posts

Software Consulting — Can Davarcı

Build Your Own REST API: CRUD Operations with Node.js

Build Your Own URL Shortener: A Complete Node.js Guide

RPA Process Automation: 10x Efficiency with Robotic Automation