URL validation
Python programming
data validation
URL parsing
Python tutorials

How to validate a url in Python? Malformed or not

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Python provides multiple ways to validate URLs, from basic syntax checks to full schema validation. The built-in urllib.parse.urlparse() splits a URL into components, and you can check if required parts (scheme and netloc) are present. For stricter validation, the validators library provides a one-line check. For production applications, combine syntax validation with re patterns or the validators package, and optionally check reachability with requests or urllib.

Basic Validation with urllib.parse

python
1from urllib.parse import urlparse
2
3def is_valid_url(url):
4    try:
5        result = urlparse(url)
6        return all([result.scheme, result.netloc])
7    except ValueError:
8        return False
9
10print(is_valid_url("https://www.example.com"))      # True
11print(is_valid_url("http://example.com/path?q=1"))   # True
12print(is_valid_url("ftp://files.example.com"))        # True
13print(is_valid_url("not a url"))                       # False
14print(is_valid_url(""))                                # False
15print(is_valid_url("example.com"))                     # False (no scheme)

urlparse() splits the URL into scheme, netloc, path, params, query, and fragment. A valid URL needs at minimum a scheme (http, https, etc.) and a network location (netloc).

Stricter Validation with Scheme Whitelist

python
1from urllib.parse import urlparse
2
3def is_valid_http_url(url):
4    try:
5        result = urlparse(url)
6        return result.scheme in ('http', 'https') and bool(result.netloc)
7    except ValueError:
8        return False
9
10print(is_valid_http_url("https://example.com"))       # True
11print(is_valid_http_url("ftp://example.com"))          # False (not http/https)
12print(is_valid_http_url("javascript:alert(1)"))        # False
13print(is_valid_http_url("file:///etc/passwd"))          # False

Always restrict the scheme to http and https when validating user-supplied URLs to prevent scheme-based attacks.

Using the validators Library

bash
pip install validators
python
1import validators
2
3print(validators.url("https://www.example.com"))        # True
4print(validators.url("http://example.com/path?q=1"))    # True
5print(validators.url("not a url"))                        # ValidationError
6print(validators.url("example.com"))                      # ValidationError
7print(validators.url("https://"))                         # ValidationError
8
9# Use in a boolean context
10if validators.url("https://example.com"):
11    print("Valid URL")
12
13# Check the result type
14result = validators.url("bad url")
15if isinstance(result, validators.ValidationError):
16    print("Invalid URL")

validators.url() performs more thorough checks than urlparse, including verifying the domain format and TLD presence.

Regex-Based Validation

python
1import re
2
3URL_PATTERN = re.compile(
4    r'^https?://'                  # scheme
5    r'(?:[a-zA-Z0-9-]+\.)*'       # subdomains
6    r'[a-zA-Z0-9-]+'              # domain
7    r'\.[a-zA-Z]{2,}'             # TLD
8    r'(?::\d{1,5})?'              # optional port
9    r'(?:/[^\s]*)?$'              # optional path
10)
11
12def is_valid_url_regex(url):
13    return bool(URL_PATTERN.match(url))
14
15print(is_valid_url_regex("https://www.example.com"))       # True
16print(is_valid_url_regex("http://example.com:8080/path"))  # True
17print(is_valid_url_regex("https://example"))               # False (no TLD)
18print(is_valid_url_regex("ftp://example.com"))             # False (not http/https)

Regex gives full control over what you accept but is harder to maintain and easy to get wrong. Prefer validators or urlparse for most use cases.

Checking URL Reachability

python
1import requests
2
3def is_url_reachable(url, timeout=5):
4    try:
5        response = requests.head(url, timeout=timeout, allow_redirects=True)
6        return response.status_code < 400
7    except requests.RequestException:
8        return False
9
10print(is_url_reachable("https://www.google.com"))     # True
11print(is_url_reachable("https://nonexistent.invalid")) # False

Only check reachability when necessary — it adds latency and makes network requests. Syntax validation is sufficient for most form inputs.

Validating URL Components

python
1from urllib.parse import urlparse, parse_qs
2
3def analyze_url(url):
4    parsed = urlparse(url)
5    return {
6        'valid': bool(parsed.scheme and parsed.netloc),
7        'scheme': parsed.scheme,
8        'domain': parsed.netloc,
9        'path': parsed.path,
10        'query_params': parse_qs(parsed.query),
11        'fragment': parsed.fragment,
12    }
13
14info = analyze_url("https://example.com/search?q=python&page=2#results")
15print(info)
16# {
17#   'valid': True,
18#   'scheme': 'https',
19#   'domain': 'example.com',
20#   'path': '/search',
21#   'query_params': {'q': ['python'], 'page': ['2']},
22#   'fragment': 'results'
23# }

Django and Pydantic Validators

python
1# Django's URLValidator
2from django.core.validators import URLValidator
3from django.core.exceptions import ValidationError
4
5validate = URLValidator()
6try:
7    validate("https://www.example.com")
8    print("Valid")
9except ValidationError:
10    print("Invalid")
11
12# Pydantic v2 — built-in URL type
13from pydantic import BaseModel, HttpUrl
14
15class Link(BaseModel):
16    url: HttpUrl
17
18link = Link(url="https://example.com")  # Valid
19# Link(url="not a url")  # Raises ValidationError

Common Pitfalls

  • Accepting urlparse results without checking scheme and netloc: urlparse("not-a-url") does not raise an error — it returns a result with an empty scheme and the input as the path. Always check that both scheme and netloc are non-empty.
  • Not restricting the URL scheme: urlparse("javascript:alert(1)") parses as a valid URL with scheme javascript. If you accept any scheme, you may allow XSS or file access attacks. Whitelist http and https for web URLs.
  • Using regex that is too permissive or too strict: URL validation regex is notoriously hard to get right. It either rejects valid URLs (internationalized domains, unusual ports) or accepts invalid ones. Use a tested library instead of writing your own regex.
  • Checking reachability for every URL: Making an HTTP request for each URL validation adds latency and can be exploited for SSRF (Server-Side Request Forgery). Only check reachability when explicitly needed, and never for user-supplied URLs in server-side code without SSRF protections.
  • Not handling internationalized domain names (IDN): URLs like https://münchen.de are valid but use non-ASCII characters. urlparse handles them, but custom regex patterns may reject them. Use idna encoding or the validators library which supports IDN.

Summary

  • Use urlparse() with scheme and netloc checks for basic validation
  • Restrict schemes to http and https for user-supplied URLs
  • The validators library provides comprehensive one-line URL validation
  • Django's URLValidator and Pydantic's HttpUrl integrate with their respective frameworks
  • Only check URL reachability when explicitly needed — syntax validation is usually sufficient
  • Avoid custom regex for URL validation — use tested libraries that handle edge cases

Course illustration
Course illustration

All Rights Reserved.