摘要：问题描述：我已从url用户处收到信息，并且必须使用获取的 HTML 进行回复。我如何检查 URL 是否格式错误？例如：url = 'google' # Malformed url = 'google.com' # Malformed url = 'http://google.com' # Valid url...

问题描述：

我已从url用户处收到信息，并且必须使用获取的 HTML 进行回复。

我如何检查 URL 是否格式错误？

例如：

url = 'google' # Malformed
url = 'google.com' # Malformed
url = 'http://google.com' # Valid
url = 'http://google' # Malformed

解决方案 1：

使用验证器包：

>>> import validators
>>> validators.url("http://google.com")
True
>>> validators.url("http://google")
ValidationFailure(func=url, args={'value': 'http://google', 'require_tld': True})
>>> if not validators.url("http://google"):
...     print "not valid"
... 
not valid
>>>

使用 pip ( )从 PyPI安装它pip install validators。

解决方案 2：

根据@DMfll 的回答，判断正确与否的版本：

try:
    # python2
    from urlparse import urlparse
except ModuleNotFoundError:
    # python3
    from urllib.parse import urlparse

a = 'http://www.cwi.nl:80/%7Eguido/Python.html'
b = '/data/Python.html'
c = 532
d = u'dkakasdkjdjakdjadjfalskdjfalk'
e = 'https://stackoverflow.com'

def uri_validator(x):
    try:
        result = urlparse(x)
        return all([result.scheme, result.netloc])
    except AttributeError:
        return False

print(uri_validator(a))
print(uri_validator(b))
print(uri_validator(c))
print(uri_validator(d))
print(uri_validator(e))

给出：

True
False
False
False
True

解决方案 3：

事实上，我认为这是最好的方法。

from django.core.validators import URLValidator
from django.core.exceptions import ValidationError

val = URLValidator()
try:
    val('httpx://www.google.com')
except (ValidationError,) as e: 
    print(e)

编辑：啊，是的，这个问题与这个问题重复：如何使用 Django 的验证器检查 URL 是否存在？

解决方案 4：

django url 验证正则表达式（来源）：

import re
regex = re.compile(
        r'^(?:http|ftp)s?://' # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?.)+(?:[A-Z]{2,6}.?|[A-Z0-9-]{2,}.?)|' #domain...
        r'localhost|' #localhost...
        r'd{1,3}.d{1,3}.d{1,3}.d{1,3})' # ...or ip
        r'(?::d+)?' # optional port
        r'(?:/?|[/?]S+)$', re.IGNORECASE)

print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None)            # False

解决方案 5：

现在，根据 Padam 的回答，我使用以下内容：

$ python --version
Python 3.6.5

它看起来是这样的：

from urllib.parse import urlparse

def is_url(url):
  try:
    result = urlparse(url)
    return all([result.scheme, result.netloc])
  except ValueError:
    return False

只需使用is_url("http://www.asdf.com")。

希望有帮助！

解决方案 6：

我来到这个页面，试图找出一种合理的方法来验证字符串是否为“有效”网址。我在这里分享我使用 python3 的解决方案。无需额外的库。

如果您使用的是 python2，请参阅https://docs.python.org/2/library/urlparse.html 。

如果您和我一样使用 python3，请参阅https://docs.python.org/3.0/library/urllib.parse.html 。

import urllib
from pprint import pprint

invalid_url = 'dkakasdkjdjakdjadjfalskdjfalk'
valid_url = 'https://stackoverflow.com'
tokens = [urllib.parse.urlparse(url) for url in (invalid_url, valid_url)]

for token in tokens:
    pprint(token)
    
min_attributes = ('scheme', 'netloc')  # add attrs to your liking
for token in tokens:
    if all(getattr(token, attr) for attr in min_attributes) is False:
        error = "'{url}' string has no scheme or netloc.".format(url=token.geturl())
        print(error)
    else:
        print("'{url}' is probably a valid url.".format(url=token.geturl()))

ParseResult（方案=''，netloc=''，路径='dkakasdkjdjakdjadjfalskdjfalk'，参数=''，查询=''，片段=''）

ParseResult（scheme ='https'，netloc ='stackoverflow.com'，路径 =''，params =''，查询 =''，片段 =''）

'dkakasdkjdjakdjadjfalskdjfalk' 字符串没有方案或 netloc。

'https://stackoverflow.com' 可能是一个有效的网址。

这是一个更简洁的函数：

from urllib.parse import urlparse

min_attributes = ('scheme', 'netloc')


def is_valid(url, qualifying=min_attributes):
    tokens = urlparse(url)
    return all(getattr(tokens, qualifying_attr)
               for qualifying_attr in qualifying)

这是一个使用示例。我更喜欢在函数之外进行任何异常处理。

my_list = [
    "http://www.cwi.nl:80/%7Eguido/Python.html",
    "/data/Python.html",
    532,
    type("FooObject", (), {"decode": None})(),
    "dkakasdkjdjakdjadjfalskdjfalk",
    "https://stackoverflow.com",
]

for item in my_list:
    try:
        print(f"{item} is valid: {is_valid(item)}")
    except (AttributeError, TypeError) as e:
        print(e)

输出：

http://www.cwi.nl:80/%7Eguido/Python.html有效：真

/data/Python.html 有效：False

“int”对象没有属性“decode”

‘NoneType’对象不可调用

dkakasdkjdjakdjadjfalskdjfalk 有效：False

https://stackoverflow.com有效：真实

解决方案 7：

注意- lepl 不再受支持，抱歉（欢迎您使用它，我认为下面的代码有效，但不会得到更新）。

rfc 3696 http://www.faqs.org/rfcs/rfc3696.html定义了如何执行此操作（针对 http url 和电子邮件）。我使用 lepl（解析器库）在 python 中实现了其建议。请参阅http://acooke.org/lepl/rfc3696.html

使用：

> easy_install lepl
...
> python
...
>>> from lepl.apps.rfc3696 import HttpUrl
>>> validator = HttpUrl()
>>> validator('google')
False
>>> validator('http://google')
False
>>> validator('http://google.com')
True

解决方案 8：

编辑

正如@Kwame 指出的那样，即使不存在.com或等，下面的代码也会验证 url 。.co
@Blaise 还指出，像https://www.google这样的 URL是有效的 URL，您需要单独进行 DNS 检查以检查它是否可以解析。

这很简单并且有效：

因此min_attr包含定义 URL 有效性所需存在的一组基本字符串，即http://部分和google.com部分。

urlparse.scheme商店http://和

urlparse.netloc存储域名google.com

from urlparse import urlparse
def url_check(url):

    min_attr = ('scheme' , 'netloc')
    try:
        result = urlparse(url)
        if all([result.scheme, result.netloc]):
            return True
        else:
            return False
    except:
        return False

all()如果其中的所有变量都返回 true，则返回 true。因此，如果result.scheme和result.netloc存在（即具有某个值），则 URL 有效，因此返回True。

解决方案 9：

这是一个正则表达式解决方案，因为最高投票的正则表达式不适用于顶级域名等奇怪情况。下面是一些测试用例。

regex = re.compile(
    r"(w+://)?"                # protocol                      (optional)
    r"(w+.)?"                 # host                          (optional)
    r"(([w-]+).(w+))"        # domain
    r"(.w+)*"                 # top-level domain              (optional, can have > 1)
    r"([w-._~/]*)*(?<!.)"  # path, params, anchors, etc.   (optional)
)

cases = [
    "http://www.google.com",
    "https://www.google.com",
    "http://google.com",
    "https://google.com",
    "www.google.com",
    "google.com",
    "http://www.google.com/~as_db3.2123/134-1a",
    "https://www.google.com/~as_db3.2123/134-1a",
    "http://google.com/~as_db3.2123/134-1a",
    "https://google.com/~as_db3.2123/134-1a",
    "www.google.com/~as_db3.2123/134-1a",
    "google.com/~as_db3.2123/134-1a",
    # .co.uk top level
    "http://www.google.co.uk",
    "https://www.google.co.uk",
    "http://google.co.uk",
    "https://google.co.uk",
    "www.google.co.uk",
    "google.co.uk",
    "http://www.google.co.uk/~as_db3.2123/134-1a",
    "https://www.google.co.uk/~as_db3.2123/134-1a",
    "http://google.co.uk/~as_db3.2123/134-1a",
    "https://google.co.uk/~as_db3.2123/134-1a",
    "www.google.co.uk/~as_db3.2123/134-1a",
    "google.co.uk/~as_db3.2123/134-1a",
    "https://...",
    "https://..",
    "https://.",
    "https://.google.com",
    "https://..google.com",
    "https://...google.com",
    "https://.google..com",
    "https://.google...com",
    "https://...google..com",
    "https://...google...com",
    ".google.com",
    ".google.co.",
    "https://google.co."
]
for c in cases:
    if regex.match(c):
        print(c, regex.match(c).span()[1] - regex.match(c).span()[0] == len(c))
    else:
        print(c, False)

编辑：按照 nickh 的建议在域名中添加连字符。

解决方案 10：

`urllib`使用类似 Django 的正则表达式验证 URL

Django URL 验证正则表达式实际上非常好，但我需要根据我的用例对其进行一些调整。请随意调整以适应您的情况！

Python 3.7

import re
import urllib

# Check https://regex101.com/r/A326u1/5 for reference
DOMAIN_FORMAT = re.compile(
    r"(?:^(w{1,255}):(.{1,255})@|^)" # http basic authentication [optional]
    r"(?:(?:(?=S{0,253}(?:$|:))" # check full domain length to be less than or equal to 253 (starting after http basic auth, stopping before port)
    r"((?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?.)+" # check for at least one subdomain (maximum length per subdomain: 63 characters), dashes in between allowed
    r"(?:[a-z0-9]{1,63})))" # check for top level domain, no dashes allowed
    r"|localhost)" # accept also "localhost" only
    r"(:d{1,5})?", # port [optional]
    re.IGNORECASE
)
SCHEME_FORMAT = re.compile(
    r"^(http|hxxp|ftp|fxp)s?$", # scheme: http(s) or ftp(s)
    re.IGNORECASE
)

def validate_url(url: str):
    url = url.strip()

    if not url:
        raise Exception("No URL specified")

    if len(url) > 2048:
        raise Exception("URL exceeds its maximum length of 2048 characters (given length={})".format(len(url)))

    result = urllib.parse.urlparse(url)
    scheme = result.scheme
    domain = result.netloc

    if not scheme:
        raise Exception("No URL scheme specified")

    if not re.fullmatch(SCHEME_FORMAT, scheme):
        raise Exception("URL scheme must either be http(s) or ftp(s) (given scheme={})".format(scheme))

    if not domain:
        raise Exception("No URL domain specified")

    if not re.fullmatch(DOMAIN_FORMAT, domain):
        raise Exception("URL domain malformed (domain={})".format(domain))

    return url

解释

代码仅验证给定 URL 的scheme和netloc部分。（为了正确执行此操作，我将 URL 拆分为urllib.parse.urlparse()两个相应的部分，然后将其与相应的正则表达式项进行匹配。）
该netloc部分在第一次出现斜线之前停止/，因此port数字仍然是的一部分netloc，例如：

https://www.google.com:80/search?q=python
^^^^^   ^^^^^^^^^^^^^^^^^
  |             |      
  |             +-- netloc (aka "domain" in my code)
  +-- scheme

IPv4 地址也经过验证

IPv6 支持

如果您希望 URL 验证器也能处理 IPv6 地址，请执行以下操作：

is_valid_ipv6(ip)从Markus Jarderot 的答案中添加，它有一个非常好的 IPv6 验证器正则表达式
添加and not is_valid_ipv6(domain)到最后if

示例

netloc以下是(aka ) 部分正则表达式的一些实际示例domain：

IPv4 和字母数字： https: //regex101.com/r/A326u1/5
IPv6： https ://regex101.com/r/lKIIgq/1（使用Markus Jarderot 答案中的正则表达式）

解决方案 11：

Pydantic 可以用来做这件事。我不太习惯它，所以我不能说它的局限性。虽然这是一个选择，但没有人建议它。

我看到很多人在之前的回答中询问过 ftp 和文件 URL，所以我建议了解一下文档，因为 Pydantic 有许多验证类型，如 FileUrl、AnyUrl 甚至数据库 url 类型。

一个简单的使用示例：

from requests import get, HTTPError, ConnectionError
from pydantic import BaseModel, AnyHttpUrl, ValidationError
    
class MyConfModel(BaseModel):
    URI: AnyHttpUrl

try:
    myAddress = MyConfModel(URI = "http://myurl.com/")
    req = get(myAddress.URI, verify=False)
    print(myAddress.URI)

except(ValidationError):
    print('Invalid destination')

Pydantic 还会引发异常（pydantic.ValidationError），可用于处理错误。

我已经用这些模式对其进行了测试：

http://localhost (密码)
http://localhost:8080 （通过）
http://example.com (通过)
http://用户:密码@example.com (密码)
http://_example.com (通过)
http://&example.com（失败）
http://-example.com（失败）

解决方案 12：

上述所有解决方案都将“ http://www.google.com/path,www.yahoo.com/path ”这样的字符串视为有效字符串。此解决方案始终有效

import re

# URL-link validation
ip_middle_octet = u"(?:.(?:1?d{1,2}|2[0-4]d|25[0-5]))"
ip_last_octet = u"(?:.(?:[1-9]d?|1dd|2[0-4]d|25[0-4]))"

URL_PATTERN = re.compile(
                        u"^"
                        # protocol identifier
                        u"(?:(?:https?|ftp|rtsp|rtp|mmp)://)"
                        # user:pass authentication
                        u"(?:S+(?::S*)?@)?"
                        u"(?:"
                        u"(?P<private_ip>"
                        # IP address exclusion
                        # private & local networks
                        u"(?:localhost)|"
                        u"(?:(?:10|127)" + ip_middle_octet + u"{2}" + ip_last_octet + u")|"
                        u"(?:(?:169.254|192.168)" + ip_middle_octet + ip_last_octet + u")|"
                        u"(?:172.(?:1[6-9]|2d|3[0-1])" + ip_middle_octet + ip_last_octet + u"))"
                        u"|"
                        # IP address dotted notation octets
                        # excludes loopback network 0.0.0.0
                        # excludes reserved space >= 224.0.0.0
                        # excludes network & broadcast addresses
                        # (first & last IP address of each class)
                        u"(?P<public_ip>"
                        u"(?:[1-9]d?|1dd|2[01]d|22[0-3])"
                        u"" + ip_middle_octet + u"{2}"
                        u"" + ip_last_octet + u")"
                        u"|"
                        # host name
                        u"(?:(?:[a-z/u00a1-/uffff0-9_-]-?)*[a-z/u00a1-/uffff0-9_-]+)"
                        # domain name
                        u"(?:.(?:[a-z/u00a1-/uffff0-9_-]-?)*[a-z/u00a1-/uffff0-9_-]+)*"
                        # TLD identifier
                        u"(?:.(?:[a-z/u00a1-/uffff]{2,}))"
                        u")"
                        # port number
                        u"(?::d{2,5})?"
                        # resource path
                        u"(?:/S*)?"
                        # query string
                        u"(?:?S*)?"
                        u"$",
                        re.UNICODE | re.IGNORECASE
                       )
def url_validate(url):   
    """ URL string validation
    """                                                                                                                                                      
    return re.compile(URL_PATTERN).match(url)

解决方案 13：

不直接相关，但通常需要确定某个 token 是否可以是 url，不一定 100% 正确形成（即省略 https 部分等）。我读过这篇文章，但没有找到解决方案，因此为了完整起见，我在这里发布了我自己的解决方案。

def get_domain_suffixes():
    import requests
    res=requests.get('https://publicsuffix.org/list/public_suffix_list.dat')
    lst=set()
    for line in res.text.split('
'):
        if not line.startswith('//'):
            domains=line.split('.')
            cand=domains[-1]
            if cand:
                lst.add('.'+cand)
    return tuple(sorted(lst))

domain_suffixes=get_domain_suffixes()

def reminds_url(txt:str):
    """
    >>> reminds_url('yandex.ru.com/somepath')
    True
    
    """
    ltext=txt.lower().split('/')[0]
    return ltext.startswith(('http','www','ftp')) or ltext.endswith(domain_suffixes)

解决方案 14：

使用此示例来得出您自己对“URL”的定义，并将其应用到您的代码中的任何地方：

#         DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
#                 Version 2, December 2004
#
# Copyright (C) 2004 Sam Hocevar <sam@hocevar.net>
#
# Everyone is permitted to copy and distribute verbatim or modified
# copies of this license document, and changing it is allowed as long
# as the name is changed.
#
#         DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
# TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
#
# 0. You just DO WHAT THE FUCK YOU WANT TO.
#
# Copyright © 2023 Anthony anthony@example.com
#
# This work is free. You can redistribute it and/or modify it under the
# terms of the Do What The Fuck You Want To Public License, Version 2,
# as published by Sam Hocevar. See the LICENSE file for more details.

import operator as op

from urllib.parse import (
    ParseResult,
    urlparse,
)

import attrs
import pytest

from phantom import Phantom
from phantom.fn import compose2


def is_url_address(value: str) -> bool:
    return any(urlparse(value))


class URL(str, Phantom, predicate=is_url_address):
    pass


# presume that an empty URL is a nonsense
def test_empty_url():
    with pytest.raises(TypeError, match="Could not parse .* from ''"):
        URL.parse("")


# is it enough now?
def test_url():
    assert URL.parse("http://")


scheme_and_netloc = op.attrgetter("scheme", "netloc")


def has_scheme_and_netloc(value: ParseResult) -> bool:
    return all(scheme_and_netloc(value))

如何在 Python 中验证 URL？（无论格式是否正确）

问题描述：

解决方案 1：

解决方案 2：

解决方案 3：

解决方案 4：

django url 验证正则表达式（来源）：

解决方案 5：

解决方案 6：

解决方案 7：

解决方案 8：

解决方案 9：

解决方案 10：

`urllib`使用类似 Django 的正则表达式验证 URL

Python 3.7

解释

IPv6 支持

示例

解决方案 11：

解决方案 12：

解决方案 13：

解决方案 14：

云端的项目管理软件

问题描述：

解决方案 1：

解决方案 2：

解决方案 3：

解决方案 4：

django url 验证正则表达式（来源）：

解决方案 5：

解决方案 6：

解决方案 7：

解决方案 8：

解决方案 9：

解决方案 10：

urllib使用类似 Django 的正则表达式验证 URL

Python 3.7

解释

IPv6 支持

示例

解决方案 11：

解决方案 12：

解决方案 13：

解决方案 14：

云端的项目管理软件

`urllib`使用类似 Django 的正则表达式验证 URL