文章/答案/技术大牛

发布

社区首页 >问答首页 >Imgur解析器

问Imgur解析器
EN

Code Review用户

提问于 2018-09-25 02:02:20

回答 2查看 2.8K关注 0票数 10

Python非常新，为了更好地帮助解决问题，我一直在做一些Edabit挑战。我刚刚完成了一个半艰难的挑战，我希望得到一些反馈。

挑战本身：

创建一个函数，该函数接受一个imgur链接(作为字符串)，并提取唯一的id和类型。返回一个包含唯一id的对象，并返回一个字符串，指示它是什么类型的链接。链接可以指向：

专辑(如http://imgur.com/a/cjh4E)
画廊(如http://imgur.com/gallery/59npG)
图像(如http://imgur.com/OzZUNMM)
图像(直接链接)(例如http://i.imgur.com/altd8Ld.png)

示例

"http://imgur.com/a/cjh4E“➞{ id："cjh4E"，键入：”相册“}
"http://imgur.com/gallery/59npG“➞{ id："59npG"，类型：”图库“}
"http://i.imgur.com/altd8Ld.png“➞{ id："altd8Ld"，键入："image”}

我想出了以下几点。

import re

def imgurUrlParser(url):

    url_regex          =    "^[http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/|www\.]*[imgur|i.imgur]*\.com"
    url = re.match(url_regex, url).string

    gallery_regex     =     re.match(url_regex + "(\/gallery\/)(\w+)", url)
    album_regex       =     re.match(url_regex + "(\/a\/)(\w+)", url)
    image_regex       =     re.match(url_regex + "\/(\w+)", url)
    direct_link_regex =     re.match(url_regex + "(\w+)(\.\w+)", url)

    if gallery_regex:
        return { "id" : gallery_regex.group(2), "type" : "gallery" } 
    elif album_regex:
        return { "id" : album_regex.group(2), "type" : "album" }
    elif image_regex:
        return { "id" : image_regex.group(1), "type" : "image" } 
    elif direct_link_regex:
        return { "id" : direct_link_regex.group(1), "type" : "image"}

python

python-3.x

programming-challenge

regex

url

回答 2

Code Review用户

发布于 2018-09-25 04:55:40

根据PEP 8，官方的Python风格指南，函数名应该是lower_case_with_underscores。此外，该函数解析URL，而不是创建解析器对象，因此函数名应该是动词短语而不是名词短语。

通过RFC 1738，URL的方案和主机部分是不区分大小写的.此外，允许在URL中包含一个冗余端口号。

Imgur还与某些其他网站建立了合作关系。例如，当您通过问题编辑器堆栈交换站点上传图像时，它将在https://i.stack.imgur.com上结束。

在不同的规则中有许多共同之处。考虑将它们合并成一个正则表达式。使用命名捕获组来避免神奇的组号。

带有博士考试的docstring对此函数非常有用。

import re

def parse_imgur_url(url):
    """
    Extract the type and id from an Imgur URL.

    >>> parse_imgur_url('http://imgur.com/a/cjh4E')
    {'id': 'cjh4E', 'type': 'album'}
    >>> parse_imgur_url('HtTP://imgur.COM:80/gallery/59npG')
    {'id': '59npG', 'type': 'gallery'}
    >>> parse_imgur_url('https://i.imgur.com/altd8Ld.png')
    {'id': 'altd8Ld', 'type': 'image'}
    >>> parse_imgur_url('https://i.stack.imgur.com/ELmEk.png')
    {'id': 'ELmEk', 'type': 'image'}
    >>> parse_imgur_url('http://not-imgur.com/altd8Ld.png') is None
    Traceback (most recent call last):
      ...
    ValueError: "http://not-imgur.com/altd8Ld.png" is not a valid imgur URL
    >>> parse_imgur_url('tftp://imgur.com/gallery/59npG') is None
    Traceback (most recent call last):
      ...
    ValueError: "tftp://imgur.com/gallery/59npG" is not a valid imgur URL
    >>> parse_imgur_url('Blah') is None
    Traceback (most recent call last):
      ...
    ValueError: "Blah" is not a valid imgur URL
    """
    match = re.match(
        r'^(?i:https?://(?:[^/:]+\.)?imgur\.com)(:\d+)?'
        r'/(?:(?P<album>a/)|(?P<gallery>gallery/))?(?P<id>\w+)',
        url
    )
    if not match:
        raise ValueError('"{}" is not a valid imgur URL'.format(url))
    return {
        'id': match.group('id'),
        'type': 'album' if match.group('album') else
                'gallery' if match.group('gallery') else
                'image',
    }

请注意，上面的正则表达式依赖于Python3.6的(?aiLmsux-imsx:...)功能，而doctest依赖于Python3.6/ 3.7中的字典键的可预测顺序。

票数 10

Code Review用户

发布于 2018-09-26 00:00:27

其他的答案都很好，但我将讨论一些更基本的问题: Regex不是URL解析的合适工具。Python有非常好的内置模块。好好利用他们。urlparse太棒了！

acceptable_hostname_regex = re.compile("(i.(stack.)?)?imgur.com")

def parse_imgur_url(url):
    parsed = urlparse(url)

    if not acceptable_hostname_regex.match(parsed.hostname):
        raise ValueError(f"The string {url} is not a valid imgur link")

    path_components = [c for c in parsed.path.split("/") if c]

    if len(path_components) == 1:
        image_name = path_components[0]
        id = image_name[:image_name.index(".")]
        type = "image"

    elif len(path_components) == 2:
        type = path_components[0]
        id = path_components[1]

        type_overwrite_table = { "a": "album" }

        type = type_overwrite_table.get(type, type)

    else:
        raise ValueError(f"The imgur link {url} has too many path components.")

    return { 'id': id, 'type': type }

另一个问题是:您的gallery_regex、album_regex等局部变量实际上并不像其名称所暗示的那样存储regex对象(类型为re.Pattern)，相反，它们存储的是re.Match对象。

票数 2

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/204316

复制

相似问题

问Imgur解析器
EN

回答 2

Code Review用户

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Imgur解析器EN

回答 2

Code Review用户

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Imgur解析器
EN