首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >安装pytube后,xml到srt的转换不起作用

安装pytube后,xml到srt的转换不起作用
EN

Stack Overflow用户
提问于 2021-08-14 06:14:40
回答 2查看 620关注 0票数 1

我已经安装了pytube来从一些youtube视频中提取字幕。下面的两段代码都给出了xml标题。

代码语言:javascript
复制
from pytube import YouTube
yt = YouTube('https://www.youtube.com/watch?v=4ZQQofkz9eE')
caption = yt.captions['a.en']
print(caption.xml_captions)

也正如在文档中提到的

代码语言:javascript
复制
yt = YouTube('http://youtube.com/watch?v=2lAe1cqCOXo')
caption = yt.captions.get_by_language_code('en')
caption.xml_captions

但在这两种情况下,我都会得到xml输出以及何时使用

代码语言:javascript
复制
print(caption.generate_srt_captions())

我得到一个错误,如下所示。你能帮助解压srt格式吗?

代码语言:javascript
复制
KeyError
~/anaconda3/envs/myenv/lib/python3.6/site-packages/pytube/captions.py in 
generate_srt_captions(self)
49         recompiles them into the "SubRip Subtitle" format.
50         """
51         return self.xml_caption_to_srt(self.xml_captions)
52 
53     @staticmethod

~/anaconda3/envs/myenv/lib/python3.6/site-packages/pytube/captions.py in 
xml_caption_to_srt(self, xml_captions)
81             except KeyError:
82                 duration = 0.0
83             start = float(child.attrib["start"])
84             end = start + duration
85             sequence_number = i + 1  # convert from 0-indexed to 1.

KeyError: 'start'
EN

回答 2

Stack Overflow用户

发布于 2021-08-28 18:29:56

这是库本身的一个bug。下面的一切都是在pytube 11.01中完成的。在captions.py文件的第76行中,替换:

代码语言:javascript
复制
for i, child in enumerate(list(root)):

至:

代码语言:javascript
复制
for i, child in enumerate(list(root.findall('body/p'))):

然后在第83行,替换为:

代码语言:javascript
复制
duration = float(child.attrib["dur"])

至:

代码语言:javascript
复制
duration = float(child.attrib["d"])

然后在第86行,替换为:

代码语言:javascript
复制
start = float(child.attrib["start"])

至:

代码语言:javascript
复制
start = float(child.attrib["t"])

如果只显示行数和时间,而不显示字幕文本,则替换第77行:

代码语言:javascript
复制
text = child.text or ""

至:

代码语言:javascript
复制
text = ''.join(child.itertext()).strip()
if not text:
    continue

它适用于我,python 3.9,pytube 11.01。祝你好运!

票数 2
EN

Stack Overflow用户

发布于 2021-08-31 19:23:39

我在captions.py文件的源代码上做了一些工作。只需将此文件的整个代码替换为以下代码:

代码语言:javascript
复制
import math
import os
import time
import xml.etree.ElementTree as ElementTree
from html import unescape
from typing import Dict, Optional

from pytube import request
from pytube.helpers import safe_filename, target_directory


class Caption:

    def __init__(self, caption_track: Dict):

        self.url = caption_track.get("baseUrl")

        name_dict = caption_track['name']
        if 'simpleText' in name_dict:
            self.name = name_dict['simpleText']
        else:
            for el in name_dict['runs']:
                if 'text' in el:
                    self.name = el['text']

        self.code = caption_track["vssId"]

        self.code = self.code.strip('.')

    @property
    def xml_captions(self) -> str:

        return request.get(self.url)

    def generate_srt_captions(self) -> str:

        return self.xml_caption_to_srt(self.xml_captions)

    @staticmethod
    def float_to_srt_time_format(d: float) -> str:

        fraction, whole = math.modf(d/1000)
        time_fmt = time.strftime("%H:%M:%S,", time.gmtime(whole))
        ms = f"{fraction:.3f}".replace("0.", "")
        return time_fmt + ms

    def xml_caption_to_srt(self, xml_captions: str) -> str:

        segments = []
        root = ElementTree.fromstring(xml_captions)
        count_line = 0
        for i, child in enumerate(list(root.findall('body/p'))):
        
            text = ''.join(child.itertext()).strip()
            if not text:
                continue
            count_line += 1
            caption = unescape(text.replace("\n", " ").replace("  ", " "),)
            try:
                duration = float(child.attrib["d"])
            except KeyError:
                duration = 0.0
            start = float(child.attrib["t"])
            end = start + duration
            try:
                end2 = float(root.findall('body/p')[i+2].attrib['t'])
            except:
                end2 = float(root.findall('body/p')[i].attrib['t']) + duration
            sequence_number = i + 1  # convert from 0-indexed to 1.
            line = "{seq}\n{start} --> {end}\n{text}\n".format(
                seq=count_line,
                start=self.float_to_srt_time_format(start),
                end=self.float_to_srt_time_format(end2),
                text=caption,
            )
            segments.append(line)

        return "\n".join(segments).strip()

    def download(
        self,
        title: str,
        srt: bool = True,
        output_path: Optional[str] = None,
        filename_prefix: Optional[str] = None,
    ) -> str:

        if title.endswith(".srt") or title.endswith(".xml"):
            filename = ".".join(title.split(".")[:-1])
        else:
            filename = title

        if filename_prefix:
            filename = f"{safe_filename(filename_prefix)}{filename}"

        filename = safe_filename(filename)

        filename += f" ({self.code})"

        if srt:
            filename += ".srt"
        else:
            filename += ".xml"

        file_path = os.path.join(target_directory(output_path), filename)

        with open(file_path, "w", encoding="utf-8") as file_handle:
            if srt:
                file_handle.write(self.generate_srt_captions())
            else:
                file_handle.write(self.xml_captions)

        return file_path

    def __repr__(self):
        return '<Caption lang="{s.name}" code="{s.code}">'.format(s=self)
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/68780808

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档