我正在用python编写一个小脚本,但是由于我是个新手,所以我陷入了一个部分:我需要从.srt文件中获取时间和文本。例如,从
1
00:00:01,000 --> 00:00:04,074
Subtitles downloaded from www.OpenSubtitles.org我要得到:
00:00:01,000 --> 00:00:04,074
和
Subtitles downloaded from www.OpenSubtitles.org。
我已经成功地制定了时间的准则,但我仍然坚持文字。我试着用“后面看”来表示计时:
( ?<=(\d+):(\d+):(\d+)(?:\,)(\d+) --> (\d+):(\d+):(\d+)(?:\,)(\d+) )\w+但没有效果。就我个人而言,我认为使用回头看是解决这一问题的正确方法,但我不知道如何正确地编写它。有谁可以帮我?谢谢。
发布于 2014-05-12 23:32:31
老实说,我看不出有什么理由对这个问题大发雷霆。.srt文件是高度结构化。其结构如下:
..。再说一遍。注意粗体部分--在时间代码之后,您可能需要捕获1、2或20行字幕内容。
所以,好好利用这个结构。通过这种方式,您可以在一次传递中解析所有内容,而不需要每次将多行放入内存中,并且仍然将每个字幕的所有信息放在一起。
from itertools import groupby
# "chunk" our input file, delimited by blank lines
with open(filename) as f:
res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]例如,使用SRT页面上的示例,我得到:
res
Out[60]:
[['1\n',
'00:02:17,440 --> 00:02:20,375\n',
"Senator, we're making\n",
'our final approach into Coruscant.\n'],
['2\n', '00:02:20,476 --> 00:02:22,501\n', 'Very good, Lieutenant.\n']]我可以把它进一步转化为一个有意义的对象列表:
from collections import namedtuple
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
number, start_end, *content = sub # py3 syntax
start, end = start_end.split(' --> ')
subs.append(Subtitle(number, start, end, content))
subs
Out[65]:
[Subtitle(number='1', start='00:02:17,440', end='00:02:20,375', content=["Senator, we're making", 'our final approach into Coruscant.']),
Subtitle(number='2', start='00:02:20,476', end='00:02:22,501', content=['Very good, Lieutenant.'])]发布于 2014-05-12 23:36:16
不同意@roippi。Regex是一个非常好的文本匹配解决方案。这个解决方案的Regex并不复杂。
import re
f = file.open(yoursrtfile)
# Parse the file content
content = f.read()
# Find all result in content
# The first big (__) retrieve the timing, \s+ match all timing in between,
# The (.+) means retrieve any text content after that.
result = re.findall("(\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)\s+(.+)", content)
# Just print out the result list. I recommend you do some formatting here.
print result发布于 2017-01-26 16:52:01
号码:^[0-9]+$
时间:
^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]$
字符串:*[a-zA-Z]+*
希望能帮上忙。
https://stackoverflow.com/questions/23620423
复制相似问题