文章/答案/技术大牛

发布

社区首页 >问答首页 >检测字幕错误的正则表达式

问检测字幕错误的正则表达式
EN

Stack Overflow用户

提问于 2014-03-04 16:15:18

回答 2查看 359关注 0票数 1

我有一些字幕的问题，我需要一种方法来检测特定的错误。我认为正则表达式会有所帮助，但需要帮助来弄清楚这一点。在SRT格式的字幕的这个例子中，行#13结束于00:01:10,130，行#14开始于00:01:10:129。

13
00:01:05,549 --> 00:01:10,130
some text here.

14
00:01:10,129 --> 00:01:14,109
some other text here.

问题是，下一行不能在当前行之前开始-当这种情况发生时，过度嵌入算法不起作用。我需要检查我的SRT文件并手动更正，但手动在大约20个视频中查找，每个视频一小时长，这不是一个选择。特别是因为我需要它‘昨天’(：

SRT字幕的格式非常具体：

XX 
START --> END 
TEXT
EMPTY LINE

[line number (digits)][new line character]
[start and end times in 00:00:00,000 format, separated by _space__minusSign__minusSign__greaterThenSign__space_][new line character]
[text - can be any character - letter, digit, punctuation sign.. pretty much anything][new line character]
[new line character]

我需要检查结束时间是否大于以下字幕的开始时间。如果能帮上忙，我们将不胜感激。

PS。我可以使用Notepad++、Eclipse (Aptana)、python或javascript……

regex

回答 2

Stack Overflow用户

发布于 2014-03-04 16:31:50

正则表达式可以用来实现你想要的东西，也就是说，他们不能()自己去做。正则表达式用于匹配模式，而不是数字范围。

如果我在你那里，我会这样做：

解析文件并将开始-结束时间放在一个数据结构中(称为DS_A)，将文本放在另一个数据结构中(按升序称为DS_A )。这应该可以保证你不会有重叠的范围。(上一篇SO post应该指向您正确的direction).
1. Iterate，并在您的文件中写入以下内容：j DS_A[i] --> DS_A[i + 1] <newline> DS_B[j]，其中i是DS_A的循环计数器，j是DS_B.
的循环计数器

票数 1

Stack Overflow用户

发布于 2014-03-04 20:49:41

我最终写了一些简短的脚本来解决这个问题。这就是它：

# -*- coding: utf-8 -*-
from datetime import datetime
import getopt, re, sys

count = 0
def fix_srt(inputfile):
  global count
  parsed_file, errors_file = '', ''
  try:
    with open( inputfile , 'r') as f:
      srt_file = f.read()
      parsed_file, errors_file = parse_srt(srt_file)
  except:
    pass
  finally:
    outputfile1 = ''.join( inputfile.split('.')[:-1] ) + '_fixed.srt'
    outputfile2 = ''.join( inputfile.split('.')[:-1] ) + '_error.srt'
    with open( outputfile1 , 'w') as f:
      f.write(parsed_file)
    with open( outputfile2 , 'w') as f:
      f.write(errors_file)
    print 'Detected %s errors in "%s". Fixed file saved as "%s"
           (Errors only as "%s").' % ( count, inputfile, outputfile1, outputfile2 )

previous_end_time = datetime.strptime("00:00:00,000", "%H:%M:%S,%f")
def parse_times(times):
  global previous_end_time
  global count
  _error = False
  _times = []
  for time_code in times:
    t = datetime.strptime(time_code, "%H:%M:%S,%f")
    _times.append(t)

  if _times[0] < previous_end_time:
    _times[0] = previous_end_time
    count += 1
    _error = True
  previous_end_time = _times[1]

  _times[0] =  _times[0].strftime("%H:%M:%S,%f")[:12]
  _times[1] = _times[1].strftime("%H:%M:%S,%f")[:12]

  return _times, _error

def parse_srt(srt_file):
  parsed_srt = []
  parsed_err = []
  for srt_group in re.sub('\r\n', '\n', srt_file).split('\n\n'):
    lines = srt_group.split('\n')
    if len(lines) >= 3:
      times = lines[1].split(' --> ')
    correct_times, error = parse_times(times)
    if error:
      clean_text = map( lambda x: x.strip(' '), lines[2:] )
      srt_group = lines[0].strip(' ') + '\n' + ' --> '.join( correct_times ) + '\n' + '\n'.join( clean_text )
      parsed_err.append( srt_group )
    parsed_srt.append( srt_group )
  return '\r\n'.join( parsed_srt ), '\r\n'.join( parsed_err )

def main(argv):
  inputfile = None
  try:
    options, arguments = getopt.getopt(argv, "hi:", ["input="])
  except:
    print 'Usage: test.py -i <input file>'

  for o, a in options:
    if o == '-h':
      print 'Usage: test.py -i <input file>'
      sys.exit()
    elif o in ['-i', '--input']:
      inputfile = a
  fix_srt(inputfile)

if __name__ == '__main__':
  main( sys.argv[1:] )

如果有人需要它，例如，将代码保存为srtfix.py，并从命令行使用它：

python srtfix.py -i "my srt subtitle.srt"

我很懒，使用datetime模块来处理时间码，所以不确定脚本对字幕的工作时间是否超过24小时(：我也不确定何时将毫秒添加到Python的datetime模块中，我使用的是2.7.5版本；因此脚本可能不能在早期版本上工作……

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/22166344

复制

相似问题

问检测字幕错误的正则表达式
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问检测字幕错误的正则表达式EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问检测字幕错误的正则表达式
EN