我需要帮助从这个文本文件(https://www.gutenberg.org/files/768/768.txt)删除谷歌Colab的段落。我需要的文本文件开始后“ccx074@pglaf.org”,并结束前“项目古腾堡电子书呼啸山庄的结束,以便有一个准确的总数的字数。下面列出了我到目前为止所拥有的代码。
# download and installing pyspark in colab
!pip install -q pyspark
# download Wuthering Heights, by Emily Bronte
!wget -q https://www.gutenberg.org/files/768/768.txt
import os.path
baseDir = os.path.join('data')
inputPath = os.path.join('/content/768.txt')
fileName = os.path.join(baseDir, inputPath)
with open('/content/768.txt','r') as f:
print(f.read())发布于 2020-08-25 22:42:26
只需在找到要查找的文本的位置对字符串进行分割即可。
!wget -q https://www.gutenberg.org/files/768/768.txt
import os.path
baseDir = os.path.join('data')
inputPath = os.path.join('768.txt')
fileName = os.path.join(baseDir, inputPath)
with open('768.txt','r') as f:
text = f.read()
#GET START LOC
start_loc = text.find("ccx074@pglaf.org") + len("ccx074@pglaf.org")
#GET END LOC
end_loc = text[start_loc:].find("***")
#SLICE THE TEXT STRING AND THE INDEXES
text[start_loc:start_loc+end_loc].replace("\n","")发布于 2020-08-25 22:53:38
您可以使用正则表达式提取两个字符串之间的文本:
import re
text = open('768.txt','r').read()
start = "ccx074@pglaf.org"
end = "END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS"
m = re.search(f'{start}(?s)(.*){end}', text)
print(m.group(1))https://stackoverflow.com/questions/63581223
复制相似问题