我有一个docx文件,从中提取它包含的所有文本。这个文件包含许多图像,多亏了tika,我可以从文档中提取文本,从图像中提取文本。
我需要的是用相应的文本替换图像标签。
我用巨蟒和福汤来做这件事。
我把xml文件留在这里,看看是否有人能帮我忙。
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2018-09-24T10:14:00Z" />
<meta name="extended-properties:DocSecurity" content="4" />
<meta name="extended-properties:AppVersion" content="16.0000" />
<meta name="meta:paragraph-count" content="18" />
<meta name="Word-Count" content="1403" />
<meta name="meta:line-count" content="66" />
<meta name="custom:MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Name" content="General" />
<meta name="Template" content="Normal" />
<meta name="custom:MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Extended_MSFT_Method" content="Automatic" />
<meta name="Paragraph-Count" content="18" />
<meta name="meta:character-count-with-spaces" content="9386" />
<meta name="dc:title" content="Introduction to Blob storage - Object storage in Azure | Microsoft Docs" />
<meta name="modified" content="2018-09-24T10:14:00Z" />
<meta name="meta:author" content="Aya Kamel" />
<meta name="meta:creation-date" content="2018-09-24T10:14:00Z" />
<meta name="extended-properties:Application" content="Microsoft Office Word" />
<meta name="Creation-Date" content="2018-09-24T10:14:00Z" />
<meta name="Character-Count-With-Spaces" content="9386" />
<meta name="Last-Author" content="Tulasi Menon" />
<meta name="Character Count" content="8001" />
<meta name="Page-Count" content="10" />
<meta name="custom:MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_SiteId" content="72f988bf-86f1-41af-91ab-2d7cd011db47" />
<meta name="Application-Version" content="16.0000" />
<meta name="extended-properties:Template" content="Normal" />
<meta name="custom:Sensitivity" content="General" />
<meta name="Author" content="Aya Kamel" />
<meta name="publisher" content="" />
<meta name="custom:MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Owner" content="aykame@microsoft.com" />
<meta name="meta:page-count" content="10" />
<meta name="custom:MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Enabled" content="True" />
<meta name="cp:revision" content="2" />
<meta name="meta:word-count" content="1403" />
<meta name="dc:creator" content="Aya Kamel" />
<meta name="extended-properties:Company" content="" />
<meta name="dcterms:created" content="2018-09-24T10:14:00Z" />
<meta name="dcterms:modified" content="2018-09-24T10:14:00Z" />
<meta name="Last-Modified" content="2018-09-24T10:14:00Z" />
<meta name="Last-Save-Date" content="2018-09-24T10:14:00Z" />
<meta name="meta:character-count" content="8001" />
<meta name="Line-Count" content="66" />
<meta name="meta:save-date" content="2018-09-24T10:14:00Z" />
<meta name="Application-Name" content="Microsoft Office Word" />
<meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.wordprocessingml.document" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
<meta name="creator" content="Aya Kamel" />
<meta name="meta:last-author" content="Tulasi Menon" />
<meta name="custom:MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_SetDate" content="2018-09-23T15:37:34.0264484Z" />
<meta name="xmpTPg:NPages" content="10" />
<meta name="custom:MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Application" content="Microsoft Azure Information Protection" />
<meta name="Revision-Number" content="2" />
<meta name="extended-properties:DocSecurityString" content="ReadOnlyEnforced" />
<meta name="dc:publisher" content="" />
<title>Introduction to Blob storage - Object storage in Azure | Microsoft Docs</title>
</head>
<body><p><a name="_GoBack" />Manage Azure Blob Storage resources with Storage Explorer</p>
<p> </p>
<h1>Overview</h1>
<p><a href="https://learn.microsoft.com/en-us/azure/storage/blobs/storage-dotnet-how-to-use-blobs">Azure Blob Storage</a> is a service for storing large amounts of unstructured data, such as text or binary data, that can be accessed from anywhere in the world via HTTP or HTTPS. You can use Blob storage to expose data publicly to the world, or to store application data privately. In this article, you'll learn how to use Storage Explorer to work with blob containers and blobs.</p>
<h1>Prerequisites</h1>
<p>To complete the steps in this article, you'll need the following:</p>
<p><a href="http://www.storageexplorer.com/">Download and install Storage Explorer</a></p>
<p>Connect to a Azure storage account or service</p>
<h1>Create a blob container</h1>
<p>All blobs must reside in a blob container, which is simply a logical grouping of blobs. An account can contain an unlimited number of containers, and each container can store an unlimited number of blobs.</p>
<p>The following steps illustrate how to create a blob container within Storage Explorer.</p>
<p>1. Open Storage Explorer.</p>
<p>2. In the left pane, expand the storage account within which you wish to create the blob container.</p>
<p>3. Right-click <b>Blob Containers</b>, and - from the context menu - select <b>Create Blob Container</b>.</p>
<p>4. A text box will appear below the <b>Blob Containers</b> folder. Enter the name for your blob container. See the Create the container and set permissions for information on rules and restrictions on naming blob containers.</p>
<p><img src="embedded:image2.jpg" alt="" /></p>
<p>5. Press <b>Enter</b> when done to create the blob container, or <b>Esc</b> to cancel. Once the blob container has been successfully created, it will be displayed under the <b>Blob Containers</b> folder for the selected storage account.</p>
<p><img src="embedded:image3.jpg" alt="" /></p>
</body></html><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Number of Tables" content="4 Huffman tables" />
<meta name="Compression Type" content="Baseline" />
<meta name="Data Precision" content="8 bits" />
<meta name="Number of Components" content="3" />
<meta name="tiff:ImageLength" content="124" />
<meta name="Component 2" content="Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert" />
<meta name="Thumbnail Height Pixels" content="0" />
<meta name="Component 1" content="Y component: Quantization table 0, Sampling factors 2 horiz/2 vert" />
<meta name="X Resolution" content="96 dots" />
<meta name="embeddedRelationshipId" content="rId10" />
<meta name="File Size" content="10645 bytes" />
<meta name="Component 3" content="Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert" />
<meta name="File Name" content="apache-tika-10777883143042172609.tmp" />
<meta name="tiff:BitsPerSample" content="8" />
<meta name="Content-Type" content="image/jpeg" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.ocr.TesseractOCRParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.jpeg.JpegParser" />
<meta name="Resolution Units" content="inch" />
<meta name="File Modified Date" content="Mon Jul 11 10:30:38 +00:00 2022" />
<meta name="resourceName" content="image2.jpg" />
<meta name="Image Height" content="124 pixels" />
<meta name="Thumbnail Width Pixels" content="0" />
<meta name="Image Width" content="290 pixels" />
<meta name="X-TIKA:embedded_depth" content="1" />
<meta name="X-TIKA:embedded_resource_path" content="/image2.jpg" />
<meta name="tiff:ImageWidth" content="290" />
<meta name="Y Resolution" content="96 dots" />
<title></title>
</head>
<body><div class="ocr">
4B tarcher
4 Bi Blob Containers
Bho,
TM Queues
> By Tables
FS 22016103 1050;
</div>
</body></html><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Number of Tables" content="4 Huffman tables" />
<meta name="Compression Type" content="Baseline" />
<meta name="Data Precision" content="8 bits" />
<meta name="Number of Components" content="3" />
<meta name="tiff:ImageLength" content="124" />
<meta name="Component 2" content="Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert" />
<meta name="Thumbnail Height Pixels" content="0" />
<meta name="Component 1" content="Y component: Quantization table 0, Sampling factors 2 horiz/2 vert" />
<meta name="X Resolution" content="96 dots" />
<meta name="embeddedRelationshipId" content="rId11" />
<meta name="File Size" content="10533 bytes" />
<meta name="Component 3" content="Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert" />
<meta name="File Name" content="apache-tika-9241358526221461145.tmp" />
<meta name="tiff:BitsPerSample" content="8" />
<meta name="Content-Type" content="image/jpeg" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.ocr.TesseractOCRParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.jpeg.JpegParser" />
<meta name="Resolution Units" content="inch" />
<meta name="File Modified Date" content="Mon Jul 11 10:30:38 +00:00 2022" />
<meta name="resourceName" content="image3.jpg" />
<meta name="Image Height" content="124 pixels" />
<meta name="Thumbnail Width Pixels" content="0" />
<meta name="Image Width" content="290 pixels" />
<meta name="X-TIKA:embedded_depth" content="1" />
<meta name="X-TIKA:embedded_resource_path" content="/image3.jpg" />
<meta name="tiff:ImageWidth" content="290" />
<meta name="Y Resolution" content="96 dots" />
<title></title>
</head>
<body><div class="ocr">
IM Queues
</div></body></html>例如,在这种情况下,我需要替换标记:
<img src="embedded:image2.jpg" alt="" />使用包含ocr类和这个元标记名的div标记中的文本:
<meta name="resourceName" content="image2.jpg" />这个div标记中的文本用于替换img标记:
<body><div class="ocr">
4B tarcher
4 Bi Blob Containers
Bho,
TM Queues
> By Tables
FS 22016103 1050;
</div>
</body></html><html xmlns="http://www.w3.org/1999/xhtml">发布于 2022-07-13 16:54:16
最后,我解决了这个问题,我在这里留下了代码:
from bs4 import BeautifulSoup
def replace_image_for_text(self, xml):
soup = BeautifulSoup(xml, 'html.parser')
headers = soup.find_all("head")
bodies = soup.find_all("body")
resource_names = []
resource_texts = []
for header, body in zip(headers, bodies):
resource_name = header.find("meta", {"name": "resourceName"})
if resource_name:
resource_names.append(header.find("meta",{"name": "resourceName"})['content'])
resource_texts.append(body.get_text())
for resource_name, resource_text in zip(resource_names, resource_texts):
soup.find("img",{"src":f"embedded:{resource_name}"}).replace_with(resource_text)
return souphttps://stackoverflow.com/questions/72968811
复制相似问题