我的html解析器有问题。除了"< style > content < /style >“部分之外,我将装满html代码的电子邮件转换为干净的文本,这完全忽略了它,我不知道我做错了什么:
# Remove any HTML code from our raw content
htmlparse = re.sub(r'<.*?>', '', clean) \
.replace(" ", '') \
.replace("é", 'é') \
.replace("è", 'è') \
clean_email = htmlparse它实际上应该删除的是:
<style> .MailHeader { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-bottom: 0px; padding-top: 0px; spacing: 0px 0px 0px 0px; } .DataHeader { font-family: Tahoma; font-size: 8pt; color: #666666; text-decoration: none; padding-left: 15px; padding-right: 15px; border: solid 1px #E0E0E0; vertical-align: text-top; } .Data { font: normal 8pt Tahoma,Verdana; padding-left: 3px; padding-right: 15px; border: solid 1px #E0E0E0; background: #F9F9F9; font-size: 8pt; color: #666666; height: 20px !important; } .GridHeader { font: normal 8pt Tahoma,Verdana; padding-left: 6px; background: #DAEBFF; height: 20px; } .DataRow { padding-left: 3px; padding-right: 15px; border: solid 1px #E0E0E0; font-size: 8pt; color: #003399; } .GridRow { font: normal 8pt Tahoma, Verdana, Sans-serif; padding-left: 6px; background: transparent; height: 20px !important; min-height: 1%; } .GridAltRow { font: normal 8pt Tahoma, Verdana, Sans-serif; padding-left: 6px; background: #F9F9F9; height: 20px !important; min-height: 1%; } .li { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-left: 5px; } .TableHeader { font-family: Tahoma,calibri,verdana; font-size: 8pt; font-weight: bold; height: 22px; color: #003399; border: solid 1px #E0E0E0; border-collapse: collapse; padding-left: 5px; padding-right: 5px; background-color: #BBD8FF; } .TableSubHeader { font: normal 8pt Tahoma, Verdana, Sans-Serif; vertical-align: middle; padding-left: 3px; font-weight: bold; color: #666666; } .Separator { background-repeat: repeat-x; background-position: center; background: #666666; } .tableDetail { padding: 0 0 0 0; spacing: 0 0 0 0; border-collapse: collapse; width: 750px; margin-left: 5px; border: solid 1px #E0E0E0; } .style1 { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-bottom: 0px; padding-top: 0px; spacing: 0px 0px 0px 0px; height: 18px; } </style>
它实际上所做的是删除样式和/style,但将样式表的全部垃圾留在解析的文件中.
.MailHeader {字体:普通10pt Tahoma,Verdana,align;垂直对齐:顶部;填充-底部: 0px;填充-顶: 0px;间距:0px0px0px0px;} .DataHeader {字体-家族: Tahoma;字体大小: 8pt;颜色:#666666;文本-装饰:零;填充-左:15 3px;填充-右:15 3px;边框: solid 1px #E0E0E0;垂直对齐:文本-顶部;} .Data {字体:普通8pt Tahoma,Verdana;填充-左:3px;填充-右:15 3px;边框: solid 1px #E0E0E0;背景:#F9F9F9;字体大小: 8pt;颜色:#666666;高度:20 1px!重要;} .GridHeader {字体:普通8pt Tahoma,Verdana;填充-左: 6px;背景:#DAEBFF;高度:20 1px;} .DataRow {填充-左: 3px;填充-右:15 1px;边框: solid 1px #E0E0E0;字体大小: 8pt;颜色:#003399;} .GridRow {字体:普通8pt Tahoma,Verdana,Sans-serif;填充-左: 6px;背景:透明;高度:20 6px!重要;最小-高度: 1%;} .GridAltRow {字体:普通8pt Tahoma,Verdana,Sans-serif;填充-左: 6px;背景:#F9F9F9;高度:20 5px!重要;最小高度: 1%;} .li {字体:普通10 5px,Verdana,Sans-Serif;垂直-对齐:顶部;填充-左:5 5px;} .TableHeader {字体族: Tahoma,杯状,verdana;字体大小: 8pt;字体重量:粗体;高度:22 1px;颜色:#003399;边框:实心1 1px #E0E0E0;边框-折叠:折叠;填充-左:5 1px;填充-右:5 1px;背景颜色:#bbdbd8ff;} .TableSubHeader {字体:普通8pt Tahoma,Verdana,Sans-Serif;垂直对齐:中间;填充-左: 3px;字体重:粗体;颜色:#666666;} .Separator {背景-重复:重复-x;背景-位置:中心;背景:#666666;} .tableDetail {填充:0 0 0;间距:0 0 0;边框-折叠:折叠;宽度:750 0px;边距-左:5 0px;边框: solid 1 0px #E0E0E0;} .style1 {字体:普通10 0px,Verdana,Sans-Serif;垂直-对齐:顶;填充-底部:0 0px;填充-顶部:0 0px;间距:0 0px 0 0px 0 0px;高度:18 0px;}你好,这是一封测试邮件。
有谁可以帮我?
先谢谢各位,问候
发布于 2015-04-08 08:43:40
首先删除样式本身,然后在第二步,做您想做的事情。
import re
some = """
<style>.MailHeader { font: normal 10pt Tahoma, Verdana, Sans-Serif;
vertical-align: top; padding-bottom: 0px; padding-top: 0px; spacing: 0px 0px 0px 0px; }
.DataHeader { font-family: Tahoma; font-size: 8pt; color: #666666; text-decoration: none;
padding-left: 15px; padding-right: 15px; border: solid 1px #E0E0E0; vertical-align: text-top; }
.Data { font: normal 8pt Tahoma,Verdana; padding-left: 3px; padding-right: 15px; border: solid 1px #E0E0E0;
\ background: #F9F9F9; font-size: 8pt; color: #666666; height: 20px !important; }
.GridHeader { font: normal 8pt Tahoma,Verdana; padding-left: 6px; background: #DAEBFF; height: 20px; }
.DataRow { padding-left: 3px; padding-right: 15px; border: solid 1px #E0E0E0; font-size: 8pt; color: #003399; }
.GridRow { font: normal 8pt Tahoma, Verdana, Sans-serif; padding-left: 6px; background: transparent;
height: 20px !important; min-height: 1%; } .GridAltRow { font: normal 8pt Tahoma, Verdana, Sans-serif;
padding-left: 6px; background: #F9F9F9; height: 20px !important; min-height: 1%; }
.li { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-left: 5px; }
.TableHeader { font-family: Tahoma,calibri,verdana; font-size: 8pt; font-weight: bold; height: 22px;
color: #003399; border: solid 1px #E0E0E0; border-collapse: collapse; padding-left: 5px;
padding-right: 5px; background-color: #BBD8FF; }
.TableSubHeader { font: normal 8pt Tahoma, Verdana, Sans-Serif;
vertical-align: middle; padding-left: 3px; font-weight: bold; color: #666666; }
.Separator { background-repeat: repeat-x; background-position: center; background: #666666; }
.tableDetail { padding: 0 0 0 0;
spacing: 0 0 0 0; border-collapse: collapse; width: 750px; margin-left: 5px; border: solid 1px #E0E0E0; }
.style1 { font: normal 10pt Tahoma, Verdana, Sans-Serif; vertical-align: top; padding-bottom:
0px; padding-top: 0px; spacing: 0px 0px 0px 0px; height: 18px; }
</style>
<h1>Hello, this is a test mail.</h1>
"""
some1 = re.sub(r'<style>.*</style>', '', some, flags=re.DOTALL)
print some1结果:
I have no name!@sla-334:~/stack_o$ python stack_o_html.py
<h1>Hello, this is a test mail.</h1>现在,用这个html做你想做的事。
发布于 2015-04-08 08:53:50
我用解析的文本来修正它,并使用它作为参数。
cleaner = re.sub(r'{.*}', '', clean_email)\
.replace(".MailHeader", '') \我来试试你的解决方案
https://stackoverflow.com/questions/29509216
复制相似问题