例如,我有一堆HTML页面,如下所示:
<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>
</head><body
>
<!--l. 125--><div class="crosslinks"><p class="noindent">[<a
href="chapter1.html" >next</a>] [<a
href="#tailcontent.html">tail</a>] [<a
href="/sciences/index.html" >up</a>] </p></div>
<h2 class="likechapterHead"><a
id="x2-1000"></a>Table des matières</h2>
<div class="tableofcontents">但是不可能将这些HTML页面中的所有法语口音转换成"Table des matières“中的重音,"è”代替"è“。
我试了两件事:
for i in $(ls *.html); do iconv -f iso-8859-1 -t utf8 $i > $i"_new"; mv -f $i"_new" $i; done=>口音未被转换
for i in $(ls *.html); do recode ..html $i; done=>我有以下错误:
recode: section5.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section6.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section7.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section8.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section9.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
...我不知道该怎么做才能转换所有这些法国口音?
有没有人有任何想法或建议来转换所有可能的法语口音?我想使用iconv、recode或sed命令。
更新1:以一个基本示例为例,下面是我为单个文件获得的消息:
$ recode ..html table_of_contents.html
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2' 怎么了?
更新2:这里是我原来的页面的输出:
$file -i index.html
$ index.html: text/x-tex; charset=iso-8859-1
index.html的负责人:
<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);如果我应用该命令:
$ recode -vfd u8..html index.html
Request: UTF-8..:libiconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done和
<!DOCTYPE html>
<html>
<head><title>Table des matires</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>如你所见,"è“已经消失了。
我能做些什么?
发布于 2022-02-13 15:25:30
假设源文件编码为UTF-8。以下命令在我的环境中工作:
$ recode -vfd u8..html index.html输出:
$ locale charmap
UTF-8
$ file -i index.html
index.html: text/html; charset=utf-8
$ recode -vfd u8..html index.html
Request: UTF-8..:iconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done您可以使用命令选项以这种方式调试错误:
-v详细输出。查找发生错误的步骤非常有用。-f也强制完成。您可以使用原始的比较输出文件来确定哪个字符/位置会带来麻烦。-d,编码不转换ASCII字符。避免转换< > " &等html字符。如果编码/字符集是,则需要使用iso-8859-1更新:
$ recode -vfd iso-8859-1..html index.html
Request: ISO-8859-1..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
#Or use following.
$ recode -vfd lat1..html index.html
Request: ISO-8859-1..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... doneISO-8859-1在重新编码中有以下别名:
l1
lat1
latin1
Latin-1
819/CR-LF
CP819/CR-LF
CSISOLATIN1
IBM819/CR-LF
ISO8859-1
iso-ir-100
ISO_8859-1
ISO_8859-1:1987您可以在命令中使用上述任何一个。
https://stackoverflow.com/questions/71061611
复制相似问题