我正在尝试使用带shell脚本的正则表达式从文本文件中提取一些特定的数据
这就是使用多行grep。我使用的工具是pcregrep,这样我就可以与perl的正则表达式兼容
[58]Walid Chamoun Architects WLL
* [59]Map
* [60]Website
* [61]Email
* [62]Profile
* [63]Display Ad
Walid Chamoun Architects WLL
PO Box:
55803, Doha, Qatar
Location:
D-Ring Road, New Salata Shamail 40, Villa 340, Doha, Qatar
Tel:
(00974) 44568833
Fax:
(00974) 44568811
Mob:
(00974) 44568822
* Accurate Budget Costing
* Eco-Friendly Structural Design
* Exclusive & Unique Design
* Quality Architecture & Design
Company Profile
Walid Chamoun Architects (WCA) was founded in Beirut, Lebanon, in 1992,
committed to the concept of fully integrated design-build delivery of
projects. In late '90s, company established in-house architectural and
engineering services. As a full service provider, WCA expanded from
multi-family projects to industrial and office construction, which
added development services, including site acquisition and financing.
In 2001, WCA had opportunity and facilities to experience European
market and establish office in Puerto Banus, Marbella, Spain. By 2005,
WCA refined its structure to focus on specific market segments and new
office was opened in Doha, state of Qatar. From a solid foundation and
reputation built over eighteen years, WCA continually to provide
leadership in design-build through promotion of benefits and education
to its practitioners.
Project Planning: Project planning and investigation occurs before
design begins has greatest impact on cost, schedule and ultimately the
success of project. Creativity in Design: You can rely on our in-house
designers for design excellence in all aspects of the project. Our
designs have received recommendations and appreciations on national and
international levels. Creativity in Execution: Experienced in close
collaboration with the designers as part of the integrated team, our
construction managers, superintendents and field staff create value
throughout the project. Post Completion Services: Your needs can be
served through our skills and experience long after the last
construction crew has left the site. Performance: Corporate and
institutional clients, developers and public agencies repeatedly select
WCA on the basis of its consistent record of performance excellence.
Serving clients throughout the Middle East and GCC, WCA provides
complete planning for architectural, interior design and construction
on a single-responsibility basis. Our expertise spans industrial,
commercial, institutional, public and residential projects. Benefits of
Design-Build: Design-build is a system of contracting under which one
entity performs both design and construction. Benefits of design-build
project delivery include: Single point responsibility Early knowledge
of cost Time and Cost savings
Classification:
Architects - [64]Architects
[65]Al Ali Consulting & Engineering
* [66]Map
* Website
* Email
* Profile
* Display Ad
Is this your company?
[67]Upgrade this free listing here
PO Box:
467, Doha, Qatar
Tel:
(00974) 44360011
Company Profile
Classification:
Architects - [68]Architects
[69]Al Gazeerah Consulting Engineering
* [70]Map
* Website
* Email
* Profile
* Display Ad
Is this your company?
[71]Upgrade this free listing here
PO Box:
22414, Doha, Qatar
Tel:
(00974) 44352126
Company Profile
Classification:
Architects - [72]Architects
[73]Al Murgab Consulting Engineering
* [74]Map
* Website
* Email
* Profile
* Display Ad
Is this your company?
[75]Upgrade this free listing here
PO Box:
2856, Doha, Qatar
Tel:
(00974) 44448623
Company Profile
Classification:
Architects - [76]Architects
References
Visible links
1. http://www.qatcom.com/useraccounts/login
2. http://www.qatcom.com/useraccounts/register
3. http://www.qatcom.com/
4. http://www.qatcom.com/
5. http://www.qatcom.com/qataryellowpages/map-of-doha
6. http://www.qatcom.com/qataryellowpages/about-qatcom
7. http://www.qatcom.com/qataryellowpages/advertise-with-qatcom
8. http://www.qatcom.com/qataryellowpages/advertiser_testimonials
9. http://www.qatcom.com/useraccounts/login
10. http://www.qatcom.com/useraccounts/register
11. http://www.qatcom.com/contact-qatcom
12. http://www.qatcom.com/qataryellowpages/companies
13. http://www.qatcom.com/classifications/index/A
14. http://www.qatcom.com/classifications/index/B
15. http://www.qatcom.com/classifications/index/C
16. http://www.qatcom.com/classifications/index/D
17. http://www.qatcom.com/classifications/index/E
18. http://www.qatcom.com/classifications/index/F
19. http://www.qatcom.com/classifications/index/G
20. http://www.qatcom.com/classifications/index/H
21. http://www.qatcom.com/classifications/index/I
22. http://www.qatcom.com/classifications/index/J
23. http://www.qatcom.com/classifications/index/K
24. http://www.qatcom.com/classifications/index/L
25. http://www.qatcom.com/classifications/index/M
26. http://www.qatcom.com/classifications/index/N
27. http://www.qatcom.com/classifications/index/O
28. http://www.qatcom.com/classifications/index/P对于这样的样本数据,我试图获取公司的详细信息,即
company name
po box
Tel
fax
mobile
company profile 到一个.csv文件中,我对正则表达式和linux也是新手。我所能得到的就是这样的东西
\[\d*\][^\.]*[\(\d*\)\s\d*)]有谁能帮我解决这个问题吗..
改进:
我想出了像这样的东西
$ awk '/^\[/ && ! /Upgrade this free listing/ {print $0} /:$/ && ! /Classification/ {printf $0 ; getline x ; print x}' file但这仍然不是我想要的.
发布于 2012-06-27 10:21:01
您可以在HTML 中执行此操作,但最好是解析HTML语言。一个不错的工具是使用Beautiful Soup模块的Python。但这并不是很令人兴奋,所以这里是如何做尴尬的(哈!)方式:
#!/usr/bin/awk -f
function trim(s) {
gsub(/(^ +)|( +$)/, "", s)
return s
}
BEGIN {
count = 0
fields[0] = "company"
fields[1] = "pobox"
fields[2] = "tel"
fields[3] = "fax"
fields[4] = "mob"
fields[5] = "profile"
}
# company name
/^ +\[[0-9]+\].*$/ {
sub(/^ +\[[0-9]+\]/, "") # get rid of the Lynx reference
# this is a bit naughty: our regex also matches this other link, but there's only one of them, so we just filter it
if ($0 != "Upgrade this free listing here") data[count,"company"]=$0
}
# two line fields, easy!
/ +PO Box:$/ { getline; data[count,"pobox"]=$0 }
/ +Tel:$/ { getline; data[count,"tel"]=$0 }
/ +Fax:$/ { getline; data[count,"fax"]=$0 }
/ +Mob:$/ { getline; data[count,"mob"]=$0 }
# multi-line field, tricky because it can be empty
/^Company Profile$/ {
getline # skip empty line
# process lines until encountering Classification field
s = ""
do {
s = s $0
getline
} while ($0 !~ / +Classification:$/)
data[count,"profile"]=s
count++ # the Classification field denotes the end of the company record
}
END {
OFS=","
# output CSV header row
for ( key in fields ) {
printf "\"" fields[key] "\","
}
printf "\n"
# output data
for ( i=0; i<count; i++ ) {
for ( key in fields ) {
printf "\"" trim(data[i,fields[key]]) "\","
}
printf "\n"
}
}另存为parse.awk,然后使用./parse.awk < sample.txt调用。出现了CSV,如下所示:
"tel","fax","mob","profile","company","pobox",
"(00974) 44568833","(00974) 44568811","(00974) 44568822","Walid Chamoun Architects (WCA) was founded in Beirut, Lebanon, in 1992, committed to the blablabla","Walid Chamoun Architects WLL","55803, Doha, Qatar",
"(00974) 44360011","","","","Al Ali Consulting & Engineering","467, Doha, Qatar",
"(00974) 44352126","","","","Al Gazeerah Consulting Engineering","22414, Doha, Qatar",
"(00974) 44448623","","","","Al Murgab Consulting Engineering","2856, Doha, Qatar",有一些评论有望解释正在发生的事情。这将在普通的旧awk中运行,并且不需要花哨的gawk功能。请记住,awk数组是任意排序的。这很容易破坏具有不同输入数据的一大堆数据,这只是您真正应该解析HTML而不是这样的lynx -dump恶作剧的众多原因之一。
https://stackoverflow.com/questions/11205308
复制相似问题