目前,我正在使用此算法将文本解析为CSV文件,以进行市场调查。
import re
def text2csv(inname, outname):
with open(inname, 'r', encoding="UTF-8") as f:
data = f.read().strip().replace('\n', '\t').replace(',', '')
print(data)
print("check1")
#info = re.findall(r'\t(.*?)\ is\ (.*?\t\t.*?)\t\t.*?Founded Year:\ (.*?)\t\tHeadquarters:\ (.*?)\t\tWebsite:\ (.*?)\t\t.*?\tFounders:\ (.*?)\t', data, re.MULTILINE)
info = re.findall(r'\t(.*?)\ is\ (.*?\t\t.*?)\t\t.*?Founding Year:\ (.*?)\t\tHeadquarters:\ (.*?)\t\tWebsite:\ (.*?)\t\t.*?\tFounders:\ (.*?)\t', data, re.MULTILINE)
print(data)
with open(outname, 'w', encoding="UTF-8") as f:
f.write('Name,Description,Founding Year,Headquarters,Website,Founders\n')
for i in info:
f.write(','.join(i).replace('\t', '') + '\n')
text2csv("proptech.txt", "proptech.csv")该算法适用于如下文本结构:
Big Hit Entertainment is among the top media & adtech startups for 2020
Big Hit Entertainment is a South Korean entertainment company that currently manages soloist Lee Hyun and idol group BTS. It helps bring the music and content from various sources in one place on its innovative platform.
Founded Year: 2005
Headquarters: Seoul, Seoul-t’ukpyolsi, South Korea
Website: www.ibighit.com
Twitter: www.twitter.com/bighitent
Founders: Bang Si-Hyuk
One97 Communications is among the top media & adtech startups for 2020
One97 is a startup that delivers mobile content and commerce services to millions of mobile consumers. It does so through India’s most widely deployed telecom applications cloud platform.
Founded Year: 2000
Headquarters: Noida, Uttar Pradesh, India
Website: www.one97.com
LinkedIn: www.linkedin.com/company/one97-communications-limited
Twitter: www.twitter.com/One97
Founders: Vijay Shekhar Sharma但是当结构改变成这样时,算法似乎就崩溃了:
NestAway is one of the top proptech startups for 2020
This Bangalore-based startup is a home rental network that aims to provide better rental solutions via design and technology. Their motto is to assist customers in booking, finding, and moving into a rental home of choice across Indian cities. All of this is made possible within an application. They also help their customer’s move-in, ask for services from tap leakage to door lock broken, rental payment, etc. Alongside this, they also assist customers in moving out. Here is some more information about this venture, one of the top proptech startups for 2020.
Founding Year: 2015
Headquarters: Bangalore, India
Website: www.nestaway.com
LinkedIn: www.linkedin.com/company/9334060/
Founders: Amarendra Sahu, Deepak Dhar, Jitendra Jagadev, Smruti Parida
Ucommune is one of the top proptech startups for 2020
This startup offers co-working space solutions. They also have provision for long-term leasing, hot desk, and corporate customization and professional solutions. They provide services to small-to-medium enterprises across China, Singapore, New York City, San Francisco in California, and London in the United Kingdom. Here is some more information about this venture, one of the top proptech startups for 2020.
Founding Year: 2015
Headquarters: Beijing, China
Website: www.ucommune.com
LinkedIn: www.linkedin.com/company/ucommune
Founders: Mao Daqing我刚开始使用正则表达式,但希望得到帮助,将我的代码修复为a)工作和b)更普遍的工作。我目前正在从hexgn.com上的列表中收集市场研究数据,他们的网站不是动态的,所以很难简单地使用google chrome插件从网站上抓取数据。可悲的是,他们列出列表的结构并不总是相同的,所以在这种情况下,让我感到困惑的是第二行没有公司名称,并且描述中的第一个单词是" is“。谢谢!
发布于 2020-05-22 23:27:33
你当然可以尝试用正则表达式来做这件事,但在我看来,一种更简单的方法不会使用它,假设一家公司总是以一个描述开始(它本身以“is...”开始)。最后以"Founders:...“结束。
要获取字典列表形式的数据,请执行以下操作:
FIELDS = [
"Founding Year:",
"Headquarters:",
"Website:",
"LinkedIn:"
]
def text2csv(inname, outname):
companies = [{
"Name": None,
"Description": "",
"Founding Year": None,
"Headquarters": None,
"Website": None,
"Founders": None,
"LinkedIn": None
}]
with open(inname, "r", encoding="UTF-8") as input_data:
for line in input_data:
if not line.startswith("Founders:"):
data = line.strip()
if data:
data_field = next((field.replace(":", "") for field in FIELDS if data.startswith(field)), None)
if data_field:
companies[-1][data_field] = data.split(":", 1)[-1].strip()
else:
companies[-1]["Description"] += data
if companies[-1]["Name"] is None:
companies[-1]["Name"] = data.split(" is")[0]
else:
companies[-1]["Founders"] = line.strip().split(":", 1)[-1].strip()
companies.append({
"Name": None,
"Description": "",
"Founding Year": None,
"Headquarters": None,
"Website": None,
"Founders": None
})
companies.pop()
for company in companies:
print()
for field, value in company.items():
print(field, value)产生:
Name NestAway
Description NestAway is one of the top proptech startups for 2020This Bangalore-based startup is a home rental network that aims to provide better rental solutions via design and technology. Their motto is to assist customers in booking, finding, and moving into a rental home of choice across Indian cities. All of this is made possible within an application. They also help their customer’s move-in, ask for services from tap leakage to door lock broken, rental payment, etc. Alongside this, they also assist customers in moving out. Here is some more information about this venture, one of the top proptech startups for 2020.
Founding Year 2015
Headquarters Bangalore, India
Website www.nestaway.com
Founders Amarendra Sahu, Deepak Dhar, Jitendra Jagadev, Smruti Parida
LinkedIn www.linkedin.com/company/9334060/
Name Ucommune
Description Ucommune is one of the ....
Founding Year 2015
Headquarters Beijing, China
Website www.ucommune.com
Founders Mao Daqing
LinkedIn www.linkedin.com/company/ucommune然后,您可以使用这些字典非常容易地打印到csv文件,python具有内置的模块来帮助您https://docs.python.org/3/library/csv.html
https://stackoverflow.com/questions/61956149
复制相似问题