文章/答案/技术大牛

发布

社区首页 >问答首页 >用于解析金融数据的Python regex

问用于解析金融数据的Python regex
EN

Stack Overflow用户

提问于 2015-09-04 17:12:39

回答 4查看 582关注 0票数 0

我对regex相对来说是个新手(总是因为某种原因而挣扎).

我有这份表格的文本：

David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...

Mark Brookes, Non Executive Director bought 811 shares in the company on  YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...

Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...

对文本的分析显示了以下结构：

以句子开头的两个或多个单词，在第一个逗号之前，是参与交易的人的姓名。
在此之前的一个或多个单词(“出售”、“购买的”、“锻炼过的”、“出售后的锻炼”)是该人的标题。
这两种类型中的任何一种：(“出售”、“购买”、“已行使”、“出售后锻炼”)在标题之后，标识交易类型。
在事务类型之后的第一个数字字符串(‘sold’区)-已购买的‘\’\{e76f}表示事务的大小
“价格”继承了一个数字字符串，该字符串指定了交易的价格。

我的问题是：

我如何使用这些知识(和regex)来编写一个解析类似文本的函数来返回感兴趣的变量(上面列出了1-5)？

我想要写的函数的伪代码..

def grok_directors_dealings_text(text_input):
    name, title, transaction_type, lot_size, price = (None, None, None, None, None)
    ....
    name = ...
    title = ...
    transaction_type = ...
    lot_size = ...
    price = ...

    pass

我将如何使用regex来实现函数，以便在传递符合上述结构的文本时返回感兴趣的变量？

[编辑]

出于某种原因，我似乎与regex斗争了一段时间，如果我要从S.O上的正确答案中学习，如果给出一个解释为什么这个神奇的表达(抱歉，regexpr)真的有效的话，情况会好得多。

我想真正学习这些东西而不是复制粘贴表达式..。

python

regex

回答 4

Stack Overflow用户

回答已采纳

发布于 2015-09-04 17:42:58

您可以使用以下正则表达式：

(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)

演示

Python:

import re

financialData = """
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...

Mark Brookes, Non Executive Director bought 811 shares in the company on  YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...

Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
"""

print(re.findall('(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)',financialData))

输出：

[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00p'), ('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75p'), ('Albert Ellis', 'CEO', 'bought', '262', '52.00p')]

编辑1

要理解它们的含义，请遵循演示链接，在右上角可以找到一个块来解释每个字符的含义如下：

此外，德布盖克斯还通过显示哪个组匹配哪个字符来帮助您模拟字符串！

下面是一个针对特定情况的调试器演示：

(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)

Debuggex Demo

票数 2

Stack Overflow用户

发布于 2015-09-04 17:37:34

我想出了这个准则：

([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p

Debuggex Demo

基本上，我们使用括号来捕获您想要的重要信息，所以让我们检查一下每个信息：

([\w ]+)：\w匹配任何单词字符a-zA-Z0-9_ 1次或多次，这将给我们一个人的名字；
在一个空格和逗号之后的另一个([\w ]+)得到标题；
(sold post-exercise|sold|bought|exercised)然后我们搜索我们的事务类型。请注意，我将post-exercise放在post前面，以便它首先尝试匹配更大的单词；
然后，我们试图找出数字，它是由数字(\d)，一个逗号和可能出现一个点也可能出现；
([\d\.,]+)，然后，我们需要得到的价格，基本上是相同的交易规模。

连接每个组的正则表达式也是相当基本的。

如果您在regex101上尝试它，它提供了关于regex的一些解释，并在python中生成以下代码以供使用：

import re
p = re.compile(ur'([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p')

test_str = u"David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...\n\nMark Brookes, Non Executive Director bought 811 shares in the company on  YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...\n\nAlbert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by..."

re.findall(p, test_str)

票数 1

Stack Overflow用户

发布于 2015-09-04 17:26:44

这就是要做这件事的准则

(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)

你就这样用它

import re
def get_data(line):
    pattern = r"(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)"
    m = re.match(pattern, line)
    return m.groups()

对于第一行，它将返回

(“大卫·梅雷迪斯”，“财务总监”，“演习后销售”，“15,000”，“1044.00”)编辑:增加解释

此正则表达式如下所示:第一个字符(.*?)，将字符串取为下一次匹配(女巫为，)

。意味着每一个字符

*意味着它可以多次(许多字符，而不仅仅是1)

？意思是不要贪婪，这意味着它将使用第一个'，‘和另一个(如果有很多'，')

之后又出现了这样的情况(.*?)再取字符直到下一件事匹配(与常量字)。

在那之后，有(出售的，锻炼的，出售的，买的，买的，锻炼的)巫婆的意思-找到其中的一个单词(被\x)

在那之后有一个.*？女巫又一次意味着把所有的文本都拿到下一次比赛(这一次它不会被()覆盖，所以它不会被作为一个组来选择，也不会成为输出的一部分)。

(\d\\，+)意思是取数字(\d)或逗号。+代表一次或多次

又是.*？像以前一样

“价格”找到实际的字符串‘价格’

最后(\d|.+)的意思是再次取一个数字或一个点(因为字符转义)。由regex用于“任意字符”)一次或多次

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/32403338

复制

相似问题

问用于解析金融数据的Python regex
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于解析金融数据的Python regexEN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于解析金融数据的Python regex
EN