我对ML/DL很陌生,但我正在寻找一种从文本中提取元数据的方法,我认为ML可能是一个很好的解决方案。
目标
输入:包含字段描述符和值/值的句子,例如:
非流动资产5 675 5 512 4 789 4 586
现金及现金等价物909 861 912 630
存货、贸易和其他应收款及其他流动资产3 756 2 998 2 864 2 834
资产总额10 340 9 372 8 565 8 051
股本5 649 4 560 2 365 1 969
非流动负债2 438 2 403 3 270 2 407
流动负债2 253 2 409 2 931 3 675
我已经做了一些研究,并知道单词需要嵌入(使用Word2Vec或类似的东西)。但是数字是如何处理的呢?
输出: Tuple {field: value}
{non_current_assets: 5675}
{cash_and_cash_equivalents: 909}
{total_assets: 10340}
{股本: 5649}
{non_current_liabilities: 2438}
{current_liabilities: 3756}
{库存: 3756}
问题
发布于 2020-03-05 15:42:14
你的问题还不完全清楚。
如果您只有带有文本的字符串,然后是数字,并且希望{ text : number}您应该只对int字符进行拆分而不执行ML,则您的字符串在更完整的文本文档中,但是这样可以更容易地获得完整的示例。
如果你的句子在一个文本中,比如敌人的例子:
text = " If your import is failing due to a missing package, you can use pip. Non-current assets 5 675 5 512 4 789 4 586. We also expect cash equivalents 909 861 912 630 in toal"您可以使用部分词性标记和分块来检测在数字之前的名词组:
就像这样:https://medium.com/@acrosson/extracting-names-emails-and-phone-numbers-5d576354baa
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
stop = stopwords.words('english')
document = ' '.join([i for i in text.split() if i not in stop])
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
nltk.download('averaged_perceptron_tagger')
sentences = [nltk.pos_tag(sent) for sent in sentences]然后句子是:
[[('If', 'IN'),
('import', 'NN'),
('failing', 'VBG'),
('due', 'JJ'),
('missing', 'VBG'),
('package', 'NN'),
(',', ','),
('use', 'NN'),
('pip', 'NN'),
('.', '.')],
[('Non-current', 'JJ'),
('assets', 'NNS'),
('5', 'CD'),
('675', 'CD'),
('5', 'CD'),
('512', 'CD'),
('4', 'CD'),
('789', 'CD'),
('4', 'CD'),
('586', 'CD'),
('.', '.')],
[('We', 'PRP'),
('also', 'RB'),
('expect', 'VBP'),
('cash', 'NN'),
('equivalents', 'NNS'),
('909', 'CD'),
('861', 'CD'),
('912', 'CD'),
('630', 'CD'),
('toal', 'NN')]]您可以根据语法定义regex,以检测标称组,后面跟着数字,例如:
grammar = """MATCH:{<JJ><NNS><CD>}""" #grammar would need to be completed
cp = nltk.RegexpParser(grammar)
for sentence in sentences:
print(cp.parse(sentence))返回:
(S
If/IN
import/NN
failing/VBG
due/JJ
missing/VBG
package/NN
,/,
use/NN
pip/NN
./.)
(S
(MATCH Non-current/JJ assets/NNS 5/CD)
675/CD
5/CD
512/CD
4/CD
789/CD
4/CD
586/CD
./.)
(S
We/PRP
also/RB
expect/VBP
cash/NN
equivalents/NNS
909/CD
861/CD
912/CD
630/CD
toal/NN)如果你不是专家,用tensorflow从头到尾做这件事要困难得多。
https://stackoverflow.com/questions/60417835
复制相似问题