首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >KDB -文本解析和文本数据编目

KDB -文本解析和文本数据编目
EN

Stack Overflow用户
提问于 2020-09-22 05:01:06
回答 2查看 189关注 0票数 0

我有由变化的周期性字符串组成的数据,这些字符串实际上是一个时间值列表,其中包含一个周期性标志。不幸的是,每个字符串长度可以有不同数量的元素,但不能超过7个。

下面的例子-(每个字符串末尾的#和#/M表示这些是月度值)从8/2020开始,而#/Y是年度数字,所以我们除以12就得到了月度值。#在开始时,简单的意思是从上一阶段继续。

从CSV复制

代码语言:javascript
复制
ID,seg,strField
AAA,1,8/2020 2333 2456 2544 2632 2678 #/M
AAA,2,# 3333 3456 3544 3632 3678 #
AAA,3,# 4333 4456 4544 4632 4678 #/M
AAA,4,11/2021 5333 5456 #/M
AAA,5,# 6333 6456 6544 6632 6678 #/Y
代码语言:javascript
复制
t:("SSS";enlist",") 0:`:./Data/src/strField.csv; // read in csv data above
t:update result:count[t]#enlist`float$() from t; // initiate empty result column

我通常会对7列进行标记化,然后将这7列中的每一列传递给一个函数,但限制是8个参数,除了这7个参数之外,我还想发送其他元数据。

代码语言:javascript
复制
t:@[t;`tok1`tok2`tok3`tok4`tok5`tok6`tok7;:;flip .Q.fu[{" " vs'x}]t `strField];  
代码语言:javascript
复制
t: ungroup t; 

//Desired result
ID   seg    iDate   result
AAA  1  8/31/2020   2333    
AAA  1  9/30/2020   2456    
AAA  1  10/31/2020  2544    
AAA  1  11/30/2020  2632    
AAA  1  12/31/2020  2678    
AAA  2  1/31/2021   3333    
AAA  2  2/28/2021   3456    
AAA  2  3/31/2021   3544    
AAA  2  4/30/2021   3632    
AAA  2  5/31/2021   3678    
AAA  3  6/30/2021   4333    
AAA  3  7/31/2021   4456    
AAA  3  8/31/2021   4544    
AAA  3  9/30/2021   4632    
AAA  3  10/31/2021  4678    
AAA  4  11/30/2021  5333    
AAA  4  12/31/2021  5456    
AAA  5  1/31/2022    527.75     <-- 6333/12
AAA  5  2/28/2022    527.75     
AAA  5  3/31/2022    527.75     
AAA  5  4/30/2022    527.75     
AAA  5  5/31/2022    527.75     
AAA  5  6/30/2022    527.75     
AAA  5  7/31/2022    527.75     
AAA  5  8/31/2022    527.75     
AAA  5  9/30/2022    527.75     
AAA  5  10/31/2022   527.75     
AAA  5  11/30/2022   527.75     
AAA  5  12/31/2022   527.75     
AAA  5  1/31/2023    538.00     <--6456/12
AAA  5  2/28/2023    538.00     
AAA  5  3/31/2023    538.00     
AAA  5  4/30/2023    538.00     
AAA  5  5/31/2023    538.00     
AAA  5  6/30/2023    538.00     
AAA  5  7/31/2023    538.00     
AAA  5  8/31/2023    538.00     
AAA  5  9/30/2023    538.00     
AAA  5  10/31/2023   538.00     
AAA  5  11/30/2023   538.00     
AAA  5  12/31/2023   538.00     
AAA  5  1/31/2024       etc..
AAA  5  2/29/2024       
AAA  5  3/31/2024       
AAA  5  4/30/2024       
AAA  5  5/31/2024       
AAA  5  6/30/2024       
AAA  5  7/31/2024       
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-09-23 04:21:02

ddonelly是正确的,字典或列表绕过了函数8个参数的限制,但我认为这不是正确的方法。下面实现了所需的输出:

代码语言:javascript
复制
t:("SSS";enlist",") 0:`:so.csv;

// This will process each distinct ID separately as the date logic I have here would break if you had a BBB entry that starts date over
{[t]
    
    t:@[{[x;y] select from x where ID = y}[t;]';exec distinct ID from t];  

    raze {[t]
        t:@[t;`strField;{" "vs string x}'];
        t:ungroup update`$date from delete strField from @[t;`date`result`year;:;({first x}each t[`strField];"J"${-1_1_x}each t[`strField];
            `Y =fills @[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];
        delete year from ungroup update date:`$'string date from update result:?[year;result%12;result],
            date:{x+til count x} each {max($[z;12#(x+12-x mod 12);1#x+1];y)}\[0;"M"$/:raze each reverse each 
                "/" vs/: string date;year] from t
     } each t
    
    }[t]

ID  seg date    result
AAA 1   2020.08 2333
AAA 1   2020.09 2456
AAA 1   2020.10 2544
AAA 1   2020.11 2632
AAA 1   2020.12 2678
AAA 2   2021.01 3333
AAA 2   2021.02 3456
AAA 2   2021.03 3544
AAA 2   2021.04 3632
AAA 2   2021.05 3678
AAA 3   2021.06 4333
AAA 3   2021.07 4456
AAA 3   2021.08 4544
AAA 3   2021.09 4632
AAA 3   2021.10 4678
AAA 4   2021.11 5333
AAA 4   2021.12 5456
AAA 5   2022.01 527.75
AAA 5   2022.02 527.75
AAA 5   2022.03 527.75
...
AAA 5   2023.01 538
AAA 5   2023.02 538
AAA 5   2023.03 538
AAA 5   2023.04 538
...
AAA 5   2024.01 545.3333
AAA 5   2024.02 545.3333
...

下面是嵌套函数中发生的事情的完整分析,如果你需要理解它的话。

代码语言:javascript
复制
// vs (vector from scalar) is useful for string manipulation to separate the strField column into a more manageable list of seperate strings 
t:@[t;`strField;{" "vs string x}'];

// split the strField out to more manageable columns
t:@[t;`date`result`year;:;
    
    // date column from the first part of strField 
    ({first x}each t[`strField];
    
    // result for the actual value fields in the middle
    "J"${-1_1_x}each t[`strField];
     
    // year column which is a boolean to indicate special handling is needed. 
    // I also forward fill to account for rows which are continuation of 
    // the previous rows time period, 
    // e.g. if you had 2 or 3 lines in a row of continuous yearly data 
    `Y =fills @[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];

// ungroup to split each result into individual rows
t:ungroup update`$date from delete strField from t;

t:update 
    // divide yearly rows where necessary with a vector conditional
    result:?[year;result%12;result],
    
    // change year into a progressive month list
    date:{x+til count x} each 
        
        // check if a month exists, if not take previous month + 1. 
        // If a year, previous month + 12 and convert to Jan
        // create a list of Jans for the year which I convert to Jan->Dec above
        {max($[z;12#(x+12-x mod 12);1#x+1];y)}\
             // reformat date to kdb month to feed with year into the scan iterator above
             [0;"M"$/:raze each reverse each "/" vs/: string date;year] from t;

// finally convert date to symbol again to ungroup year rows into individual rows
delete year from ungroup update date:`$'string date from t
票数 2
EN

Stack Overflow用户

发布于 2020-09-22 17:25:51

可以将列传递到字典中,然后将字典传递到函数中吗?这避免了最多有8个参数的问题,因为字典可以根据您的需要而定。

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/63999927

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档