我有以下要求-
我有一个包含json格式数据行的大文件-
{
"_length": "88",
"_id" : "1",
"_store": {
"meta": {
"value": {
"uid": "sam",
}
}
}
}
{
"_length": "22",
"_id" : "2",
"_store": {
"meta": {
"value": {
"uid": "uncle",
}
}
}
}……
我有另一个文件,其中包含以下内容-
{
"uid" : "sam",
"zid" : "121212121"
}
{
"uid" : "aborted",
"zid" : "9989821"
}……
现在我需要从第一个文件生成一个新文件,其中包含第二个文件中没有udi的所有记录。
我是Pig的新手,想知道它支持什么样的JOIN或SET操作。
发布于 2017-03-11 13:57:02
我想象鸟可以帮你。我从来没有尝试过这样的东西,但是因为你的是嵌套的json,你可以使用elephant bird将2个文件读入2个变量,然后连接并实现你的目标。
这里有几个链接,可以帮助你开始使用象鸟。
发布于 2017-03-14 02:52:10
以下是示例文件以及相应的中间和最终结果-
cat ids_test.json
{"A":"a1","B":"a2"}
cat part-test
{"content":"both_A_a1_B_a2","meta":{"A":"a1","B":"a2"}}
{"content":"only_B_a2","meta":{"A":"","B":"a2"}}
{"content":"only_A_a1","meta":{"A":"a1","B":""}}
{"content":"both_A_b1_B_b2","meta":{"A":"b1","B":"b2"}}
{"content":"only_A_c1","meta":{"A":"c1","B":""}}
cat /tmp/j1/part-m-00000
{"user_data::json":{"meta":"{B=a2, A=a1}","content":"both_A_a1_B_a2"},"ids::json":{"B":"a2","A":"a1"}}
{"user_data::json":{"meta":"{B=a2, A=}","content":"only_B_a2"},"ids::json":null}
{"user_data::json":{"meta":"{B=, A=a1}","content":"only_A_a1"},"ids::json":{"B":"a2","A":"a1"}}
{"user_data::json":{"meta":"{B=b2, A=b1}","content":"both_A_b1_B_b2"},"ids::json":null}
{"user_data::json":{"meta":"{B=, A=c1}","content":"only_A_c1"},"ids::json":null}
cat /tmp/j1_filter/part-m-00000
{"user_data::json":{"meta":"{B=a2, A=}","content":"only_B_a2"},"ids::json":null}
{"user_data::json":{"meta":"{B=b2, A=b1}","content":"both_A_b1_B_b2"},"ids::json":null}
{"user_data::json":{"meta":"{B=, A=c1}","content":"only_A_c1"},"ids::json":null}
cat /tmp/j2/part-m-00000
{"J1_FILTER::user_data::json":{"meta":"{B=a2, A=}","content":"only_B_a2"},"J1_FILTER::ids::json":null,"ids::json":{"B":"a2","A":"a1"}}
{"J1_FILTER::user_data::json":{"meta":"{B=b2, A=b1}","content":"both_A_b1_B_b2"},"J1_FILTER::ids::json":null,"ids::json":null}
{"J1_FILTER::user_data::json":{"meta":"{B=, A=c1}","content":"only_A_c1"},"J1_FILTER::ids::json":null,"ids::json":null}
cat /tmp/results/part-m-00000
{"J1_FILTER::user_data::json":{"meta":"{B=b2, A=b1}","content":"both_A_b1_B_b2"}}
{"J1_FILTER::user_data::json":{"meta":"{B=, A=c1}","content":"only_A_c1"}}以下是脚本-
user_data = LOAD 'part-test' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]);
ids = LOAD 'ids_test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]);
J1 = JOIN user_data BY json#'meta'#'A' LEFT OUTER, ids BY json#'A' USING 'replicated';
rmf /tmp/j1
store J1 into '/tmp/j1' USING JsonStorage;
J1_FILTER = FILTER J1 BY ids::json is null;
rmf /tmp/j1_filter
store J1_FILTER into '/tmp/j1_filter' USING JsonStorage;
J2 = JOIN J1_FILTER BY user_data::json#'meta'#'B' left outer, ids BY json#'B' USING 'replicated';
rmf /tmp/j2
store J2 into '/tmp/j2' USING JsonStorage;
J2_FILTER = FILTER J2 BY ids::json is null;
RESULTS = FOREACH J2_FILTER GENERATE J1_FILTER::user_data::json;
--filtered_ids = FOREACH user_data_MINUS_ids GENERATE user_data AS data;
--DUMP filtered_ids;
rmf /tmp/results
store RESULTS into '/tmp/results' USING JsonStorage;https://stackoverflow.com/questions/42706330
复制相似问题