Lars Plessmann
2015-01-21 19:06:15 UTC
I have a really huge CSV (about 240GB) file with several columns (lets say
there are columns A - H).
The first column A is the primary key of the main record (vertex
MainRecord). But the columns D, E, F, G are columns which should be stored
in an own vertex (because these fields are redundant over all the records
and I dont want to store them in the main record again and again). So the
column value of D-G itself should be stored as a property called "title" in
a new vertex (but it should not generate duplicates). Afterwards these
vertexes needs to be linked.
Is this possible to reach this with an single orient-etl configuration? I
think the only way I know is to split the huge csv file's columns and
create sepperate files for each vertex. But I dont want to do this if that
is not neccessairy (file is so big).
I hope you can give me an advice?
I try to describe it in the config json syntax what I need (of course, this
will not work):
{
"source": {
"file": {
"path": "dataexport.csv"
}
},
"extractor": {"row": {}},
"transformers": [
{
"csv": {
"separator": ",",
"nullValue": "NULL",
"skipFrom": -1,
"skipTo": -1
}
},
{
"field": {
"fieldName": "_id",
"expression": "$input._id.substring(9, 33)"
}
},
{
"field": {
"fieldName": "colD",
"class": "ColumnD",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colE",
"class": "ColumnE",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colF",
"class": "ColumnF",
"classProperty": "title"
} {
"field": {
"fieldName": "colG",
"class": "ColumnG",
"classProperty": "title"
}
}
},
{
"vertex": {"class": "MainRecord"}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:127.0.0.1/msales_testing",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"dbType": "graph",
"classes": [
{
"name": "MainRecord",
"extends": "V"
},
{
"name": "ColumnD",
"extends": "V"
},
{
"name": "ColumnE",
"extends": "V"
},
{
"name": "ColumnF",
"extends": "V"
},
{
"name": "ColumnG",
"extends": "V"
}
],
"indexes": [
{
"class": "MainRecord",
"fields": ["_id:string"],
"type": "UNIQUE"
}
]
}
}
}
By the way: _id is in the MongoDB ObjectID format. I just want to store the
original hex value, so I used the substring sql method to extract the hex
id. Maybe there is a better way.
regards
Lars
there are columns A - H).
The first column A is the primary key of the main record (vertex
MainRecord). But the columns D, E, F, G are columns which should be stored
in an own vertex (because these fields are redundant over all the records
and I dont want to store them in the main record again and again). So the
column value of D-G itself should be stored as a property called "title" in
a new vertex (but it should not generate duplicates). Afterwards these
vertexes needs to be linked.
Is this possible to reach this with an single orient-etl configuration? I
think the only way I know is to split the huge csv file's columns and
create sepperate files for each vertex. But I dont want to do this if that
is not neccessairy (file is so big).
I hope you can give me an advice?
I try to describe it in the config json syntax what I need (of course, this
will not work):
{
"source": {
"file": {
"path": "dataexport.csv"
}
},
"extractor": {"row": {}},
"transformers": [
{
"csv": {
"separator": ",",
"nullValue": "NULL",
"skipFrom": -1,
"skipTo": -1
}
},
{
"field": {
"fieldName": "_id",
"expression": "$input._id.substring(9, 33)"
}
},
{
"field": {
"fieldName": "colD",
"class": "ColumnD",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colE",
"class": "ColumnE",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colF",
"class": "ColumnF",
"classProperty": "title"
} {
"field": {
"fieldName": "colG",
"class": "ColumnG",
"classProperty": "title"
}
}
},
{
"vertex": {"class": "MainRecord"}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:127.0.0.1/msales_testing",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"dbType": "graph",
"classes": [
{
"name": "MainRecord",
"extends": "V"
},
{
"name": "ColumnD",
"extends": "V"
},
{
"name": "ColumnE",
"extends": "V"
},
{
"name": "ColumnF",
"extends": "V"
},
{
"name": "ColumnG",
"extends": "V"
}
],
"indexes": [
{
"class": "MainRecord",
"fields": ["_id:string"],
"type": "UNIQUE"
}
]
}
}
}
By the way: _id is in the MongoDB ObjectID format. I just want to store the
original hex value, so I used the substring sql method to extract the hex
id. Maybe there is a better way.
regards
Lars
--
---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-database+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-database+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.