[orientdb] ETL: how to import a huge csv and generate different vertexes/edges based on the csv column values

Discussion:

Lars Plessmann

2015-01-21 19:06:15 UTC

I have a really huge CSV (about 240GB) file with several columns (lets say
there are columns A - H).
The first column A is the primary key of the main record (vertex
MainRecord). But the columns D, E, F, G are columns which should be stored
in an own vertex (because these fields are redundant over all the records
and I dont want to store them in the main record again and again). So the
column value of D-G itself should be stored as a property called "title" in
a new vertex (but it should not generate duplicates). Afterwards these
vertexes needs to be linked.
Is this possible to reach this with an single orient-etl configuration? I
think the only way I know is to split the huge csv file's columns and
create sepperate files for each vertex. But I dont want to do this if that
is not neccessairy (file is so big).
I hope you can give me an advice?

I try to describe it in the config json syntax what I need (of course, this
will not work):

{
"source": {
"file": {
"path": "dataexport.csv"
}
},
"extractor": {"row": {}},
"transformers": [
{
"csv": {
"separator": ",",
"nullValue": "NULL",
"skipFrom": -1,
"skipTo": -1
}
},
{
"field": {
"fieldName": "_id",
"expression": "$input._id.substring(9, 33)"
}
},
{
"field": {
"fieldName": "colD",
"class": "ColumnD",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colE",
"class": "ColumnE",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colF",
"class": "ColumnF",
"classProperty": "title"
} {
"field": {
"fieldName": "colG",
"class": "ColumnG",
"classProperty": "title"
}
}
},
{
"vertex": {"class": "MainRecord"}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:127.0.0.1/msales_testing",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"dbType": "graph",
"classes": [
{
"name": "MainRecord",
"extends": "V"
},
{
"name": "ColumnD",
"extends": "V"
},
{
"name": "ColumnE",
"extends": "V"
},
{
"name": "ColumnF",
"extends": "V"
},
{
"name": "ColumnG",
"extends": "V"
}
],
"indexes": [
{
"class": "MainRecord",
"fields": ["_id:string"],
"type": "UNIQUE"
}
]
}
}
}

By the way: _id is in the MongoDB ObjectID format. I just want to store the
original hex value, so I used the substring sql method to extract the hex
id. Maybe there is a better way.

regards
Lars

--
---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-database+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luca Garulli

2015-01-21 22:09:17 UTC

Permalink

Hi Lars,
I don't know the "class" and "classProperty" fields in "field" transformer.
This is the syntax:

http://www.orientechnologies.com/docs/last/orientdb-etl.wiki/Transformer.html#field

Furthermore why do you need multiple classes?

Post by Lars Plessmann
I have a really huge CSV (about 240GB) file with several columns (lets say
there are columns A - H).
The first column A is the primary key of the main record (vertex
MainRecord). But the columns D, E, F, G are columns which should be stored
in an own vertex (because these fields are redundant over all the records
and I dont want to store them in the main record again and again). So the
column value of D-G itself should be stored as a property called "title" in
a new vertex (but it should not generate duplicates). Afterwards these
vertexes needs to be linked.
Is this possible to reach this with an single orient-etl configuration? I
think the only way I know is to split the huge csv file's columns and
create sepperate files for each vertex. But I dont want to do this if that
is not neccessairy (file is so big).
I hope you can give me an advice?
I try to describe it in the config json syntax what I need (of course,
{
"source": {
"file": {
"path": "dataexport.csv"
}
},
"extractor": {"row": {}},
"transformers": [
{
"csv": {
"separator": ",",
"nullValue": "NULL",
"skipFrom": -1,
"skipTo": -1
}
},
{
"field": {
"fieldName": "_id",
"expression": "$input._id.substring(9, 33)"
}
},
{
"field": {
"fieldName": "colD",
"class": "ColumnD",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colE",
"class": "ColumnE",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colF",
"class": "ColumnF",
"classProperty": "title"
} {
"field": {
"fieldName": "colG",
"class": "ColumnG",
"classProperty": "title"
}
}
},
{
"vertex": {"class": "MainRecord"}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:127.0.0.1/msales_testing",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"dbType": "graph",
"classes": [
{
"name": "MainRecord",
"extends": "V"
},
{
"name": "ColumnD",
"extends": "V"
},
{
"name": "ColumnE",
"extends": "V"
},
{
"name": "ColumnF",
"extends": "V"
},
{
"name": "ColumnG",
"extends": "V"
}
],
"indexes": [
{
"class": "MainRecord",
"fields": ["_id:string"],
"type": "UNIQUE"
}
]
}
}
}
By the way: _id is in the MongoDB ObjectID format. I just want to store
the original hex value, so I used the substring sql method to extract the
hex id. Maybe there is a better way.
regards
Lars
--
---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Lars Plessmann

2015-01-22 10:11:03 UTC

Permalink

Hi Luca,

I know these fields do not exist. This should only be an example what I'm
looking for for a better understanding (see my note above the config file
;)).
The Wiki is not really informative about complex Transformation :-(

I think its more efficient to store records in its own classes instead of
building document records with lots of redundant fields (esp. during
reporting/OLAP)? So that is the reason I want to use multiple classes
(reduce space and enhance query speed).

So let me make an concrete example of an CSV file I want to process:

id, voucherid, description, creationdate, brand,
model, customerid
1, 4711, "Special Offer III", "2012-05-02T09:12:17", "Samsung",
"Galaxy S3", 50112
2, 4712, "Special Offer II", "2012-05-03T09:1417", "HTC",
"One", 50113
3, 4713, "Special Offer III", "2012-05-04T10:17:17", "HTC",
"One", 50002
4, 4714, "Special Offer for Mr XY", "2012-05-04T11:09:17", "Apple",
"iPhone 5", 50017
[...]

So I want to create the class "Order" (extends V), "Brand" (extends V),
"Model" (extends V), "Customer" (extends V).
The Brand and Model and the Customer should be linked with the Order and I
do not want duplicates in the class Order, Brand, Model, Customer.

class Order gets the properties voucherid, description, creationdate.
class Brand gets the property title, where the value of csv field "brand"
is stored.
class Model gets the property title, where the value of csv field "model"
is stored.
class Customer gets the property customerid (firstname, lastname etc.),
where the csv field "customerid" is stored (and firstname, lastname etc.)

I want to run a ETL job just one time (because there are millions of
records) because its a huge file instead of running the job for each Class
again and again.
I hope you know what I want?

regards
Lars

Post by Luca Garulli
Hi Lars,
I don't know the "class" and "classProperty" fields in "field"
http://www.orientechnologies.com/docs/last/orientdb-etl.wiki/Transformer.html#field
Furthermore why do you need multiple classes?

Post by Lars Plessmann
I have a really huge CSV (about 240GB) file with several columns (lets
say there are columns A - H).
The first column A is the primary key of the main record (vertex
MainRecord). But the columns D, E, F, G are columns which should be stored
in an own vertex (because these fields are redundant over all the records
and I dont want to store them in the main record again and again). So the
column value of D-G itself should be stored as a property called "title" in
a new vertex (but it should not generate duplicates). Afterwards these
vertexes needs to be linked.
Is this possible to reach this with an single orient-etl configuration? I
think the only way I know is to split the huge csv file's columns and
create sepperate files for each vertex. But I dont want to do this if that
is not neccessairy (file is so big).
I hope you can give me an advice?
I try to describe it in the config json syntax what I need (of course,
{
"source": {
"file": {
"path": "dataexport.csv"
}
},
"extractor": {"row": {}},
"transformers": [
{
"csv": {
"separator": ",",
"nullValue": "NULL",
"skipFrom": -1,
"skipTo": -1
}
},
{
"field": {
"fieldName": "_id",
"expression": "$input._id.substring(9, 33)"
}
},
{
"field": {
"fieldName": "colD",
"class": "ColumnD",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colE",
"class": "ColumnE",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colF",
"class": "ColumnF",
"classProperty": "title"
} {
"field": {
"fieldName": "colG",
"class": "ColumnG",
"classProperty": "title"
}
}
},
{
"vertex": {"class": "MainRecord"}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:127.0.0.1/msales_testing",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"dbType": "graph",
"classes": [
{
"name": "MainRecord",
"extends": "V"
},
{
"name": "ColumnD",
"extends": "V"
},
{
"name": "ColumnE",
"extends": "V"
},
{
"name": "ColumnF",
"extends": "V"
},
{
"name": "ColumnG",
"extends": "V"
}
],
"indexes": [
{
"class": "MainRecord",
"fields": ["_id:string"],
"type": "UNIQUE"
}
]
}
}
}
By the way: _id is in the MongoDB ObjectID format. I just want to store
the original hex value, so I used the substring sql method to extract the
hex id. Maybe there is a better way.
regards
Lars
--
---
You received this message because you are subscribed to the Google Groups
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

kurtuluş yılmaz

2015-03-30 08:02:51 UTC

Permalink

Hi Lars;
Can you find a solution to migrate data from this type of table. I also
need to migrate data from this type table. If you find a solution can you
describe it.
Best regrads.

Post by Lars Plessmann
Hi Luca,
I know these fields do not exist. This should only be an example what I'm
looking for for a better understanding (see my note above the config file
;)).
The Wiki is not really informative about complex Transformation :-(
I think its more efficient to store records in its own classes instead of
building document records with lots of redundant fields (esp. during
reporting/OLAP)? So that is the reason I want to use multiple classes
(reduce space and enhance query speed).
id, voucherid, description, creationdate,
brand, model, customerid
1, 4711, "Special Offer III", "2012-05-02T09:12:17",
"Samsung", "Galaxy S3", 50112
2, 4712, "Special Offer II", "2012-05-03T09:1417",
"HTC", "One", 50113
3, 4713, "Special Offer III", "2012-05-04T10:17:17",
"HTC", "One", 50002
4, 4714, "Special Offer for Mr XY", "2012-05-04T11:09:17",
"Apple", "iPhone 5", 50017
[...]
So I want to create the class "Order" (extends V), "Brand" (extends V),
"Model" (extends V), "Customer" (extends V).
The Brand and Model and the Customer should be linked with the Order and I
do not want duplicates in the class Order, Brand, Model, Customer.
class Order gets the properties voucherid, description, creationdate.
class Brand gets the property title, where the value of csv field "brand"
is stored.
class Model gets the property title, where the value of csv field "model"
is stored.
class Customer gets the property customerid (firstname, lastname etc.),
where the csv field "customerid" is stored (and firstname, lastname etc.)
I want to run a ETL job just one time (because there are millions of
records) because its a huge file instead of running the job for each Class
again and again.
I hope you know what I want?
regards
Lars

Post by Lars Plessmann
I have a really huge CSV (about 240GB) file with several columns (lets
say there are columns A - H).
The first column A is the primary key of the main record (vertex
MainRecord). But the columns D, E, F, G are columns which should be stored
in an own vertex (because these fields are redundant over all the records
and I dont want to store them in the main record again and again). So the
column value of D-G itself should be stored as a property called "title" in
a new vertex (but it should not generate duplicates). Afterwards these
vertexes needs to be linked.
Is this possible to reach this with an single orient-etl configuration?
I think the only way I know is to split the huge csv file's columns and
create sepperate files for each vertex. But I dont want to do this if that
is not neccessairy (file is so big).
I hope you can give me an advice?
I try to describe it in the config json syntax what I need (of course,
{
"source": {
"file": {
"path": "dataexport.csv"
}
},
"extractor": {"row": {}},
"transformers": [
{
"csv": {
"separator": ",",
"nullValue": "NULL",
"skipFrom": -1,
"skipTo": -1
}
},
{
"field": {
"fieldName": "_id",
"expression": "$input._id.substring(9, 33)"
}
},
{
"field": {
"fieldName": "colD",
"class": "ColumnD",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colE",
"class": "ColumnE",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colF",
"class": "ColumnF",
"classProperty": "title"
} {
"field": {
"fieldName": "colG",
"class": "ColumnG",
"classProperty": "title"
}
}
},
{
"vertex": {"class": "MainRecord"}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:127.0.0.1/msales_testing",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"dbType": "graph",
"classes": [
{
"name": "MainRecord",
"extends": "V"
},
{
"name": "ColumnD",
"extends": "V"
},
{
"name": "ColumnE",
"extends": "V"
},
{
"name": "ColumnF",
"extends": "V"
},
{
"name": "ColumnG",
"extends": "V"
}
],
"indexes": [
{
"class": "MainRecord",
"fields": ["_id:string"],
"type": "UNIQUE"
}
]
}
}
}
By the way: _id is in the MongoDB ObjectID format. I just want to store
the original hex value, so I used the substring sql method to extract the
hex id. Maybe there is a better way.
regards
Lars
--
---
You received this message because you are subscribed to the Google
Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

sck2015

2015-03-31 21:10:38 UTC

Permalink

Hi Lars and Luca, was this ever resolved? This is related to my question
re: creating multiple vertices from within a single etl config json file.
Similarly, another post asked how to create edges between two vertices that
already exist via the etl (is this possible). My perception is that
different vertex classes have to be created in separate source files in the
etl. I'd like to understand if this is the case, how to then create edges
between *two pre-existing vertices* in the etl if possible. Thanks, Sonu

kurtuluş yılmaz

2015-04-03 20:17:40 UTC

Permalink

Hi I find a solution for this. Actually you can create edge between
existing nodes you can use command property of transformers at json etl
file. I send you my solution. It can help you .
Key point at config json
{
"command": {
"command": "create edge cookieUsedBy from (select from
Cookie where compId = ${input.compId}) to (select from Member where
member_id = ${input.member_id})",
"output": "edge"
}
Key point at config json

Best regards.

{
"config": {
"log": "debug"
},
"extractor": {
"jdbc": {
"driver": "com.mysql.jdbc.Driver",
"url": "",
"userName": "log",
"userPassword": "",
"query": "select compId,member_id from login_log where
member_id = 12034788 ",
"fetchSize": 100
}
},
"transformers": [
{
"command": {
"command": "create edge cookieUsedBy from (select from
Cookie where compId = ${input.compId}) to (select from Member where
member_id = ${input.member_id})",
"output": "edge"
}
}
],
"loader": {
"orientdb": {
"dbURL": "remote: localhost/",
"dbUser": "root",
"dbPassword": "",
"dbAutoCreate": true,
"tx": true,
"batchCommit": 1000,
"wal": false,
"dbType": "graph",
"classes": [
{
"name": "Cookie",
"extends": "V"
}
],
"indexes": [
{
"class": "Cookie",
"fields": [
"compId: String"
],
"type": "UNIQUE_HASH_INDEX"
}
]
}
}
}

Post by sck2015
Hi Lars and Luca, was this ever resolved? This is related to my question
re: creating multiple vertices from within a single etl config json file.
Similarly, another post asked how to create edges between two vertices that
already exist via the etl (is this possible). My perception is that
different vertex classes have to be created in separate source files in the
etl. I'd like to understand if this is the case, how to then create edges
between *two pre-existing vertices* in the etl if possible. Thanks, Sonu

Post by Lars Plessmann
I have a really huge CSV (about 240GB) file with several columns (lets
say there are columns A - H).
The first column A is the primary key of the main record (vertex
MainRecord). But the columns D, E, F, G are columns which should be stored
in an own vertex (because these fields are redundant over all the records
and I dont want to store them in the main record again and again). So the
column value of D-G itself should be stored as a property called "title" in
a new vertex (but it should not generate duplicates). Afterwards these
vertexes needs to be linked.
Is this possible to reach this with an single orient-etl configuration? I
think the only way I know is to split the huge csv file's columns and
create sepperate files for each vertex. But I dont want to do this if that
is not neccessairy (file is so big).
I hope you can give me an advice?
I try to describe it in the config json syntax what I need (of course,
{
"source": {
"file": {
"path": "dataexport.csv"
}
},
"extractor": {"row": {}},
"transformers": [
{
"csv": {
"separator": ",",
"nullValue": "NULL",
"skipFrom": -1,
"skipTo": -1
}
},
{
"field": {
"fieldName": "_id",
"expression": "$input._id.substring(9, 33)"
}
},
{
"field": {
"fieldName": "colD",
"class": "ColumnD",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colE",
"class": "ColumnE",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colF",
"class": "ColumnF",
"classProperty": "title"
} {
"field": {
"fieldName": "colG",
"class": "ColumnG",
"classProperty": "title"
}
}
},
{
"vertex": {"class": "MainRecord"}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:127.0.0.1/msales_testing",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"dbType": "graph",
"classes": [
{
"name": "MainRecord",
"extends": "V"
},
{
"name": "ColumnD",
"extends": "V"
},
{
"name": "ColumnE",
"extends": "V"
},
{
"name": "ColumnF",
"extends": "V"
},
{
"name": "ColumnG",
"extends": "V"
}
],
"indexes": [
{
"class": "MainRecord",
"fields": ["_id:string"],
"type": "UNIQUE"
}
]
}
}
}
By the way: _id is in the MongoDB ObjectID format. I just want to store
the original hex value, so I used the substring sql method to extract the
hex id. Maybe there is a better way.
regards
Lars

Kunal Goyal

2018-03-08 07:32:36 UTC

Permalink

HI Lars,

did you get the solution for this problme ??

Maxim Nikolaev

2018-03-08 07:44:11 UTC

Permalink

in my suggestion , you must use java api, it work 100%

Post by Kunal Goyal
HI Lars,
did you get the solution for this problme ??

Post by Lars Plessmann
I have a really huge CSV (about 240GB) file with several columns (lets
say there are columns A - H).
The first column A is the primary key of the main record (vertex
MainRecord). But the columns D, E, F, G are columns which should be stored
in an own vertex (because these fields are redundant over all the records
and I dont want to store them in the main record again and again). So the
column value of D-G itself should be stored as a property called "title" in
a new vertex (but it should not generate duplicates). Afterwards these
vertexes needs to be linked.
Is this possible to reach this with an single orient-etl configuration? I
think the only way I know is to split the huge csv file's columns and
create sepperate files for each vertex. But I dont want to do this if that
is not neccessairy (file is so big).
I hope you can give me an advice?
I try to describe it in the config json syntax what I need (of course,
{
"source": {
"file": {
"path": "dataexport.csv"
}
},
"extractor": {"row": {}},
"transformers": [
{
"csv": {
"separator": ",",
"nullValue": "NULL",
"skipFrom": -1,
"skipTo": -1
}
},
{
"field": {
"fieldName": "_id",
"expression": "$input._id.substring(9, 33)"
}
},
{
"field": {
"fieldName": "colD",
"class": "ColumnD",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colE",
"class": "ColumnE",
"classProperty": "title"
}
},
{
"field": {
"fieldName": "colF",
"class": "ColumnF",
"classProperty": "title"
} {
"field": {
"fieldName": "colG",
"class": "ColumnG",
"classProperty": "title"
}
}
},
{
"vertex": {"class": "MainRecord"}
}
],
"loader": {
"orientdb": {
"dbURL": "remote:127.0.0.1/msales_testing",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"dbType": "graph",
"classes": [
{
"name": "MainRecord",
"extends": "V"
},
{
"name": "ColumnD",
"extends": "V"
},
{
"name": "ColumnE",
"extends": "V"
},
{
"name": "ColumnF",
"extends": "V"
},
{
"name": "ColumnG",
"extends": "V"
}
],
"indexes": [
{
"class": "MainRecord",
"fields": ["_id:string"],
"type": "UNIQUE"
}
]
}
}
}
By the way: _id is in the MongoDB ObjectID format. I just want to store
the original hex value, so I used the substring sql method to extract the
hex id. Maybe there is a better way.
regards
Lars

--
---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.