[orientdb] Re: Bulk insert of a massive graph

Discussion:

Nicolas

2012-07-21 09:12:46 UTC

Hi,

I didn't succeed in making it work with the 1.0.2-Snapshot. It didn't work
neither with the 1.1 Snapshot.

I tried another way using a server with a lot of RAM : 64GB memory.
I used the OGraphDatabase to design my graph.
The loading is made in two phases :
-Firstly, I load the vertices in the database and I keep in a HashMap
matching between the idTwitter and the vertex I loaded
-Secondly, I load the edges by using the HashMap to create the edges
between the vertices representing the idTwitter.

During the load of the vertices, the RAM usage grows fast and then stops at
28GB and nothing seems to happen. Is there a RAM limitation in the OrientDB
config ?

Hi,
the OrientBatchGraph is quite recent. Assure to use it against OrientDB
1.0.2-SNAPSHOT.

Hi Luca,
I tested the OrientBatchGraph but I don't succeed in running my
program without exception.
OrientBatchGraph g = new OrientBatchGraph("local:/home/**
nicolas/Recherche/OrientDb/**testdb");
BufferedReader buff = new BufferedReader(new FileReader("/home/nicolas/**
Recherche/file"));
String line="";
int left, right;
Vertex v1, v2;
String[] t;
while ((line = buff.readLine()) != null) {
t = line.split(" ");
left = Integer.parseInt(t[0]);
right = Integer.parseInt(t[1]);
v1 = g.getVertex(left);
v2=g.getVertex(right);
if (v1 == null) {
g.addVertex(left);
v1=g.getVertex(left);
}
if (v2 == null) {
g.addVertex(right);
v2=g.getVertex(right);
}
g.addEdge(left+"to"+right, v1, v2, "f");
}
g.shutdown();
buff.close();
The error is that when "g.addEdge(left+"to"+right, v1, v2, "f");" is
executed, v1 and v2 are null. Do i use "getVertex" correctly ?
Thanks again ;)

Hi everybody,
We have a text file which represents the edges of a graph.
id1 id2
id2 id3
id3 id1
That means that there is an oriented edge from the vertex id1 to
the vertex id2, from the vertex id2 to the vertex id3 and from the
vertex id3 to the vertex id1.
This file size is about 38GB, it's about 50millions vertices and
2B edges.
What is the best way to load and store this graph in GraphDB ?
I red that there may be a way by inserting a JSon file using
Gremlin console but I'm not sure that it can work.
Thanks a lot.
Nicolas

Luca Garulli

2012-07-21 10:33:22 UTC

Permalink

Hi,
the 1.0.2-Snapshot is quite old: it has been renamed 1.1.0-SNAPSHOT since
1-2 months.

Please use the 1.1.0-SNAPSHOT.

About your use case what is the JVM's heap? Are you following the
suggestion in http://code.google.com/p/orient/wiki/PerformanceTuning ?

Please write here the relevant code that make the insertion.

Post by Nicolas
Hi,
I didn't succeed in making it work with the 1.0.2-Snapshot. It didn't work
neither with the 1.1 Snapshot.
I tried another way using a server with a lot of RAM : 64GB memory.
I used the OGraphDatabase to design my graph.
-Firstly, I load the vertices in the database and I keep in a HashMap
matching between the idTwitter and the vertex I loaded
-Secondly, I load the edges by using the HashMap to create the edges
between the vertices representing the idTwitter.
During the load of the vertices, the RAM usage grows fast and then stops
at 28GB and nothing seems to happen. Is there a RAM limitation in the
OrientDB config ?

Hi,
the OrientBatchGraph is quite recent. Assure to use it against OrientDB
1.0.2-SNAPSHOT.

Hi Luca,
I tested the OrientBatchGraph but I don't succeed in running my
program without exception.
OrientBatchGraph g = new OrientBatchGraph("local:/home/****
nicolas/Recherche/OrientDb/**tes**tdb");
BufferedReader buff = new BufferedReader(new FileReader("/home/nicolas/*
*Rech**erche/file"));
String line="";
int left, right;
Vertex v1, v2;
String[] t;
while ((line = buff.readLine()) != null) {
t = line.split(" ");
left = Integer.parseInt(t[0]);
right = Integer.parseInt(t[1]);
v1 = g.getVertex(left);
v2=g.getVertex(right);
if (v1 == null) {
g.addVertex(left);
v1=g.getVertex(left);
}
if (v2 == null) {
g.addVertex(right);
v2=g.getVertex(right);
}
g.addEdge(left+"to"+right, v1, v2, "f");
}
g.shutdown();
buff.close();
The error is that when "g.addEdge(left+"to"+right, v1, v2, "f");" is
executed, v1 and v2 are null. Do i use "getVertex" correctly ?
Thanks again ;)

anthony

2012-07-24 07:11:28 UTC

Permalink

Hi Luca,

I work with Nicolas on this project, he is away for a while so I respond
for him :-)

We tried with the 1.1.0-SNAPSHOT but still did not succeed with the
OrientBatchGraph.
Regarding our use case, we did not try the performance tuning suggestions,
this is something I will try
to do in the next couple of days. Meanwhile, here is the code we used to
make our insertion:

OGraphDatabase db = new OGraphDatabase("local:"+args[0]);
if (db.exists())
db.open("admin", "admin");
else
db.create();
db.declareIntent(new OIntentMassiveInsert());

FileReader fr = new FileReader(args[1]);
BufferedReader buff = new BufferedReader(fr);
String line;
int left;
ODocument v;

HashMap<Integer, ODocument> map = new HashMap<Integer, ODocument>();

System.out.println("Début insertion des noeuds");
while ((line = buff.readLine()) != null) {
left = Integer.parseInt(line);
v = (ODocument) db.createVertex().field("idTwitter", left).save();
map.put(left, v);
}
System.out.println("Fin insertion des noeuds");
buff.close();
fr.close();

fr = new FileReader(args[2]);
buff = new BufferedReader(fr);

int right, k= 0;
String[] t;
ODocument v1, v2;
System.out.println("Début insertion des arcs");
while ((line = buff.readLine()) != null) {
t = line.split(" ");
try {
left = Integer.parseInt(t[0]);
v1 = map.get(left);

right = Integer.parseInt(t[1]);
v2 = map.get(right);

db.createEdge(v1, v2).save();
} catch (Exception e) {
}
k++;
if (k % 1000000 == 0) {
k= 0;
System.out.println("10000000 edges inserted");
}
}

buff.close();
fr.close();

db.close();

Post by Luca Garulli
Hi,
the 1.0.2-Snapshot is quite old: it has been renamed 1.1.0-SNAPSHOT since
1-2 months.
Please use the 1.1.0-SNAPSHOT.
About your use case what is the JVM's heap? Are you following the
suggestion in http://code.google.com/p/orient/wiki/PerformanceTuning ?
Please write here the relevant code that make the insertion.

Post by Nicolas
Hi,
I didn't succeed in making it work with the 1.0.2-Snapshot. It didn't
work neither with the 1.1 Snapshot.
I tried another way using a server with a lot of RAM : 64GB memory.
I used the OGraphDatabase to design my graph.
-Firstly, I load the vertices in the database and I keep in a HashMap
matching between the idTwitter and the vertex I loaded
-Secondly, I load the edges by using the HashMap to create the edges
between the vertices representing the idTwitter.
During the load of the vertices, the RAM usage grows fast and then stops
at 28GB and nothing seems to happen. Is there a RAM limitation in the
OrientDB config ?

Hi,
the OrientBatchGraph is quite recent. Assure to use it against OrientDB
1.0.2-SNAPSHOT.

Hi Luca,
I tested the OrientBatchGraph but I don't succeed in running my
program without exception.
OrientBatchGraph g = new OrientBatchGraph("local:/home/****
nicolas/Recherche/OrientDb/**tes**tdb");
BufferedReader buff = new BufferedReader(new FileReader("/home/nicolas/
**Rech**erche/file"));
String line="";
int left, right;
Vertex v1, v2;
String[] t;
while ((line = buff.readLine()) != null) {
t = line.split(" ");
left = Integer.parseInt(t[0]);
right = Integer.parseInt(t[1]);
v1 = g.getVertex(left);
v2=g.getVertex(right);
if (v1 == null) {
g.addVertex(left);
v1=g.getVertex(left);
}
if (v2 == null) {
g.addVertex(right);
v2=g.getVertex(right);
}
g.addEdge(left+"to"+right, v1, v2, "f");
}
g.shutdown();
buff.close();
The error is that when "g.addEdge(left+"to"+right, v1, v2, "f");" is
executed, v1 and v2 are null. Do i use "getVertex" correctly ?
Thanks again ;)

Luca Garulli

2012-07-24 07:26:29 UTC

Permalink

Hi,
how many vertices and edges are you loading?

First suggestion: keep in memory only the RID, not the entire ODocument:

HashMap<Integer, ORID> map = new HashMap<Integer, ORID>();

...

map.put(left, v.getIdentity());
...

v1 = (ODocument) map.get(left).getRecord();

...

v2 = (ODocument) map.get(right).getRecord();

...

Hi Luca,
I work with Nicolas on this project, he is away for a while so I respond
for him :-)
We tried with the 1.1.0-SNAPSHOT but still did not succeed with the
OrientBatchGraph.
Regarding our use case, we did not try the performance tuning suggestions,
this is something I will try
to do in the next couple of days. Meanwhile, here is the code we used to
OGraphDatabase db = new OGraphDatabase("local:"+args[0]);
if (db.exists())
db.open("admin", "admin");
else
db.create();
db.declareIntent(new OIntentMassiveInsert());
FileReader fr = new FileReader(args[1]);
BufferedReader buff = new BufferedReader(fr);
String line;
int left;
ODocument v;
HashMap<Integer, ODocument> map = new HashMap<Integer, ODocument>();
System.out.println("Début insertion des noeuds");
while ((line = buff.readLine()) != null) {
left = Integer.parseInt(line);
v = (ODocument) db.createVertex().field("idTwitter", left).save();
map.put(left, v);
}
System.out.println("Fin insertion des noeuds");
buff.close();
fr.close();
fr = new FileReader(args[2]);
buff = new BufferedReader(fr);
int right, k= 0;
String[] t;
ODocument v1, v2;
System.out.println("Début insertion des arcs");
while ((line = buff.readLine()) != null) {
t = line.split(" ");
try {
left = Integer.parseInt(t[0]);
v1 = map.get(left);
right = Integer.parseInt(t[1]);
v2 = map.get(right);
db.createEdge(v1, v2).save();
} catch (Exception e) {
}
k++;
if (k % 1000000 == 0) {
k= 0;
System.out.println("10000000 edges inserted");
}
}
buff.close();
fr.close();
db.close();

Post by Luca Garulli
Hi,
the 1.0.2-Snapshot is quite old: it has been renamed 1.1.0-SNAPSHOT
since 1-2 months.
Please use the 1.1.0-SNAPSHOT.
About your use case what is the JVM's heap? Are you following the
suggestion in http://code.google.com/p/**orient/wiki/PerformanceTuning<http://code.google.com/p/orient/wiki/PerformanceTuning>
**?
Please write here the relevant code that make the insertion.

Post by Nicolas
Hi,
I didn't succeed in making it work with the 1.0.2-Snapshot. It didn't
work neither with the 1.1 Snapshot.
I tried another way using a server with a lot of RAM : 64GB memory.
I used the OGraphDatabase to design my graph.
-Firstly, I load the vertices in the database and I keep in a HashMap
matching between the idTwitter and the vertex I loaded
-Secondly, I load the edges by using the HashMap to create the edges
between the vertices representing the idTwitter.
During the load of the vertices, the RAM usage grows fast and then stops
at 28GB and nothing seems to happen. Is there a RAM limitation in the
OrientDB config ?

Hi,
the OrientBatchGraph is quite recent. Assure to use it against OrientDB
1.0.2-SNAPSHOT.

Hi Luca,
I tested the OrientBatchGraph but I don't succeed in running my
program without exception.
OrientBatchGraph g = new OrientBatchGraph("local:/home/******
nicolas/Recherche/OrientDb/**tes****tdb");
BufferedReader buff = new BufferedReader(new FileReader("/home/nicolas/
**Rech****erche/file"));
String line="";
int left, right;
Vertex v1, v2;
String[] t;
while ((line = buff.readLine()) != null) {
t = line.split(" ");
left = Integer.parseInt(t[0]);
right = Integer.parseInt(t[1]);
v1 = g.getVertex(left);
v2=g.getVertex(right);
if (v1 == null) {
g.addVertex(left);
v1=g.getVertex(left);
}
if (v2 == null) {
g.addVertex(right);
v2=g.getVertex(right);
}
g.addEdge(left+"to"+right, v1, v2, "f");
}
g.shutdown();
buff.close();
The error is that when "g.addEdge(left+"to"+right, v1, v2, "f");" is
executed, v1 and v2 are null. Do i use "getVertex" correctly ?
Thanks again ;)

anthony

2012-07-24 10:09:06 UTC

Permalink

Hi,

Thanks for your quick answer. We are trying to load about 50 million
vertices and 2 billion arcs.
Your first suggestion improved the load of the vertices, however now the
edges are loading
really slowly (something like 100000 every 10 minutes...)

Is there a way to speed up this insertion? (maybe with an index on the
vertices or something?)

Post by Luca Garulli
Hi,
how many vertices and edges are you loading?
HashMap<Integer, ORID> map = new HashMap<Integer, ORID>();
...
map.put(left, v.getIdentity());
...
v1 = (ODocument) map.get(left).getRecord();
...
v2 = (ODocument) map.get(right).getRecord();
...

Hi Luca,
I work with Nicolas on this project, he is away for a while so I respond
for him :-)
We tried with the 1.1.0-SNAPSHOT but still did not succeed with the
OrientBatchGraph.
Regarding our use case, we did not try the performance tuning
suggestions, this is something I will try
to do in the next couple of days. Meanwhile, here is the code we used to
OGraphDatabase db = new OGraphDatabase("local:"+args[0]);
if (db.exists())
db.open("admin", "admin");
else
db.create();
db.declareIntent(new OIntentMassiveInsert());
FileReader fr = new FileReader(args[1]);
BufferedReader buff = new BufferedReader(fr);
String line;
int left;
ODocument v;
HashMap<Integer, ODocument> map = new HashMap<Integer, ODocument>();
System.out.println("Début insertion des noeuds");
while ((line = buff.readLine()) != null) {
left = Integer.parseInt(line);
v = (ODocument) db.createVertex().field("idTwitter", left).save();
map.put(left, v);
}
System.out.println("Fin insertion des noeuds");
buff.close();
fr.close();
fr = new FileReader(args[2]);
buff = new BufferedReader(fr);
int right, k= 0;
String[] t;
ODocument v1, v2;
System.out.println("Début insertion des arcs");
while ((line = buff.readLine()) != null) {
t = line.split(" ");
try {
left = Integer.parseInt(t[0]);
v1 = map.get(left);
right = Integer.parseInt(t[1]);
v2 = map.get(right);
db.createEdge(v1, v2).save();
} catch (Exception e) {
}
k++;
if (k % 1000000 == 0) {
k= 0;
System.out.println("10000000 edges inserted");
}
}
buff.close();
fr.close();
db.close();

Post by Luca Garulli
Hi,
the 1.0.2-Snapshot is quite old: it has been renamed 1.1.0-SNAPSHOT
since 1-2 months.
Please use the 1.1.0-SNAPSHOT.
About your use case what is the JVM's heap? Are you following the
suggestion in http://code.google.com/p/**orient/wiki/PerformanceTuning<http://code.google.com/p/orient/wiki/PerformanceTuning>
**?
Please write here the relevant code that make the insertion.

Post by Nicolas
Hi,
I didn't succeed in making it work with the 1.0.2-Snapshot. It didn't
work neither with the 1.1 Snapshot.
I tried another way using a server with a lot of RAM : 64GB memory.
I used the OGraphDatabase to design my graph.
-Firstly, I load the vertices in the database and I keep in a HashMap
matching between the idTwitter and the vertex I loaded
-Secondly, I load the edges by using the HashMap to create the edges
between the vertices representing the idTwitter.
During the load of the vertices, the RAM usage grows fast and then
stops at 28GB and nothing seems to happen. Is there a RAM limitation in the
OrientDB config ?

Hi,
the OrientBatchGraph is quite recent. Assure to use it against
OrientDB 1.0.2-SNAPSHOT.

Hi Luca,
I tested the OrientBatchGraph but I don't succeed in running my
program without exception.
OrientBatchGraph g = new OrientBatchGraph("local:/home/******
nicolas/Recherche/OrientDb/**tes****tdb");
BufferedReader buff = new BufferedReader(new
FileReader("/home/nicolas/**Rech****erche/file"));
String line="";
int left, right;
Vertex v1, v2;
String[] t;
while ((line = buff.readLine()) != null) {
t = line.split(" ");
left = Integer.parseInt(t[0]);
right = Integer.parseInt(t[1]);
v1 = g.getVertex(left);
v2=g.getVertex(right);
if (v1 == null) {
g.addVertex(left);
v1=g.getVertex(left);
}
if (v2 == null) {
g.addVertex(right);
v2=g.getVertex(right);
}
g.addEdge(left+"to"+right, v1, v2, "f");
}
g.shutdown();
buff.close();
The error is that when "g.addEdge(left+"to"+right, v1, v2, "f");" is
executed, v1 and v2 are null. Do i use "getVertex" correctly ?
Thanks again ;)

Luca Garulli

2012-07-24 10:22:56 UTC

Permalink

You should know if the most of the time is spent on:

1. *retrieve of the vertices*. Fast solutions:
1. then create an index against the vertex's idTwitter field and
remove the hashmap. This will reduce the memory used = less GC but the
index will be always slower than a map in memory...
2. or don't use the massive insert = let to OrientDB to use the
2level cache. This could improve a lot if you've a lot of memory but it's
not so distant by your initial solution to keep all the vertexes
in memory
inside a HashMap
2. *save the vertex*: this could be due to the auto-defragmentation of
the underlying storage layer (works exactly like a file system). Fast
solutions:
1. wait for the 1.2.0-SNAPSHOT that will improve this
2. or in the meanwhile enlarge the size occupied by the vertex in the
storage by using the oversize. "OGraphVertex" class has 2 as factor of
oversize but if you've many edges probably you could enlarge a bit and
avoid on using the Tree to handle edge set... more on this in
another email
if this is the problem

Post by anthony
Hi,
Thanks for your quick answer. We are trying to load about 50 million
vertices and 2 billion arcs.
Your first suggestion improved the load of the vertices, however now the
edges are loading
really slowly (something like 100000 every 10 minutes...)
Is there a way to speed up this insertion? (maybe with an index on the
vertices or something?)

Hi Luca,
I work with Nicolas on this project, he is away for a while so I respond
for him :-)
We tried with the 1.1.0-SNAPSHOT but still did not succeed with the
OrientBatchGraph.
Regarding our use case, we did not try the performance tuning
suggestions, this is something I will try
to do in the next couple of days. Meanwhile, here is the code we used to
OGraphDatabase db = new OGraphDatabase("local:"+args[**0]);
if (db.exists())
db.open("admin", "admin");
else
db.create();
db.declareIntent(new OIntentMassiveInsert());
FileReader fr = new FileReader(args[1]);
BufferedReader buff = new BufferedReader(fr);
String line;
int left;
ODocument v;
HashMap<Integer, ODocument> map = new HashMap<Integer, ODocument>();
System.out.println("Début insertion des noeuds");
while ((line = buff.readLine()) != null) {
left = Integer.parseInt(line);
v = (ODocument) db.createVertex().field("**idTwitter", left).save();
map.put(left, v);
}
System.out.println("Fin insertion des noeuds");
buff.close();
fr.close();
fr = new FileReader(args[2]);
buff = new BufferedReader(fr);
int right, k= 0;
String[] t;
ODocument v1, v2;
System.out.println("Début insertion des arcs");
while ((line = buff.readLine()) != null) {
t = line.split(" ");
try {
left = Integer.parseInt(t[0]);
v1 = map.get(left);
right = Integer.parseInt(t[1]);
v2 = map.get(right);
db.createEdge(v1, v2).save();
} catch (Exception e) {
}
k++;
if (k % 1000000 == 0) {
k= 0;
System.out.println("10000000 edges inserted");
}
}
buff.close();
fr.close();
db.close();

Post by Luca Garulli
Hi,
the 1.0.2-Snapshot is quite old: it has been renamed 1.1.0-SNAPSHOT
since 1-2 months.
Please use the 1.1.0-SNAPSHOT.
About your use case what is the JVM's heap? Are you following the
suggestion in http://code.google.com/p/**or**
ient/wiki/PerformanceTuning<http://code.google.com/p/orient/wiki/PerformanceTuning>
**?
Please write here the relevant code that make the insertion.

Post by Nicolas
Hi,
I didn't succeed in making it work with the 1.0.2-Snapshot. It didn't
work neither with the 1.1 Snapshot.
I tried another way using a server with a lot of RAM : 64GB memory.
I used the OGraphDatabase to design my graph.
-Firstly, I load the vertices in the database and I keep in a HashMap
matching between the idTwitter and the vertex I loaded
-Secondly, I load the edges by using the HashMap to create the edges
between the vertices representing the idTwitter.
During the load of the vertices, the RAM usage grows fast and then
stops at 28GB and nothing seems to happen. Is there a RAM limitation in the
OrientDB config ?

Hi,
the OrientBatchGraph is quite recent. Assure to use it against
OrientDB 1.0.2-SNAPSHOT.

Hi Luca,
I tested the OrientBatchGraph but I don't succeed in running my
program without exception.
OrientBatchGraph g = new OrientBatchGraph("local:/home/********
nicolas/Recherche/OrientDb/**tes******tdb");
BufferedReader buff = new BufferedReader(new
FileReader("/home/nicolas/**Rech******erche/file"));
String line="";
int left, right;
Vertex v1, v2;
String[] t;
while ((line = buff.readLine()) != null) {
t = line.split(" ");
left = Integer.parseInt(t[0]);
right = Integer.parseInt(t[1]);
v1 = g.getVertex(left);
v2=g.getVertex(right);
if (v1 == null) {
g.addVertex(left);
v1=g.getVertex(left);
}
if (v2 == null) {
g.addVertex(right);
v2=g.getVertex(right);
}
g.addEdge(left+"to"+right, v1, v2, "f");
}
g.shutdown();
buff.close();
The error is that when "g.addEdge(left+"to"+right, v1, v2, "f");" is
executed, v1 and v2 are null. Do i use "getVertex" correctly ?
Thanks again ;)

anthony

2012-07-24 12:35:21 UTC

Permalink

So, I've tried looking into the first point by removing the use of massive
insert.
This has drastically reduced the speed of vertices insertion: it has
inserted about 20 million
vertices pretty fast, but won't insert another million (i have waited about
an hour before posting).
The memory stopped growing, like if the script was frozen.

Actually, when I first did the test inserting all vertices and then trying
to add edges, the memory use
was about 50% (we have 64 Go available). Like you said, this probably means
that adding an index will not solve the problem,
since retrieving the vertices should be done quite fast with the hashmap.

So maybe the issue is located at the second point you mentioned?
For some deadlines reasons, it would be so nice if we could load the graph
before early august, so
if the other solution is not too long to explain, I would be glad to read
it :-)

Post by Luca Garulli
1. then create an index against the vertex's idTwitter field and
remove the hashmap. This will reduce the memory used = less GC but the
index will be always slower than a map in memory...
2. or don't use the massive insert = let to OrientDB to use the
2level cache. This could improve a lot if you've a lot of memory but it's
not so distant by your initial solution to keep all the vertexes in memory
inside a HashMap
2. *save the vertex*: this could be due to the auto-defragmentation of
the underlying storage layer (works exactly like a file system). Fast
1. wait for the 1.2.0-SNAPSHOT that will improve this
2. or in the meanwhile enlarge the size occupied by the vertex in
the storage by using the oversize. "OGraphVertex" class has 2 as factor of
oversize but if you've many edges probably you could enlarge a bit and
avoid on using the Tree to handle edge set... more on this in another email
if this is the problem

Hi Luca,
I work with Nicolas on this project, he is away for a while so I
respond for him :-)
We tried with the 1.1.0-SNAPSHOT but still did not succeed with the
OrientBatchGraph.
Regarding our use case, we did not try the performance tuning
suggestions, this is something I will try
to do in the next couple of days. Meanwhile, here is the code we used
OGraphDatabase db = new OGraphDatabase("local:"+args[**0]);
if (db.exists())
db.open("admin", "admin");
else
db.create();
db.declareIntent(new OIntentMassiveInsert());
FileReader fr = new FileReader(args[1]);
BufferedReader buff = new BufferedReader(fr);
String line;
int left;
ODocument v;
HashMap<Integer, ODocument> map = new HashMap<Integer, ODocument>();
System.out.println("Début insertion des noeuds");
while ((line = buff.readLine()) != null) {
left = Integer.parseInt(line);
v = (ODocument) db.createVertex().field("**idTwitter", left).save();
map.put(left, v);
}
System.out.println("Fin insertion des noeuds");
buff.close();
fr.close();
fr = new FileReader(args[2]);
buff = new BufferedReader(fr);
int right, k= 0;
String[] t;
ODocument v1, v2;
System.out.println("Début insertion des arcs");
while ((line = buff.readLine()) != null) {
t = line.split(" ");
try {
left = Integer.parseInt(t[0]);
v1 = map.get(left);
right = Integer.parseInt(t[1]);
v2 = map.get(right);
db.createEdge(v1, v2).save();
} catch (Exception e) {
}
k++;
if (k % 1000000 == 0) {
k= 0;
System.out.println("10000000 edges inserted");
}
}
buff.close();
fr.close();
db.close();

Post by Luca Garulli
Hi,
the 1.0.2-Snapshot is quite old: it has been renamed 1.1.0-SNAPSHOT
since 1-2 months.
Please use the 1.1.0-SNAPSHOT.
About your use case what is the JVM's heap? Are you following the
suggestion in http://code.google.com/p/**or**
ient/wiki/PerformanceTuning<http://code.google.com/p/orient/wiki/PerformanceTuning>
**?
Please write here the relevant code that make the insertion.

Post by Nicolas
Hi,
I didn't succeed in making it work with the 1.0.2-Snapshot. It didn't
work neither with the 1.1 Snapshot.
I tried another way using a server with a lot of RAM : 64GB memory.
I used the OGraphDatabase to design my graph.
-Firstly, I load the vertices in the database and I keep in a HashMap
matching between the idTwitter and the vertex I loaded
-Secondly, I load the edges by using the HashMap to create the edges
between the vertices representing the idTwitter.
During the load of the vertices, the RAM usage grows fast and then
stops at 28GB and nothing seems to happen. Is there a RAM limitation in the
OrientDB config ?

Hi,
the OrientBatchGraph is quite recent. Assure to use it against
OrientDB 1.0.2-SNAPSHOT.

Hi Luca,
I tested the OrientBatchGraph but I don't succeed in running my
program without exception.
OrientBatchGraph g = new OrientBatchGraph("local:/home/********
nicolas/Recherche/OrientDb/**tes******tdb");
BufferedReader buff = new BufferedReader(new
FileReader("/home/nicolas/**Rech******erche/file"));
String line="";
int left, right;
Vertex v1, v2;
String[] t;
while ((line = buff.readLine()) != null) {
t = line.split(" ");
left = Integer.parseInt(t[0]);
right = Integer.parseInt(t[1]);
v1 = g.getVertex(left);
v2=g.getVertex(right);
if (v1 == null) {
g.addVertex(left);
v1=g.getVertex(left);
}
if (v2 == null) {
g.addVertex(right);
v2=g.getVertex(right);
}
g.addEdge(left+"to"+right, v1, v2, "f");
}
g.shutdown();
buff.close();
The error is that when "g.addEdge(left+"to"+right, v1, v2, "f");"
is executed, v1 and v2 are null. Do i use "getVertex" correctly ?
Thanks again ;)

Luca Garulli

2012-07-24 12:47:00 UTC

Permalink

Hi,
First: 2Billions of edges in 50millions of edges means that each vertex has
about 40 edges. Are they equally distributed (or close to be equal) or are
there some super-nodes that has many of them?

Second: you can have much more memory than 64Gb, but the bottleneneck with
JVM is the number of old-gen objects that make the GC run too much time.
Unfortunately we don't have a off-heap cache yet (it's planned). So the
best strategy could be to keep only the RID in memory and lazy loads the
ODocument when you create edges.

If you use your HashMap disable the cache (or just use the massive insert
intent) or try to remove it in favor of keep ORIDs only and use the
OrientDB cache.

To improve the edge creation, if you've about 40 edges per vertex, is to
avoid the automatic upgrading for Set<Edge> by reducing the threshold.

Try to set this before any operation against OrientDB:

OGobalCOnfiguration.MVRBTREE_RID_BINARY_THRESHOLD.setValue(-1);

and then try with:

OGobalCOnfiguration.MVRBTREE_RID_BINARY_THRESHOLD.setValue(1);

The first disable the creation of a Tree to handle the edges, while in the
second one always use a tree.

Post by anthony
So, I've tried looking into the first point by removing the use of massive
insert.
This has drastically reduced the speed of vertices insertion: it has
inserted about 20 million
vertices pretty fast, but won't insert another million (i have waited
about an hour before posting).
The memory stopped growing, like if the script was frozen.
Actually, when I first did the test inserting all vertices and then trying
to add edges, the memory use
was about 50% (we have 64 Go available). Like you said, this probably
means that adding an index will not solve the problem,
since retrieving the vertices should be done quite fast with the hashmap.
So maybe the issue is located at the second point you mentioned?
For some deadlines reasons, it would be so nice if we could load the graph
before early august, so
if the other solution is not too long to explain, I would be glad to read
it :-)

Post by Luca Garulli
1. then create an index against the vertex's idTwitter field and
remove the hashmap. This will reduce the memory used = less GC but the
index will be always slower than a map in memory...
2. or don't use the massive insert = let to OrientDB to use the
2level cache. This could improve a lot if you've a lot of memory but it's
not so distant by your initial solution to keep all the vertexes in memory
inside a HashMap
2. *save the vertex*: this could be due to the auto-defragmentation
of the underlying storage layer (works exactly like a file system). Fast
1. wait for the 1.2.0-SNAPSHOT that will improve this
2. or in the meanwhile enlarge the size occupied by the vertex in
the storage by using the oversize. "OGraphVertex" class has 2 as factor of
oversize but if you've many edges probably you could enlarge a bit and
avoid on using the Tree to handle edge set... more on this in another email
if this is the problem

Hi Luca,
I work with Nicolas on this project, he is away for a while so I
respond for him :-)
We tried with the 1.1.0-SNAPSHOT but still did not succeed with the
OrientBatchGraph.
Regarding our use case, we did not try the performance tuning
suggestions, this is something I will try
to do in the next couple of days. Meanwhile, here is the code we used
OGraphDatabase db = new OGraphDatabase("local:"+args[**0**]);
if (db.exists())
db.open("admin", "admin");
else
db.create();
db.declareIntent(new OIntentMassiveInsert());
FileReader fr = new FileReader(args[1]);
BufferedReader buff = new BufferedReader(fr);
String line;
int left;
ODocument v;
HashMap<Integer, ODocument> map = new HashMap<Integer, ODocument>();
System.out.println("Début insertion des noeuds");
while ((line = buff.readLine()) != null) {
left = Integer.parseInt(line);
v = (ODocument) db.createVertex().field("**idTwi**tter", left).save();
map.put(left, v);
}
System.out.println("Fin insertion des noeuds");
buff.close();
fr.close();
fr = new FileReader(args[2]);
buff = new BufferedReader(fr);
int right, k= 0;
String[] t;
ODocument v1, v2;
System.out.println("Début insertion des arcs");
while ((line = buff.readLine()) != null) {
t = line.split(" ");
try {
left = Integer.parseInt(t[0]);
v1 = map.get(left);
right = Integer.parseInt(t[1]);
v2 = map.get(right);
db.createEdge(v1, v2).save();
} catch (Exception e) {
}
k++;
if (k % 1000000 == 0) {
k= 0;
System.out.println("10000000 edges inserted");
}
}
buff.close();
fr.close();
db.close();

Post by Luca Garulli
Hi,
the 1.0.2-Snapshot is quite old: it has been renamed 1.1.0-SNAPSHOT
since 1-2 months.
Please use the 1.1.0-SNAPSHOT.
About your use case what is the JVM's heap? Are you following the
suggestion in http://code.google.com/p/**or****
ient/wiki/PerformanceTuning<http://code.google.com/p/orient/wiki/PerformanceTuning>
**?
Please write here the relevant code that make the insertion.

Post by Nicolas
Hi,
I didn't succeed in making it work with the 1.0.2-Snapshot. It
didn't work neither with the 1.1 Snapshot.
I tried another way using a server with a lot of RAM : 64GB memory.
I used the OGraphDatabase to design my graph.
-Firstly, I load the vertices in the database and I keep in a
HashMap matching between the idTwitter and the vertex I loaded
-Secondly, I load the edges by using the HashMap to create the edges
between the vertices representing the idTwitter.
During the load of the vertices, the RAM usage grows fast and then
stops at 28GB and nothing seems to happen. Is there a RAM limitation in the
OrientDB config ?

Hi,
the OrientBatchGraph is quite recent. Assure to use it against
OrientDB 1.0.2-SNAPSHOT.

Hi Luca,
I tested the OrientBatchGraph but I don't succeed in running my
program without exception.
OrientBatchGraph g = new OrientBatchGraph("local:/home/********
**nicolas/Recherche/OrientDb/**tes********tdb");
BufferedReader buff = new BufferedReader(new
FileReader("/home/nicolas/**Rech********erche/file"));
String line="";
int left, right;
Vertex v1, v2;
String[] t;
while ((line = buff.readLine()) != null) {
t = line.split(" ");
left = Integer.parseInt(t[0]);
right = Integer.parseInt(t[1]);
v1 = g.getVertex(left);
v2=g.getVertex(right);
if (v1 == null) {
g.addVertex(left);
v1=g.getVertex(left);
}
if (v2 == null) {
g.addVertex(right);
v2=g.getVertex(right);
}
g.addEdge(left+"to"+right, v1, v2, "f");
}
g.shutdown();
buff.close();
The error is that when "g.addEdge(left+"to"+right, v1, v2, "f");"
is executed, v1 and v2 are null. Do i use "getVertex" correctly ?
Thanks again ;)

anthony

2012-07-25 08:05:28 UTC

Permalink

Hi,

Unfortunately, no, the degrees are not equally distributed. The graph is
power-law, so we have a few vertices with very high degree (up to 3
millions) and a lot of vertices with very small degree.

I have tried the following: keeping the massive insert for the vertices
(this is definitely needed), and removing it for the edges. This improved
the speed of insertion a bit, but still remains quite slow (something like
1 million in 10 minutes).

I have also tried both the configurations you suggested, and it seems that
the insertion is faster when the tree is disabled (maybe this comes from
the high degree vertices?).

Just to be clear on what I'm doing right now, I preserve the ORIDs in a
Hashmap, and then use it to create the edges (this is something you adviced
previously). If I follow you correctly, there is a way to keep ORIDs in the
OrientDB cache and then avoid the use of the hashmap?

Again, thanks for your help and quick answers :)
Anthony

Post by Luca Garulli
Hi,
First: 2Billions of edges in 50millions of edges means that each vertex
has about 40 edges. Are they equally distributed (or close to be equal) or
are there some super-nodes that has many of them?
Second: you can have much more memory than 64Gb, but the bottleneneck with
JVM is the number of old-gen objects that make the GC run too much time.
Unfortunately we don't have a off-heap cache yet (it's planned). So the
best strategy could be to keep only the RID in memory and lazy loads the
ODocument when you create edges.
If you use your HashMap disable the cache (or just use the massive insert
intent) or try to remove it in favor of keep ORIDs only and use the
OrientDB cache.
To improve the edge creation, if you've about 40 edges per vertex, is to
avoid the automatic upgrading for Set<Edge> by reducing the threshold.
OGobalCOnfiguration.MVRBTREE_RID_BINARY_THRESHOLD.setValue(-1);
OGobalCOnfiguration.MVRBTREE_RID_BINARY_THRESHOLD.setValue(1);
The first disable the creation of a Tree to handle the edges, while in the
second one always use a tree.

Post by anthony
So, I've tried looking into the first point by removing the use of
massive insert.
This has drastically reduced the speed of vertices insertion: it has
inserted about 20 million
vertices pretty fast, but won't insert another million (i have waited
about an hour before posting).
The memory stopped growing, like if the script was frozen.
Actually, when I first did the test inserting all vertices and then
trying to add edges, the memory use
was about 50% (we have 64 Go available). Like you said, this probably
means that adding an index will not solve the problem,
since retrieving the vertices should be done quite fast with the hashmap.
So maybe the issue is located at the second point you mentioned?
For some deadlines reasons, it would be so nice if we could load the
graph before early august, so
if the other solution is not too long to explain, I would be glad to read
it :-)

Post by Luca Garulli
1. then create an index against the vertex's idTwitter field and
remove the hashmap. This will reduce the memory used = less GC but the
index will be always slower than a map in memory...
2. or don't use the massive insert = let to OrientDB to use the
2level cache. This could improve a lot if you've a lot of memory but it's
not so distant by your initial solution to keep all the vertexes in memory
inside a HashMap
2. *save the vertex*: this could be due to the auto-defragmentation
of the underlying storage layer (works exactly like a file system). Fast
1. wait for the 1.2.0-SNAPSHOT that will improve this
2. or in the meanwhile enlarge the size occupied by the vertex in
the storage by using the oversize. "OGraphVertex" class has 2 as factor of
oversize but if you've many edges probably you could enlarge a bit and
avoid on using the Tree to handle edge set... more on this in another email
if this is the problem

Post by anthony
Hi,
Thanks for your quick answer. We are trying to load about 50 million
vertices and 2 billion arcs.
Your first suggestion improved the load of the vertices, however now
the edges are loading
really slowly (something like 100000 every 10 minutes...)
Is there a way to speed up this insertion? (maybe with an index on the
vertices or something?)

Hi Luca,
I work with Nicolas on this project, he is away for a while so I
respond for him :-)
We tried with the 1.1.0-SNAPSHOT but still did not succeed with the
OrientBatchGraph.
Regarding our use case, we did not try the performance tuning
suggestions, this is something I will try
to do in the next couple of days. Meanwhile, here is the code we used
OGraphDatabase db = new OGraphDatabase("local:"+args[**0**]);
if (db.exists())
db.open("admin", "admin");
else
db.create();
db.declareIntent(new OIntentMassiveInsert());
FileReader fr = new FileReader(args[1]);
BufferedReader buff = new BufferedReader(fr);
String line;
int left;
ODocument v;
HashMap<Integer, ODocument> map = new HashMap<Integer, ODocument>();
System.out.println("Début insertion des noeuds");
while ((line = buff.readLine()) != null) {
left = Integer.parseInt(line);
v = (ODocument) db.createVertex().field("**idTwi**tter", left).save();
map.put(left, v);
}
System.out.println("Fin insertion des noeuds");
buff.close();
fr.close();
fr = new FileReader(args[2]);
buff = new BufferedReader(fr);
int right, k= 0;
String[] t;
ODocument v1, v2;
System.out.println("Début insertion des arcs");
while ((line = buff.readLine()) != null) {
t = line.split(" ");
try {
left = Integer.parseInt(t[0]);
v1 = map.get(left);
right = Integer.parseInt(t[1]);
v2 = map.get(right);
db.createEdge(v1, v2).save();
} catch (Exception e) {
}
k++;
if (k % 1000000 == 0) {
k= 0;
System.out.println("10000000 edges inserted");
}
}
buff.close();
fr.close();
db.close();

Post by Luca Garulli
Hi,
the 1.0.2-Snapshot is quite old: it has been renamed 1.1.0-SNAPSHOT
since 1-2 months.
Please use the 1.1.0-SNAPSHOT.
About your use case what is the JVM's heap? Are you following the
suggestion in http://code.google.com/p/**or****
ient/wiki/PerformanceTuning<http://code.google.com/p/orient/wiki/PerformanceTuning>
**?
Please write here the relevant code that make the insertion.

Post by Nicolas
Hi,
I didn't succeed in making it work with the 1.0.2-Snapshot. It
didn't work neither with the 1.1 Snapshot.
I tried another way using a server with a lot of RAM : 64GB memory.
I used the OGraphDatabase to design my graph.
-Firstly, I load the vertices in the database and I keep in a
HashMap matching between the idTwitter and the vertex I loaded
-Secondly, I load the edges by using the HashMap to create the
edges between the vertices representing the idTwitter.
During the load of the vertices, the RAM usage grows fast and then
stops at 28GB and nothing seems to happen. Is there a RAM limitation in the
OrientDB config ?

Hi,
the OrientBatchGraph is quite recent. Assure to use it against
OrientDB 1.0.2-SNAPSHOT.

Hi Luca,
I tested the OrientBatchGraph but I don't succeed in running
my program without exception.
OrientBatchGraph g = new OrientBatchGraph("local:/home/*******
***nicolas/Recherche/OrientDb/**tes********tdb");
BufferedReader buff = new BufferedReader(new
FileReader("/home/nicolas/**Rech********erche/file"));
String line="";
int left, right;
Vertex v1, v2;
String[] t;
while ((line = buff.readLine()) != null) {
t = line.split(" ");
left = Integer.parseInt(t[0]);
right = Integer.parseInt(t[1]);
v1 = g.getVertex(left);
v2=g.getVertex(right);
if (v1 == null) {
g.addVertex(left);
v1=g.getVertex(left);
}
if (v2 == null) {
g.addVertex(right);
v2=g.getVertex(right);
}
g.addEdge(left+"to"+right, v1, v2, "f");
}
g.shutdown();
buff.close();
The error is that when "g.addEdge(left+"to"+right, v1, v2, "f");"
is executed, v1 and v2 are null. Do i use "getVertex" correctly ?
Thanks again ;)

Luca Garulli

2012-07-25 14:25:25 UTC

Permalink

Hi Antony,
release 1.2 will have the defragmentation queue, so on massive insert and
update (when you create edges consider as a massive insert+update!) will
speed up a lot, but this is not yet available.

If you remove your HashMap you should create an index against the idTwitter
attribute but this would slow down the insertion but saving memory. Since
this is not your case (you've plenty of memory) using the HashMap could
help.

Another small hint: disable the multi-threading if you've a single thread:

OGlobalConfiguration.ENVIRONMENT_CONCURRENT.setValue( false );

Post by anthony
Hi,
Unfortunately, no, the degrees are not equally distributed. The graph is
power-law, so we have a few vertices with very high degree (up to 3
millions) and a lot of vertices with very small degree.
I have tried the following: keeping the massive insert for the vertices
(this is definitely needed), and removing it for the edges. This improved
the speed of insertion a bit, but still remains quite slow (something like
1 million in 10 minutes).
I have also tried both the configurations you suggested, and it seems that
the insertion is faster when the tree is disabled (maybe this comes from
the high degree vertices?).
Just to be clear on what I'm doing right now, I preserve the ORIDs in a
Hashmap, and then use it to create the edges (this is something you adviced
previously). If I follow you correctly, there is a way to keep ORIDs in the
OrientDB cache and then avoid the use of the hashmap?
Again, thanks for your help and quick answers :)
Anthony

Post by Luca Garulli
Hi,
First: 2Billions of edges in 50millions of edges means that each vertex
has about 40 edges. Are they equally distributed (or close to be equal) or
are there some super-nodes that has many of them?
Second: you can have much more memory than 64Gb, but the bottleneneck
with JVM is the number of old-gen objects that make the GC run too much
time. Unfortunately we don't have a off-heap cache yet (it's planned). So
the best strategy could be to keep only the RID in memory and lazy loads
the ODocument when you create edges.
If you use your HashMap disable the cache (or just use the massive insert
intent) or try to remove it in favor of keep ORIDs only and use the
OrientDB cache.
To improve the edge creation, if you've about 40 edges per vertex, is to
avoid the automatic upgrading for Set<Edge> by reducing the threshold.
OGobalCOnfiguration.MVRBTREE_**RID_BINARY_THRESHOLD.setValue(**-1);
OGobalCOnfiguration.MVRBTREE_**RID_BINARY_THRESHOLD.setValue(**1);
The first disable the creation of a Tree to handle the edges, while in
the second one always use a tree.

Post by anthony
So, I've tried looking into the first point by removing the use of
massive insert.
This has drastically reduced the speed of vertices insertion: it has
inserted about 20 million
vertices pretty fast, but won't insert another million (i have waited
about an hour before posting).
The memory stopped growing, like if the script was frozen.
Actually, when I first did the test inserting all vertices and then
trying to add edges, the memory use
was about 50% (we have 64 Go available). Like you said, this probably
means that adding an index will not solve the problem,
since retrieving the vertices should be done quite fast with the hashmap.
So maybe the issue is located at the second point you mentioned?
For some deadlines reasons, it would be so nice if we could load the
graph before early august, so
if the other solution is not too long to explain, I would be glad to
read it :-)

Post by Luca Garulli
1. then create an index against the vertex's idTwitter field and
remove the hashmap. This will reduce the memory used = less GC but the
index will be always slower than a map in memory...
2. or don't use the massive insert = let to OrientDB to use the
2level cache. This could improve a lot if you've a lot of memory but it's
not so distant by your initial solution to keep all the vertexes in memory
inside a HashMap
2. *save the vertex*: this could be due to the auto-defragmentation
of the underlying storage layer (works exactly like a file system). Fast
1. wait for the 1.2.0-SNAPSHOT that will improve this
2. or in the meanwhile enlarge the size occupied by the vertex
in the storage by using the oversize. "OGraphVertex" class has 2 as factor
of oversize but if you've many edges probably you could enlarge a bit and
avoid on using the Tree to handle edge set... more on this in another email
if this is the problem

Post by anthony
Hi,
Thanks for your quick answer. We are trying to load about 50 million
vertices and 2 billion arcs.
Your first suggestion improved the load of the vertices, however now
the edges are loading
really slowly (something like 100000 every 10 minutes...)
Is there a way to speed up this insertion? (maybe with an index on the
vertices or something?)

Hi Luca,
I work with Nicolas on this project, he is away for a while so I
respond for him :-)
We tried with the 1.1.0-SNAPSHOT but still did not succeed with the
OrientBatchGraph.
Regarding our use case, we did not try the performance tuning
suggestions, this is something I will try
to do in the next couple of days. Meanwhile, here is the code we
OGraphDatabase db = new OGraphDatabase("local:"+args[**0****]);
if (db.exists())
db.open("admin", "admin");
else
db.create();
db.declareIntent(new OIntentMassiveInsert());
FileReader fr = new FileReader(args[1]);
BufferedReader buff = new BufferedReader(fr);
String line;
int left;
ODocument v;
HashMap<Integer, ODocument> map = new HashMap<Integer, ODocument>();
System.out.println("Début insertion des noeuds");
while ((line = buff.readLine()) != null) {
left = Integer.parseInt(line);
v = (ODocument) db.createVertex().field("**idTwi****tter", left).save();
map.put(left, v);
}
System.out.println("Fin insertion des noeuds");
buff.close();
fr.close();
fr = new FileReader(args[2]);
buff = new BufferedReader(fr);
int right, k= 0;
String[] t;
ODocument v1, v2;
System.out.println("Début insertion des arcs");
while ((line = buff.readLine()) != null) {
t = line.split(" ");
try {
left = Integer.parseInt(t[0]);
v1 = map.get(left);
right = Integer.parseInt(t[1]);
v2 = map.get(right);
db.createEdge(v1, v2).save();
} catch (Exception e) {
}
k++;
if (k % 1000000 == 0) {
k= 0;
System.out.println("10000000 edges inserted");
}
}
buff.close();
fr.close();
db.close();

Post by Luca Garulli
Hi,
the 1.0.2-Snapshot is quite old: it has been renamed
1.1.0-SNAPSHOT since 1-2 months.
Please use the 1.1.0-SNAPSHOT.
About your use case what is the JVM's heap? Are you following the
suggestion in http://code.google.com/p/**or******
ient/wiki/PerformanceTuning<http://code.google.com/p/orient/wiki/PerformanceTuning>
**?
Please write here the relevant code that make the insertion.

Post by Nicolas
Hi,
I didn't succeed in making it work with the 1.0.2-Snapshot. It
didn't work neither with the 1.1 Snapshot.
I tried another way using a server with a lot of RAM : 64GB memory.
I used the OGraphDatabase to design my graph.
-Firstly, I load the vertices in the database and I keep in a
HashMap matching between the idTwitter and the vertex I loaded
-Secondly, I load the edges by using the HashMap to create the
edges between the vertices representing the idTwitter.
During the load of the vertices, the RAM usage grows fast and then
stops at 28GB and nothing seems to happen. Is there a RAM limitation in the
OrientDB config ?

Hi,
the OrientBatchGraph is quite recent. Assure to use it against
OrientDB 1.0.2-SNAPSHOT.

Hi Luca,
I tested the OrientBatchGraph but I don't succeed in running
my program without exception.
OrientBatchGraph g = new OrientBatchGraph("local:/home/******
******nicolas/Recherche/OrientDb/**tes**********tdb");
BufferedReader buff = new BufferedReader(new
FileReader("/home/nicolas/**Rech**********erche/file"));
String line="";
int left, right;
Vertex v1, v2;
String[] t;
while ((line = buff.readLine()) != null) {
t = line.split(" ");
left = Integer.parseInt(t[0]);
right = Integer.parseInt(t[1]);
v1 = g.getVertex(left);
v2=g.getVertex(right);
if (v1 == null) {
g.addVertex(left);
v1=g.getVertex(left);
}
if (v2 == null) {
g.addVertex(right);
v2=g.getVertex(right);
}
g.addEdge(left+"to"+right, v1, v2, "f");
}
g.shutdown();
buff.close();
The error is that when "g.addEdge(left+"to"+right, v1, v2,
"f");" is executed, v1 and v2 are null. Do i use "getVertex" correctly ?
Thanks again ;)

Milen Dyankov

2012-07-26 12:30:43 UTC

Permalink

Hi guys,

sorry for jumping in the middle of your discussion but
I'm experiencing similar problem and it seams I managed to optimize it a
bit. I'm inserting about 1M vertexes and about 1.5M edges like this (pseudo
code):

foreach (vertexDef) {
insert vertex_N in DB
put RID_N in cache
}
...
foreach (edgeDef) {
insert edge (RID_1 from cache, RID_2 from cache)
}

On my machine this approach resulted the application to "hang" a while
after it started to create vertexes. I increased the cache size to 1M
entries (I'm using Apache's LRUMap) as I had too much misses causing
selecting RIDs from the DB . Strangely enough this din't have any positive
impact on performance.

Profiling the application revealed that with huge LRUMap (1M entries) it
takes very long (~200ms) to get the OID from the map (not sure if this is
the same with HashMap) . So I reorganized the import process to insert
vertexes as soon as possible. Like this (pseudo code):

insert vertex1 in DB
put RID_1 in cache
insert vertex2 in DB
put RID_2 in cache
insert edge (RID_1 from cache, RID2 from cache)
insert vertex3 in DB
put RID3 in cache
insert edge (RID1 from cache, RID3 from cache)
....

Now my cache size is only 10000 items and the whole process completes in
about 40 minutes on my machine. Monitoring the cache shows I have no
misses.

To sum up, make sure the HashMap performs well with your amount of data! If
not, try to reorganize the import to reduce the amount of data in the map
or use different structure.
Not sure it is even possible in your case, but I just thought I'll share my
findings.

Regards,
Milen

Post by Luca Garulli
Hi Antony,
release 1.2 will have the defragmentation queue, so on massive insert and
update (when you create edges consider as a massive insert+update!) will
speed up a lot, but this is not yet available.
If you remove your HashMap you should create an index against the
idTwitter attribute but this would slow down the insertion but saving
memory. Since this is not your case (you've plenty of memory) using the
HashMap could help.
OGlobalConfiguration.ENVIRONMENT_CONCURRENT.setValue( false );

Post by anthony
Hi,
Unfortunately, no, the degrees are not equally distributed. The graph is
power-law, so we have a few vertices with very high degree (up to 3
millions) and a lot of vertices with very small degree.
I have tried the following: keeping the massive insert for the vertices
(this is definitely needed), and removing it for the edges. This improved
the speed of insertion a bit, but still remains quite slow (something like
1 million in 10 minutes).
I have also tried both the configurations you suggested, and it seems
that the insertion is faster when the tree is disabled (maybe this comes
from the high degree vertices?).
Just to be clear on what I'm doing right now, I preserve the ORIDs in a
Hashmap, and then use it to create the edges (this is something you adviced
previously). If I follow you correctly, there is a way to keep ORIDs in the
OrientDB cache and then avoid the use of the hashmap?
Again, thanks for your help and quick answers :)
Anthony

Post by Luca Garulli
Hi,
First: 2Billions of edges in 50millions of edges means that each vertex
has about 40 edges. Are they equally distributed (or close to be equal) or
are there some super-nodes that has many of them?
Second: you can have much more memory than 64Gb, but the bottleneneck
with JVM is the number of old-gen objects that make the GC run too much
time. Unfortunately we don't have a off-heap cache yet (it's planned). So
the best strategy could be to keep only the RID in memory and lazy loads
the ODocument when you create edges.
If you use your HashMap disable the cache (or just use the massive
insert intent) or try to remove it in favor of keep ORIDs only and use the
OrientDB cache.
To improve the edge creation, if you've about 40 edges per vertex, is to
avoid the automatic upgrading for Set<Edge> by reducing the threshold.
OGobalCOnfiguration.MVRBTREE_**RID_BINARY_THRESHOLD.setValue(**-1);
OGobalCOnfiguration.MVRBTREE_**RID_BINARY_THRESHOLD.setValue(**1);
The first disable the creation of a Tree to handle the edges, while in
the second one always use a tree.

Post by anthony
So, I've tried looking into the first point by removing the use of
massive insert.
This has drastically reduced the speed of vertices insertion: it has
inserted about 20 million
vertices pretty fast, but won't insert another million (i have waited
about an hour before posting).
The memory stopped growing, like if the script was frozen.
Actually, when I first did the test inserting all vertices and then
trying to add edges, the memory use
was about 50% (we have 64 Go available). Like you said, this probably
means that adding an index will not solve the problem,
since retrieving the vertices should be done quite fast with the hashmap.
So maybe the issue is located at the second point you mentioned?
For some deadlines reasons, it would be so nice if we could load the
graph before early august, so
if the other solution is not too long to explain, I would be glad to
read it :-)

Post by Luca Garulli
1. then create an index against the vertex's idTwitter field
and remove the hashmap. This will reduce the memory used = less GC but the
index will be always slower than a map in memory...
2. or don't use the massive insert = let to OrientDB to use the
2level cache. This could improve a lot if you've a lot of memory but it's
not so distant by your initial solution to keep all the vertexes in memory
inside a HashMap
2. *save the vertex*: this could be due to the
auto-defragmentation of the underlying storage layer (works exactly like a
1. wait for the 1.2.0-SNAPSHOT that will improve this
2. or in the meanwhile enlarge the size occupied by the vertex
in the storage by using the oversize. "OGraphVertex" class has 2 as factor
of oversize but if you've many edges probably you could enlarge a bit and
avoid on using the Tree to handle edge set... more on this in another email
if this is the problem

Post by anthony
Hi,
Thanks for your quick answer. We are trying to load about 50 million
vertices and 2 billion arcs.
Your first suggestion improved the load of the vertices, however now
the edges are loading
really slowly (something like 100000 every 10 minutes...)
Is there a way to speed up this insertion? (maybe with an index on
the vertices or something?)

Hi Luca,
I work with Nicolas on this project, he is away for a while so I
respond for him :-)
We tried with the 1.1.0-SNAPSHOT but still did not succeed with the
OrientBatchGraph.
Regarding our use case, we did not try the performance tuning
suggestions, this is something I will try
to do in the next couple of days. Meanwhile, here is the code we
OGraphDatabase db = new OGraphDatabase("local:"+args[**0****]);
if (db.exists())
db.open("admin", "admin");
else
db.create();
db.declareIntent(new OIntentMassiveInsert());
FileReader fr = new FileReader(args[1]);
BufferedReader buff = new BufferedReader(fr);
String line;
int left;
ODocument v;
HashMap<Integer, ODocument> map = new HashMap<Integer, ODocument>();
System.out.println("Début insertion des noeuds");
while ((line = buff.readLine()) != null) {
left = Integer.parseInt(line);
v = (ODocument) db.createVertex().field("**idTwi****tter", left).save();
map.put(left, v);
}
System.out.println("Fin insertion des noeuds");
buff.close();
fr.close();
fr = new FileReader(args[2]);
buff = new BufferedReader(fr);
int right, k= 0;
String[] t;
ODocument v1, v2;
System.out.println("Début insertion des arcs");
while ((line = buff.readLine()) != null) {
t = line.split(" ");
try {
left = Integer.parseInt(t[0]);
v1 = map.get(left);
right = Integer.parseInt(t[1]);
v2 = map.get(right);
db.createEdge(v1, v2).save();
} catch (Exception e) {
}
k++;
if (k % 1000000 == 0) {
k= 0;
System.out.println("10000000 edges inserted");
}
}
buff.close();
fr.close();
db.close();

Post by Luca Garulli
Hi,
the 1.0.2-Snapshot is quite old: it has been renamed
1.1.0-SNAPSHOT since 1-2 months.
Please use the 1.1.0-SNAPSHOT.
About your use case what is the JVM's heap? Are you following the
suggestion in http://code.google.com/p/**or******
ient/wiki/PerformanceTuning<http://code.google.com/p/orient/wiki/PerformanceTuning>
**?
Please write here the relevant code that make the insertion.

Post by Nicolas
Hi,
I didn't succeed in making it work with the 1.0.2-Snapshot. It
didn't work neither with the 1.1 Snapshot.
I tried another way using a server with a lot of RAM : 64GB memory.
I used the OGraphDatabase to design my graph.
-Firstly, I load the vertices in the database and I keep in a
HashMap matching between the idTwitter and the vertex I loaded
-Secondly, I load the edges by using the HashMap to create the
edges between the vertices representing the idTwitter.
During the load of the vertices, the RAM usage grows fast and
then stops at 28GB and nothing seems to happen. Is there a RAM limitation
in the OrientDB config ?

Hi,
the OrientBatchGraph is quite recent. Assure to use it against
OrientDB 1.0.2-SNAPSHOT.

Hi Luca,
I tested the OrientBatchGraph but I don't succeed in running
my program without exception.
OrientBatchGraph g = new OrientBatchGraph("local:/home/*****
*******nicolas/Recherche/OrientDb/**tes**********tdb");
BufferedReader buff = new BufferedReader(new
FileReader("/home/nicolas/**Rech**********erche/file"));
String line="";
int left, right;
Vertex v1, v2;
String[] t;
while ((line = buff.readLine()) != null) {
t = line.split(" ");
left = Integer.parseInt(t[0]);
right = Integer.parseInt(t[1]);
v1 = g.getVertex(left);
v2=g.getVertex(right);
if (v1 == null) {
g.addVertex(left);
v1=g.getVertex(left);
}
if (v2 == null) {
g.addVertex(right);
v2=g.getVertex(right);
}
g.addEdge(left+"to"+right, v1, v2, "f");
}
g.shutdown();
buff.close();
The error is that when "g.addEdge(left+"to"+right, v1, v2,
"f");" is executed, v1 and v2 are null. Do i use "getVertex" correctly ?
Thanks again ;)

Hi everybody,
We have a text file which represents the edges of a graph.
id1 id2
id2 id3
id3 id1
That means that there is an oriented edge from the vertex
id1 to
the vertex id2, from the vertex id2 to the vertex id3 and from the
vertex id3 to the vertex id1.
This file size is about 38GB, it's about 50millions
vertices and
2B edges.
What is the best way to load and store this graph in GraphDB ?
I red that there may be a way by inserting a JSon file using
Gremlin console but I'm not sure that it can work.
Thanks a lot.
Nicolas

--
http://about.me/milen

--

Milen Dyankov

2012-07-26 13:40:05 UTC

Permalink

Just realized I've made a typo. This:

* ... the application to "hang" a while after it started to create vertexes*

should be

* ... the application to "hang" a while after it started to create edges*
*
*
*
*
Regards,
Milen

Post by Milen Dyankov
Hi guys,
sorry for jumping in the middle of your discussion but
I'm experiencing similar problem and it seams I managed to optimize it a
bit. I'm inserting about 1M vertexes and about 1.5M edges like this (pseudo
foreach (vertexDef) {
insert vertex_N in DB
put RID_N in cache
}
...
foreach (edgeDef) {
insert edge (RID_1 from cache, RID_2 from cache)
}
On my machine this approach resulted the application to "hang" a while
after it started to create vertexes. I increased the cache size to 1M
entries (I'm using Apache's LRUMap) as I had too much misses causing
selecting RIDs from the DB . Strangely enough this din't have any positive
impact on performance.
Profiling the application revealed that with huge LRUMap (1M entries) it
takes very long (~200ms) to get the OID from the map (not sure if this is
the same with HashMap) . So I reorganized the import process to insert
insert vertex1 in DB
put RID_1 in cache
insert vertex2 in DB
put RID_2 in cache
insert edge (RID_1 from cache, RID2 from cache)
insert vertex3 in DB
put RID3 in cache
insert edge (RID1 from cache, RID3 from cache)
....
Now my cache size is only 10000 items and the whole process completes in
about 40 minutes on my machine. Monitoring the cache shows I have no
misses.
To sum up, make sure the HashMap performs well with your amount of data!
If not, try to reorganize the import to reduce the amount of data in the
map or use different structure.
Not sure it is even possible in your case, but I just thought I'll share
my findings.
Regards,
Milen

Post by anthony
Hi,
Unfortunately, no, the degrees are not equally distributed. The graph is
power-law, so we have a few vertices with very high degree (up to 3
millions) and a lot of vertices with very small degree.
I have tried the following: keeping the massive insert for the vertices
(this is definitely needed), and removing it for the edges. This improved
the speed of insertion a bit, but still remains quite slow (something like
1 million in 10 minutes).
I have also tried both the configurations you suggested, and it seems
that the insertion is faster when the tree is disabled (maybe this comes
from the high degree vertices?).
Just to be clear on what I'm doing right now, I preserve the ORIDs in a
Hashmap, and then use it to create the edges (this is something you adviced
previously). If I follow you correctly, there is a way to keep ORIDs in the
OrientDB cache and then avoid the use of the hashmap?
Again, thanks for your help and quick answers :)
Anthony

Post by Luca Garulli
Hi,
First: 2Billions of edges in 50millions of edges means that each vertex
has about 40 edges. Are they equally distributed (or close to be equal) or
are there some super-nodes that has many of them?
Second: you can have much more memory than 64Gb, but the bottleneneck
with JVM is the number of old-gen objects that make the GC run too much
time. Unfortunately we don't have a off-heap cache yet (it's planned). So
the best strategy could be to keep only the RID in memory and lazy loads
the ODocument when you create edges.
If you use your HashMap disable the cache (or just use the massive
insert intent) or try to remove it in favor of keep ORIDs only and use the
OrientDB cache.
To improve the edge creation, if you've about 40 edges per vertex, is
to avoid the automatic upgrading for Set<Edge> by reducing the threshold.
OGobalCOnfiguration.MVRBTREE_**RID_BINARY_THRESHOLD.setValue(**-1);
OGobalCOnfiguration.MVRBTREE_**RID_BINARY_THRESHOLD.setValue(**1);
The first disable the creation of a Tree to handle the edges, while in
the second one always use a tree.

Post by anthony
So, I've tried looking into the first point by removing the use of
massive insert.
This has drastically reduced the speed of vertices insertion: it has
inserted about 20 million
vertices pretty fast, but won't insert another million (i have waited
about an hour before posting).
The memory stopped growing, like if the script was frozen.
Actually, when I first did the test inserting all vertices and then
trying to add edges, the memory use
was about 50% (we have 64 Go available). Like you said, this probably
means that adding an index will not solve the problem,
since retrieving the vertices should be done quite fast with the hashmap.
So maybe the issue is located at the second point you mentioned?
For some deadlines reasons, it would be so nice if we could load the
graph before early august, so
if the other solution is not too long to explain, I would be glad to
read it :-)

Post by Luca Garulli
1. then create an index against the vertex's idTwitter field
and remove the hashmap. This will reduce the memory used = less GC but the
index will be always slower than a map in memory...
2. or don't use the massive insert = let to OrientDB to use
the 2level cache. This could improve a lot if you've a lot of memory but
it's not so distant by your initial solution to keep all the vertexes in
memory inside a HashMap
2. *save the vertex*: this could be due to the
auto-defragmentation of the underlying storage layer (works exactly like a
1. wait for the 1.2.0-SNAPSHOT that will improve this
2. or in the meanwhile enlarge the size occupied by the vertex
in the storage by using the oversize. "OGraphVertex" class has 2 as factor
of oversize but if you've many edges probably you could enlarge a bit and
avoid on using the Tree to handle edge set... more on this in another email
if this is the problem

Post by anthony
Hi,
Thanks for your quick answer. We are trying to load about 50 million
vertices and 2 billion arcs.
Your first suggestion improved the load of the vertices, however now
the edges are loading
really slowly (something like 100000 every 10 minutes...)
Is there a way to speed up this insertion? (maybe with an index on
the vertices or something?)

Hi Luca,
I work with Nicolas on this project, he is away for a while so I
respond for him :-)
We tried with the 1.1.0-SNAPSHOT but still did not succeed with
the OrientBatchGraph.
Regarding our use case, we did not try the performance tuning
suggestions, this is something I will try
to do in the next couple of days. Meanwhile, here is the code we
OGraphDatabase db = new OGraphDatabase("local:"+args[**0****]);
if (db.exists())
db.open("admin", "admin");
else
db.create();
db.declareIntent(new OIntentMassiveInsert());
FileReader fr = new FileReader(args[1]);
BufferedReader buff = new BufferedReader(fr);
String line;
int left;
ODocument v;
HashMap<Integer, ODocument> map = new HashMap<Integer, ODocument>();
System.out.println("Début insertion des noeuds");
while ((line = buff.readLine()) != null) {
left = Integer.parseInt(line);
v = (ODocument) db.createVertex().field("**idTwi****tter", left).save();
map.put(left, v);
}
System.out.println("Fin insertion des noeuds");
buff.close();
fr.close();
fr = new FileReader(args[2]);
buff = new BufferedReader(fr);
int right, k= 0;
String[] t;
ODocument v1, v2;
System.out.println("Début insertion des arcs");
while ((line = buff.readLine()) != null) {
t = line.split(" ");
try {
left = Integer.parseInt(t[0]);
v1 = map.get(left);
right = Integer.parseInt(t[1]);
v2 = map.get(right);
db.createEdge(v1, v2).save();
} catch (Exception e) {
}
k++;
if (k % 1000000 == 0) {
k= 0;
System.out.println("10000000 edges inserted");
}
}
buff.close();
fr.close();
db.close();

Post by Luca Garulli
Hi,
the 1.0.2-Snapshot is quite old: it has been renamed
1.1.0-SNAPSHOT since 1-2 months.
Please use the 1.1.0-SNAPSHOT.
About your use case what is the JVM's heap? Are you following the
suggestion in http://code.google.com/p/**or******
ient/wiki/PerformanceTuning<http://code.google.com/p/orient/wiki/PerformanceTuning>
**?
Please write here the relevant code that make the insertion.

Post by Nicolas
Hi,
I didn't succeed in making it work with the 1.0.2-Snapshot. It
didn't work neither with the 1.1 Snapshot.
I tried another way using a server with a lot of RAM : 64GB memory.
I used the OGraphDatabase to design my graph.
-Firstly, I load the vertices in the database and I keep in a
HashMap matching between the idTwitter and the vertex I loaded
-Secondly, I load the edges by using the HashMap to create the
edges between the vertices representing the idTwitter.
During the load of the vertices, the RAM usage grows fast and
then stops at 28GB and nothing seems to happen. Is there a RAM limitation
in the OrientDB config ?

Hi,
the OrientBatchGraph is quite recent. Assure to use it against
OrientDB 1.0.2-SNAPSHOT.

Hi Luca,
I tested the OrientBatchGraph but I don't succeed in
running my program without exception.
OrientBatchGraph g = new OrientBatchGraph("local:/home/****
********nicolas/Recherche/OrientDb/**tes**********tdb");
BufferedReader buff = new BufferedReader(new
FileReader("/home/nicolas/**Rech**********erche/file"));
String line="";
int left, right;
Vertex v1, v2;
String[] t;
while ((line = buff.readLine()) != null) {
t = line.split(" ");
left = Integer.parseInt(t[0]);
right = Integer.parseInt(t[1]);
v1 = g.getVertex(left);
v2=g.getVertex(right);
if (v1 == null) {
g.addVertex(left);
v1=g.getVertex(left);
}
if (v2 == null) {
g.addVertex(right);
v2=g.getVertex(right);
}
g.addEdge(left+"to"+right, v1, v2, "f");
}
g.shutdown();
buff.close();
The error is that when "g.addEdge(left+"to"+right, v1, v2,
"f");" is executed, v1 and v2 are null. Do i use "getVertex" correctly ?
Thanks again ;)

Hi everybody,
We have a text file which represents the edges of a graph.
id1 id2
id2 id3
id3 id1
That means that there is an oriented edge from the vertex
id1 to
the vertex id2, from the vertex id2 to the vertex id3 and
from the
vertex id3 to the vertex id1.
This file size is about 38GB, it's about 50millions
vertices and
2B edges.
What is the best way to load and store this graph in
GraphDB ?
I red that there may be a way by inserting a JSon file using
Gremlin console but I'm not sure that it can work.
Thanks a lot.
Nicolas

--
http://about.me/milen

--
http://about.me/milen

--

Konrad

2012-08-01 18:32:39 UTC

Permalink

I've been reading around the Groups (both Gremlin and Orient) and still
haven't found out how to do this "disable transactions".

There is written on the OrientDB wiki that disabling transactions (along
with specifying intent) will speed up bulk inserts but I haven't figured
out how that is done with OrientGraph or OrientBatchGraph.

Couldn't just the use of OrientBatchGraph both specify intent "massive
insert" and disabling transactions in underlying Orient database? Or this
is what's already happening?
In that case, then there should be no point in using BatchGraph wrapping
a OrientBatchGraph right?

Kind regards,
Konrad

Hi,
Okay, with your method Marko and the line you gave me and that I
added Luca, it seems to work correctly and faster.
However, even if it's faster, to load 38GB, I will need a week.
So, if TransactionalGraph is just start/stop Transactions Marko,
is there a way to do bulk insert disabling transactions ?
I use the BatchGraph class to load data into Neo4j too and it's
the same matter.
Thanks a lot for your help,
Nico

Hi,

Then is to disable transactions, but until the TinkerPop fix is ready

you will go slowly.

What fix are you talking about? Again, TransactionalGraph is just

start/stopTransactions -- there is no notion of "transaction buffer." The

long counter = 0;
while(...) {
// insert some data
if(++counter % 1000 == 0)
g.stopTransaction(SUCCESS);
}
Thanks,
Marko.
http://markorodriguez.com

Luca Garulli

2012-08-01 22:39:24 UTC

Permalink

Hi,
using OrientBatchGraph class you avoid transactions and the massive insert
intent is declared by default ;-)

Post by Konrad
I've been reading around the Groups (both Gremlin and Orient) and still
haven't found out how to do this "disable transactions".
There is written on the OrientDB wiki that disabling transactions (along
with specifying intent) will speed up bulk inserts but I haven't figured
out how that is done with OrientGraph or OrientBatchGraph.
Couldn't just the use of OrientBatchGraph both specify intent "massive
insert" and disabling transactions in underlying Orient database? Or this
is what's already happening?
In that case, then there should be no point in using BatchGraph wrapping
a OrientBatchGraph right?
Kind regards,
Konrad

Hi,

Then is to disable transactions, but until the TinkerPop fix is ready

you will go slowly.

What fix are you talking about? Again, TransactionalGraph is just

start/stopTransactions -- there is no notion of "transaction buffer." The

long counter = 0;
while(...) {
// insert some data
if(++counter % 1000 == 0)
g.stopTransaction(SUCCESS);
}
Thanks,
Marko.
http://markorodriguez.com

darren.seay-daKfhaFl/

2013-02-15 21:59:34 UTC

Permalink

Hi All,

Has anyone come up with a preferred solution for this? Using tinkerpop
2.2.0 and orient 1.3 i have tried using 4 different approaches
(GraphMLReader, BatchGraph<OrientGraph>, OrientBatchGraph and
OGraphDatabase directly) to load a graph with about 100K vertex and 200K
edges and is structured as 100 trees that go 4 levels deep. This load is
taking over a minute for all methods ranging from 62 seconds to 78
seconds. There is not that much difference in time so I'm assuming the
slow down is due to some configuration issue and not the specific api used
but I'm just not seeing it. This solution will need to scale to about 10
to 20 million vertex and 30 to 40 million edges in production and run
regularly. I have attached the 4 classes implementing the load.

Any feed back would be greatly appreciated.

Thanks
Darren

--
---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-database+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.

Jan Drake

2013-10-31 00:54:02 UTC

Permalink

Darren,

Did you get a resolution on this? I'm seeing similar issues with 1.5 and
50-150 edges.

Post by darren.seay-daKfhaFl/
Hi All,
Has anyone come up with a preferred solution for this? Using tinkerpop
2.2.0 and orient 1.3 i have tried using 4 different approaches
(GraphMLReader, BatchGraph<OrientGraph>, OrientBatchGraph and
OGraphDatabase directly) to load a graph with about 100K vertex and 200K
edges and is structured as 100 trees that go 4 levels deep. This load is
taking over a minute for all methods ranging from 62 seconds to 78
seconds. There is not that much difference in time so I'm assuming the
slow down is due to some configuration issue and not the specific api used
but I'm just not seeing it. This solution will need to scale to about 10
to 20 million vertex and 30 to 40 million edges in production and run
regularly. I have attached the 4 classes implementing the load.
Any feed back would be greatly appreciated.
Thanks
Darren

Jan Drake

2013-10-31 19:51:58 UTC

Permalink

Luca,

Looks like Darren didn't get a resolution for insert speeds with OrientDB.
I found this:
http://blog.euranova.eu/wp-content/uploads/2013/10/an-empirical-comparison-of-graph-databases1.pdf

I'm seeing very similar problems with 1.5 and inserting into a graph with
about five vertex classes and 7 edge classes but with fairly high
cardinality of edge instances to vertex instances.

What should we expect from OrientDB in terms of insertion speed?

Jan

Post by Jan Drake
Darren,
Did you get a resolution on this? I'm seeing similar issues with 1.5 and
50-150 edges.

tia

2014-02-20 18:33:51 UTC

Permalink

Hi,
Any updates on this? Been struggling with this problem also. I test
loaded an RDF/XML file with 48631 triples and the total time to complete is:

TOTALTIME (ms): 971055

And this is a load onto an empty database. When i load a file that has
nodes with less edges, timing is very fast .

Here's my code for loading and the test data file if that would help. I'm
running orientdb-server-1.7-SNAPSHOT .

public void loadRDFFile_OrientGraphNoTx(
String remoteDbsUrl, String remoteDbsUser, String remoteDbsPwd,
String inputFile, String baseURI, String inputFormat) {

Orient.instance().registerEngine(new OEngineRemote());

OrientGraphNoTx graph = new OrientGraphNoTx(remoteDbsUrl,remoteDbsUser
, remoteDbsPwd);
graph.getRawGraph().declareIntent(new OIntentMassiveInsert());

Sail s = new GraphSail<OrientGraphNoTx>(graph);
SailGraph sailGraph = new SailGraph(s);

Long curr=System.currentTimeMillis();
System.out.println((new Timestamp((new java.util.Date()).getTime()))
+ "|STARTTIME: "+ curr);
try {
sailGraph.loadRDF(
new FileInputStream(inputFile),
baseURI,
inputFormat,
null);
System.out.println("TOTALTIME: "+(System.currentTimeMillis()-
curr));

} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (java.lang.RuntimeException e2) {
System.out.println("DATAFILE FAILED LOADING: "+ inputFile);
e2.printStackTrace();

} finally {
sailGraph.shutdown();
graph.shutdown();
}
}

Tia

Post by Jan Drake
Luca,
Looks like Darren didn't get a resolution for insert speeds with OrientDB.
http://blog.euranova.eu/wp-content/uploads/2013/10/an-empirical-comparison-of-graph-databases1.pdf
I'm seeing very similar problems with 1.5 and inserting into a graph with
about five vertex classes and 7 edge classes but with fairly high
cardinality of edge instances to vertex instances.
What should we expect from OrientDB in terms of insertion speed?
Jan

Post by Jan Drake
Darren,
Did you get a resolution on this? I'm seeing similar issues with 1.5 and
50-150 edges.

Darren Seay

2014-02-20 18:37:12 UTC

Permalink

Sorry i gave up on orient and we went with an a home grown solution. Just
could not get the performance we needed.

Darren

Post by tia
Hi,
Any updates on this? Been struggling with this problem also. I test
TOTALTIME (ms): 971055
And this is a load onto an empty database. When i load a file that has
nodes with less edges, timing is very fast .
Here's my code for loading and the test data file if that would help.
I'm running orientdb-server-1.7-SNAPSHOT .
public void loadRDFFile_OrientGraphNoTx(
String remoteDbsUrl, String remoteDbsUser, String remoteDbsPwd
,
String inputFile, String baseURI, String inputFormat) {
Orient.instance().registerEngine(new OEngineRemote());
OrientGraphNoTx graph = new OrientGraphNoTx(remoteDbsUrl,remoteDbsUser
, remoteDbsPwd);
graph.getRawGraph().declareIntent(new OIntentMassiveInsert());
Sail s = new GraphSail<OrientGraphNoTx>(graph);
SailGraph sailGraph = new SailGraph(s);
Long curr=System.currentTimeMillis();
System.out.println((new Timestamp((new java.util.Date()).getTime
())) + "|STARTTIME: "+ curr);
try {
sailGraph.loadRDF(
new FileInputStream(inputFile),
baseURI,
inputFormat,
null);
System.out.println("TOTALTIME: "+(System.currentTimeMillis()-
curr));
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (java.lang.RuntimeException e2) {
System.out.println("DATAFILE FAILED LOADING: "+ inputFile);
e2.printStackTrace();
} finally {
sailGraph.shutdown();
graph.shutdown();
}
}
Tia

Post by Jan Drake
Luca,
Looks like Darren didn't get a resolution for insert speeds with
OrientDB. I found this: http://blog.euranova.eu/wp-
content/uploads/2013/10/an-empirical-comparison-of-graph-databases1.pdf
I'm seeing very similar problems with 1.5 and inserting into a graph with
about five vertex classes and 7 edge classes but with fairly high
cardinality of edge instances to vertex instances.
What should we expect from OrientDB in terms of insertion speed?
Jan

Post by Jan Drake
Darren,
Did you get a resolution on this? I'm seeing similar issues with 1.5
and 50-150 edges.

---
You received this message because you are subscribed to a topic in the
Google Groups "OrientDB" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/orient-database/wYT5lAlRXow/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

Luca Garulli

2014-02-20 19:08:39 UTC

Permalink

Hi,
please try 1.7-rc1 or 1.7-rc2-SNAPSHOT. Furthermore don't use remote but
rather plocal URL directly.

Post by Jan Drake
Darren,
Did you get a resolution on this? I'm seeing similar issues with 1.5
and 50-150 edges.

---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

tia

2014-02-20 21:27:20 UTC

Permalink

Ah, yes. I do pass in the plocal URL directly to the method. Sorry if my
code parameter naming is mis-leading.

Tia

Post by Luca Garulli
Hi,
please try 1.7-rc1 or 1.7-rc2-SNAPSHOT. Furthermore don't use remote but
rather plocal URL directly.

Post by tia
Hi,
Any updates on this? Been struggling with this problem also. I test
TOTALTIME (ms): 971055
And this is a load onto an empty database. When i load a file that has
nodes with less edges, timing is very fast .
Here's my code for loading and the test data file if that would help.
I'm running orientdb-server-1.7-SNAPSHOT .
public void loadRDFFile_OrientGraphNoTx(
String remoteDbsUrl, String remoteDbsUser, StringremoteDbsPwd
,
String inputFile, String baseURI, String inputFormat) {
Orient.instance().registerEngine(new OEngineRemote());
OrientGraphNoTx graph = new OrientGraphNoTx(remoteDbsUrl,remoteDbsUser
, remoteDbsPwd);
graph.getRawGraph().declareIntent(new OIntentMassiveInsert());
Sail s = new GraphSail<OrientGraphNoTx>(graph);
SailGraph sailGraph = new SailGraph(s);
Long curr=System.currentTimeMillis();
System.out.println((new Timestamp((new java.util.Date()).getTime
())) + "|STARTTIME: "+ curr);
try {
sailGraph.loadRDF(
new FileInputStream(inputFile),
baseURI,
inputFormat,
null);
System.out.println("TOTALTIME: "+(System.currentTimeMillis()-
curr));
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (java.lang.RuntimeException e2) {
System.out.println("DATAFILE FAILED LOADING: "+ inputFile);
e2.printStackTrace();
} finally {
sailGraph.shutdown();
graph.shutdown();
}
}
Tia

Post by Jan Drake
Luca,
Looks like Darren didn't get a resolution for insert speeds with
OrientDB. I found this: http://blog.euranova.eu/wp-
content/uploads/2013/10/an-empirical-comparison-of-graph-databases1.pdf
I'm seeing very similar problems with 1.5 and inserting into a graph
with about five vertex classes and 7 edge classes but with fairly high
cardinality of edge instances to vertex instances.
What should we expect from OrientDB in terms of insertion speed?
Jan

Post by Jan Drake
Darren,
Did you get a resolution on this? I'm seeing similar issues with 1.5
and 50-150 edges.

Post by darren.seay-daKfhaFl/
Hi All,
Has anyone come up with a preferred solution for this? Using
tinkerpop 2.2.0 and orient 1.3 i have tried using 4 different approaches
(GraphMLReader, BatchGraph<OrientGraph>, OrientBatchGraph and
OGraphDatabase directly) to load a graph with about 100K vertex and 200K
edges and is structured as 100 trees that go 4 levels deep. This load is
taking over a minute for all methods ranging from 62 seconds to 78
seconds. There is not that much difference in time so I'm assuming the
slow down is due to some configuration issue and not the specific api used
but I'm just not seeing it. This solution will need to scale to about 10
to 20 million vertex and 30 to 40 million edges in production and run
regularly. I have attached the 4 classes implementing the load.
Any feed back would be greatly appreciated.
Thanks
Darren
--

---
You received this message because you are subscribed to the Google Groups
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Continue reading on narkive:

Search results for '[orientdb] Re: Bulk insert of a massive graph' (Questions and Answers)

replies

Wow, You Guys In This Section Really Do Believe Global Warming Is A Man-made Phenomenom...?

started 2008-02-19 16:40:59 UTC

global warming

replies

how to build a circuit for solar lamp?

started 2006-11-10 07:39:24 UTC

maintenance & repairs