Optimize Relationship Building between Many neo4j Nodes - neo4j

I have a database containing two particular node types: GenomicRange and GeneModel. The GenomicRange node set contains ~80 million nodes while GeneModel contains ~45,000 nodes.
The GenomicRange nodes contain a property posStart which is stored as an integer. The GeneModel node contains two particular integer properties geneStart and geneEnd. These coordinates are found on a chromosome property found in both node types (e.g. 1 through 10).
What I would like to do is to efficiently create relationships (e.g. [:RANGE_WITHIN]) between these two nodes if (1) Their chromosome properties match, (2) if the posStart value in GenomicRange falls within range of the geneStart and geneEnd properties on the GeneModel node.
My problem I am currently having is that my querying/building process is incredibly slow. Is there a way to optimize this code?
Thanks for your help!
MATCH (model:GeneModel)
WITH model
MATCH (range:GenomicRange)
WHERE range.chromosome = model.chromosome AND range.posStart >= model.geneStart AND range.posStart <= model.geneEnd
CREATE (range)-[:RANGE_WITHIN]->(model)

Few suggestions:
Add index on the properties you are using for comparison.
Here: posStart, chromosome, geneEnd, geneStart.
`CREATE INDEX ON :GenomicRange(chromosome)`
Increase Heap Memory: Creating index increases memory usage so increase heap size up to 50% of your memory. You can configure this in neo4j.conf file.
Increase page cache: more the cache size more the data cached in memory, It will help avoid costly disk access.
Read more about memory configuration here.
P.S. If you still get out of memory error after increasing heap size, swap GenomicRange and GeneModel on line 1 and 3 OR use APOC plugin to create relationships periodically.


How to calculate storage requirement for neo4j?

We are using Neo4j in our Node.js based application. It stores the sensor data. And this has growth rate as exponential. So it not long that we would be dealing with 10 millions of nodes and relationships on them. I am not sure how to calculate storage requirement for this. Is there any formula available which will help in capacity planning for the data growth?
You can take a look in the Neo4j’s Hardware Requirements documentation. The Disk Storage section says that:
Nodes occupy 15B of space, relationships occupy 31B of space and
properties occupy 41B of space.
So, the storage size does not just depend on the number of nodes, but relationships and properties too.
An example disk space calculation (from the docs):
10,000 Nodes x 14B = 140kB
1,000,000 Rels x 33B = 31.5MB
2,010,000 Props x 41B = 78.6MB
Total is 110.2MB

How to know the configured size of the Chronicle Map?

We use Chronicle Map as a persisted storage. As we have new data arriving all the time, we continue to put new data into the map. Thus we cannot predict the correct value for net.openhft.chronicle.map.ChronicleMapBuilder#entries(long). Chronicle 3 will not break when we put more data than expected, but will degrade performance. So we would like to recreate this map with new configuration from time to time.
Now it the real question: given a Chronicle Map file, how can we know which configuration was used for that file? Then we can compare it with actual amount of data (source of this knowledge is irrelevant here) and recreate a map if needed.
entries() is a high-level config, that is not stored internally. What is stored internally is the number of segments, expected number of entries per segment, and the number of "chunks" allocated in the segment's entry space. They are configured via ChronicleMapBuilder.actualSegments(), entriesPerSegment() and actualChunksPerSegmentTier() respectively. However, there is no way at the moment to query the last two numbers from the created ChronicleMap, so it doesn't help much. (You can query the number of segments via ChronicleMap.segments().)
You can contribute to Chronicle-Map by adding getters to ChronicleMap to expose those configurations. Or, you need to store the number of entries separately, e. g. in a file along with the ChronicleMap persisted file.

Simple Neo4j query is very slow on large database

I have a Neo4J database with the following properties:
Array Store 8.00 KiB
Logical Log 16 B
Node Store 174.54 MiB
Property Store 477.08 MiB
Relationship Store 3.99 GiB
String Store Size 174.34 MiB
MiB Total Store Size 5.41 GiB
There are 12M nodes and 125M relationships.
So you could say this is a pretty large database.
My OS is windows 10 64bit, running on an Intel i7-4500U CPU #1.80Ghz with 8GB of RAM.
This isn't a complete powerhouse, but it's a decent machine and in theory the total store could even fit in RAM.
However when I run a very simple query (using the Neo4j Browser)
MATCH (n {title:"A clockwork orange"}) RETURN n;
I get a result:
Returned 1 row in 17445 ms.
I also used a post request with the same query to http://localhost:7474/db/data/cypher, this took 19seconds.
something like this:
is however executed in 23ms...
And I can confirm there is an index on title:
ON :Page(title) ONLINE
So anyone have ideas on why this might be running so slow?
This has to scan all nodes in the db - if you re-run your query using n:Page instead of just n, it'll use the index on those nodes and you'll get better results.
To expand this a bit more - INDEX ON :Page(title) is only for nodes with a :Page label, and in order to take advantage of that index your MATCH() needs to specify that label in its search.
If a MATCH() is specified without a label, the query engine has no "clue" what you're looking for so it has to do a full db scan in order to find all the nodes with a title property and check its value.
That's why
MATCH (n {title:"A clockwork orange"}) RETURN n;
is taking so long - it has to scan the entire db.
If you tell the MATCH() you're looking for a node with a :Page label and a title property -
MATCH (n:Page {title:"A clockwork orange"}) RETURN n;
the query engine knows you're looking for nodes with that label, it also knows that there's an index on that label it can use - which means it can perform your search with the performance you're looking for.

Indexing process in Hadoop

could any body please explain me what is meant by Indexing process in Hadoop.
Is it something like a traditional indexing of data that we do in RDBMS, so drawing the same analogy here in Hadoop we index the data blocks and store the physical address of the blocks in some data structure.
So it will be an additional space in the Cluster.
Googled around this topic but could not get any satisfactory and detailed things.
Any pointers will help.
Thanks in advance
Hadoop stores data in files, and does not index them. To find something, we have to run a MapReduce job going through all the data. Hadoop is efficient where the data is too big for a database. With very large datasets, the cost of regenerating indexes is so high you can't easily index changing data.
However, we can use indexing in HDFS using two types viz. file based indexing & InputSplit based indexing.
Lets assume that we have 2 Files to store in HDFS for processing. First one is of 500 MB and 2nd one is around 250 MB. Hence we'll have 4 InputSplits of 128MB each on 1st File and 3 InputSplits on 2nd file.
We can apply 2 types of indexing for the mentioned case -
1. With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query
2. With InputSplit based indexing, you will end up with 4 InputSplits. The performance should be definitely better than doing a full scan query.
Now, to for implementing InputSplits index we need to perform following steps:
Build index from your full data set - This can be achived by writing a MapReduce job to extract the value we want to index, and output it together with its InputSplit MD5 hash.
Get the InputSplit(s) for the indexed value you are looking for - Output of MapReduce program will be Reduced Files (Containing Indices based on InputSplits) which will be stored in HDFS
Execute your actual MapReduce job on indexed InputSplits only. - This can be done by Hadoop as it is able to retrieve the number of InputSplit to be used using the FileInputFormat.class. We will create our own IndexFileInputFormat class extending the default FileInputFormat.class, and overriding its getSplits() method. You have to read the file you have created at previous step, add all your indexed InputSplits into a list, and then compare this list with the one returned by the super class. You will return to JobTracker only the InputSplits that were found in your index.
In Driver class we have now to use this IndexFileInputFormat class. We need to set as InputFormatClass using -
To Use our custom IndexFileInputFormat In Driver class we need to provide
For Code Sample and other details Refer this -
We can identify 2 different levels of granularity for creating indices: Index based on File URI or index based on InputSplit. Let’s take 2 different examples of data set.
First example:
2 files in your data set fit in 25 blocks, and have been identified as 7 different InputSplits. The target you are looking for (grey highlighted) is available on file #1 (block #2,#8 and #13), and on file #2 (block #17)
With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query
With InputSplit based indexing, you will end up with 4 InputSplits on 7 available. The performance should be definitely better than doing a full scan query
Let’s take a second example:
This time the same data set has been sorted by the column you want to index. The target you are looking for (grey highlighted) is now available on file #1 (block #1,#2,#3 and #4).
With File based indexing, you will end up with only 1 file from your data set
With InputSplit based indexing, you will end up with 1 InputSplit on 7 available
For this specific study, I decided to use a custom InputSplit based index. I believe such approach should be quite a good balance between the efforts it takes to implement, the added value it might bring in term of performance optimization, and its expected applicability regardless to the data distribution.

Processing and splitting large data using Hadoop Map reduce?

I have a large data in text files (1,000,000 lines) .Each line has 128 columns .
Now i am trying to build a kd tree with this large data . I want to use map reduce for calculations.
Brute Force approach for my problem:
1) write a map reduce job to find variance of each column and select the column with highest variance
2) taking (column name ,variance value ) as inputs write another map reduce job to split the input data into 2 parts . 1 part has all the rows with value less than input value for the given column name the second part has all the rows greater than input value.
3) for each part repeat step 1 and step 2 , continue the process until you are left with 500 values in each part.
the column name , variance value forms a single node for my tree . so with the brute force approach for tree of height 10 i need to run 1024 map reduce jobs.
My questions:
1 ) Is there any way i can improve the efficiency by running less number of map reduce jobs ?
2 ) I am reading the same data every time . Is there any way i can avoid that ?
3 ) are there any other frameworks like pig , hive etc which are efficient for this kind of tasks ?
4 ) Any frameworks using which i can save the data into a data store and easily retrieve data ?
Pleas help ...
Why don't you try using Apache Spark (https://spark.apache.org/) here ?...this seems like a perfect use case for spark
With an MR job per node of the tree you have O(n) = 2^n number of jobs (where n is the height of the tree), which is not good for the overheads of the YARN. But with simple programming tricks you can bring it down to the O(n) = n.
Here are some ideas:
Add extra partition column in front of your key, this column is nodeID (each node in your tree has unique ID). This will create independent data flows and will ensure that keys from different branches of the tree do not mix and all of the variances are calculated in the context of the nodeID in waves, for each layer of nodes. This will remove the necessity of having an MR job per node with very little change in the code and ensure that you have O(n) = n number of jobs and not O(n) = 2^n;
Data is not sorted around the split value and while splitting elements from parent list will have to travel to their destination child lists and there will be network traffic between the cluster nodes. Thus caching the whole data set on the cluster with multiple machines might not give significant improvements;
After calculating a couple of levels of the tree, there can be a situation that certain nodeIDs have a number of rows that can fit in the memory of the mapper or the reducer, then you could continue processing that sub-tree completely in memory and avoid costly MR job, this could reduce the number of MR jobs as you get to the bottom of the tree or reduce the amount of data as the processing gets closer to the bottom;
Another optimisation would be to write a single MR job that in the mapper does the splitting around the selected value of each node and outputs them via MultipleOutputs and emits the keys with child nodeIDs of the next tree level to the reducer to calculate the variance of the columns within the child lists. Of cause the first ever run has no splitting value, but all subsequent runs will have multiple split values, one for each child nodeID.