placement of blocks in hadoop 3 - hadoop3

I am currently concerned with block placement where we are trying to implement our own placement algorithm in Hadoop 3. In Hadoop 2, we can find a file called where we can feed our placement policy but I can't find any such file in the Hadoop 3 source code. Can someone indicate where I can find a file which serves a similarfunctionality?


Hadoop Data Blocks and Data Content

Hadoop breaks up the content of the input data into blocks without regard to the content.
As a post described:
HDFS has no idea (and doesn’t care) what’s stored inside the file, so raw files are not split in accordance with rules that we humans would understand. Humans, for example, would want record boundaries — the lines showing where a record begins and ends — to be respected.
The part I am unclear is that if the data is split based only on the data size without regards to the content, then would there not be implication on the accuracy of queries performed later? For instance, an often sited example of list of cities and daily temperatures. could a city be in one block and its temperature somewhere else, how then a map operation correctly queries the info. There seems to be something fundamental about blocks and MR queries that I am missing.
Any help would be greatly appreciated.
could a city be in one block and its temperature somewhere else
That is entirely possible, yes. In this instance, the record boundary crosses two blocks, and both are gathered.
Accuracy is not lost, but performance is, sure, in terms of Disk and network IO. When the end of a block is detected without reaching an InputSplit, then the next block is read. Even if this split is within the first few bytes of that following block, it's byte stream is still processed.
Lets get into basics of ext FileSystem(forget HDFS for timebeing).
1. In your Hardisk data is stored in form of track and sectors. Now when a file is stored its not necessary the complete record will be saved in the same block(4kb) and it can span across blocks .
2. The Process which is reading the files, reads the block and find the record boundary (Record is a logical entity). A record is a logical entity
3. The file saved into Hardisk as bytes has no understanding of record or file format. File format and records are logical entities.
Apply the same logic on HDFS.
1. The block size is 128MB.
2. Just like ext filesystem HDFS has no clue of the record boundaries.
3. What Mappers do is logically find the record boundaries by
a. The mapper which reads fileOffset 0 starts reading from start of file, till it finds \n.
b. All mapper which don't read a file from offset 0 will skip the bytes till they reach \n and then continue reading. The sequences of bytes till newline is ommited. Now this byte sequence can be a complete record or partial record and is consumed by other mapper.
c. Mappers will read the block they are supposed to and continue reading till they find \n which is present in other block and not on the block which is local to them.
d. Except first mapper all other mapper read the block local to them and byte sequence from other block till they find \n.
See Ali,
the data block were decided by hadoop hdfs gateway in storage procedure and this will based on the hadoop version 1.x or 2.x and depends on size of the file too, which you are placing from the local to hadoop gateway and later after -put command, the hadoop gateway splits the file blocks and stores to your data node directory /data/dfs/data/current/ (if you are running on single node then it is inside your hadoop directory) in form of blk_<job_process_id> and meta data about the blk_ with same job_id name and with .meta extension.
Size of data blocks in Hadoop 1 is 64 mb and in Hadoop 2 it increased to 128 mb block size and thereafter depending on file size it split the block as per I said earlier, so there is no tool to control this thing in hadoop hdfs, if anything, please let me know!
In Hadoop 1, We put simply a file in cluster like below said, if file size is 100 mb then what -
bin/hadoop fs -put <full-path of the input file till ext> </user/datanode/(target-dir)>
Hadoop 1 gateway will divide the file in two(64 mb & 36 mb) blocks and Hadoop 2 make it simply one block and sequentially replicates those blocks as per your configuration.
if you are placing a jar using hadoop for map reduce jobs then there you can set org.apache.hadoop.mapreduce.Job method inside you java Mapper-Reducer class to 1 and later after testing export a jar for mr job like below.
//Setting the Results to Single Target File in Java File inside main method
and later run the hadoop fs script like :
bin/hadoop jar <full class path of your jar file> <full class path of Main class inside jar> <input directory or file path> <give the output target directory>
If you are using sqoop to import the data from rdbms engines then there you can use the "-m 1" to set the single file result but it varies from your question.
Hope, my answer will give a glance to you on the issue, thanks.

hadoop - map/reduce functionality

I just started looking into the hadoop and made the wordcount example work on a cluster(two datanodes) after going through some struggles.
But I have a question about Map/Reduce functionality. I read that during map, the input files/data is transformed into another form of data that can be efficiently processed during the reduce step.
Let's say that I have four input files(input1.txt, input2.txt, input3.txt, input4.txt) and want to read input files and transform into another form of data for reduce.
So here is the question. If I run the application (wordcount) on a cluster environment (two datanodes), are these four input files read on each datanode or two input files read on each datanode? And how can I check which file is read on which datanode?
Or does map(on each datanode) read files as some kind of block instead of reading an individual file?
See hadoop works on basis of blocks rather than files. So even if all the four files are less than 128MB(or 64MB depending on block size) than they will be read by only one mapper. This chunk which is read by mapper is also known as InputSplit. I hope that answers your question.
files are divided into blocks. blocks are spread on the data nodes in the cluster. blocks are also replicated by replication factor (default 3) so each block can be on multiple nodes. scheduler decides where to run mapper which depends on which data node is available to process and where the data is located (data locality comes into picture). in your program (wordcount) each line is fed to a mapper (not whole file or block) one by one.

Reading contents of blocks directly in a datanode

In HDFS , the blocks are distributed among the active nodes/slaves. The content of the blocks are simple text so is there any way to see read or access the blocks present in each data node ?
As an entire file or to read a single block (say block number 3) out of sequence?
You can read the file via various mechanisms including the Java API but you cannot start reading in the middle of the file (for example at the start of block 3).
Hadoop reads a block of data and feeds each line to the mapper for further processing. Also, the Hadoop clients gets the blocks related to a file from different Data Nodes before concatenating them. So, it should be possible to get the data from a particular block.
Hadoop Client might be a good place to start with to look at the code. But, HDFS provides file system abstraction. Not sure what the requirement would be for reading the data from a particular block.
Assuming you have ssh access (and appropriate permissions) to the datanodes, you can cd to the path where the blocks are stored and read the blocks stored on that node (e.g., do a cat BLOCK_XXXX). The configuration parameter that tells you where the blocks are stored is, which defaults to file://${hadoop.tmp.dir}/dfs/data. More details here.
Caveat: the block names are coded by HDFS depending on their internal block ID. Just by looking at their names, you cannot know to which file a block belongs.
Finally, I assume you want to do this for debugging purposes or just to satisfy your curiosity. Normally, there is no reason to do this and you should just use the HDFS web-UI or command-line tools to look at the contents of your files.

Hadoop datanode blocks storing information

I want to find out how many blocks are stored in particular Datanode in the Hadoop cluster. And which file those blocks belong to.. I have only a 2-node cluster.
Since you have only 2 node clsuter all the blocks will be stored there. In general I dont think so that you can find easily which blocks are present on datanode. What is the use case by the way for this
Go to the HDFS webUI by pointing your web browser to NameNode_Machine:50070. Come down to the Cluster Summary and click on Live Datanodes. It'll show you all the DataNodes available currently in a table whose last column will show you the number of blocks.
And to find the relation between a files and its blocks you could open that file in webUI and and scroll down. It'll show you all the blocks of that file along with the location of each block.
You can use hadoop fsck command with argument -locations to know the locations for each block.
Usage: hadoop fsck [GENERIC_OPTIONS] <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]
Here is the reference page, search for fsck for more information.

replace text in input file with hadoop MR

I am a newbie on the MR and Hadoop front.
I wrote an MR for finding missing's in csv file and it is working fine.
now I have an usecase where i need to parse a csv file and code it with the regarding category.
ex: "11,abc,xyz,51,61,78","11,adc,ryz,41,71,38",.............
now this has to be replaced as "1,abc,xyz,5,6,7","1,adc,ryz,4,7,3",.............
here i am doing a mod of 10 but there will be different cases of mod's.
data size is in gb's.
I want to know how to replace the content in-place for the input. Is this achievable with MR?
Basically i have not seen any file handling or writing based hadoop examples any where.
At this point i do not want to go to HBase or other db tools.
You can not replace data in place, since HDFS files are append only, and can not be edited.
I think simplest way to achiece your goal is to register your data in the Hive as external table, and write your trnasformation in HQL.
Hive is a system sitting aside of hadoop and translating your queries to MR Jobs.
Its usage is not serious infrastructure decision as HBASE usage