Review Google Research: Map Reduce

Nov 4 2015

Definition

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines.

Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logic "record" in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately.

Hardware

dual-processor x86 processors running Linux, with 2-4GB of memory performance
commodity networking hardware - 100 MB/secs
machine failures are common
storage is in expensive IDE disk

Execution Overview

Master Data Strcutures

For each completed map task, the master stores the locations and sizes of the R intermediate file regions produced by the map task. ... The information is pushed incrementally to works that have in-progress reduce tasks.

Fault Tolerance

The master pings every worker periodically.

MapReduce is resilient to large-scale worker failures. The MapReduce master simply re-executed the work done by the unreachable worker machines.

However, given that there is only a single master, its failure is unlikely; therefore our current implementation aborts the MapReduce computation if the master failes.

Locality

The MapReduce master takes the location information of the input files into account and attempts to schedule a map taks on a machine that contains a replica of the corresponding input data.

Backup Tasks

One of the common causes that lengthens the total time taken for a MapReduce operation is a "straggler": a machine that takes an unusually long time to complete one of the last few map or reduce taks in the computation. We have a general mechanism to alleviate the problem of stragglers. When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks.

Partitioning/Combiner Function

Remaining questions after review

How reduce workers fetch data from different temporary files that map workers generated?
How to explain the map/reduce function as below with more concrete examples:

map (k1,v1) -> list(k2,v2) reduce (k2,list(v2)) -> list(v2)

Db2 Sql Tuning Tips

Sep 27 2015

List most critical tips from DB2 SQL Tuning Tips for z/OS Developers

Stay Away from Distinct if Possbile Details: group by or in or exists subquery if duplicates due to one-to-many relationship

Try Rewriting Range Predicates as Between Predicates

1 2	WHERE HIREDATE > :HV-DATE WHERE HIREDATE BETWEEN :HV-DATE and '9999-12-31'

Use Tools for Monitoring Details: Some of the most common z/OS DB2 monitoring tools are OMEGAMON®, TMon, MainView, Apptune, Insight for DB2, and DB2PM. Some of the most common DB2 LUW monitoring tools are IBM’s Optim™ Query Tuner, Precise for DB2, Quest, DBI Software, Embarcadero’s Performance Analyst, and ITGAIN.

Avoid Discrepancies with Non-Column Expressions

1 2	WHERE EDLEVEL = 123.45 * 12 WHERE EDLEVEL = SMALLINT(ROUND(123.45*12,0))

Ensure That Subquery Predicates Involving Min and Max Have the Possibility of Nulls Being Returned Handled
Always Code For Fetch Only or For Read Only with Cursor Processing When a Query Is Only Selecting Data
Avoid Selecting a Row from a Table to Help Decide Whether the Logic in the Code Should Execute an Update or an Insert
Use the DB2 V9 MERGE Statement
Keep Table and Index Files Healthy and Organized Details: SYSIBM.SYSTABLESPACESTATS and SYSIBM.SYSINDEXSPACESTATS tables
Consider OLTP Front-End Processing
1
OPTIMIZE FOR 14 ROWS

Take Advantage of Materialized Query Tables to Improve Response Time (Dynamic SQL Only)

CREATE TABLE DEPT_SUM_MQT_TABLE AS
(SELECT  WORKDEPT, AVG(SALARY) AS AVG_SAL,
AVG(BONUS) AS AVG_BONUS
AVG(COMM)  AS AVG_COMM,
SUM(SALARY) AS SUM_SAL,
SUM(BONUS) AS SUM_BONUS
SUM(COMM)  AS SUM_COMM
FROM EMP
GROUP BY  WORKDEPT)
DATA INITIALLY DEFERRED REFRESH DEFERRED
MAINTAINED BY SYSTEM
ENABLE QUERY OPTIMIZATION;
--original query
SELECT SUM(BONUS)
FROM EMP
WHERE  WORKDEPT = 'A00'
--rewrite by optimize
SELECT AVG_BONUS
FROM DEPT_SUM_MQT_TABLE
WHERE  WORKDEPT = 'A00'

Insert with Select

SELECT EMPNO, LASTNAME, MIDINIT, FIRSTNME,
DEPTNO, SALARY, BONUS, COMM
FROM FINAL TABLE
(INSERT   (EMPNO, LASTNAME, MIDINIT, FIRSTNME, DEPTNO)
VALUES ('123456', 'SMITH', 'A', 'JOE', 'A00')
)

Identify Times for Volatile Tables Details: Using the VOLATILE keyword on the Create or Alter Table statement identifies tables whose data volumes fluctuate.
1
2
3
4
CREATE TABLE EMP_TEMP
(EMPNO CHAR(6) Not Null,
LASTNAME VARCHAR(15) Not Null )
VOLATILE

Use the ON COMMIT DROP Enhancement

DECLARE GLOBAL TEMPORARY TABLE
SESSION.EMP_TEMP1  (COL1 CHAR (20)
COL2 CHAR (10)
COL3 SMALLINT )
ON COMMIT DROP TABLE
ON COMMIT DELETE ROWS
ON COMMIT PRESERVE ROWS

Use Multiple Distincts Details: This is an automatic performance gain that alleviates the need for separate calls to DB2.
1
2
3
SELECT SUM (DISTINCT SALARY),
AVG(DISTINCT BONUS)
FROM EMP;
Take Advantage of Backward Index Scanning Details: For each column that is in the Order By clause, the ordering that is specified in the index has to be exactly the opposite of the ordering that is specified in the Order By clause.
Set Your Clustering Index Correctly
Take Advantage of DB2 V8 Enhanced DISCARD Capabilities When It Comes to Mass Deletes
Consider Parallelism https://www-01.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/com.ibm.db2z11.doc.perf/src/tpc/db2z_enableparallelprocess.dita
Consider Direct Row Access Using ROWID Datatype (V8) or RID Function (V9) Details: If the ROWID value is then used in the search condition of subsequent updates or deletes, DB2 may navigate directly to the row, bypassing any index processing or tablespace scans.
1
EMP_ID ROWID NOT NULL GENERATED ALWAYS
Specify the Leading Index Columns in WHERE Clauses Details: For a composite index, the following query would use the index and execute matching processing as long as the leading column of the index is specified in the WHERE clause.
Keep in Mind Index Only Processing Whenever Possible Many times, developers code queries that need to execute one of the four column aggregate functions (Sum, Avg, Min, or Max). They can get some great results by including an extra column as part of an index. Take, for example, the following query and associated index, Index1 = (YEAR, PERIOD):
1
2
3
SELECT SUM(JRNL_AMT)
FROM LEDGER_TABLE
WHERE YEAR = 2008 AND PERIOD IN (1,2,3)
DB2 is good at recognizing the JRNL_AMT column in the index, even though there is no specific predicate on it, and it will choose index only processing.

Take Advantage of DB2 V9 Optimistic Locking

CREATE TABLE EMP
 (EMPNO  CHAR(6)  NOT NULL,
   ....
   .....
   EMP_UPD  TIMESTAMP NOT NULL
   GENERATED BY DEFAULT
   FOR EACH ROW ON UPDATE
   AS ROW CHANGE TIMESTAMP)
)

Use the DB2 V9 MERGE Statement https://www-01.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.sql.ref.doc/doc/r0010873.html

Code Boolean Term Predicates Whenever Possible Details: A Boolean term predicate is a predicate that is connected by And logic, so when one predicate is evaluated as false for a row, it then makes the entire Where clause evaluate to false.

SELECT EMPNO, LASTNAME, SALARY
FROM EMP
WHERE (LASTNAME > 'SMITH')
OR (LASTNAME = 'SMITH' and FIRSTNME > 'BOB')
SELECT EMPNO, LASTNAME, SALARY
FROM EMP
WHERE  LASTNAME >= 'SMITH'
  AND (LASTNAME >  'SMITH'   OR
      (LASTNAME = 'SMITH' AND FIRSTNME > 'BOB'))

Avoid Sorts with Order By Details: DB2 may avoid sorts if there are leading columns in an index to match the Order By or if the query contains an equal predicate on the leading column(s) of an index.

Watch Out for Case Logic

SELECT EMPNO, LASTNAME, SALARY,
       CASE WHEN SALARY < ...
            WHEN SALARY > ...
       ELSE
       END
FROM EMP

Details: If most of the rows have a Salary < 50000.00, then that piece of logic should be coded first.

Know Your Version of DB2

1 2	SELECT GETVARIABLE('SYSIBM.VERSION') FROM SYSIBM.SYSDUMMY1

If You Need True Uniqueness, Try the V8 Generate_Unique Function

CREATE TABLE ORDER_TBL
 (ORDER_ID        CHAR(13) FOR BIT DATA,
  ORDER_DATE      DATE,
  ORDER_LOC       CHAR(03) );
INSERT INTO ORDER_TBL
  VALUES(GENERATE_UNIQUE(), CURRENT DATE, 'NY');
SELECT TIMESTAMP(ORDER_ID), ORDER DATE, ....
FROM ORDER_TBL;

Update and Delete with Select (V9) Details: The words Old Table are SQL DB2 reserved words that need to be coded in order for this statement to work.

SELECT SALARY, BONUS
FROM OLD TABLE
 (UPDATE EMP
    SET SALARY = :HV1-SALARY,
    BONUS = :HV2-BONUS
    WHERE EMPNO = :HV-EMPNO)
SELECT LASTNAME, FIRSTNME, WORKDEPT
FROM OLD TABLE
  (DELETE FROM EMP
   WHERE EMPNO = :HV-EMPNO)

Avoid Unnecessary Sorting Details: Data sorting can be caused by Group By, Order By, Distinct, Union, Intersect, Except, and Join processing.

List most critical tips of store procedure related from DB2 SQL Tuning Tips for z/OS Developers

Thread Pool Executor or Producer Consumer Pattern

Sep 27 2015

In this topic, I would like to talk about more ThreadPoolExecutor in Java, comapred with ProducerConsumer multi-threading pattern in tradition.

TODO

References:

http://ifeve.com/blocking-threadpool-executor/

Java Generics Tips and Tricks

May 12 2015

Parameterized type: List<String>

Actual type parameter: String

Generic type: List<E>

Formal type parameter: E

Unbounded wildcard type: List<?>

Raw type: List

Bounded type parameter: <E extends Number>

Recursive type bound: <T extends Comparable<T>>

Bounded wildcard type: Bounded wildcard type

List<? extends Number>: static <E> List<E> asList(E[] a)

Type token: String.class

Jvm Internals

Feb 20 2015

1. JVM Structure

2. Class Loader SubSystem

Loading: finding and importing the binary data for a type
Linking: performing verification, preparation, and (optionally) resolution
1. Verification: ensuring the correctness of the imported type
2. Preparation: allocating memory for class variables and initializing the memory to default values
3. Resolution: transforming symbolic references from the type into direct references.
Initialization: invoking Java code that initializes class variables to their proper starting values.

Parent-Delegation Model

Bootstrap Class Loader: $JAVA_HOME/jre/lib/rt.jar, controlled by -Xbootclasspath option
Extension Class Loader: %JAVA_HOME/jre/lib/ext/*.jar, controlled by -Djava.ext.dirs system property
Application/System Class Loader: $CLASSPATH, controlled by -classpath/-cp/$CLASSPATH
User defined classloader: custom classpath [path on disk, a network address etc]

Itemreader Writer in Spring Batch

Jan 18 2015

References:

http://docs.spring.io/spring-batch/trunk/reference/html/readersAndWriters.html

Atomic Db Operation in Web Apps

Dec 28 2014

Recently I met concurrency issues while trying to design a table that multiple thread on different hosts can access in a high frequency. Therfore I had to reconsider the solution to this situation.

After digging couples of references, list possbile solutions as below:

distributed lock in memory, e.g. redis, zookeeper
pessimistic lock(select .. for update), you cannot bear update failure while high-frequency collision occurs.
optimistic lock(version field), you bears update failure when collision occurs.
mysql last_insert_id function, that only works with the same db connection. This will be infeasible for web application while using db connection pools.
split data across the machine and synchronized local thread
batch fetch and maintain in memory, this cannot avoid concurrency issues but on the other hand can decrease the chances on race condition.
You can also create counter table(UPDATE hit_counter SET cnt = cnt + 1 WHERE slot = RAND() * 100;), and retrieve statistics, just use aggregate queries(retrieve statistics, just use aggregate queries)

When concurrency issues happens in a high frequency rate, it is recommended to use distributed in-memory lock and perssimistic lock in order to keep avoiding update failure and retry logic in optimistic lock.

References:

Mysql High Performance Datatypes/Indexing

Nov 25 2014

Choosing Optimal Data Types

Smaller is usually better.
Simple is good. For example, integers are cheaper to compare than characters, because character sets and collations (sorting rules) make character comparisons compli- cated.
Avoid NULL if possible.

VARCHAR and CHAR types

BLOB and TEXT types

Using ENUM instead of a string type

DATETIME and TIMESTAMP types

Bit-Packed Data Types

A Mixture of Normalized and Denormalized

The most common way to denormalize data is to duplicate, or cache, selected columns from one table in another table. In MySQL 5.0 and newer, you can use triggers to update the cached values, which makes the implementation easier.

Cache and Summary Tables

Materialized Views

Counter Tables

Types of Indexes

Pending...

Transcation Management in Java

Nov 11 2014

Recently I encountered a great number of topics with transcation management whatever in a interview or web application.

How to handle read-write splitting?
How to handle transcation in distributed system, i.e. distributed transcation?

There are lots of Java solution in the wild, e.g. Spring transaction, Java Transaction API (JTA), JDBC, Hibernate, Java Persistence API (JPA), and Java Data Objects (JDO).

Pending...

References:

http://docs.spring.io/spring-framework/docs/current/spring-framework-reference/html/transaction.html

Learning Unit Test in Java

Nov 8 2014

The post aims to share my some thoughts of how to learn unit test in java, I will give a little bit more insights on how I am thinking on this. Hope this will bring you more inspirations.

There are two popular testing philosophys a.k.a TDD(Test-Driven Development) & BDD(Behaviour-Driven Development), I started to search some good frameworks with BDD. I knew about BDD from javascipt testing framework, e.g. Mocha, Jasmine, so I am pretty eager to know whether there are any popular BDD frameworks in Java perspective. After reading couples of posts, JBehave and Cucumber are relatively welcomed in the community. Cucumber works with Ruby, Java, .NET, Flex or web applications written in any language. It has been translated to over 40 spoken languages, and gains more stars on github. Moreiver, TDD is much more well-known for most Java developer like JUnit, TestNG etc.

BobbyTang's Tech Blog Learn Share Inspire

Definition

Hardware

Execution Overview

Master Data Strcutures

Fault Tolerance

Locality

Backup Tasks

Partitioning/Combiner Function

Remaining questions after review

1. JVM Structure

2. Class Loader SubSystem

Choosing Optimal Data Types

VARCHAR and CHAR types

BLOB and TEXT types

Using ENUM instead of a string type

DATETIME and TIMESTAMP types

Bit-Packed Data Types

A Mixture of Normalized and Denormalized

Cache and Summary Tables

Materialized Views

Counter Tables

Types of Indexes