Haifeng's Random Walk

LLM

Haifeng Li Jul 1, 2023

Generative artificial intelligence (GenAI), especially ChatGPT, captures everyone’s attention. The transformer based large language models (LLMs), trained on a vast quantity …

Continue reading →

Show full content

Generative artificial intelligence (GenAI), especially ChatGPT, captures everyone’s attention. The transformer based large language models (LLMs), trained on a vast quantity of unlabeled data at scale, demonstrate the ability to generalize to many different tasks. To understand why LLMs are so powerful, we will deep dive into how they work in this post.

LLM Evolutionary Tree

Formally, a decoder only language model is simply a conditional distribution p(xi|x1···xi−1) over next tokens xi given contexts x1 · · · xi−1. Such a formulation is an example of Markov process, which has been studied in many use cases. This simple setup also allows us to generate token by token in an autoregressive way.

Before our deep dive, I have to call out the limitation of this formulation to reach artificial general intelligence (AGI). Thinking is a non-linear process but our communication device, mouth, can speak only linearly. Therefore, language appears a linear sequence of words. It is a reasonable start to model language with a Markov process. But I suspect that this formulation can capture the thinking process (or AGI) completely. On the other hand, thinking and language are interrelated. A strong enough language model may still demonstrates some sort of thinking capability as GPT4 shows. In what follows, let’s check out the scientific innovations that makes LLMs to appear intelligently.

Transformer

There are many ways to model/represent the conditional distribution p(xi|x1···xi−1). In LLMs, we attempt to estimate this conditional distribution with a neural network architecture called Transformer. In fact, neural networks, especially a variety of recurrent neural networks (RNNs), have been employed in language modeling for long time before Transformer. RNNs process tokens sequentially, maintaining a state vector that contains a representation of the data seen prior to the current token. To process the n-th token, the model combines the state representing the sentence up to token n-1 with the information of the new token to create a new state, representing the sentence up to token n. Theoretically, the information from one token can propagate arbitrarily far down the sequence, if at every point the state continues to encode contextual information about the token. Unfortunately, the vanishing gradient problem leaves the model’s state at the end of a long sentence without precise, extractable information about preceding tokens. The dependency of token computations on the results of previous token computations also makes it hard to parallelize computation on modern GPU hardware.

These problems were addressed by self-attention mechanisms in Transformer. Transformer is a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The attention layer can access all previous states and weigh them according to a learned measure of relevance, providing relevant information about far-away tokens. Importantly, Transformers use an attention mechanism without an RNN, processing all tokens simultaneously and calculating attention weights between them in successive layers. Since the attention mechanism only uses information about other tokens from lower layers, it can be computed for all tokens in parallel, which leads to improved training speed.

The input text is parsed into tokens by a byte pair tokenizer, and each token is converted into an embedding vector. Then, positional information of the token is added to the embedding. The transformer building blocks are scaled dot-product attention units. When a sentence is passed into a transformer model, attention weights are calculated between every token simultaneously. The attention unit produces embeddings for every token in context that contain information about the token itself along with a weighted combination of other relevant tokens each weighted by its attention weight.

For each attention unit, the transformer model learns three weight matrices; the query weights WQ, the key weights WK, and the value weights WV. For each token i, the input word embedding is multiplied with each of the three weight matrices to produce a query vector qi, a key vector ki, and a value vector vi. Attention weights are dot product between qi and kj, scaled by the square root of the dimension of the key vectors, and normalized through softmax. The output of the attention unit for token i is the weighted sum of the value vectors of all tokens, weighted by the attention from token i to each token j. The attention calculation for all tokens can be expressed as one large matrix calculation:

One set of (WQ, WK, WV) matrices is called an attention head, and each layer of transformer has multiple attention heads. With multiple attention heads the model can calculate different relevance between tokens. The computations for each attention head can be performed in parallel and the outputs are concatenated and projected back to same input dimension by a matrix WO.

In an encoder, there is a fully-connected multilayer perceptron (MLP) after the self-attention mechanism. The MLP block further processes each output encoding individually. In the encoder-decoder setting (e.g. for translation), an additional attention mechanism is inserted between self-attention and MLP into the decoder to draw relevant information from the encodings generated by the encoders. In a decoder only architecture, this is not necessary. No matter encoder-decoder or decoder only architecture, decoder must not use the current or future output to predict an output, so the output sequence must be partially masked to prevent this reverse information flow, which allows for autoregressive text generation. To generate token by token, the last decoder is followed by a softmax layer to produce the output probabilities over the vocabulary.

Supervised Fine-Tuning

Decoder-only GPT is essentially a unsupervised (or self-supervised) pre-training algorithm that maximizes the following likelihood:

where k is the size of context window. While the architecture is task-agnostic, GPT demonstrates that large gains on natural language inference, question answering, semantic similarity, and text classification can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.

After pre-training the model with the above objective, we can adapt the parameters to the supervised target task. Given a labeled dataset C, where each instance consists of a sequence of input tokens, x1, . . . , xm, along with a label y. The inputs are passed through the pre-trained model to obtain the final transformer block’s activation hlm, which is then fed into an added linear output layer with parameters Wy to predict y:

Correspondingly, we have the following objective function:

In addition, it is helpful including language modeling as an auxiliary objective as it improves generalization of the supervised model and accelerates convergence. That is, we optimize the following objective:

Text classification can be directly fine-tuned as described above. Other tasks, like question answering or textual entailment, have structured inputs such as ordered sentence pairs, or triplets of document, question, and answers. Since the pre-trained model was trained on contiguous sequences of text, it needs some modifications to apply to these tasks.

Textual entailment: concatenate the premise p and hypothesis h token sequences, with a delimiter token ($) in between.
Similarity: there is no inherent ordering of the two sentences being compared. Therefore, the input sequence contain both possible sentence orderings (with a delimiter in between) and process each independently to produce two sequence representations, which are added element-wise before being fed into the linear output layer.
Question Answering and Commonsense Reasoning: each sample has a context document z, a question q, and a set of possible answers {ak}. GPT concatenates the document context and question with each possible answer, adding a delimiter token in between to get [z;q;$;ak]. Each of these sequences are processed independently and then normalized via a softmax layer to produce an output distribution over possible answers.

Zero-Shot Transfer (aka Meta Learning)

While GPT shows that supervised fine-tuning works well on task specific datasets, to achieve strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands of examples specific to that task. Interestingly, GPT2 demonstrates that language models begin to learn multiple tasks without any explicit supervision, conditioned on a document plus questions (aka prompts).

Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distribution p(output|input). Since a general system should be able to perform many different tasks, even for the same input, it should condition not only on the input but also on the task to be performed. That is, it should model p(output|input, task). Previously, task conditioning is often implemented at an architectural level or at an algorithmic level. But language provides a flexible way to specify tasks, inputs, and outputs all as a sequence of symbols. For example, a translation training example can be written as the sequence (translate to french, english text, french text). In particular, GPT2 is conditioned on a context of example pairs of the format english sentence = French sentence and then after a final prompt of english sentence = we sample from the model with greedy decoding and use the first generated sentence as the translation.

Similarly, to induce summarization behavior, GPT2 adds the text TL;DR: after the article and generate 100 tokens with Top-k random sampling with k = 2 which reduces repetition and encourages more abstractive summaries than greedy decoding. Likewise, a reading comprehension training example can be written as (answer the question, document, question, answer).

Note that zero-shot transfer is different from zero-shot learning in next section. In zero-shot transfer, “zero-shot” is in the sense that no gradient updates are performed, but it often involves providing inference-time demonstrations to the model (e.g. the above translation example), so is not truly learning from zero examples.

I find an interesting connection between this meta learning approach with Montague semantics, which is a theory of natural language semantics and of its relationship with syntax. In 1970, Montague formulated his views:

There is in my opinion no important theoretical difference between natural languages and the artificial languages of logicians; indeed I consider it possible to comprehend the syntax and semantics of both kinds of languages with a single natural and mathematically precise theory.

Philosophically, both zero-shot transfer and Montague semantics treat natural language same as programming language. LLMs capture the task through the embedding vectors in a black box approach. It is not clear to us how it really works though. In contrast, the most important features of Montague semantics are its adherence to the principle of compositionality—that is, the meaning of the whole is a function of the meanings of its parts and their mode of syntactic combination. This may be an approach to improve LLMs.

In Context Learning

GPT3 shows that scaling up language models greatly improves task-agnostic, few-shot performance. GPT3 further specialize the description to “zero-shot”, “one-shot”, or “few-shot” depending on how many demonstrations are provided at inference time: (a) “few-shot learning”, or in-context learning where we allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only an instruction in natural language is given to the model.

For few-shot learning, GPT3 evaluates each example in the evaluation set by randomly drawing K examples from that task’s training set as conditioning, delimited by 1 or 2 newlines depending on the task. K can be any value from 0 to the maximum amount allowed by the model’s context window, which is nctx = 2048 for all models and typically fits 10 to 100 examples. Larger values of K are usually but not always better.

For some tasks GPT3 also uses a natural language prompt in addition to (or for K = 0, instead of) demonstrations. On tasks that involve choosing one correct completion from several options (multiple choice), the prompt includes K examples of context plus correct completion, followed by one example of context only, and the evaluation process compares the model likelihood of each completion.

On tasks that involve binary classification, GPT3 gives the options more semantically meaningful names (e.g. “True” or “False” rather than 0 or 1) and then treat the task like multiple choice.

On tasks with free-form completion, GPT3 uses beam search. The evaluation process scores the model using F1 similarity score, BLEU, or exact match, depending on what is standard for the dataset at hand.

ProConNotesFine-Tuningstrong performancethe need for a new large dataset for every taskinvolves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired taskthe potential for poor generalization out-of-distributionthe potential to exploit spurious features of the training datapotentially resulting in an unfair comparison with human performance.Few-Shota major reduction in the need for task-specific dataa small amount of task specific data is still requiredgiven a few demonstrations of the task at inference time as conditioning, but no weight updates are allowedreduced potential to learn an overly narrow distribution from a large but narrow fine-tuning datasetOne-Shotsame as few-shot except that only one demonstration is allowed, in addition to a natural language description of the taskThe reason to distinguish one-shot from few-shot and zero-shot is that it most closely matches the way in which some tasks are communicated to humans.Zero-Shotmaximum conveniencethe most challenging settingno demonstrations are allowed, and the model is only given a natural language instruction describing the taskpotential for robustnessavoidance of spurious correlations Model Size Matters (So Far)

The capacity of the language model is essential to the success of task-agnostic learning and increasing it improves performance in a log-linear fashion across tasks. GPT-2 was created as a direct scale-up of GPT-1, with both its parameter count and dataset size increased by a factor of 10. But it can perform downstream tasks in a zero-shot transfer setting – without any parameter or architecture modification.

GPT3 uses the same model and architecture as GPT2 with the exception using alternating dense and locally banded sparse attention patterns in the layers of the transformer.

Model Size

On TriviaQA, GPT3’s performance grows smoothly with model size, suggesting that language models continue to absorb knowledge as their capacity increases. One-shot and few-shot performance make significant gains over zero-shot behavior.

Data Quality Matters

While less discussed, data quality matters too. Datasets for language models have rapidly expanded. For example, the CommonCrawl dataset constitutes nearly a trillion words, which is sufficient to train largest models without ever updating on the same sequence twice. However, it was found that unfiltered or lightly filtered versions of CommonCrawl tend to have lower quality than more curated datasets.

Therefore, GPT2 created a new web scrape which emphasizes document quality by scraping all outbound links from Reddit which received at least 3 karma, which acts as a heuristic indicator for whether other users found the link interesting, educational, or just funny. The final dataset contains slightly over 8 million documents for a total of 40 GB of text after de-duplication and some heuristic based cleaning.

Further, GPT3 took 3 steps to improve the average quality of datasets: (1) filtered CommonCrawl based on similarity to a range of high-quality reference corpora, (2) fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of held-out validation set as an accurate measure of overfitting, and (3) added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.

Similarly, GLaM develops a text quality classifier to produce a high-quality web corpus out of an original larger raw corpus. This classifier is trained to classify between a collection of curated text (Wikipedia, books and a few selected web-sites) and other webpages. GLaM uses this classifier to estimate the content quality of a webpage and then uses a Pareto distribution to sample webpages according to their score. This allows some lower-quality webpages to be included to prevent systematic biases in the classifier.

Data and mixture weights in GLaM training set

GLaM also sets the mixture weights based on the performance of each data component in a smaller model and to prevent small sources such as Wikipedia from being over-sampled.

Chain of Thought

As pointing out earlier, the prediction of next token is not same as the thinking process. Interestingly, some reasoning and arithmetic ability of LLMs can be unlocked by Chain-of-thought prompting. A chain of thought is a series of intermediate natural language reasoning steps that lead to the final output. Sufficiently large language models can generate chains of thought if demonstrations of chain-of-thought reasoning are provided in the exemplars for few-shot prompting: ⟨input, chain of thought, output⟩. Why and how it works is not clear to us though.

Reinforcement Learning from Human Feedback (RLHF)

The language modeling objective used for LLMs—predicting the next token—is different from the objective “follow the user’s instructions helpfully and safely”. Thus, we say that the language modeling objective is misaligned.

InstructGPT aligns language models with user intent on a wide range of tasks by using reinforcement learning from human feedback (RLHF). This technique uses human preferences as a reward signal to fine-tune models.

Step 1: Collect demonstration data, and train a supervised policy. Labelers provide demonstrations of the desired behavior on the input prompt distribution. Then fine-tune a pre-trained GPT3 model on this data using supervised learning.

Step 2: Collect comparison data, and train a reward model. Collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input. Then train a reward model to predict the human-preferred output.

Step 3: Optimize a policy against the reward model using PPO. Use the output of the RM as a scalar reward. Fine-tune the supervised policy to optimize this reward using the PPO algorithm.

Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy.

Instruction Fine-Tuning

While supervised fine-tuning introduced in GPT-1 focuses on task specific tuning, T5 is trained with a maximum likelihood objective (using “teacher forcing”) regardless of the task. Essentially, T5 leverages the same intuition as zero-shot transfer that NLP tasks can be described via natural language instructions, such as “Is the sentiment of this movie review positive or negative?” or “Translate ‘how are you’ into Chinese.” To specify which task the model should perform, T5 adds a task-specific (text) prefix to the original input sequence before feeding it to the model. Further, FLAN explores instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data.

For each dataset, FLAN manually composes ten unique templates that use natural language instructions to describe the task for that dataset. While most of the ten templates describe the original task, to increase diversity, for each dataset FLAN also includes up to three templates that “turned the task around,” (e.g., for sentiment classification we include templates asking to generate a movie review). We then instruction tune a pretrained language model on the mixture of all datasets, with examples in each dataset formatted via a randomly selected instruction template for that dataset.

The so-called prompt engineering is essentially a reverse engineering how the training data are prepared for instruction fine-tuning and in context learning.

Retrieval Augmented Generation (RAG)

Due to the cost and time, LLMs in production usages are often lagged in term of training data freshness. To address this issue, we may use LLMs in the way of Retrieval Augmented Generation (RAG). In this use case, we do not want the LLM to generate text based solely on the data it was trained over, but rather want it to incorporate other external data in some way. With RAG, LLMs can also answer (private) domain specific questions. Therefore, RAG is also referred as “open-book” question answering. LLM + RAG could be an alternative to classic search engine. In other word, it acts as information retrieval with hallucination.

Currently, the retrieval part of RAG is often implemented as k-nearest neighbor (similarity) search on a vector database that contains the vector embedding of external text data. For example, DPR formulates encoder training as a metric learning problem.However, we should notice the information retrieval is generally based on relevance, which is different from similarity. I expect that there will be many more improvements in this area in the future.

Conclusion

LLM is an exciting area and will experience rapid innovations. I hope that this post helps you a little bit understand how it works. Besides excitement, we should also notice that LLMs learn language in a very different way from humans—they lack access to the social and perceptual context that human language learners use to infer the relationship between utterances and speakers’ mental states. They are also trained in a different way from human’s thinking process. These could be the areas to improve LLMs or to invent new paradigms of learning algorithms.

http://haifengl.wordpress.com/?p=1212

Extensions

Will Libra Succeed?

Haifeng Li Jun 21, 2019

Facebook has finally revealed its cryptocurrency, Libra, which it will launch in 2020. Like other cryptocurrencies, users can access Libra through …

Continue reading →

Show full content

Facebook has finally revealed its cryptocurrency, Libra, which it will launch in 2020. Like other cryptocurrencies, users can access Libra through apps and use to pay for things or to send money to each other. While Libra shows a lot of more potentials with its careful design choices, Facebook lacks of expertise and experience in regulations and financial service operations. In this post, we will investigate the potentials and challenges of Libra.

First of all, Libra explicitly targets people who are unbanked (or underbanked). And it promises a cheaper way (near zero fees) for people to transact and borrow money compared to traditional banking or payday loans. Nonconsumption is the best start point for disruptive innovations. In fact, nonconsumption is the fiercest competition to incumbents —and it’s winning in almost every battlefield. Clearly, Facebook is a believer of the theory of disruptive innovation and wisely chooses its beachhead. Both Google and Apple have tried to get into payment business but achieve only modest adoption because they chose the wrong market. iPhone users already have multiple credit cards in their wallet. Android users must have a phone with NFC to pay in stores with Google Pay. Those high end Android phone owners are not in desperate need of yet another payment approach. In contrast, Ali Pay and Wechat achieve huge success in mobile payment because China was a cash-based society. Not many people had credit cards and small businesses had no reliable and secure way for online payment. Ali Pay and Wechat filled in the void and provide a simple solution with QR code. Even a cheapest phone has a camera, right? Similarly, Facebook proudly talks about $40 mobile phone in the Libra White Paper. It is an open secret that Facebook wants to replicate Wechat’s success with its own messaging platform.

Libra is a currency, not an asset like Bitcoin. To avoid the gambling vibe of Bitcoin and other cryptocurrencies, Libra’s value is backed by a reserve of real assets. A basket of bank deposits and short-term government securities will be held in the Libra Reserve for every Libra that is created, building trust in its intrinsic value. This isn’t a coin that you buy because you think it will grow 100 times as valuable. It is more like exchanging a dollar for a Euro.

Facebook knows that it will face a hard fight and it cannot win by itself. Facebook needs allies, a lot of them. And it did a really good job of building the Libra Association. The Libra Association is an independent, not-for-profit membership organization headquartered in Geneva, Switzerland. It has an impressive list of initial participants, which cover a variety of industries and forms a solid foundation of the whole ecosystem. Although Facebook is expected to maintain a leadership role through 2019, final decision-making authority rests with the association. Facebook should be praised for good leadership and fair play.

Libra also makes good decisions on technical details. For example, it chooses permissioned blockchain (for now) to ensure the scalability to serve billions of users. It also adopts a Byzantine Fault Tolerant (BFT) consensus approach, which enables high transaction throughput, low latency, and a more energy-efficient approach to consensus than “proof of work” used in Bitcoin. Overall, Libra resolves many issues with Bitcoin, which we discussed in What’s Wrong with Bitcoin two year ago (when Bitcoin was reaching its peak). However, Libra is not flawless. The following challenges will have big impacts on how it fares.

“Don’t fight the Fed” is an old investing mantra in Wall Street. Although it refers to a investment policy, I would like to borrow it here literally. Libra is a currency and the Libra Association is essentially the central bank of Libra, which is the only party able to mint and burn Libra. In fact, the Libra Reserve acts as a “buyer of last resort”. Of course, central banks don’t like a new competitor, especially which will operate globally. French Finance Minister Bruno Le Maire said that Libra shouldn’t be allow to become a sovereign currency, while Markus Ferber, a German member of the European Parliament, warned Facebook could become a “shadow bank.” US lawmakers are not shy either. They called for Facebook to delay the launch of Libra.

Bank of England Governor Mark Carney appeared more open to the scheme, saying he’s keeping an open mind — although added the caveat that Libra would have to face regulation: “Anything that works in this world will become instantly systemic and will have to be subject to the highest standards of regulation.”

To be fair, central bankers and regulators are not paranoid. It is a reasonable requirement that the Libra Association needs to knows who its customers are (KYC) and that it has strong anti-money-laundering (AML) controls. But Libra, like other cryptocurrencies, are pseudonymous. It is challenging (if possible at all) to achieve KYC and AML goals under such settings. Besides, it is well known that complying with all the rules can be an onerous and expensive business for any money-transfer firms. All major banks spends well above $100MM annually on AML and still struggle to satisfy the regulators. Facebook seems ignorant on this topic. Even though the white paper says “We believe that collaborating and innovating with the financial sector, including regulators and experts across a variety of industries, is the only way to ensure that a sustainable, secure and trusted framework underpins this new system.”, there is no single word about KYC and AML.

Another issue with Libra is that to be a stable global currency, it is backed by a basket of fiat assets. It’s important to highlight that one Libra will not always be able to convert into the same amount of a given local currency (i.e., Libra is not a “peg” to a single currency). Rather, as the value of the underlying assets moves, the value of one Libra in any local currency may fluctuate. For most of us, we make only the domestic payments in everyday life. If so, why would I have to take risk of exchange rate?

A deeper question is “do we really need a global currency”? In 1990s, Motorola believed that a businessman would carry a satellite phone around the world and make calls anywhere. Unfortunately, the imagined demands didn’t exist and Iridium failure becomes a classic business case study. Facebook says many times in the white paper that a global currency should be designed but doesn’t provide any evidence to support the claim. If I were a $40 mobile phone owner, how likely do I need global transactions? Although I can start a presentation with the phrase “Imagining a world in which …” and list some use cases, any business venture needs to start with real demands and now.

In summary, Facebook develops a better cryptocurrency and avoids many issues with Bitcoin that are caused by anti-economics mindset. However, Facebook lacks of expertise and experience in regulations and financial service operations. Its ignorance of KYC, AML, FX risk, etc. may be relieved a little bit by other founding members of the Libra Association. But the Libra Association is skewed to tech companies and doesn’t have a heavy weight banking partner yet. I would not expect that Libra will enjoy an overnight success in OECD countries. On the other hand, it does provide values in high inflation areas and where people are unbanked or underbanked.

http://haifengl.wordpress.com/?p=1206

Extensions

Two Fundamental Changes in Apache Spark

Haifeng Li May 9, 2019

Barrier execution mode and Delta Lake are two new Apache Spark features. Interestingly, they break apart from the root of Apache …

Continue reading →

Show full content

Barrier execution mode and Delta Lake are two new Apache Spark features. Interestingly, they break apart from the root of Apache Spark. Let’s figure out together what they are and why they are developed. More importantly, will they be a success?

Essentially, Spark is a better implementation of MapReduce. In MapReduce/Spark, a task in a stage doesn’t depend on any other tasks in the same stage, and hence it can be scheduled independently. This fundamental assumption enables Spark to hide the complexity of scheduling from developers. It also brings in the elasticity with the helps of an underlying resource manager. Finally, fault tolerance becomes simple as the scheduler can rerun a task at any time (with the great help of the immutability of RDDs).

However, this nice yet simple parallel compute model meets new challenges when AI jobs show up. Spark has been very popular for data wrangling. After cleaning up the data with Spark, developers naturally want to build machine learning models with Spark too in the same pipeline. However, the MapReduce pattern is not suitable for machine learning in most cases. Many machine learning algorithms explore complex communication patterns. As a baby step, SPARK-24374 introduces the barrier execution mode. Using this new execution mode, Spark launches all training tasks together and restarts all tasks in case of task failures. Spark also introduces a new mechanism of fault tolerance for barrier tasks. When any barrier task failed in the middle, Spark would abort all the tasks and restart the stage. This mechanism can be used to support All-Reduce pattern to accelerate distributed TensorFlow training.

However, there are two major issues with this new barrier scheduling. First, it is far from a complete solution to support various communication patterns used in distributed machine learning. Shall the community keep introducing new scheduling and/or communication APIs for AI jobs? It is a hard question. An important merit of Spark is its clean and easy-to-use API that hides all the complexity of parallel computing, all based on the MapReduce model. In contrast, MPI supports all kinds of communication and synchronous patterns but exposes a very complicated interface. We might gradually lose the simplicity of Spark if we keep adding new compute models. Second, SPARK-24374 is a step back from elasticity and fault tolerance. In a large cluster, hardware and software failures happen almost everyday. A week-long deep learning job might have great challenges to finish in a large cluster as the scheduler would abort all the tasks and restart the stage when any barrier task fails. Also, resource are dynamically allocated in a large cluster to maximize the utilization. As the barrier scheduler starts all the tasks at the same time, we would lose many opportunities of leveraging elastic compute resources.

To meet the customer’s demand, Spark vendors have to add the support to AI jobs. The question is how to strike a balance between simplicity of programming model, flexible communication and synchronous patterns, elasticity and fault tolerance of compute infrastructure. The computer scientists have been struggling for 50 years to find such a perfect model.

On the other hand, we should also ask ourselves if we have to do everything in one system, especially when we use a product in the way that it was not designed for.

cat

Alternatively, I suggest that we need to develop a new platform for distributed machine learning. Like Spark, it should do one thing, and do it well. Meanwhile, this new system should run side by side with Spark on the same infrastructure, which is managed by a general purpose resource manager. In addition, it can access RDDs and DataFrames from Spark (the lifecycle of RDDs and DataFrames should go beyond jobs). Therefore, we will get the best of two worlds.

Delta Lake is another interesting development by Databricks. It is a storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of existing data lake and is fully compatible with Apache Spark APIs.

I will focus on the business impact rather than the technical details of Delta Lake in this post as I speculate that Delta Lake was developed to meet the (new) business strategy of Databricks. Different from Hadoop, Spark has been a pure compute engine. Now Spark steps into the storage layer too with Delta Lake. To understand why, let’s look at Hadoop. After 10 years, people realize that what they want from Hadoop is neither HDFS nor MapReduce, but a data warehouse. With Hive and Impala (and many other SQL-on-Hadoop solutions), Hadoop becomes a legitimate data warehouse indeed.

To survive, Databricks needs to go beyond ETL and Spark could become a good data warehouse solution with a lot of innovations in SparkSQL. However, only innovations in compute engine are not sufficient to compete with market leaders. Any successful data warehouse must invest heavily in storage layer to gain performance advantage. On the other hand, in the era of cloud computing, the compute layer and the storage layer are decoupled. It doesn’t make sense to introduce a full-blown storage solution nowadays as Hadoop did 10 years ago. Therefore, Databricks wisely designs Delta Lake to work on existing data lake while adding many missing pieces in data warehouse puzzle, especially ACID transactions, metadata and schema management. I expect they will continue the efforts in these areas in next phase because the current implementations still have gap to be a real data warehouse.

If my prediction be correct, Databricks may become a strong cloud native data warehouse provider in a couple of years, competing with the market leader Snowflake Computing. Maybe they took a page from Snowflake’s playbook

http://haifengl.wordpress.com/?p=1200

Extensions

Cloud is Not Cheap

Haifeng Li Mar 20, 2019

Lyft is going public. Its S-1 filing reveals that “[Lyft] committed to spend an aggregate of at least $300 million between January …

Continue reading →

Show full content

Lyft is going public. Its S-1 filing reveals that “[Lyft] committed to spend an aggregate of at least $300 million between January 2019 and December 2021, with a minimum amount of $80 million in each of the three years, on AWS services.” If its usage of Amazon’s cloud doesn’t hit or exceed that $300 million threshold, Lyft will have to pay the difference.

It echoes Snap’s (Snapchat parent company) S-1 filing in 2017: “[Snap] have committed to spend $2 billion with Google Cloud over the next five years and have built our software and computer systems to use computing, storage capabilities, bandwidth, and other services provided by Google, some of which do not have an alternative in the market.”

Cloud is great for a startup to grow its business when it lacks of capital and skills to build the IT infrastructure. But once they grow up, cloud may not make sense any more. Some large-scale web companies have actually moved from cloud to their own data centers. Dropbox shared their story in the S-1 filing. Over two years, Dropbox shaved $74.6 million off its operational expenses primarily because of the move.

It is not only about cost but also about innovations. If compute is part of your core competency, building own infrastructure is worth of heavy investment as every business has its unique compute challenges. With AWS, Dropbox struggled to handle the explosion of data it was dealing with. While building in-house data centers, Dropbox had designed and built its own storage systems. Codenamed Diskotech, the system hosts 500 petabytes of data with 3-5x better performance at tail latency.

It is time for a reality check of your cloud strategy. Checkout my previous post on private cloud strategy.

http://haifengl.wordpress.com/?p=1195

Extensions

Private cloud is dead. Long live private cloud.

Haifeng Li Feb 26, 2019

To remain valuable and relevant to their lines of business (LOB), in-house IT organizations have to be able to deliver …

Continue reading →

Show full content

To remain valuable and relevant to their lines of business (LOB), in-house IT organizations have to be able to deliver private cloud services that are competitive with public cloud services. Correspondingly, the community has been developing private cloud solution such as OpenStack, which mimic public cloud services by offering virtual machines and other resources. However, corporate are realizing they could never build an AWS-like cloud in-house. In 2018, Walmart, the biggest cheerleader of OpenStack, announced that they signed a five-year deal to make Microsoft Azure the preferred cloud provider. Although Walmart won’t leave OpenStack completely in the dust, it does raise questions about Walmart’s future use of OpenStack.

The idea of private cloud may not be dead yet but it has definitely gone through near-death experiences. There are plenty of articles declaring that private cloud is dead. The analysis goes from the technical difficulties, operational challenges, slowness of enterprise IT processes, to the economics of public cloud, etc. However, it is very difficult for a large enterprise to go all-in to public cloud. The value of private cloud is still undeniable. So there are new solutions such as AzureStack to deploy an on-premises public cloud solution. Hipsters also talk about that containers and Kubernetes will rule the hyperscale data centers.

Unfortunately, all these efforts are not sufficient unless we understand the fundamental problems with corporate data centers. To find out why, let’s start with this video, an interview with actors Gene Hackman and Dustin Hoffman.

If you cannot access Youtube for whatever reason, here is the story. Hoffman asked to borrow money from Hackman when the two men were broke young actors living in Los Angeles. Hackman went to his friend’s apartment and saw on a shelf several jars labeled with various household expenses: rent, electricity, etc. They were all stuffed with cash, except the empty one labeled “food,” which Hoffman wanted money to fill.

This interesting phenomenon is called mental accounting, a tendency people have to separate their money into different accounts based on miscellaneous subjective criteria, including the source of the money and the intended use for each account.

Richard Thaler, the 2017 Nobel laureate in economics, has been studying mental accounting intensively. Thaler points out that mental accounting violates the economist’s basic assumption that money is fungible. While many people (probably all of us including smart economists) use mental accounting in some way, we may not realize that this line of thinking often results in an irrational and detrimental set of behaviors. For example, some people keep a special “money jar” for a vacation or a new home while at the same time carrying substantial credit card debt.

Mental accounting happens in corporate data centers too. When you walk into a data center of Fortune 500 company, you will find thousand servers in rows of racks. They all look similar, right? Indeed, most servers run Linux on Intel processors. Just like money, general purpose computers are fungible by design. If you can run a program on a computer, you should be able to run it on any other computers with same hardware and software specs.

data-center

In reality, however, you cannot run your application on another server because each server is “labeled” based on budget source or intended use (HR, Finance, FICC, Risk, Hadoop, Cassandra, etc.). When you need more computing power, you have to go through a painful procurement process (6 month if your are lucky) while many servers in other silos are idle in most time. Because machines are allocated into silos, the overall utilization is also terribly low while each silo may suffer limited resource at peak time. It takes billion dollars to build a modern data center and cost hundred million dollars to operate it. Low utilization simply means that we burn hundred million dollars every year while people rarely notice it.

Joker burning money

Why enterprise data center are operated in this way? Budget and service-level agreement (SLA). Budgets serve as a crude way to keep costs under control while giving employees discretion to spend as they see fit. IT departments typically charge back LOBs monthly based on cluster size because it makes budget planning predicable and also simplifies the internal payment process. Meanwhile, LOBs care the SLA first and most. During the budget planning season, they would like to fight really hard to secure sufficient fund to own large enough clusters to meet the SLA at the peak time. To guarantee the SLA, they also rarely want to share the clusters with other LOBs to avoid any potential interference. In result, huge data centers are carved into hundreds clusters (silos) with little resource sharing. Moreover, enterprise applications are rarely busy 7 x 24. In fact, many internal applications are batch jobs and run on schedule. Therefore, the overall utilization stays very low. But data centers become bigger and bigger to accommodate more applications. Although budgets and SLA exist for sensible, understandable reasons, they lead to silly outcomes such as IT infrastructure and cost grow out of control.

Now it is clear why most private cloud projects failed. OpenStack or AzureStack is introduced for additional technical capabilities (e.g. self-service VMs, managed database, object storage, etc.). They are not used to solve the aforementioned fundamental problems. In fact, they make situation even a little bit worse as they are deployed as yet another cluster and operate on dedicated budgets.

Since we find the root cause of inefficiency of corporate data centers, we now have opportunities to get private cloud right. But first we need to figure out what exactly is cloud computing. Both Amazon AWS and Microsoft Azure agree on that cloud computing is the delivery of computing services (VMs, storage, etc.) over the Internet (“the cloud”). But what they try to sell is neither “the cloud” nor computing power. What they really sell are

Elasticity — Ability to adapt to workload changes by provisioning and de-provisioning resources in an autonomic manner, such that at each time point the available resources match the current match demand as closely as possible. Remember AWS EC2 stands for Elastic Cloud Compute.
Agility — Elasticity requires that vast amounts of computing resources (VMs) can be provisioned in minutes. It could be down to seconds/milliseconds if the resource orchestration units are less heavy than VMs. Agility is the competitive advantage for a digital age.
Pay-as-you-go pricing — Without usage based charge, elasticity makes no sense at all. Just like mobile plans, all-you-can-eat monthly plans expect people to use less while usage based plans expect people to use smartly.

All other selling points of cloud computing are secondary to giant enterprises. For example,

Cost — With cloud, startups can avoid the heavy front cost to buy servers. But enterprises avoid the front cost by leasing servers in their data centers. Besides, they have strong bargaining power in procurement negotiation.
Scale — The data centers of Fortune 100 are no smaller than those of public clouds. The global enterprises have global data center footprint too.
Productivity — By leveraging cloud, startups can focus on developing their core applications and minimize the infrastructure team. Enterprise IT departments have a strong team of skilled technologists. The slowness of enterprise IT process is often due to the regulatory and compliance requirements rather than lacking of skills. Moving to cloud can hardly improve productivity without reengineering the IT processes. Regulatory and compliance requirements cannot be relieved anyway.
Security — Security is a must have, no matter on-premise or cloud. No excuse.

The business goal of private cloud is to introducing elasticity, agility, and usage based payment model. Equipped with our analysis results, it is clear that enterprises cannot achieve these goals by deploying some cloud technology but without the changes of mindset. To build a successful private cloud, enterprise should follow the below principles:

Manage a data center as the whole rather than 300 individual clusters. Public clouds, as service providers, naturally manage their data centers as a set of fungible computers. Enterprise IT departments are service providers too. However, their SLAs are defined at application/cluster level with each corresponding LOB. To operate as a private cloud provider, IT departments have to change their mindset and define the SLA at the data center level for the whole firm.
Build a usage based back charge system. Elasticity is meaningless without a usage based pricing system. The cost effectiveness of cloud comes from the behavior changes nudged by the model of service fee. Even if a cloud can provide unlimited computing power, no one has unlimited budget. To reduce the cost, we have to change the way of consuming computing power, driven by the usage-based service fee.
Develop a data center operating system. To achieve elasticity and agility, we have to develop an efficient, scalable, and fault-tolerant operating system that manages the data center resource globally. It is aware of the workload and resource demand of each application, tracks the available resources, dynamically allocates/deallocates resources to/from applications in real time, efficiently manages priority and preemption to meet SLA, centralized logging for debugging, troubleshooting, and reporting, etc. It should be able to orchestrate diverse workloads such as HPC, data analytics, batch jobs, web services, long run services, etc. Although it is very challenging for the DCOS to meet the different latency and throughput requirements of various workloads, we can achieve it with the rich research and development results in recent years.
Application-oriented infrastructure. Every company is a software company today. We win the competitions by applications, not machines. Containerization enables the data center operation teams to transform from being machine-oriented to being application-oriented. The shift will dramatically improve application deployment and introspection and help enterprises compete in fast pace.

It is really hard to move legacy mission-critical applications to public cloud. Wait for a slow death? Or fight disruptive innovations now with elastic and agile private clouds? Enterprise decision makers have to ask themselves this question.

This is also an enormous opportunity for solution providers and consulting firms. Fortune 100 companies typically operates 3 and more data centers. The market of private clouds is way bigger than the public cloud in combine. It is waiting for a mindset-shifting solution.

http://haifengl.wordpress.com/?p=1188

Extensions

What are missing in Kubernetes?

Haifeng Li Dec 17, 2018

As Kubernetes becomes the de facto solution for container orchestration, more and more people expect that it will be the …

Continue reading →

Show full content

As Kubernetes becomes the de facto solution for container orchestration, more and more people expect that it will be the orchestrator of data centers. For example, ZDNet predicted Kubernetes will rule the hyperscale data center in 2018. In a little over four years’ time, the project born from Google seems going to change everything. Tracing back to its root from Google Borg, Kubernetes is nicely designed to run web services. As StatefulSets became stable in 1.9, it is also able to manage stateful applications such as database, message queue, etc. To conquer enterprise data centers, however, there are still several missing pieces.

In the data centers of large corporates (e.g. banks, pharmaceutical, energy companies), there are a variety of workloads such as HPC (high-performance computing), HPA (high-performance analytics), and batch jobs. Compared to them, web services use only a small portion of compute resources. Unfortunately, Kubernetes has been weak to orchestrate these workloads so far.

HPC

There are many kinds of HPC workloads. For simplicity, let’s just consider Monte Carlo simulation here. It is a simple use case but consumes a lot of compute time in many enterprise data centers. A typical Monte Carlo simulation involves millions of tasks with complicated dependency. The scheduling algorithm is generally task driven. Since each task doesn’t run very long (seconds), the low latency of scheduling is critical. In contrast, the median k8s pod start up latency on large cluster could be as long as 25 seconds. Of which, 80% time is for deploying container images. Although one may argue that local cache of docker images should help, quick release of new versions/images is a norm today with agile development. Hiccups or even choking happen frequently therefore.

Even worse, there will be thousand machines that simultaneously request the docker image from the docker registry server when a HPC job starts. The central registry is not only the bottleneck but also may not survive the heavy volume of requests. Instead, a distributed registry solution is a better approach. For example, in NERSC’s Shifter project, docker images are converted to tgz files and transferred to Lustre parallel distributed file system.

HPA

Since Spark 2.3, we can submit spark jobs to Kubernetes. However, the current integration takes a static resource allocation approach. When submitting a job, the user needs to configure the number of executors, which will book the resources from Kubernetes across the lifetime of job. Note that a Spark application often run several or many Spark jobs, which are decomposed into stages and tasks for scheduling. Each job and stage generally has different number of tasks and requires different amount of resources. But the user has to allocate the maximum number of executors up front. The static resource allocation approach will certainly waste a lot of CPU time and RAM.

Batch Jobs

Kubernetes’s batch job support is extremely simple, basically run to completion. However, enterprise batch jobs are way more complicated than that. For example, a batch job may execute in parallel across many hundred or even thousands of nodes using a message passing library to synchronize state. It may also require specialized resources like GPUs or require access to limited software licenses. Organizations may enforce policies around what types of resources can be used by whom to ensure projects are adequately resourced and deadlines are met. Therefore the capabilities like array jobs, configurable priority and preemption, user, group or service based quotas and a variety of other features are mandatory. There is a SIG kub-batch working on a batch scheduler for Kubernetes. But the road map and expected GA date are not available yet.

http://haifengl.wordpress.com/?p=1184

Extensions

IBM Is Missing A Disruptive Innovation In Her Own Hand

Haifeng Li Dec 8, 2017

With $80B revenue but a streak of 21 consecutive quarters of declining revenue, IBM has a big appetite for growth. …

Continue reading →

Show full content

With $80B revenue but a streak of 21 consecutive quarters of declining revenue, IBM has a big appetite for growth. The hungry of growth drives IBM investing billions into their AI business, namely IBM Watson Group. However, it also drives IBM away from a disruptive innovation even though it is in her own hand. The disruptive innovation in spotlight is Watson for Oncology.

Watson for Oncology is an AI-driven oncology clinical decision support system. Watson employs Natural Language Processing (NLP) technologies to derive hundreds of attributes from a patient’s electronic health record including doctors’ notes and lab reports. Then, it provides clinicians with confidence-ranked treatment options and supporting evidence to help them make treatment decisions. As a typical supervised Machine Learning system, Watson for Oncology learns the treatment decisions from trainers, which currently are a couple dozen physicians at New York’s Memorial Sloan Kettering Cancer Center (MSK), one of world’s leading cancer centers.

The marketing strategy of IBM is to promote Watson as a world-changing technology. At HIMSS17, IBM CEO Ginni Rometty said when she became CEO in 2012, soon after Watson won Jeopardy!, she decided using Watson to change parts of health care would “be our next moonshot.” In media, we can frequently see articles about Watson with headlines in quixotic ways, e.g. “IBM’s Watson Supercomputer May Soon Be The Best Doctor In The World.”

This means that IBM presents Watson for Oncology as a “sustaining innovation” to satisfy the most demanding and sophisticated customers. In fact, Watson for Oncology is being used by more than 50 hospitals around the world, which are large medical centers and hospitals in general. This is a “rational” decision given IBM’s size and appetite because by charging the highest prices to the most demanding and sophisticated customers at the top of market, IBM would achieve the greatest profitability. For example, MD Anderson, among IBM’s first partners and the early face of Watson in health care, spent more than three years and $60 million — much of it on outside consultants — to create its own expert oncology advisor by using Watson before shelving the effort.

Unfortunately, an STAT investigation has found that the Watson for Oncology isn’t living up to the lofty expectations IBM created for it. This indicates the failure of the business strategy that prompts Watson as a sustaining innovation despite its advanced technology. Like any other supervised machine learning system, Watson learns a complicated function that maps hundreds of attributes to a set of categories (treatment decision here) from the training data. Therefore, Watson doesn’t have knowledge beyond the training data. As Dr. Sujal Shah, a medical oncologist, said in the STAT investigation that Watson gave him more confidence that using a specific chemotherapy was a sound idea. But the system did not directly help him make that decision, nor did it tell him anything he didn’t already know. In other words, Watson doesn’t provide better solutions than its competitors and existing service providers: human physicians.

Another issue with Watson for Oncology is the one-size-fit-all strategy. So far, Memorial Sloan Kettering is the only trainer of Watson. Although it is one of the world’s most renowned cancer hospitals, Memorial Sloan Kettering’s training (recommended treatment) is not always the best practices in other parts of world. Obviously, the patient population at any single hospital doesn’t reflect the diversity of people around the world. Beyond science, Watson doesn’t take into account the economic and social issues such as medical insurance.

Although Watson for Oncology faces aforementioned challenges, it may become a promising business if IBM repositions it as a disruptive innovation. Disruptive innovation, coined by Clayton Christensen, describes a process by which a product or service takes root initially in simple applications at the bottom of a market and then relentlessly moves up market, eventually displacing established competitors. Characteristics of disruptive businesses, at least in their initial stages, can include: lower gross margins, smaller target markets, and simpler products and services that may not appear as attractive as existing solutions when compared against traditional performance metrics.

To be a disruptive innovation, Watson for Oncology should target the right segment of market, i.e. the bottom of the market. Or even better, the non-consumptions. Although well trained oncologists don’t need to ask Watson about the treatments, there are many small hospitals in rural areas across the world that don’t have any oncology specialists. Patients in those areas either have to travel to big cities for treatments with high expense or would be treated by generalists with little cancer training. With Watson, they can benefit from top-level expertise at afford cost. Besides making treatment plan, Watson can also give patients a comprehensive information package including relevant scientific articles. Patients can do their own research about their treatments. In addition, hospitals with few specialists can also benefits from Watson.

To make this new strategy work, IBM needs to create a variety of localized Watson to accommodate the diversity of patients around the world and associated economic and social issues. This is not only to provide more personalized service but also is important to make the new business model viable. Since few small institutes can afford expensive consulting fees and linking with electronic medical records, Watson simply needs to cover sufficient customers so that a plain per-patient fee business model works out.

IBM used to be a master in disruptive innovation. When the mini-computer disrupted the mainframe, only IBM made it into the mini-computer business while all other mainframe computer makers were killed. IBM kept making the mainframes in Poughkeepsie but went to Rochester Minnesota to make the minis with a different business model. It is because the gross margins in mainframes were 60 percent but the gross margins in the minis were 45 percent. When the personal computer disrupted the mini-computer, IBM went to Florida and set up yet again a different business model that could make money at 25 percent gross margins.

IBM can be the master of disruptive innovation again with Watson for Oncology. They have to forget the fact of consecutive quarters of declining revenue and put their attention to what brings customers values rather than just big fat contracts. The sales rep should stop making overreach claims but search for non-consumptions. So far Watson focus only on the treatment decisions. But the fight with cancer is a long and painful process. A complete consideration of whole user experience will bring much more values to patients. Besides the function part of treatment, the emotional and social part of the job to be done needs to be covered too.

http://haifengl.wordpress.com/?p=1163

Extensions

What’s Wrong with Bitcoin

Haifeng Li Oct 30, 2017

There are many debates on the technical shortfalls in the design of Bitcoin such as the block size, transaction throughput, …

Continue reading →

Show full content

There are many debates on the technical shortfalls in the design of Bitcoin such as the block size, transaction throughput, etc. But the biggest problem of Bitcoin is its anti-economics mindset.

Wrong Assumption

Although Satoshi Nakamoto never mentioned financial crisis in the Bitcoin white paper, many people believe that it was the reason (or at least one of reasons) that drove the mystical genius to create the cryptocurrency. Some people also believe that cryptocurrencies will be the cure to the next financial crisis. For example, a recent article in IEEE Spectrum starts with

Bitcoin was hatched as an act of defiance. Unleashed in the wake of the Great Recession, the cryptocurrency was touted by its early champions as an antidote to the inequities and corruption of the traditional financial system. They cherished the belief that as this parallel currency took off, it would compete with and ultimately dismantle the institutions that had brought about the crisis.

To be clear, financial crisis was not caused by what currency you used and won’t be prevented by a new currency, no matter physical or virtual, crypto or not. As history showed, the euro brought a lot of benefits to eurozone states but didn’t prevent the European sovereign-debt crisis even though the member states are meant to meet strict criteria, such as a budget deficit of less than three percent of their GDP, a debt ratio of less than sixty percent of GDP, low inflation, and interest rates close to the EU average.

The problem Satoshi tried to solve is how to make payments over a communication channel without a trusted party:

What is needed is an electronic payment system based on cryptographic proof instead of trust, allowing any two willing parties to transact directly with each other without the need for a trusted third party.

Bitcoin beautifully solves this problem, theoretically. If we live in a bitcoin-only world, we may achieve this noble goal. The reality is that there are third parties involved in the overall process. For example, we need brokers of Bitcoin for fiat currency and other cryptocurrencies. You probably heard of Coinbase, Gemini, or BitPagos, right? If we still need trusted third party in the ecosystem, will Bitcoin fully achieve its goal ever?

Even worse, the security of a system is as strong as the weakest link. Although Bitcoin itself is very secure so far, the ecosystem is not. Mt. Gox was a bitcoin exchange handling over 70% of all bitcoin transactions worldwide, as the largest bitcoin intermediary in 2013. However, Mt. Gox closed its exchange service and filed for bankruptcy protection in 2014 when it discovered that approximately 850,000 bitcoins (valued at $450+MM at the time) belonging to customers and the company were missing. Tokyo security company WizSec concluded that “most or all of the missing bitcoins were stolen straight out of the Mt. Gox hot wallet over time, beginning in late 2011.”

Even if we live in a Bitcoin Utopia where there are no third party at all in all payment transactions, can we avoid financial crisis for ever? NO. Getting rid of third party in payment doesn’t mean the elimination of banks, which Bitcoin lovers hate. Banks are mostly the brokers between capital and borrowers, not just the intermediary of payments. It is easy for us to point fingers to banks when crisis happens. Sure, banks played a notorious role in the financial crisis with their risk appetite and innovations of balance sheet “optimization”. But shall we not forget the unsustainable debt level created by ourselves? A cryptocurrency or any technical innovation is not the answer if we don’t change our financial behavior.

Decreasing Coin Supply Rate

Bitcoins are created each time a user discovers a new block. The number of bitcoins generated per block is set to decrease geometrically, with a 50% reduction every 210,000 blocks, or approximately four years. The decreasing-supply algorithm of Bitcoin approximates the rate at which commodities like gold are mined (actually Satoshi has never justified or explained many of the constants in the algorithm). But which modern economy links its currency to gold today?

The currency supply rate by a central bank is supposed to match the growth of the amount of goods that are exchanged so that these goods can be traded with stable prices. The supply algorithm of Bitcoin predicates the failure that it will never become the major currency of a stable economy.

Finite Supply

Because of the decreasing supply rate, the number of Bitcoins in existence is not expected to exceed 21 million. Everyday, we create more wealth but the cap of Bitcoin is fixed. No wonder the price of Bitcoin has been skyrocketing (let’s forget the speculation, a major driver of rising price, for a second). CFTC is spot on by treating Bitcoin and other cryptocurrencies as commodities rather than currencies.

No Governance

As a peer-to-peer system, Bitcoin has been sold to us that it is good without central governance. But we do need governance on the coin supply to better serve our economy. Given the extreme complexity and volatile dynamics of economy, any prescribed supply algorithm will fail. One may criticize central banks over imperfect monetary policy. But no governance is not a better alternative.

Environmental Cost

Satoshi argues that the transaction cost of traditional electronic payments is high because of the cost of mediation with financial institutions. It is intuitive that we can reduce the cost by removing the middleman. But the truth is that we the society pay very high cost for Bitcoin. The community of miners uses vast quantities of electrical power in the process. The current estimated annual electricity consumption is 23.32 terawatt hours. The innocent truth is that Bitcoin is not environmentally friendly.

New cryptocurrency is proposed almost everyday to improve some technical aspects of Bitcoin. But no one attempts to address aforementioned issues yet.

http://haifengl.wordpress.com/?p=1168

Extensions

How to Kill Bad Projects

Haifeng Li Sep 11, 2017

It is an open secret how hard to kill projects in development. In the Harvard Business Review article “Why Bad …

Continue reading →

Show full content

It is an open secret how hard to kill projects in development. In the Harvard Business Review article “Why Bad Projects Are So Hard to Kill“, professor Isabelle Royer says that many projects are hard to kill because of a “fervent and widespread belief among managers in the inevitability of their projects’s ultimate success.” The desire to believe in something is primal. The excitement and exuberance associated with a project typically originate with the project champion, whose unyielding conviction that the project will succeed is often based on a hunch rather than on strong evidence. The champion’s exuberance spreads because others also want to believe, especially if the champion is charismatic and well networked within the company.

Even worse, when a project is going monumentally off the rails, people and organizations keep adding resources to the project despite all the evidence of impending disaster. The action that throws good money after bad is known as escalation of commitment. There are four reasons causing this behavior. One is sunk cost fallacy. When estimating the value of a future investment, we have trouble ignoring what we have already invested in the past. The second reason is anticipated regret: we would be sorry if we didn’t give this another chance. The third is project completion: if we keep investing, we can finish the project. The last but probably most powerful factor is ego threat: if we don’t keep investing, we will look and feel like a fool.

The million dollar question is how to find out if your team is victim to the subtle development of entrapment. Rita Gunther McGrath and Ian C. MacMillan offer a simple way in their “Discovery-Driven Growth” methodology. According to them, each team member should answer the following “yes” or “no” questions that reflects the four reasons of escalation of commitment:

I fell we will lose the respect of others if this project is shutdown — nobody respects a failure.
Giving up now would just be an admission of weakness.
Stopping this project would have a negative effect on my career: bonus, raise, promotion, or position.
Stopping this project would have a negative effect on the rest of the team’s careers: bonus, raise, promotion, or position.
We made a public commitment to this project.
It will destroy our record of past success.
We have had some good results — it would be premature to stop the project now.
There will be a big payoff if we succeed in the end.
We’re nearly at a turning point; it would be a shame to stop now, when we are so close.
We have already spent a lot of time and money, which would be wasted if we stopped now.
It would cost us more to stop now than it would to finish the project.
We won’t get anything back if we close the project now.
Our part of the business is counting on us to succeed.
People who want us to fail (rivals, enemies, competitors) will gloat.
A lot of people are depending on us to succeed here.
A lot of people left steady, secure positions to join this project.
We’ve made commitments to outside parties that depend on the success of the project: investors, suppliers, distributors, customers.
We’ve made commitments to inside parties to continue with the project: the board, top management, other divisions, employees.
The firm’s reputation with banks and investment analysts has been staked on the success of this project.
The firm’s reputation with regional, national or foreign government officials has been staked on the success of this project.

If you have a third or more “yes” answers, your team is at risk of escalated commitment. Each of these questions reflects a reason why people have consciously or unconsciously continued committing their talent and resources to projects that reasonably should have been shut down. If these subtle pressures are overcoming the better judgment, we have to make a tough call.

http://haifengl.wordpress.com/?p=1159

Extensions

The Future Business Model of Payroll

Haifeng Li May 12, 2017

ADP Paypal Money Movement $1.7 trillion $354 billion Revenue $12.21 billion $11.27 billion Profit $1.75 billion $1.42 billion Market Cap …

Continue reading →

Show full content

ADP Paypal Money Movement $1.7 trillion $354 billion Revenue $12.21 billion $11.27 billion Profit $1.75 billion $1.42 billion Market Cap $43.3 billion $59.7 billion

ADP revenue includes full HCM services besides payroll.

Notice something here? ADP moves a lot of money than Paypal, but makes less revenue on money movement (less the revenue from other HCM services). It has a smaller market cap too. Why? Well, ADP is in the business of solution shop and value add process while Paypal is a facilitated network.

There are three general types of business models: solution shops, value-adding process businesses, and facilitated networks. Solution shops are institutions structured to diagnose and recommend solutions to unstructured problems. Almost always, solutions shops charge their clients on a fee-for-service basis. The value-adding process transform inputs of resources into outputs of higher value. Because value-adding process organization tend to do their work in repetitive ways, the capability to deliver value tends to be embedded in processes and equipment. The facilitated networks operate systems in which customers buy and sell, and deliver and receive things from other participants. Much of consumer banking is a network business in which customers will make deposits and withdrawals from a collective pool.

When on boarding clients, ADP is in the mode of solution shops where a heavy team executes a time-consuming and highly customized process for each major client. Once a client is on board, ADP performs the repetitive payroll service with computers in every pay cycle. In return, clients pay ADP service fees. No matter how big or small the paycheck is, for CEO or for average Joe, the service fee is the same. In contrast, Paypal is a facilitated network and the service fee is proportional to the transaction size just like credit card services.

Given its dominance in payroll business, it is very challenging for ADP to achieve high growth in this area by grabbing more market share. But high growth is still possible by changing the business model with the above analysis. That is, ADP should become a facilitated network, more specifically a bank!

It sounds ridiculous but ADP has a unique advantage to be a great bank by managing the risk well. The open secret is its massive payroll and HR data. By knowing the incomes in advance, work history, performance metrics, time management data, etc., ADP can reduce the risk a lot with data science. Another great news is that there is a huge market. Many households try to make a go of it week to week, paycheck to paycheck, expense to expense. In fact, 63% Of Americans don’t have enough savings to cover a $500 emergency. Often they have to pay a very high borrow rate to meet a small financial need. With the good risk management based on its data, ADP can potentially help us with much lower rate. It is a win-win for everyone.

To learn more how this works, please check out Payroll: An Overlooked Area in Fintech.

http://haifengl.wordpress.com/?p=1145

Extensions