Hackerearth — GeistHaus

Computing accurate skill percentile with DDSketch

Sep 17, 2023

Introduction

HackerEarth has lots of candidates getting evaluated on a daily basis. We have a feature that benchmarks candidates across the platform. Benchmarking is the process of creating the profile of the ideal candidate for a position, and then measuring all candidates against that profile. To benchmark candidate skills against our millions of candidates, we decided to move away from our regular cron solution to build a more reliable and accurate data pipeline. To support this, we created a new data ingestion flow and data read flow. We moved away from our deterministic algorithms to probabilistic algorithms with DDSketch.

Problem

Our old benchmarking solution was trying to compute the global benchmarking of a candidate on the fly by calculating the solve percentage of the individual skills and returning the geometric mean of all the skill benchmarks. We handle huge volumes of data every day. Analyzing this data itself—for example, calculating a quantile was optimal in terms of resources.

Solution

We came up with a solution that computes an approximate quantile from a compressed representation of that data. We first need to appropriately summarize that data without incurring an excessive loss of fidelity. We do this by creating a sketch. Sketch algorithms generate sketches: smaller, more manageable data structures, from which we can calculate some properties of the original data.

We considered various algorithms to accurately compute percentiles on noisy, large-scale, real-time data that we were receiving from candidates’ skill scores. We considered using Tdigest and DDSketch. For our use case, DDSketch served the purpose.We did a POC and compared the accuracy of both the algorithms as shown below to come up with a finalized algorithm.

POC results and observations

We compared the actual percentile ranges in comparison to the two probabilistic approaches we mentioned(DDSketch and T-DIgest), and these were the results. Note: we have run these tests on the random data samples from the POC point of view.

Initial sample size: 1000(unbiased)
New sample size:  2% of actual samples i.e., 20

actual_percentile_thresholds:  
{'p99': 99.1, 'p97': 97.56, 'p95': 96.31, 'p90': 91.37, 'p85': 80.44, 'p80': 80.44, 'p75': 75.79, 'p70': 71.27}
percentile_thresholds_ddsketch:  
{'p99': 98.5, 'p97': 98.5, 'p95': 96.55, 'p90': 90.93, 'p85': 80.65, 'p80': 80.65, 'p75': 75.95, 'p70': 71.53}


Deviation for ddsketch:
{'p99': 0.61, 'p97': -0.96, 'p95': -0.25, 'p90': 0.48, 'p85': -0.26, 'p80': -0.26, 'p75': -0.21, 'p70': -0.36}
Deviation for tdigest:
{'p99': -0.08, 'p97': -0.03, 'p95': 0.0, 'p90': -0.02, 'p85': -0.01, 'p80': -0.01, 'p75': 0.0, 'p70': -0.21}

 --------------------------------------------------------------------------------
Initial sample size: 500(unbiased)
New sample size:  2% of actual samples i.e., 10
actual_percentile_thresholds:  
{'p99': 99.18, 'p97': 96.88, 'p95': 94.98, 'p90': 90.79, 'p85': 80.95, 'p80': 80.95, 'p75': 75.65, 'p70': 70.55}
percentile_thresholds_ddsketch:  
{'p99': 98.5, 'p97': 96.55, 'p95': 94.64, 'p90': 90.93, 'p85': 80.65, 'p80': 80.65, 'p75': 75.95, 'p70': 70.11}
percentile_thresholds_tdigest:  
{'p99': 99.33, 'p97': 97.0, 'p95': 94.99, 'p90': 90.84, 'p85': 80.97, 'p80': 80.97, 'p75': 75.83, 'p70': 70.58}


Deviation for ddsketch:
{'p99': 0.69, 'p97': 0.34, 'p95': 0.36, 'p90': -0.15, 'p85': 0.37, 'p80': 0.37, 'p75': -0.4, 'p70': 0.62}
Deviation for tdigest:
{'p99': -0.15, 'p97': -0.12, 'p95': -0.01, 'p90': -0.06, 'p85': -0.02, 'p80': -0.02, 'p75': -0.24, 'p70': -0.04}

--------------------------------------------------------------------------------
Initial sample size: 5000(unbiased)
actual_percentile_thresholds:  
{'p99': 99.23, 'p97': 97.25, 'p95': 95.25, 'p90': 90.24, 'p85': 81.05, 'p80': 81.05, 'p75': 76.16, 'p70': 71.07}
percentile_thresholds_ddsketch:  
{'p99': 98.5, 'p97': 96.55, 'p95': 94.64, 'p90': 90.93, 'p85': 80.65, 'p80': 80.65, 'p75': 75.95, 'p70': 71.53}
percentile_thresholds_tdigest:  
{'p99': 99.23, 'p97': 97.25, 'p95': 95.23, 'p90': 90.24, 'p85': 81.08, 'p80': 81.08, 'p75': 76.1, 'p70': 71.08}


Deviation for ddsketch:
{'p99': 0.74, 'p97': 0.72, 'p95': 0.64, 'p90': -0.76, 'p85': 0.49, 'p80': 0.49, 'p75': 0.28, 'p70': -0.65}
Deviation for tdigest:
{'p99': 0.0, 'p97': 0.0, 'p95': 0.02, 'p90': 0.0, 'p85': -0.04, 'p80': -0.04, 'p75': 0.08, 'p70': -0.01}

From above calculations we can deduce that, the percentile deviation in tdigest is close to actual percentile deviation i.e(< 0.2 %), whereas the percentile deviation in ddsketch is approx (< 0.6 %) which is still very accurate to the actual percentile.

We further did a time and space complexity analysis for both the algorithms.Below were the observations.

Sample size: 1000
Time taken to add samples to sketch(DDSketch): 4.124 ms
Time taken to add samples to tdigest: 76.15 ms

sample size: 10000
Time taken to add samples to DDSketch: 27.54 ms
Time taken to add samples to TDigest: 658.18 ms

sample size: 100000
Time taken to add samples to DDSketch: 298.89 ms
Time taken to add samples to TDigest:  7243 ms -> 7.243 sec

Based on the above calculation, we can see that for same sample size (100000) of data DDSketch 298.89 ms to calculate the sketch with deviation of <0.6% from actual percentile and TDigest takes 7.243 sec with 0.2% deviation from actual percentile.

Serialized object size Comparison: As sketch or digest objects will be stored as serialized files, we also calculated the size of the objects

Sample size: 1000
Size of serialized DDSketch object: 4127 bytes
Size of serialized TDigest: 10015 bytes


sample size: 10000
Size of serialized DDSketch object: 5138 bytes
Size of serialized TDigest: 17496 bytes

sample size: 100000
Size of serialized DDSketch object: 6224 bytes, at relative accuracy(0.01)
Size of serialized TDigest: 22049 bytes, at relative accuracy(0.01)

POC Conclusion

Based on the above calculation, we can conclude that TDigest gives less deviation to accurate percentiles in comparison to DDSketch, consuming more memory and time. Whereas in our case, we can afford to have accuracy with deviation close to 1%, time and memory plays an important role in faster calculations of sketches.

Hence, we went with the DDSketch algorithm which takes a nominal time and memory for creating sketches.

Data Ingestion Pipeline (Improved Architechture)

Now that we know that we need to create sketches, we need to create new sketches for every new data point regularly coming from millions of candidates taking tests at our platform. We needed a data ingestion pipeline for updation of these sketches in near real time.

We built a data pipeline to update the sketch. Individual candidate skills and scores were stored in dynamo DB. Participation end triggers the data from Dynamo db to the map-reduce flow ; the candidates skill data is consumed by reducer SQS queue. Reducer lambda takes data in batches of 10000 or 5 mins time intervals and reduces the data to skill-wise scores. These messages are then consumed by the SQS FIFO queue, which groups the data based on problem template and skills. This data is again consumed by the sketch update lambda, which generates the new sketch, merges the new skill sketch with the old sketch, and then calculates the percentile threshold. This flow in turn is consumed by the SQS queue which updates the data in the SQL table.

Model for Storing the Percentile Threshold:


Class ProblemsPercentileThreshold(Base, Generic):
  """
  Model to store the percentile thresholds of a problem.
  """
  percentile_thresholds = JsonField()
  denominator = models.IntegerField()
  sketch_file = models.FileField() 
  users_attempted = models.IntegerField() # approach 4
  last_updated_timestamp = models.DateTimeField()

Script to do the initial precomutaion:

# DDSketch
from ddsketch import DDSketch
from ddsketch.pb.ddsketch_pb2 import DDSketch as DDSketch_PB
from ddsketch.pb.proto import DDSketchProto

sketch = DDSketch()
for score in normailzed_scores:
    sketch.add(score)

# to serialize to string and store sketch
protobuf_obj = DDSketchProto.to_proto(sketch)
data = protobuf_obj.SerializeToString()

# to deserialize back to obj
protobuf_obj = DDSketch_PB()
protobuf_obj.ParseFromString(data)
sketch = DDSketchProto.from_proto(protobuf_obj)
###############################################################################################
#T-digest
from tdigest import TDigest

digest = TDigest()
digest.batch_update(normailzed_scores)

# to serialize to json
data = json.dumps(digest.to_dict())

# to deserialize
digest_dict = json.loads(data)
digest = TDigest()
digest.update_from_dict(digest_dict)

Conclusion

We built global_benchmarking to reliably and effectively run resource-intensive and time-intensive percentile calculations asynchronously in the background. It is now responsible for running asynchronous flows for supporting benchmarking analysis on 1 million or more candidates. This is a beneficial insight for enabling recruiters to make best decisions as well as enabling candidates to improve their skill set.

Posted by [Raunak choudhary] (https://www.linkedin.com/in/raunak-chowdhary-b49406b1)

http://engineering.hackerearth.com/2023/09/17/building-a-relaible-global-benchmarking-platform

Logging millions of requests reliably with our new data ingestion pipeline

Jul 1, 2022

Introduction

HackerEarth handles millions of requests every day. To understand the user access patterns, get the usage of any particular feature or a page, figure out the daily active users or users who have been active for the past 6 months, etc in near real time, it is important to stream that data from across different services and ingest it to the analytics pipeline reliably.

Problem

Our old request logging architecture was complex and has many moving components. There were a lot of operational overheads involved in maintaining and scaling each of those components independently to ensure that all the self-hosted components were up and running all the time.

Architecture

Solution

Last year, we revamped the way we log our web requests. It was done mainly to increase the reliability in logging the HTTP/HTTPS request data from across web services and also to reduce the operational overheads and the infrastructure cost associated with it. The new flow is making use of Kinesis Firehose data streams to deliver the request data from our web servers to Redshift (the database that we use to log and query request data) reliably with much lower cost. Amazon Kinesis Firehose is a fully managed service that automatically scales to match the throughput of our incoming request log data and requires no ongoing administration. It also allows us to compress and encrypt the data before loading it, minimizing the amount of storage used at the destination with increased security and we only need to pay for the amount of data we transmit through the service.

The new flow is a fully managed solution with almost no operational overhead. We tried to keep the flow simple and straightforward with less number of moving components. The request data is now guaranteed to appear in the Redshift table within 10-15 minutes from the point we received a request at our web server’s end. There are retries configured between consecutive infra components in the new flow to make sure there are no message drops in case of intermittent component failures/unavailability.

Architecture

Improvements

This pipeline worked flawlessly for more than a year logging a lot of request data to our Redshift servers. The size of our Redshift cluster had grown from a single node to a lot of nodes within a short period of time. This had a direct impact on our monthly AWS bill. We were forced to take a call to reduce the retention period of our request log data to keep the costs under control. We decided to keep only the last 12 months data at any given point. We started vacuuming the older data from Redshift periodically and were able to reduce the cluster size to around 2 to 3 nodes. However, over the next one year, the business grew significantly and hence grew the number of requests and its data size. In just 10 to 12 months, we ingested almost the same amount of data that we used to ingest in 1.5 - 2 years. That’s a lot of data to handle. Our Redshift cluster had grown to almost double its size again. It seemed unsustainable to us from the maintenance and the overall cost perspective to run that big of a Redshift cluster just to hold the request log data. The amount of request data available to our data analysts to query and analyze has gone down significantly from almost 3 years to just 1 year of data due to reduced retention period which is bad. Moreover, we were storing this request data at 2 different places, both the Redshift and S3. The S3 was acting as a backup.

Architecture

The data is getting segregated in the following way while the Firehose stream is writing it to S3. The S3 key prefix for each file would be in the format logs/yyyy/MM/dd/HH.

Following are the list of gzip compressed files that are getting stored every hour.

That’s when we started exploring alternate approaches to reduce the overall storage and compute costs while maximizing the data retention period. We wanted to see if there is a way to directly use the data in the S3 storage for our queries without having to load it in another on-demand datastore. And that’s exactly what AWS Athena does. It is an interactive query service that makes it easy to query and analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and we pay only for the queries that we run. Athena has the following functionalities:

Scales automatically
Running queries in parallel so results are fast, even with large datasets and complex queries

We quickly did a POC around this and pointed Athena to read data from the S3 bucket where we are logging our request data. It worked well but we observed that sometimes queries are taking too long to get processed and we figured out that it was happening mainly because Athena was trying to look into all the files we have and scanning the entire data everytime we run a query. We had lots of gzip files holding the request data for the past 3 years and the file count and the data size is only going to increase with time. Now, this has become a serious problem as it is going to scan TBs of data everytime we run a query and Athena charges $5 per every TB of data scanned!

After a quick search around the ways to address this problem, we realized that Athena has a support for data partitioning. By partitioning the data, we can restrict the amount of data scanned by each query, thus improving performance and reducing cost. Athena recently added a Partition projection feature that allows us to specify configuration information such as the patterns (for example, YYYY/MM/DD) that are commonly used to form partitions. This gives Athena the information necessary to build partitions without retrieving such metadata information from a remote metadata store. This reduces the runtime of queries against highly partitioned tables like ours since in-memory operations are often faster than remote operations.

Here is the sample SQL definition of the request log table that we have created in Athena.

CREATE EXTERNAL TABLE IF NOT EXISTS requestlog(
  accept_encoding STRING,
  http_accept_language STRING,
  http_host STRING,
  http_referer STRING,
  http_user_agent STRING,
  http_x_forwarded_for STRING,
  internal_reference BOOLEAN,
  is_ajax BOOLEAN,
  meta_queue_client STRING,
  path_info STRING,
  query_string STRING,
  remote_addr STRING,
  remote_host STRING,
  request_method STRING,
  request_scheme STRING,
  request_timestamp TIMESTAMP,
  request_uri STRING,
  server_addr STRING,
  server_name STRING,
  server_port STRING,
  site_hostname STRING,
  tracking_session_id STRING,
  user_id INT,
  session_id STRING,
  client_id STRING,
  landing_page STRING,
  timezone STRING)
PARTITIONED BY (datehour STRING)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://mcs-request-logs/logs/'
TBLPROPERTIES
(
 'projection.enabled' = 'true',
 'projection.datehour.type' = 'date',
 'projection.datehour.range' = '2019/01/01/01,NOW',
 'projection.datehour.format' = 'yyyy/MM/dd/HH',
 'projection.datehour.interval' = '1',
 'projection.datehour.interval.unit' = 'HOURS',
 'storage.location.template' = 's3://mcs-request-logs/logs/${datehour}',
 'has_encrypted_data' = 'true'
)

Here, you can see that the minimum partition granularity we configured is 1 hour (refer to ‘projection.datehour.interval’ property) and the partition format (yyyy/MM/dd/HH) is the same as the directory structure in S3. And to take advantage of the data partitions, we need to use the ‘datehour’ field (dynamically projected column) in the WHERE clause of all queries we make. This will help us in significantly reducing the amount of data we are scanning per query. Athena supports a variety of compression formats including gzip (.gz) for reading data from S3. This enables us to store the request log data in a compressed format in S3 and thus minimizing the overall amount of data we are storing there.

In Athena, we can enforce cost controls by configuring data usage limits that apply to all queries running in an Athena workgroup. A workgroup is a logical grouping inside Athena that can be used to separate query execution and query history between users, teams, or applications running under the same AWS account. In a workgroup, we can set a limit on the amount of data scanned per-query, enforced on a running query. If a query crosses the configured threshold, Athena cancels the query. A workgroup allows us to set thresholds on the amount of data scanned on an hourly, or on a daily basis as well. This gave us full control over the maximum cost that we can incur per query.

Athena automatically stores query results and metadata information for each query that runs in S3 and keeps the query history for 45 days. It means during that 45 day period, if you try re-executing the query that was already executed, Athena will return the query result directly from S3, without actually executing it again. This helps us in improving the query performance and significantly reducing the amount of data scanned, especially when there are queries which run periodically (daily or weekly) with a good amount of intersection between the datehour ranges of consecutive queries, which are broken into smaller batches.

Implementation Python client

Here is the sample implementation of the Python client, on top of Athena, that we use to query our request log data.

class RequestLogClient:
    """
    Implements a client on top of Athena for querying request log data stored in S3
    """

    MAX_RETRY_INTERVAL = 10  # In seconds
    MAX_QUERY_WAIT_TIME = 30 * 60  # In seconds

    _database_name = "<athena database>"
    _workgroup = "<athena workgroup>"
    _athena_client = boto3.client("athena")

    @staticmethod
    def _get_query_hash(query):
        return hashlib.md5(query).hexdigest()

    @classmethod
    def _start_query_execution(cls, query):
        response = cls._athena_client.start_query_execution(
            QueryString=query,
            ClientRequestToken=cls._get_query_hash(query),
            QueryExecutionContext={
                'Database': cls._database_name
            },
            WorkGroup=cls._workgroup
        )
        logger.info("Query Execution Id: {}".format(
            response["QueryExecutionId"]))
        return response["QueryExecutionId"]

    @classmethod
    def _get_query_status(cls, query_execution_id):
        execution_time = None
        state_change_reason = None

        query_status = cls._athena_client.get_query_execution(
            QueryExecutionId=query_execution_id)
        query_exec = query_status["QueryExecution"]
        query_state = query_exec["Status"]["State"]

        if "StateChangeReason" in query_exec["Status"]:
            state_change_reason = query_exec["Status"]["StateChangeReason"]

        if "EngineExecutionTimeInMillis" in query_exec["Statistics"]:
            exec_time = query_exec["Statistics"]["EngineExecutionTimeInMillis"]

        return (query_state, state_change_reason, exec_time)

    @classmethod
    def _wait_for_result(cls, query_execution_id):
        total_wait_time = 0
        retry_interval = 1  # In seconds
        while total_wait_time <= cls.MAX_QUERY_WAIT_TIME:
            query_state, state_change_reason, exec_time = cls._get_query_status(
                query_execution_id)
            if query_state in ["SUCCEEDED"]:
                logger.info("Execution time: {} ms".format(exec_time))
                return None
            elif query_state in ["CANCELLED", "FAILED"]:
                error_string = "Query Execution Failed, Id: {}".format(
                    query_execution_id
                )
                raise RuntimeError(error_string)
            else:  # Either Queued or Running
                pass

            retry_interval = min((retry_interval * 2), cls.MAX_RETRY_INTERVAL)
            time.sleep(retry_interval)
            total_wait_time += retry_interval

        raise TimeoutError("The Athena is taking too long to process the query.")

    @staticmethod
    def _get_row_values(row):
        return [item['VarCharValue'] for item in row['Data']]

    @classmethod
    def _get_result(cls, query_execution_id):
        result = []
        next_token = None
        while next_token != "END":
            kwargs = {
                "QueryExecutionId": query_execution_id,
                "MaxResults": 1000
            }
            if next_token:
                kwargs.update({
                    "NextToken": next_token
                })
            response = cls._athena_client.get_query_results(**kwargs)
            next_token = response.get("NextToken", "END")
            result_data = response["ResultSet"]
            rows = result_data["Rows"]
            result_batch = [cls._get_row_values(row) for row in rows]
            result.extend(result_batch)
        return result[1:]

    @classmethod
    def execute_query(cls, query):
        query_execution_id = cls._start_query_execution(query)
        cls._wait_for_result(query_execution_id)
        result = cls._get_result(query_execution_id)
        return result

Conclusion

It is important to build a reliable, cost effective, and a fully-managed data ingestion pipeline because collecting data, analyzing it to get useful insights, and using those insights to drive the product growth, is crucial for any company to make informed decisions. At HackerEarth, we extract a lot of such insights from different kinds of data points, on a day-to-day basis. Request log is one such data source which we use to figure out a lot of different things, including common access patterns, user behavior, etc. If you are interested in working on projects like this and helping recruiters find the right talent they need, HackerEarth is hiring!

Posted by Jagannadh Vangala

http://engineering.hackerearth.com/2022/07/01/logging-millions-of-requests-reliably-with-our-new-data-ingestion-pipeline

Building a reliable asynchronous job pipeline

Jun 17, 2022

Asynchronous background jobs can dramatically improve the performance and scalability of web applications by offloading resource-intensive and time consuming processing from the request-response cycle of an application.

Last year, in an effort to make our asynchronous flows more reliable, secure, and scalable, we decided to move away from our self-hosted solution that was based on RabbitMQ and Kafka, to a fully-managed one. This was done mainly to reduce the operational overheads in managing and scaling the underlying infrastructure and also to improve our overall security posture.

To support this flow, we created a new library called he-messenger that implements an end-to-end queuing solution for allowing different services or different components of the same service to communicate with each other asynchronously. This library is built on top of the SQS, SNS, and S3 - managed services provided by AWS. It is a fully serverless solution that ingests events from different services, buffers them, and then delivers those events to the subscribed services in a reliable way. It simplifies the otherwise laborious process of provisioning and scaling self-hosted infrastructure.

Since its introduction, this library has become one of the critical pieces of our architecture powering a lot of different use-cases with a very high number of transactions

There are many open-source libraries available in the market that use AWS managed services to support asynchronous background jobs. However, none of them offered an end-to-end solution and the kind of guarantees we needed. Therefore, we decided to implement our own custom library to support this flow.

The purpose of this blog is to give you an overview of the internal and the code level details of the he-messenger library along with a sample reference architecture, supported flows, features,and benefits of this solution.

Terminology

Channel: An abstract communication layer responsible for passing messages between producers and consumers. A channel abstracts out all the underlying implementation details around the SNS, SQS and S3 services that we use for different purposes internally. The name of a channel can be either of the following based on the type of consumer:
- Single consumer case
  - In this case, the channel consists of an SQS queue only.
  - Format: <queue_name>
- Multi-consumer case
  - In this case, the channel consists of both the SNS topic and the SQS queue subscribed to that topic.
  - Format: <exchange_name>.<queue_name>
Producer: Implements the functionality required to push a message to a channel
Consumer: Implements the functionality required to listen, consume, and process a message that was pushed to a channel

Key concepts 1. One-to-one flow

2. One-to-many flow

The main difference between one-to-one and one-to-many flows is the type of the resource to which the producer pushes a message. In the one-to-one flow, the producer pushes a message directly to an SQS queue, whereas, in the one-to-many flow, the message is first pushed to an SNS topic and from there it is broadcasted to all the subscribed SQS queues.

Infrastructure

Implementation Producer

A producer implements the functionality required to push a message to a channel. Producer instances are implemented as thread-safe singleton objects. During the initialization, we ensure that all the infrastructure dependencies of the channel are fulfilled before we start using it for the message delivery. This will remove the overhead of checking the infrastructure dependencies every time the message is pushed.

Producers don’t really know anything about the subscribers/consumers of its message channel. In a way, they are totally decoupled. Producers will always push a message to a single channel. Consumer groups will then subscribe to that channel without knowing anything about the producers. This will allow us to have the producer class defined in one microservice and the corresponding consumer classes in a totally different set of microservices.

Ideally, all producers should push their messages to the SNS topics and the consumers should be solely responsible to create the SQS queues and to subscribe to the SNS topics that they are interested in. Like a proper publish-subscribe model.

However, due to cost concerns, we decided to go ahead with an approach where we will push messages to either an SNS topic or an SQS queue depending on the subscriber type that we configure when we define the producer.

While pushing a message, if an infrastructure provisioning error occurs, the producer will try to reprovision all the required resources in an idempotent manner. And if some transient infrastructure exception occurs, for example API throttling, then the producer will retry that operation with an exponential backoff for a maximum duration of 15-20 seconds. In cases where there are any other runtime errors, the exception will be thrown explicitly.

Message ordering is another important capability that the producer supports. If it is enabled, the order in which messages are sent and received is strictly preserved and each message is delivered exactly once. Moreover, any message that is published with the same content to an ordered channel within a five-minute interval will be rejected by the system (message deduplication). This will enable enhanced messaging between applications where the order of messages is critical or where duplicate messages cannot be tolerated. End-to-end message ordering is supported in both one-to-one and one-to-many flows.

In a one-to-many scenario, a special key called a routing_key can be passed along with each message that a producer produces. The consumer groups can choose to listen to only a subset of the incoming messages depending on the routing key value of the message. Supported filter types on the consumer side are:

Exact match filter
Prefix match filter
Exclude filter(with exact match)

# Producer class
class WishListEventProducer(MultiProducer):
    channel = 'wishlist_events_exchange'
    ordered = True

WishListEventProducer().push(body=..., routing_key='MOBILE.APPLE')
WishListEventProducer().push(body=..., routing_key='LAPTOP.LENOVO')
WishListEventProducer().push(body=..., routing_key='MOBILE.ONEPLUS')

# Consumer classes

# Receives all messages
class AuditWishlistEventsConsumer(Consumer):
    channel = 'wishlist_events_exchange.audit_log'
    ordered = True
    delay = 10  # 10 sec delay

# Receives only those messages with 'MOBILE.' prefix in their routing keys.
class ProcessMobileEventsConsumer(Consumer):
    channel = 'wishlist_events_exchange.process_mobile_events'
    ordered = True
    delay = 30  # 30 sec delay
    message_filter = MessageFilter(filter_type='prefix', values=['MOBILE.'])


# Receives only those messages with 'MOBILE.APPLE' as the routing key.
class ProcessAppleMobilesConsumer(Consumer):
    channel = 'wishlist_events_exchange.process_apple_mobile_events'
    ordered = True
    message_filter = MessageFilter(filter_type='exact', values=['MOBILE.APPLE'])

Consumer

A consumer implements the functionality required to listen, consume, and process a message that was pushed to a channel. It will always be listening to an SQS queue and will be the owner of that queue, thus responsible for provisioning and maintaining the underlying queue infra.

We wanted our consumer implementation to be simple and reliable. Inside the container environments where we usually run our consumers in, we wanted to make sure that we are running only the main entry point process and nothing else apart from it. We did not want to spawn additional processes/threads to monitor the health of the consumer or to send heartbeat messages to keep the consumer connection alive, like in our previous implementation(RabbitMQ-based). Because that would introduce an overhead in monitoring those additional processes and in making sure that all those processes are up and running all the time alongside the main container process (PID 1). The reliability of the container entry point process is taken care of by the AWS ECS system that we are using for container orchestration and management.

The consumer is implemented as a state machine. This helps us in configuring lifecycle hooks and triggering event handlers on state transitions. This pattern gives a lot of flexibility to the developers in implementing custom wrappers around the message-processing part and custom handlers that get triggered at different stages of the consumer lifecycle. Moreover, this allows us to have a different set of event handlers for different consumers unlike in our previous implementation where we could only have a single common wrapper for all the consumers. Furthermore, this pattern significantly improves the code readability by placing the event handlers alongside the main consumer logic.

On the consumer side, if a message processing is taking a lot of time, then the overall throughput of that consumer pool may drop. This could lead to longer wait times in the queue that will in turn have an effect on the average processing latencies of that flow. What if there are a lot of such rogue messages being processed by the consumer pool almost during the same time? What if the processing time of a specific message is directly proportional to the length of the inputs and some of them are taking forever to get processed? How to handle such rogue or long-running messages without affecting the overall throughput of the consumer pool? Yes, what you guessed is right. We somehow need to put a cap on the processing time of a single message depending on the use case of the given consumer. That’s exactly why we introduced a new parameter called processing timeout in our consumer config. This parameter was introduced mainly to limit the processing time of a particular message.

Processing timeout is the maximum time taken by the consumer to process any given message. The developer has to set this to a sensible value so that this threshold is never reached in normal scenarios. This number can eventually be tuned to a much more accurate limit with the help of APM tools like NewRelic or DataDog, based on the historical performance of that consumer. If the processing timeout threshold is reached before the message processing is complete, then the consumer will abruptly stop processing the message, raise the ProcessingTimeoutException exception and it will then start processing the next available message in the queue. The failed message will then be forwarded to the corresponding dead letter queue (exception queue), if configured. Otherwise, the exception will be raised explicitly and the consumer process will be terminated.

We allow developers to set the value of the processing timeout to any integer between 1 second and 1800 seconds (30 minutes) depending on the use case. Here is the sample implementation:

import signal

def register_signal(self, _signal, _handler):
    signal.signal(_signal, _handler)
    # Make sure that the system calls are restarted when
    # interrupted by the given signal.
    signal.siginterrupt(_signal, False)

def timeout_handler(self, _signal, _frame):
    raise ProcessingTimeoutException

@contextmanager
def ticker(self, timeout):
    register_signal(signal.SIGALRM, timeout_handler)
    signal.alarm(timeout)  # Setting SIGALRM timeout to `timeout` seconds
    try:
        yield
    finally:
        if self.should_reset_timeout_alarm:
            signal.alarm(0)  # Resetting SIGALRM timeout to 0


with ticker(processing_timeout):
    handler(message) # Process the message

Delay tolerance is another new parameter that we added to our consumer config. It is the maximum allowed delay in processing a message from the point from when it was pushed to the corresponding channel. The developers need to set this value to a sensible number depending on the processing time of the message along with sufficient buffers. In our case, the default value is 2 minutes. If this option is configured, the system will try to process the message within the tolerated delay on a best-effort basis. As of now, this parameter is not being used actively in any of our flows but the plan is to use it to adjust the capacity of the consumer pool dynamically to make the time taken for message processing fall within the delay-tolerance limits of the given consumer.

The SQS moves messages from the source queue to its corresponding dead-letter-queue if the consumer of the source queue fails to process a message for a specified number of times. In our case, we retry it for a maximum of three times.

def handle_processing_timeout_error(message, err):
    # Trigger `ON_PROCESSING_TIMEOUT` life-cycle hook
    trigger_hook(ConsumerHooks.ON_PROCESSING_TIMEOUT, message)
    if dlq:
        # Forward the message to the corresponding dead letter queue
        forward_message_to_dlq(message)
    else:
        # Re-raise the exception if no dlq is configured.
        raise err

def consume():
    # Transition the consumer to the `LISTENING` state
    transition_to_listening_state()
    # Continue to listen to the queue for new messages until a graceful shutdown signal is received
    while not signal_received:
        should_reprocess_message = False
        # Receive a message from the queue if available. Otherwise wait for   a max. of 20 seconds before we conclude that there are no messages available.
        message = receive_message()
        if not message:
            continue

        try:
            # Transition the consumer to the `ConsumerStates.PROCESSING` state
            transition_to_processing_state(message)
            # Start a ticker to make sure that the processing of the message
            # would not exceed the `ProcessingTimeout` seconds.
            with ticker(processing_timeout):
                # Process the message
                handler(message)
        except ProcessingTimeoutException as err:
            handle_processing_timeout_error(message, err)  # Consumer took too long to process the message
        except Exception as err:
            handle_runtime_error(message, err)  # Runtime error while processing the message
        finally:
            if should_reprocess_message:
                # Redelivers the message to the same queue with a 5 second delay.
                message.redeliver(delay=5)
            else:
                # Deletes the message from the queue.
                message.delete()
            # Transition the consumer to the `ConsumerStates.IDLE` state
            transition_to_idle_state()

Consumer states

Custom wrappers

The processing of each message can be wrapped within a custom functionality as shown below.

Example: Internationalization

To make all our asynchronous background jobs internationalization (i18n) aware, we pass the locale context along with the main message payload.

# On the producer side
class MyProducer(Producer):
    # Mainly to pass some extra context alongside the message payload
    def meta_headers(self):
        meta_headers = super(Producer, self).meta_headers()
        meta_headers.update({
            'locale': get_current_locale()
        })
        return meta_headers

The locale context associated with each message can be used to make the message processing part i18n aware on the consumer’s end.

# On the consumer side
class MyConsumer(Consumer):
    @register_hook(LifecycleHooks.MSG_PROCESSING_START)
    def message_processing_start_handler(self, message):
        locale = message.meta_headers['locale']
        activate_locale(locale)

    @register_hook(LifecycleHooks.MSG_PROCESSING_END)
    def message_processing_end_handler(self):
        deactivate_locale()


# Those handlers will be executed in the following order:
1. Receives a message from the SQS queue
2. Calls MSG_PROCESSING_START handler
3. Calls the main message handler
4. Calls MSG_PROCESSING_END handler

Consumer instrumentation

Instrumentation of the message handler can be done by slightly modifying the consumer-handler logic. The following example shows a way to integrate the NewRelic APM client with the consumer class for instrumenting the message-handler logic.

class MyConsumer(Consumer):
    channel = "update_phonenumber"

    def handler(self, message):
        if os.getenv("APM_INSTRUMENTATION_ENABLED"):
            import newrelic.agent
            with newrelic.agent.BackgroundTask(newrelic.agent.application(),
                    name=self.channel):
                super(Consumer, self).handler(message)
        else:
            super(Consumer, self).handler(message)

Consumer health checks

We periodically monitor the health of our consumer containers to ensure that they are actively listening to the queue, consuming messages, and processing them within the expected timeframe.

If a consumer is stuck at any of the states — INITIALIZING, INITIALIZED, PROCESSING, or EXITING — due to some reason, we try to kill that container gracefully and spawn a new one to replace it. Health checks are configured using Docker HEALTHCHECK command.

Here is the health check configuration from one of our consumer task definitions:

"healthCheck": {
    "command": [
        "CMD-SHELL", "check_consumer_health || exit 0"
    ],  
    "interval": 30, 
    "timeout": 10, 
    "retries": 5,
    "startPeriod": 300
}

In this we use a file to do inter-process communication between the main consumer process and the health check process. Whenever the consumer process transitions from one state to another, it writes the required data to a file. The health check process periodically reads and processes that data to determine the current state of the consumer process. We lock that file while reading from or writing to it to ensure data correctness.

Here is the sample implementation of the script that is responsible for performing consumer health checks:

# check_consumer_health.py

import datetime, json, os, sys
from filelock import FileLock, Timeout


CONSUMER_STATES_TO_MONITOR = ['INITIALIZING', 'INITIALIZED', 'PROCESSING', 'EXITING']

def process_health_check_data(data):
    state = data['state']
    healthcheck_timeout = data['healthcheck_timeout']
    state_transition_timestamp = data['transition_timestamp']
    state_transition_datetime = datetime.datetime.strptime(
        state_transition_timestamp, '%Y-%m-%d %H:%M:%S.%f')
    now = datetime.datetime.utcnow()
    time_elapsed = (now - state_transition_datetime).total_seconds()
    if int(state) in CONSUMER_STATES_TO_MONITOR and time_elapsed > healthcheck_timeout:
        return False
    return True

def check_consumer_status():
    try:
        # Wait for a maximum of 5 seconds for acquiring the lock
        with FileLock('/tmp/health_check_file.lock', timeout=5):
            contents = None
            with open('/tmp/health_check_file', 'r') as f:
                contents = json.loads(f.read())
            status = process_health_check_data(contents)
            if status is True:
                sys.exit(0)
            else:
                sys.exit(1)
    except IOError:
        sys.exit(0)
    except Timeout:
        sys.exit(0)
    except SystemExit:
        raise
    except:  
        sys.exit(1)

if __name__ == "__main__":
    check_consumer_status()

Graceful consumer shutdown

Shutting down consumer processes gracefully is important to prevent partial processing, data loss, bad exits, or unreleased resources. While terminating a container, the ECS system will first send a SIGTERM signal to the container’s entry-point process (usually PID 1) to notify it that it will be killed. Once the consumer process gets this signal, it will stop consuming new messages from the queue, finish the ongoing processing if any and clean up the resources it used.

When a SIGINT or SIGTERM signal is received by the consumer, it will wait for a maximum of 30 seconds before it kills itself in an abrupt way. This will ensure that the consumer process gets sufficient time to finish the pending processing and make a clean exit. The following diagram explains the container shutdown flow of an ECS task.

(Source: AWS Documentation)

Handling large messages

If the total message payload size (including body and attributes) crosses the 256KB threshold, then:

The contents of the message will be offloaded to the S3 storage
The S3 object reference will be sent as part of the main SNS/SQS message payload

he-messenger abstracts out the complexity around handling large messages and exposes only the fixed contracts for pushing and receiving messages irrespective of the size of the message.

Message retention

If due to some reason, the messages that are getting pushed to a channel from the producer’s end are not getting consumed on the other end, then it may result in the accumulation of messages inside the channel’s buffer (either in SQS or S3). Such messages will continue to be available for consumption till the maximum message retention period of 14 days is reached. The system will automatically delete or expire the messages that exceed the retention period. Longer message retention provides greater flexibility for developers to debug, fix, and requeue any problematic messages from DLQs to their corresponding source queues easily.

Content encryption

Encryption at rest (server-side encryption) and encryption in transit have been enabled in all three underlying AWS services—SNS, SQS, and S3. Messages stored in both the standard and the ordered channels are encrypted using a customer-managed, KMS encryption key.

Message requeuing

We have implemented a requeuer utility that helps us in consuming messages from a queue, for example, a dead-letter queue, and pushing them to the destination channel.

This is how we trigger message requeuing from one channel to another:

requeuer = Requeuer(
    source_queue='<source queue name>',
    destination_channel='<destination channel name>',
    destination_type=SubscriberTypes.MultiConsumer
)
requeuer.start()

Local development

In this mode, the producers and consumers will try to connect to the Localstack server, which is a mocking framework for AWS services, running locally instead of talking to the actual AWS services. This allows us to develop and test the asynchronous flows in our local machine without ever talking to the AWS cloud. Activation of this mode is handled by environment-specific variables. To achieve this without a lot of code-level changes, we monkey patch the entry points of the boto3 package with localstack specific ones.

Here is the sample implementation code:

def patch_boto3():
    import localstack_client.session
    localstack_session = localstack_client.session.Session()
    boto3.client = localstack_session.client
    boto3.resource = localstack_session.resource


if LOCAL_ENVIRONMENT:
    patch_boto3()

Infrastructure ownership and tracking

he-messenger owns all the infrastructure that it provides as part of our asynchronous flows. It takes the responsibility of provisioning, maintaining, and cleaning of those resources. We use AWS resource-level tags to store ownership information, alarm thresholds, and other related meta information. Tags help us have a detailed breakup of the AWS costs per resource or per use-case. Here is a sample list of tags that we usually store along with each resource that is provisioned through he-messenger.

Error handling

We made sure that the transient errors are handled gracefully by configuring retries at the interfaces between any 2 components, for example, producer-SNS interface, SNS-SQS interface, etc. Messages that cannot be delivered by an SNS topic to its subscriber queues due to client errors or server errors are eventually routed to the corresponding dead-letter queue for further analysis or reprocessing.

If there is a ProcessingTimeout exception or any other runtime exception, the message will eventually be routed to the dead-letter queue. Therefore, a message that is published from a producer to the channel will always be available either in the source queue or the dead-letter queue if it has not been deleted by the consumer.

Alerting

he-messenger auto-configures alerts for all the SQS queues that it owns to send notifications about the problematic queues if any. It runs two lambda functions periodically to monitor both the source queues and the dead-letter queues every 15 minutes and 12 hours respectively.

A lambda function checks the value of the ApproximateNumberOfMessagesVisible metric for each of the queues and also checks if that value is crossing the configured alarm threshold. It also checks whether any of those problematic queues are tagged as critical. It will flag these during notifications.

Here are the two different message variants (non-critical and critical) that he-messenger posts to our internal Slack channels if there are any problematic queues:

Resource cleaner

All unused or inactive AWS resources provisioned by the he-messenger service will eventually be deleted by the resource-cleaner utility, which is an AWS lambda function that runs once every day.

Here are the steps that are used to classify and clean up unused resources:

Get all SNS topics and SQS queues owned by the he-messenger service.
Filter the SNS topics that have zero metric value consistently for the past 1 day for the following metrics:
- NumberOfMessagesPublished
- NumberOfNotificationsDelivered
Mark those SNS topics as inactive
Get a list of active SNS topics by removing the inactive topics from the list of valid SNS topics owned by he-messenger
Mark all those SQS queues that are subscribed to these topics as active.
Get the values of the following metrics for each of the valid queues for a 3-day duration:
- ApproximateNumberOfMessagesVisible
- NumberOfMessagesSent
- NumberOfMessagesReceived
- NumberOfEmptyReceives
The sum of the values of NumberOfMessagesReceived and NumberOfEmptyReceives metrics represent the behavior of the consumer pool listening to the given SQS queue. If the sum is zero, that basically means there are no consumers listening to this queue actively.
Classify the queues into the following categories based on the values of the metrics mentioned above:
- Inactive queues with no messages (to be deleted)
- Inactive queues with stale messages (to be notified)
- Queues with no producers (to be notified)
- Queues with no consumers (to be notified)
If the values of NumberOfMessagesSent and (NumberOfMessagesReceived + NumberOfEmptyReceives) metrics are consistently zero, the queue can very well be marked as inactive if it is not subscribing to any of the active SNS topics.
If the value of the ApproximateNumberOfMessagesVisible metric is zero for any of those inactive queues then they will be marked for deletion. Otherwise, they will be categorized as ‘Inactive queues with stale messages’.
In the remaining queues, if the value of the NumberOfMessagesSent metric is consistently 0, those will be categorized as ‘Queues with no producers’.
In the remaining queues, if the value of (NumberOfMessagesReceived + NumberOfEmptyReceives) metric is consistently 0, those will be categorized as ‘Queues with no consumers’.
The infra cleaner reports the following metrics to the Engineering team:
- Unused SNS topics deleted
- Unused SQS queues deleted
- SQS queues with stale messages
- SQS queues without active producers
- SQS queues without active consumers

Here are the sample messages that he-messenger sends after cleaning up unused resources.

Conclusion

We built he-messenger to reliably and effectively run resource-intensive or time-intensive tasks asynchronously in the background. We were able to migrate all our existing flows to this new flow pretty smoothly without any hiccups. It is now responsible for running hundreds of asynchronous flows while catering to some of the most critical use cases at HackerEarth for the past one year. We are planning to make this library open source soon. If you are interested in working on projects like this and helping recruiters find the right talent they need, HackerEarth is hiring!

If you have any queries or wish to talk more about this architecture or any of the technologies involved, you can mail me at jagannadh@hackerearth.com.

Posted by Jagannadh Vangala

http://engineering.hackerearth.com/2022/06/17/building-a-reliable-asynchronous-job-pipeline

How to set a React Component or dom element as a background image

Aug 12, 2021

During my internship at HackerEarth, I faced an interesting problem. This blog is about that and how I solved it.

Problem: To set a background image to the textarea element.

My initial impression on seeing the design was that it would be easy. I thought it’s a image but after exploring the code I came to know that it’s not an image that we have to show as background instead, it’s a React Component. So now what? To solve this problem we need to think from scratch.

Solution:

First, we will think about how to do it and then will implement it step by step.

1. Think From Scratch

In this example, as you can see the content in the body element overlap the background-image. To solve this problem we just need to overlap the React Component with the textarea.

2. Implementation

Create a textarea and React Component which we are going to use. The text area is to the left and to the right is the React Component which we are going to set as a background image in the textarea element.

Steps:

1.Create a div which wrap textarea and React Component. Set the div position: relative and React Component position: absolute, top: 0 & left: 0.

<div className="editor-container"> // position: relative
  <textarea className="editor"/>
  <BackGround /> // position: absolute; top: 0; left: 0;
</div>

2.To overlap textarea on React component we need to set React Component z-index: -1.

.bg-img {  
  position: absolute;  
  top: 0;  
  left: 0;  
  width: 250px;  
  font-family: monospace;
  text-align: center;  
  color: rgba(0, 0, 0, 0.29);  
  z-index: -1;
}

As you can see we have a problem now. React Component is below the textarea and we are not able to see.

3.To solve the problem we need to make the textarea background transparent. But then if you click on the React Component you won’t be able to edit the textarea. To solve this set pointer-events: none.

.editor {
  height: 400px;
  width: 400px;  
  background: rgba(0, 0, 0, 0); // background transparent
}
.bg-img {  
  position: absolute;  
  top: 0;  
  left: 0;  
  width: 250px;  
  font-family: monospace;
  text-align: center;  
  color: rgba(0, 0, 0, 0.29);  
  z-index: -1;
  pointer-events: none;  // not to react on pointer events
}

4.(Optional) Set the position of React Component at the centre.

.bg-img {
  position: absolute;
  width: 250px;
  font-family: monospace;
  text-align: center;
  color: rgba(0, 0, 0, 0.29);
  z-index: -1;
  pointer-events: none;  
  top: 50%;
  left: 50%;
  transform: translate(-50%, -50%);
}

End Result

Posted by Ashu Deshwal, Frontend Engineer

http://engineering.hackerearth.com/2021/08/12/how-to-set-a-react-component-as-a-background-image

Zero to One and Beyond: HackerEarth's journey to Continuous Delivery

Aug 1, 2021

“Hey the deployment is broken again. Can you push this change again”
“Hey, I merged my changes were merged in the morning. I still don’t see them in production yet”
“Argh, the static files are not updated. We have to run deployment again”
These voices hollared across the hallway and this was followed by a huddle to sort things out.

These voices soon grew louder and then we realised: Our Deployment is broken.

Epilogue

At HackerEarth, we have always been good at embracing bleeding edge technologies. We have always taken pride at doing what is right and acknowledge when something needs a fix.
A faster delivery cadence and a quicker release cycle are very important for a startup of our scale. Our deployment related problems threatened our fundament need - “Pace”

This prompted us to not just fix what was broken but to introduce a new paradigm to deployment - Continuous Deployment.

The key tenets of following agile to push consistent smaller pieces of software frequently to customers and get Feedback. As any growing team, we were at one point struggling with - higher deployment failures or critical issues leaking to production. What followed was our path to redemption

Circa 2019, HackerEarth was already doing frequent deployments. We had a process to collect, merge, tag and release code into production. But, this was not enough. As it must be obvious now, our feedback cycle was not close to the point of failure. The Integration happened closer to production and any failure is now expensive to fix. This is where our journey up the CI-CD ladder begun.

You’re doing continuous delivery when:

Your software is deployable throughout its lifecycle

Your team prioritises keeping the software deployable over working on new features

Anybody can get fast, automated feedback on the production readiness of their systems any time somebody makes a change to them

You can perform push-button deployments of any version of the software to any environment on demand”
— Martin Fowler

When it all began

CI to CD in 3 steps

We decided to redraw the lines to achieve Continuous Delivery. The first step to achieve Continuous Delivery is Continuous Integration. Continuous Integration is the art of integrating different sources into a single outcome and evaluate if the combination of changes work without issues.

This has to be done frequently and the feedback should be close to point of failure.
The next step is Continuous Delivery. This step ensure that you not only have tested your integrations continuously but are also deploying to various environments as frequently as possible.
The last step is Continuous Deployment where part of the product can reach customers as soon as the it is deemed fit. We were gunning for this.

The first step in our journey is Self-assessment. A honest self-assessment is important here. Unless you know what is broken, you can never be sure when is fixed. There are many methods to assess where you stand in your journey towards continuous delivery. . We built our self assessment based on the maturity model laid by Jez Humble in his book Continuous Delivery.

The model outlines a framework where you rate yourself on a scale of 0 to 3 under various parameters such as Build, Deploy, test and database.

The CI/CD Maturity Model

Our self assessment revealed the obvious: We score 0 across all areas that we considered.

The StrikeOut! Take 1: Build

The biggest problem was our CI setup. Our branching and merging strategy needed a relook. We had to integrate and test our code more frequently. We need to fail first, fail fast. To achieve this we moved away from a custom integration setup to one based on orchestrator - Enter Jenkins

We created our build pipelines to run for every logical set of commits. We introduced Pull Request based merge strategy and added safety nets around each PR. The result: a green build on a PR reduced

our chances of broken trunk significantly. Each PR is not a testable artefact capable of being deployed independently in lower environment. Our build is now Repeatable, Consistent and Automated without any need for manual intervention. Strike one!

Take 2: Deploy

We extended our wins from Build strategy into deployments. We introduced a doer-checker system for compliance and also created a set of repeatable functions that were orchestrated from trunk to release to production. Deployment was de-coupled from Build and this enabled us to push any release version anytime, anywhere. We introduced Lower Environments mimicing the setup of production and thus the configurations were also similar and extensible. The result: Code can deployed to any environment without any major changes. Environments were now first class citizens. The decoupled deployments also helped us do hot-fixes without affecting regular deployment cadence. The CI tool (Jenkins) also made sure there is an audit trail. We now have build and deployment metrics. Strike two!

Take 3: Release

With the steps taken to fix deployments, accountability was already in place. We wired our Project Management system (JIRA) with our deployment orchestration. Each release was tagged and appropriate tickets were tagged with this release tag. This started giving visibility to the other stakeholders. “Hey, as my story been released yet?” was being answered. We started slow with our cadence. For our commit rate we decided on doing one deployment a day. Each candidate build was tested with automated unit and acceptance tests and approved by the QA team. We also integrated a short benchmarking step to diagnose any performance degradation (More on this later). With this we were able to reduce the deployment timelines by about 70%. Strike three!

Each of these steps involved multiple rounds of optimisations to achieve this. We also achieved a 98% rate of successful deployments.

The other tenets were Testing, Data Management and Configuration management

Testing

We had unit test written and automated acceptance tests were run.. manually. Our Build and deploy process was inclusive of testing right from the word go. The existing tests were leveraged, we had already identified means to measure quality metrics. We put in extra effort to increase our test coverage across the platform and across the test levels. Our test were environment agnostic and was able to give feedback on the state of the system as early as it can get.

Data Management

Our database is hosted in Amazon RDS and thus by nature gives an ability to deploy and rollback changes. We

made process changes to test and run database migrations in lower environments before running them in production. We added checks in the deployment system to ensure the state of database does not change with formal approval.

Configuration Management

We were actively using SCM to version code. But now extended to environment configurations, migrations deployment scripts etc. We introduced poetry for dependency management and embraced docker for containerisation. We reached a point where we followed IaaC guidelines to manage infrastructure.

The Ephiphany CD is a paradigm shift. Change does not happen overnight Embracing Continuous Deployment is a behaviroual shift. The team should move away from large deployments to smaller shorter release cycles. This also means establishing a sturdy safety net to catch failures at each step.
Embrace DevOps culture
This requires the teams to abolish silos and work as cross-functional teams. Treat your test code and test infrastructure as first class citizens.
Invest on Feature Flags
Feature flags/Feature toggles have been in use for a while. Start using these to push changes to trunk without affecting customer usage. Avoid long running feature branches.
Improve - continuously
Start measuring your build and deploy metrics. Pick up north start metric around each tenet and work towards moving the needle. Celebrate small wins, every input to get there counts. Accelerate!

We assessed ourselves every quarter to see where we stand. And we now stand at level 2 raring to go to level 3.
How do we get there? Accelerate!
Continuous Delivery is also a behavioural change. Now the team is ready we adopt the “Accelerate” metrics or 4 Key metrics to measure and improve. The key metrics being: Lead Time, Deployment Frequency, Mean time to Recovery, Change fail percentage. Stay tuned for more updates from our rocket-ship as we discover what lies ahead of us.

Posted by Navaneethakrishnan R, Director of Quality Engineering

http://engineering.hackerearth.com/2021/08/01/CI-CD-Journey

How I built my first search component in React

Jul 17, 2020

In 2016, I was working to build a platform to help NGOs raise donations. The platform was supposed to be built using React. The beautiful designs were in place. I was excited to build some new features as well as to try out the new architecture using Redux on a larger scale.

Yes, this was the time when Redux was fairly new. React still had PropTypes package attached to its core. The lifecycle method componentWillReceiveProps used to dominate the scene.

Out of the many design components that I worked while building that platform, I am going to discuss my first search component in React here.

Elements of a search bar

The search bar was a simple input field placed at the middle of the main header. The design idea was to have a search component which displays results as soon as the user inputs something in it. Then, to provide a simple cross icon at the right end of the input field to clear the inputs and hide the search results.

The search results were supposed to appear inside a modal starting below the main header of the site. Below are the designs to help you visualise things better:

If you typed on the search bar, a search results component appeared and showed NGOs, Live Projects and Campaigns.

Development Redux Architecture

While building apps using Redux architecture, one should be cognizant of its three principles (Single source of truth, State is read-only and Changes are made with pure functions).

In computer programming, a pure function is a function that has the following properties:

Its return value is the same for the same arguments

Its evaluation has no side effects

In our case, a single module (file) was created to include action types, reducers and action creators. The flow was - when the user inputs something in the search field, we will take that value and pass it through an action creator which will fetch the results from the server for the queried string and return an action type along with the result. There was also a service layer in the middle, which handled the caching and server errors.

Based on the action type, our reducer will update the main store’s state with the response data and an information key to show search results modal. Hence, as soon as the store was updated, the views which were subscribed to it will render again and the search container will be added to the DOM to display the results.

Here is the module code for the Search component:

export const SEARCH_MOBILE_OPEN = 'SEARCH_MOBILE_OPEN';
export const SEARCH_TEXT_CHANGE = 'SEARCH_TEXT_CHANGE';
export const SEARCH_SET = 'SEARCH_SET';
export const SEARCH_APPEND = 'SEARCH_APPEND';
export const SEARCH_LOADER_STATE = 'SEARCH_LOADER_STATE';
export const SEARCH_RESET = 'SEARCH_RESET';

const initialState = {
  loading: true,
  input_value: '',
  is_load_more_visible: false,
  show_mobile_search: false,
  data: [] /* Search result items */
};
const paginationItemsCount = 6;

export default function search (state = initialState, action = {}) {
  switch (action.type) {
    case SEARCH_TEXT_CHANGE:
      return {
        ...state,
        loading: true,
        input_value: action.inputValue
      };

    case SEARCH_SET:
      return {
        ...state,
        data: action.data.docs,
        loading: false,
        is_load_more_visible: action.isLoadMoreVisible
      };

    case SEARCH_APPEND:
      return {
        ...state,
        data: state.data.concat(action.data.docs),
        loading: false,
        is_load_more_visible: action.isLoadMoreVisible
      };

    case SEARCH_LOADER_STATE:
      return {
        ...state,
        loading: true,
        is_load_more_visible: action.isLoadMoreVisible
      };

    case SEARCH_RESET:
      return {
        ...initialState,
        input_value: action.inputValue,
        show_mobile_search: false
      };

    case SEARCH_MOBILE_OPEN:
      return {
        ...state,
        show_mobile_search: true
      };

    case '@@router/LOCATION_CHANGE':
      return {...initialState};

    default:
      return state;
  }
}

export function loadSearchResultsFail(error) {
  return {
    type: SEARCH_LOADER_STATE,
    error,
    isLoadMoreVisible: false
  };
}

export function loadSearchResultsSuccess(data) {
  return {
    type: SEARCH_SET,
    data,
    isLoadMoreVisible: data.numFound > paginationItemsCount
  };
}

export function clearSearch(e) {
  e.preventDefault();
  return {
    type: SEARCH_RESET,
    inputValue: ''
  };
}
export function showMobileSearch(e) {
  e.stopPropagation();
  e.preventDefault();
  return {
    type: SEARCH_MOBILE_OPEN
  };
}

/* Note: api hits return data object -> data.response (object) has numFound,
docs and others -> data.response.docs (array) has search result items */
export function searchTextChange(inputValue, shouldCallApi) {
  return (dispatch, getState, services) => {
    dispatch({
      type: SEARCH_TEXT_CHANGE,
      inputValue: inputValue
    });

    let requestData = {
      params: {
        q: inputValue+'~',
        rows: paginationItemsCount
      }
    };
    if(shouldCallApi && inputValue.length>2) {
      return services.search.get(requestData)
        .then(data => {
          dispatch(loadSearchResultsSuccess(data.response));
        })
        .catch(error => {
          dispatch(loadSearchResultsFail(error));
        }
      );
    }
  };
}

export function loadSearchResults() {
  return (dispatch, getState, services) => {
    var searchInputValue = getState().search.input_value;
    dispatch({
      type: SEARCH_TEXT_CHANGE,
      inputValue: searchInputValue
    });

    let requestData = {
      params: {
        q: searchInputValue+'~',
        rows: paginationItemsCount
      }
    };
    if (searchInputValue.length > 2) {
      return services.search.get(requestData)
        .then(data => {
          dispatch(loadSearchResultsSuccess(data.response));
        })
        .catch(error => {
            dispatch(loadSearchResultsFail(error));
        });
    }
  };
}

export function loadMoreResults(itemsCount) {
  return (dispatch, getState, services) => {
    var searchInputValue = getState().search.input_value;

    dispatch({
      type: SEARCH_LOADER_STATE,
      isLoadMoreVisible: false
    });

    let requestData = {
      params: {
        q: searchInputValue+'~',
        rows: paginationItemsCount,
        start: itemsCount
      }
    };
    return services.search.get(requestData)
      .then(data => {
        dispatch({
          type: SEARCH_APPEND,
          data: data.response,
          isLoadMoreVisible: data.response.numFound > (itemsCount + data.response.docs.length)
        });
      })
      .catch(error => {
        dispatch(loadSearchResultsFail(error));
      });
  };
}

React Components

Note: I have stripped down unrelated code for a better clarity.

The dumb one
Disclaimer: All the code shared in this blog was written in 2016. Be mindful when using them. Do not categorise yourself under the same category as the below input component. :P

Input

import React, {Component} from 'react';
import styles from './Input.scss';

const Input = ({type, placeHolder, autoComplete, leftIcon, rightIcon, onChange, value, onRightIconClick}) => (
  <span className="input">
    <i className={leftIcon} aria-hidden="true"></i>
    <input autoFocus type={type} name={placeHolder}
           autoComplete={autoComplete}
           placeholder={placeHolder}
           onChange={onChange}
           value={value}/>
    <i className={rightIcon} aria-hidden="true" onClick={onRightIconClick}></i>
  </span>
);

Input.defaultProps = {
  type: "text",
  placeHolder: "Input",
  autoComplete: "on" /* Possible values - on and off */
};

Input.propTypes = {
  type: React.PropTypes.string.isRequired,
  placeHolder: React.PropTypes.string.isRequired,
  autoComplete: React.PropTypes.string,
  leftIcon: React.PropTypes.string,
  rightIcon: React.PropTypes.string,
  onChange: React.PropTypes.func,
  value: React.PropTypes.string,
  onRightIconClick: React.PropTypes.func
};

export default Input;

onRightIconClick prop is expected so as to perform clearing of search input.

The smart ones

Once a developer named Dan Abramov wanted to hire a smart guy to speed up the development of the React library. However, weeks went by and he had no success. Hence, he started naming his components as the smart ones.

SearchInput

import React, {Component} from 'react';
import styles from './SearchInput.scss';
import { connect } from 'react-redux';
import { bindActionCreators } from 'redux';
import * as Search from '../../redux/modules/Search';
import Input from '../../components/Input/Input';

class SearchInput extends Component {
    search_debounce_timer;
    on_search_text_change = (event) => {
        let inputValue = event.target.value;
        let _Search = this.props.Search;
        let search = this.props.search;
        if (this.search_debounce_timer) {
            window.clearTimeout(this.search_debounce_timer);
        }

        _Search.searchTextChange(inputValue, false); /*false - so no call to api, handles multiple/fast inputs via keyboard. Check searchTextChange action creator code in the search module mentioned above for better understanding.*/
        this.search_debounce_timer = window.setTimeout(function () {
            _Search.searchTextChange(inputValue, true);
        }, 500);
    };

    render() {
        const {search, Search} = this.props;
        let rightCrossIcon = (search.input_value.length>0) ? "fa fa-times-thin" : "";
        return (
            <Input leftIcon="fa fa-search" type="text"
                   placeHolder="Search for NGOs, Projects, Campaigns…"
                   onChange={this.on_search_text_change} rightIcon={rightCrossIcon}
                   onRightIconClick={Search.clearSearch}
                   value={search.input_value||""}/>
        );
    }
}

const mapStateToProps = (state) => ({
    search: state.search
});

const mapActionToProps = (dispatch) => ({
    Search: bindActionCreators(Search, dispatch)
});

export default connect(
    mapStateToProps,
    mapActionToProps
)(SearchInput);

SearchResults

import React, {Component} from 'react';
import styles from './SearchResults.scss';
import { connect } from 'react-redux';
import { bindActionCreators } from 'redux';
import Tabs from '../../components/Tabs/Tabs';
import * as Search from '../../redux/modules/Search';
import NoSearchResults from '../../components/NoSearchResults/NoSearchResults';
import Loader from '../../components/Loaders/Loader/Loader';
import Button from '../../components/Button/Button';

class SearchResults extends Component {
  componentDidMount() {
    if(this.props.show) {
    // remove scrolling from body
      document.getElementsByTagName('body')[0].style.overflow = 'hidden';
    }
  }

  componentWillReceiveProps(nextProps) {
    // remove scroll from body and give transparent bg to header
    let headerContainer = document.getElementsByClassName('header-container')[0];
    if(nextProps.show) {
      document.getElementsByTagName('body')[0].style.overflow = 'hidden';
      headerContainer.classList.add('header-transparent-bg');
    } else {
      document.getElementsByTagName('body')[0].style.overflow = '';
      headerContainer.classList.remove('header-transparent-bg');
    }
  }

  render() {
    const {show, search, Search} = this.props;

    var items = search.data.map(function (item, i) {
      return <li>item.name</li>
    });

    return (show) ? <div>
      <div className="overlay-bg" onClick={Search.clearSearch}></div>
      <div className="search-wrapper">
        <div className="search-results">
            {(items.length>0) ? items :
              (search.loading) ? <Loader/> :
                <NoSearchResults clearSearch={Search.clearSearch}
                                 link="/discover"/>}
        </div>
        {(search.is_load_more_visible) ?
          <div className="more-results-btn">
            <Button buttonClass="btn2" name="SHOW MORE RESULTS"
                    onClick={function() {
                      Search.loadMoreResults(search.data.length);
                    }}/>
          </div> : null
        }
      </div>
    </div> : null;
  }
}

SearchResults.propTypes = {
  show: React.PropTypes.bool.isRequired
};

const mapStateToProps = (state) => ({
  search: state.search
});

const mapActionToProps = (dispatch) => ({
  Search: bindActionCreators(Search, dispatch)
});

export default connect(
    mapStateToProps,
    mapActionToProps
)(SearchResults);

Note: That Dan Abramov story is made-up. Do not believe everything that you read on the internet.

Header: It was divided into 3 parts viz. the logo, the search field and the links container.

import React, {Component} from 'react';
import ReactDOM from 'react-dom';
import {Link} from 'react-router';
import styles from './Header.scss';
import { connect } from 'react-redux';
import { bindActionCreators } from 'redux';
import * as Search from '../../redux/modules/Search';
import Button from '../Button/Button';
import letzLogo from '../../resources/images/Letz-Logo.png';
import SearchInput from '../../connect-views/SearchInput/SearchInput';
import SearchResults from '../../connect-views/SearchResults/SearchResults';
import * as Utils from '../../helpers/Utils';

class Header extends Component {
  constructor(props, context) {
    super(props, context);
    this.state = {fixed: false};
  }

  componentDidMount() {
    window.addEventListener('scroll', this.handleScroll);
  }

  componentWillUnmount() {
    window.removeEventListener('scroll', this.handleScroll);
  }

  handleScroll = () => {
    let willHeaderFix = Utils.WillElementFixOnScroll(this.refs.header);
    if(this.state.fixed !== willHeaderFix){
      this.setState({fixed: willHeaderFix});
    }
  };

  render() {
    const {search, Search, isDiscoverBtnVisible, isHeaderFixed} = this.props;
    let headerFixedClass = (this.state.fixed && isHeaderFixed) ? 'header-fixed' : '';

    return (
      <div>
        <header>
          <div className={"header-container " + headerFixedClass} ref="header">
            <div className="header-wrapper">
              <div className="letz-logo">
                <a href="/">
                  <span className="lc-logo"></span>
                </a>
              </div>
              <div style={{display: search.show_mobile_search ? "" : 'none'}}>
                <div className="mobile-top-search top-search">
                  <SearchInput/>
                </div>
                {(search.input_value.length>2) ? null : <div className="overlay-mobnav overlay-mobile-transparent" onClick={Search.clearSearch}></div>}
              </div>
              <div className="top-search">
                <SearchInput/>
              </div>
              <div className="account-top">
                {isDiscoverBtnVisible ? <Link to="/discover">
                  <Button buttonClass="btn3 donate-btn" iconClass="fa fa-heart"
                          name="DISCOVER & DONATE"/>
                </Link> : null}
                <div className="user-account"><a href={'/dashboard'} target="_self" className="user-color"><i className="fa fa-user" aria-hidden="true"></i></a></div>
              </div>
              <div className="mobile-nav">
                <button onClick={Search.showMobileSearch}>
                  <i className="fa fa-search" aria-hidden="true"></i>
                </button>
                <a href={'/dashboard'} target="_self" className="user-color">
                  <button>
                    <i className="fa fa-user" aria-hidden="true"></i>
                  </button>
                </a>
              </div>
            </div>
          </div>
        </header>
        <SearchResults show={search.input_value.length>2}/>
      </div>
    );
  }
}

Header.defaultProps = {
  isDiscoverBtnVisible: true,
  isHeaderFixed: true
};

Header.propTypes = {
  isDiscoverBtnVisible: React.PropTypes.bool,
  isHeaderFixed: React.PropTypes.bool
};

const mapStateToProps = (state) => ({
  search: state.search
});

const mapActionToProps = (dispatch) => ({
  Search: bindActionCreators(Search, dispatch)
});

export default connect(
  mapStateToProps,
  mapActionToProps
)(Header);

These were sufficient to get the search bar up and rolling.

Parting notes

I wanted to share the code to help new developers who want to try out their hands in building a basic Search component as well as learn about the evolution of the code patterns. Certain naming conventions and methods do not hold true today. However, feel free to modernise and improve this implementation as a side project.

I also wanted to take this opportunity to inform people that there is a website which transfers 100% of your donations to the NGOs.

To people who do not code: “Many many lorem ipsum of the day. May your life be full of Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque consequat eleifend justo vitae facilisis. Praesent ut felis in velit feugiat accumsan”.

Thank you for your time.

…

Adios amigos!

Posted by Chandransh Srivastava

http://engineering.hackerearth.com/2020/07/17/how-i-built-my-first-search-component

Hassle-free automated assessments

Feb 2, 2020

Our older test creation flow had several inherent problems associated with it:

1. Profile driven
Earlier to this, we had test creation based on job profiles (role-based), which was restrictive as the skills were tightly coupled to the roles and were not customizable by the recruiters. For example, I could select a profile say front-end engineer and the questions generated will only be from HTML, CSS and Javascript. So, if a recruiter wanted to generate questions for a front-end engineer profile to test React skill, they had to add them manually from our questions library or theirs.

2. Complex UI
The older interface was filled with unwanted form elements and had almost no provision to support newer question types. We somehow missed to provide a scalable and an intuitive step-by-step test creation flow that catered to the needs of our non-technical recruiters.

3. Rigid algorithm
The algorithm we used for selecting the test questions was relatively rigid, so there was a limited scope in picking questions from a mix of skills and of varied difficulties, either from the HackerEarth’s questions library or from the company’s library or both.

4. Restricted question types
Though there was a demand from our customers to include additional question types such as SQL, front-end project and Java project questions to the test creation flow, we could not accommodate them as the interface was already bloated.

Design

Based on the customer feedback and research data, it was important for us to improve the old flow in accord to the correct persona of our users. More focus was put on enterprise talent acquisition teams. We found out that in general scenario, such teams are bound by the following:

Want - To hire good people with minimum effort and time
Capabilities - Familiar with the technical terms and the skills needed for a role
Limitations - Have low technical knowledge

The new design was implemented keeping in mind all the requirements by our star designer, Neha Singh. The idea was to keep the interface minimal and distraction free, so to attenuate the cognitive load for the recruiter.

Benefits of the new design

Skill-driven approach to test candidates better when auto-generating a test in HackerEarth Assessments
Scalable system
Better user controls and flexibility
User friendly jargons and actions
Focuses on persona of a non-tech test administrator

Features of the new flow

The new skill-based test creation flow involves a stepwise process where bare minimum inputs from a user are taken to auto-generate a test. Primarily, the flow is divided into two screens:

Screen #1

Skills dropdown - Since the definition of profile required by different companies and geographies do not overlap, skill is the primary domain used for tests by the recruiters.
- In the new flow, users can search for a set of skills they desire. The list will be in alphabetical order and will auto-suggest on typing
- Multiple skills can be selected
- The recruiter should select atleast one skill to auto-generate a test
Experience - Years of experience is one of the prime components in a job description. We are starting out with the following options and based upon the customers’ feedback, we may alter or add more in the next version.
- 0-3 years
- 4-6 years
- More than 6 years

The default selected option would be 0-3 years of experience.

The above mentioned inputs are sufficient to auto-generate a test.

Screen #2

After selecting the skills and experience, the user is shown the summary of the questions selected using our algorithm. They can simply click on the “Create Test” button to get the task done. The following information would be shown, in case they want to edit any information:

Test name
- Default name is based on the first selected skill appended by the text “Test” i.e. <First skill name> Test. This was an assumption based on the philosophy that the recruiter would usually choose the first skill as the most important one. This implementation helped to reduce the effort needed to type the heading always
- The name would be inline editable
- The interface will not allow to have empty test name
Experience
- This cannot be changed in this step
- Assumption is that the need for it would be only in case of a wrong selection in the first step. Therefore, the user has to start afresh if they want to change the value.
Duration
In general scenario, with our unique algorithm, the default test duration would be 90 minutes. This would change:
- When a question set is deleted
- When a question set is edited
- When a question set is added
Test summary table
- Number of skills selected and the question type count in a question set
- The user can edit or delete an already added question set
- The user can add a new question set. For adding a set, the user is asked to select the following information in a sequence in a separate screen:
  - Skill such as Java, Python or Basic Programming
  - Question type such as MCQ or Programming
  - Difficulty levels viz. easy, medium and hard
  - Count of questions for each difficulty level

What we build

The new skill-based test creation was a complex implementation. Around 5,000+ lines of code was added to get the feature up and running without bugs. I would try to be precise in explaining the front-end engineering behind it.

The app architecture

The app was built using React v15+ (now using v16.8.5) and Redux. The modules were bundled using Webpack. We maintain a default config for all our Webpack apps. So, to create the config for new test creation app, the following changes were added:

const testCreationConfigGenerator = new WebpackConfigGenerator({
    name: "test-creation",
    entryPath: "./src/test-creation/index.js",
    extractBundles: [
        {
            name: 'vendors',
            criteria: isExternal,
        },
    ]
});

WebpackConfigGenerator is our in-house implementation. It is a class consisting of various methods to handle the default configuration such as entry, output and module loaders. The isExternal criteria checks if the modules are from node_modules directory or any other external library. So, with the above config, we get 2 JS files viz. test-creation.js and vendors.js and dependent CSS files.

Building the interface

Based on the new interface, the following components were required:

Disabled create test link
We have planned to restrict the new skill-based test creation to few customers at first. Moreover, in case of API or server failure, a recruiter should not have access to the test creation modal but should be able to create a blank test. Hence, separated out this component.

Modal
We already had a full width modal component and planned to use that.

Custom select dropdown with tags
This was built over the React Select package.

render() {
    return (
        <div className='select-dropdown-container'>
            <div className="form-field">
                <Select
                    name={this.props.name}
                    value={this.state.value}
                    options={this.props.options}
                    onChange={this.handleSelectChange}
                    onFocus={this.onFocus}
                    multi={!!this.props.multi}
                    required={this.props.required || false}
                    disabled={this.props.disabled}
                    searchable={this.props.searchable}
                    filterOptions={this.props.filterOptions}
                    noResultsText={this.props.noResultsText || NO_RESULTS_FOUND}
                    openOnFocus={this.props.openOnFocus || true}
                    tabSelectsValue={this.props.tabSelectsValue || false}
                    clearable={this.props.clearable || false}
                />
                <label className={this.props.labelClass} htmlFor={this.props.name}>
                    {this.props.label}
                </label>
            </div>
        </div>
    );
}

The prominent aspect was to create a separate component for options dropdown. The Select component provided by the package, expects a prop optionComponent. The component CustomSelectOption was used for that. It was responsible for handling mouse events like mousemove, mouseenter and mousedown. The options object, containing the selected status of every option, was provided as props by the parent component SkillsInputContainer.

SkillsInputContainer was responsible for handling skills and experience selections, providing a link to create a blank test and showing statuses such as “You can add more than one skill to your test” and “Maximum number of skills selected. Please remove existing skills to add more”.

We use Stylus for writing CSS. We already had custom design for input fields, but the design for this particular use case was a bit different. As we were using common code from the React Select package, we had to overwrite most of the default CSS code to make the component as per the design.

.form-field
    .Select
        &.Select--single
            &.has-value 
                > .Select-control 
                    .Select-value 
                        .Select-value-label
                            width: auto;
        &.is-focused, &.is-open
            ~label
                font-size: 10px
                top: -12px
                color: $brand-dark-gray
        &.Select--single
        &.Select--multi
            .Select-control
                padding: 0
                box-shadow: none
                .Select-placeholder
                    opacity: 0
                .Select-input
                    margin-left: 0
            &.is-open
                .Select-control
                    border-bottom: 1px solid rgba(0,153,255,0.8)
            .Select-value
                border: none
                color: $brand-dark-gray
                padding-left: 0 /* Overwrites default */
                top: 4px /* Overwrites default */
            .Select-value-label
                float: left
            .Select-value-icon
                border: none
                &:hover
                    color: white
                    background-color: $brand-blue
                    font-weight: 400
            .Select-clear-zone
                display: none
        .Select-menu-outer
            border: none
            box-shadow: 0 2px 10px 0 rgba(0, 0, 0, 0.1)
            border-radius: 0
            .Select-menu
                .Select-option
                    font-size: 16px
                    box-sizing: border-box
                    border: 0
                    color: $brand-dark-gray
                    cursor: pointer
                    display: block
                    padding: 5px 10px
                    max-height: initial
                    &.is-focused
                        background-color: $hover-blue
                        font-weight: 400
                        color: $brand-dark-gray
        .Select--single
            .Select-control
                .Select-value
                    font-weight: 400
                    background-color: #fff
                    padding: 0
                    .Select-value-label
                        font-size: 14px
                        color: $brand-dark-gray

        .Select-placeholder
            padding: 0

        // Handle fliping of arrow (caret)
        &.is-open
            > .Select-control
                .Select-arrow
                    top: 4px // Half the height of caret
                    border-color: $brand-gray
                    border-width: 1px 0 0 1px

Editable heading
By default, the test name is created using the first skill selected by the user. In case the recruiter wants to edit the test name, they can click the heading and edit. This component also handles the case if the recruiter clears the test name. In such case, the default name populates again immediately when the input is focused out.

Alert
Notifies which question set was updated along with the test duration update.

Test summary container
We used our own UI framework, Nuskha, for building out tables. We had to follow a particular HTML structure to build the required table.

Basic structure to generate a table with 2 rows:

<table class="he-table he-table-hover">
    <thead>
        <tr>
            <th><span>Skills&nbsp;(1)</span></th>
            <th><span>Question type&nbsp;(2)</span></th>
            <th><span>Difficulty level</span></th>
            <th class="align-right"><span>Question count&nbsp;(19)</span></th>
            <th class="align-right"><span>Total score&nbsp;(128)</span></th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td class="weight-600">Algorithms</td>
            <td>MCQ</td>
            <td>Easy, Medium, Hard</td>
            <td class="align-right">17</td>
            <td class="align-right">58
                <div class="action-icons-container hidden">
                    <div class="vertical-align-middle inline-block edit-set"><i class="icon ui-pencil" style="font-size: 12px;"></i></div>
                    <div class="vertical-align-middle inline-block delete-set"><i class="icon ui-trash" style="font-size: 12px;"></i></div>
                </div>
            </td>
        </tr>
        <tr>
            <td class="weight-600">Algorithms</td>
            <td>Programming</td>
            <td>Easy, Medium</td>
            <td class="align-right">2</td>
            <td class="align-right">70
                <div class="action-icons-container hidden">
                    <div class="vertical-align-middle inline-block edit-set"><i class="icon ui-pencil" style="font-size: 12px;"></i></div>
                    <div class="vertical-align-middle inline-block delete-set"><i class="icon ui-trash" style="font-size: 12px;"></i></div>
                </div>
            </td>
        </tr>
    </tbody>
</table>

In the above code snippet, action-icons-container contains two icons blocks viz. edit and delete. The delete icon is used to load a popup component for confirmation while the edit icon will take the recruiter to edit the existing question set.

Question set table
For recruiters, we had kept the options to edit an existing question set and add a new question set. While editing or creating a question set, they can update skills, question type such as MCQ and Programming, and question counts for each difficulty level. The interface take care of the duration update.

If a recruiter is mistakenly trying to override and existing set, we prompt them for confirmation to save the hassles.

The secret sauce

The algorithm that we use to populate initial set of questions is a complex one. We use parameters such as experience levels, skills selected by the recruiter, difficulty levels, and types of skills selected to display a summary table with appropriate question sets.

Our internal algorithm tries to create a test of 90 min duration first, however if there are not enough questions then it will fall back to 75, 60, 45 minutes in that order. These are the durations with which we had started and had the following restrictions or limitations:

It will never auto-generate a test with duration less than 45 min
A test’s actual configuration can be displayed +5 or -10 min
If too many skills are added the duration can go beyond 90 min
The duration would alter if the initial load of skill set is changed

Of course, the above logic is just the tip of the logics ice-berg that we use to auto-generate a skill-based test.

Aspirations

With this feature in place, we aspire to add a few differentiating aspects to it as well.

Give the option to save a custom profile, based upon the skills selected
Auto-select the languages in which the candidate can code while generating test questions
Warn if sufficient questions are not added or if too many questions are present
Suggest skills based on the current skill selected in the dropdown while configuring a test
Allow the user to request for the skill from the interface in case the skill is absent in the platform
Ability to create random sets
Ability to use questions from the recruiter’s library
Ability to change the years of experience in the test summary page

…

I will also take this opportunity to thank one of the great engineers that I have worked with, Jagannadh Vangala aka Jaggu for being the perfect ally in diligently completing this project together.

Adios amigos!

Posted by Chandransh Srivastava

http://engineering.hackerearth.com/2020/02/02/hassle-free-automated-assessments

Profiling Django views with Scooby profiler

Sep 20, 2018

Earlier in 2016, I came up with the idea of creating a Python module which could provide functionality of profiling functions w.r.t. SQL queries and tell exactly at what lines of function, the queries were happening. I called that package Goofy and created it with the engineering of Python AST manipulation. You can read the post on it here later. It helped us profiling views serving AJAX requests but it had some limitations too. E.g. we couldn’t see the whole callstack trace of the queries and couldn’t analyze the queries.

In an internal hackathon @HackerEarth which happened in Nov 2017, I revisited this problem and tried to come up with a profiler, through which we could see the stats on front-end and which should be much lightweight and would work with AJAX requests. I named this package Scooby.

Idea

We have been using the package django-debug-toolbar, but it wasn’t enough because it didn’t have the functionality of profiling AJAX requests. And because of the size of our codebase, it had made serving web pages slower in develop mode because it injects the stats in HTML by rendering and rendering takes time. So we needed an alternative.

The idea while creating the Scooby package was to have a similar type of package as of django-debug-toolbar but instead of rendering the stats to HTML, we can dump the stats data to some backend store e.g. Redis where it will reside temporarily. And show the stats on front-end using a chrome extension.

Implementation

We had to create a Python/Django package for backend and a npm package for front-end which would build the chrome-extension. We took the decision of using ReactJS as rendering framework for the chrome-extension.

In backend, we just had to create and put a new middleware which will do the job of collecting profiled stats for different plugins (E.g. SQL, Memcache etc.) and put the stats to Redis against some UUID as key. We would put that key as value of a custom header (X-Scooby) in HTTP response, so that chrome-extension could collect stats for that request-response later using that key.

The chrome-extension would put a hook to network calls in browser and whenever a response comes for a request, check if the response has that custom header which contains the UUID. If yes, then collect the data from url /scooby/get-data/<uuid>/ in same domain.

This is the code of middleware (You can skip directly to Further improvements section if you are not interested in code):

import uuid

from .base import ScoobyData
from .utils import get_redis


class ScoobyMiddleware(object):
    def process_request(self, request):
        request.scooby_data = ScoobyData()
        request.scooby_data.on_process_request(request)

    def process_view(self, request, view, view_args, view_kwargs):
        request.scooby_data.on_process_view(
            request, view, view_args, view_kwargs)

    def process_response(self, request, response):
        request.scooby_data.on_process_response(request, response)
        unique_hex = uuid.uuid4().hex
        response['X-Scooby'] = unique_hex
        # Set data in redis for 10 minutes.
        redis = get_redis()
        redis.set(unique_hex, request.scooby_data.as_json(), 600)
        return response

And this is how ScoobyData class is defined. For each stage of middleware (process_request, process_view and process_response), it deferred to all available plugins so that those could collect their respective data on their own.

import json

from .plugin_finder import get_plugins

class ScoobyData(object):
    def __init__(self):
        self.plugins = get_plugins() # Instances of different plugin classes.
        self.plugins_data = {}
        for plugin in self.plugins:
            self.plugins_data[plugin.name] = plugin.Data()

    def on_process_request(self, request):
        for plugin in self.plugins:
            plugin_data = self.plugins_data[plugin.name]
            plugin.on_process_request(plugin_data, request)

    def on_process_view(self, request, view, view_args, view_kwargs):
        for plugin in self.plugins:
            plugin_data = self.plugins_data[plugin.name]
            plugin.on_process_view(plugin_data, request,
                                   view, view_args, view_kwargs)

    def on_process_response(self, request, response):
        for plugin in self.plugins:
            plugin_data = self.plugins_data[plugin.name]
            plugin.on_process_response(plugin_data, request, response)

    def as_json(self):
        plugins_data_json = {}
        for plugin_name in self.plugins_data:
            plugin_data = self.plugins_data[plugin_name]
            plugins_data_json[plugin_name] = plugin_data.as_json_dict()
        return json.dumps({
            'plugins_data': plugins_data_json
        })

Here is the code of a very simple plugin which would collect data regarding which view was hit and what all args and kwargs were passed to it:

class ViewNamePluginData(object):
    def __init__(self):
        self.view_name = None
        self.args = ()
        self.kwargs = {}

    def as_json_dict(self):
        return {
            'view_name': self.view_name,
            'args': self.args,
            'kwargs': self.kwargs,
        }

class ViewNamePlugin(object):
    Data = ViewNamePluginData

    def __init__(self):
        self.name = 'ViewName'

    def on_process_request(self, request):
        pass

    def on_process_view(self, plugin_data, request,
                        view, view_args, view_kwargs):
        plugin_data.view_name = '%s.%s' % (view.__module__, view.__name__)
        plugin_data.args = view_args
        plugin_data.kwargs = view_kwargs

    def on_process_response(self, request, response):
        pass

This has been the generic way of adding new plugins, and collecting/serving data regarding those plugins. We added the plugin for SQL queries with same pattern.

Further improvements

After this was build in hackathon, there have been many additions to this package later on.

We added the plugins for queries happening in Memcache and Thriftpy along with SQL.
Added a plugin for raw Python cProfiler using which you could cProfile your views with just enabling it in chrome-extension. You don’t need to put temporary code for cProfiling views anymore.
Added the option to enable profiling on front-end instead of always collecting/dumping data on backend. This way there is no overhead of this profiler on backend when you are not using the chrome-extension.
Added a logger with scooby package too which would behave alternative to print/logger.debug for debugging purposes. With this new logger you wouldn’t have to look into console but instead look into the chrome extension in browser for those logs. Here’s how you can do that:

import scooby
scooby.log("foo", "bar")

Screenshot

Here is the screenshot of how SQL queries look in chrome-extension.

In the right panel under SQL tab, you can see stacktraces of all different queries. There is an option to group all similar queries together. Let’s say if you are making same type of sql query in a for loop, just few parameters are changed in different queries. By grouping them together you can see the count of those and also see how much total time those took. By this, N+1 query proplems are analyzed.

Open sourcing it

Both backend and front-end packages of Scooby profiler are open-sourced.

If you are using Django as the backend framework, give this profiler a try. We would be glad to hear the feedback.

The Python package is available in PYPI as name of django-scooby-profiler. This is the github link https://github.com/shhaumb/django-scooby-profiler. That page contains the documentation of how you can integrate it with your project.
The chome extension is available in chrome web store https://chrome.google.com/webstore/detail/scooby-profiler/kicgfdanpohconjegfkojbpceodecjad.

Posted by Shubham Jain. You can follow me on Github and Twitter.

http://engineering.hackerearth.com/2018/09/20/profiling-django-views-with-scooby

What you see is what you get!

Jul 19, 2018

Introduction

HackerEarth has grown into a platform that serves a huge number of customers for technical assessment. To make this possible, we try our best to make the platform as easy-to-use as it can get.

At several places in our Recruiter Dashboard, we used to have a Markdown editor to allow users to edit free text. There have been multiple times when many of our recruiters have struggled to create content using the Markdown editor. They need not to worry anymore. After many such requests to improve this, we came up with a fix. Say hello to CKEditor (version 4)—The well-known WYSIWYG, Rich Text editor.

Why CKEditor?

In the battle of the titans (of WYSIWYG editing) between CKEditor and TinyMCE, we decided to go with CKEditor because of the following reasons:

It has a huge community of active developers. The strength of the community around an open source project is strongly related to the project’s success.
As compared to TinyMCE, it provides better support for the following:
- Multiple languages
- Source editing
- Tables
- Image and media handling etc.
It was designed with modularity in mind which allows you to go much deeper if you’re a developer.
It is doing much better as compared to TinyMCE. One of the easy tricks while surveying software is to compare how alternatives are doing on Google and Stack Overflow trends.

Google search comparison (past 5 years)

Number of Stack Overflow questions asked Integration

The integration of WYSIWYG editor across HackerEarth’s Recruit platform is broadly divided into three steps:

Adding the Django CKEditor package

As the Recruiter dashboard is written entirely in Django, we decided to integrate CKEditor using the django-ckeditor package. CKEditor provides a huge list of out-of-the-box functionalities. Thinking from the perspective of recruiters and problem setters, we decided to opt for a few of them only. The Django CKEditor package reads the configuration from the settings.py file.

Here is the snapshot of what the CKEditor configuration in the code looks like:

# CKEditor UI and plugins configuration
CKEDITOR_CONFIGS = {
    'default': {
        # Toolbar configuration
        # name - Toolbar name
        # items - The buttons enabled in the toolbar
        'toolbar_DefaultToolbarConfig': [
            {
                'name': 'basicstyles',
                'items': ['Bold', 'Italic', 'Underline', 'Strike', 'Subscript',
                          'Superscript', ],
            },
            {
                'name': 'clipboard',
                'items': ['Undo', 'Redo', ],
            },
            {
                'name': 'paragraph',
                'items': ['NumberedList', 'BulletedList', 'Outdent', 'Indent',
                          'HorizontalRule', 'JustifyLeft', 'JustifyCenter',
                          'JustifyRight', 'JustifyBlock', ],
            },
            {
                'name': 'format',
                'items': ['Format', ],
            },
            {
                'name': 'extra',
                'items': ['Link', 'Unlink', 'Blockquote', 'Image', 'Table',
                          'CodeSnippet', 'Mathjax', 'Embed', ],
            },
            {
                'name': 'source',
                'items': ['Maximize', 'Source', ],
            },
        ],

        # This hides the default title provided by CKEditor
        'title': False,

        # Use this toolbar
        'toolbar': 'DefaultToolbarConfig',

        # Which tags to allow in format tab
        'format_tags': 'p;h1;h2',

        # Remove these dialog tabs (semicolon separated dialog:tab)
        'removeDialogTabs': ';'.join([
            'image:advanced',
            'image:Link',
            'link:upload',
            'table:advanced',
            'tableProperties:advanced',
        ]),
        'linkShowTargetTab': False,
        'linkShowAdvancedTab': False,

        # CKEditor height and width settings
        'height': '250px',
        'width': 'auto',
        'forcePasteAsPlainText ': True,

        # Class used inside span to render mathematical formulae using latex
        'mathJaxClass': 'mathjax-latex',

        # Mathjax library link to be used to render mathematical formulae
        'mathJaxLib': 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS_SVG',

        # Tab = 4 spaces inside the editor
        'tabSpaces': 4,

        # Extra plugins to be used in the editor
        'extraPlugins': ','.join([
            # 'devtools',  # Shows a tooltip in dialog boxes for developers
            'mathjax',  # Used to render mathematical formulae
            'codesnippet',  # Used to add code snippets
            'image2',  # Loads new and better image dialog
            'embed',  # Used for embedding media (YouTube/Slideshare etc)
            'tableresize',  # Used to allow resizing of columns in tables
        ]),
    }
}

Rendering the editor in the front-end is super easy. Out of the two widgets provided by the Django CKEditor package (CKEditorWidget and CKEditorUploadingWidget), we decided to go with CKEditorUploadingWidget because we wanted to include the support for uploading files.

Let’s suppose, your models.py file contains a model named MyModel which contains a CharField named my_field. To attach the CKEditor to this field, create a form as stated below and you are good to go.

from ckeditor_uploader.widgets import CKEditorUploadingWidget
from django import forms

class MyForm(forms.ModelForm):
    class Meta:
        model = MyModel
        fields = ('my_field',)
        widgets = {
            'my_field': CKEditorUploadingWidget(attrs={
                'class': 'my-ckeditor-class'
                'id': 'my-ckeditor-id'
            })
        }

By default, CKEditorUploadingWidget fetches the configuration from CKEDITOR_CONFIGS['default'] which is defined in settings.py. If you want to use different configurations, say for rendering multiple editors, you can define my_config in the settings.py file instead of default and pass it in the widget as follows:

widget = CKEditorUploadingWidget(config_name='my_config')

Modifying the code of Django CKEditor to suit our needs

Supporting custom language for internationalization

We defined a utility function get_ckeditor_language to provide the language in which we want to render the CKEditor.

  from django.conf import settings

  # CKEditor localization mapping
  CKEDITOR_LOCALE_MAP = {
      'en-us': 'en',
      'ja': 'ja',
      'zh': 'zh-cn',
      'fr': 'fr',
      'es': 'es',
      'pt-br': 'pt-br',
      'id': 'id',
  }

  def get_ckeditor_language():
      """ Returns the UI language localization to be used with CKEditor """
      default_language_code = settings.LANGUAGE_CODE
      default_plugin_language = CKEDITOR_LOCALE_MAP.get(default_language_code,
                                                        default_language_code)
      return default_plugin_language

In the settings.py file:

  # The user interface language localization to be used with CKEditor
  CKEDITOR_UI_LANGUAGE_SELECTOR = 'get_ckeditor_language'

In the django-ckeditor ckeditor/widgets.py file, we modified the _set_config method as follows:

  from django.utils.module_loading import import_string
  def _set_config(self):
      lang = import_string(getattr(settings, 'CKEDITOR_UI_LANGUAGE_SELECTOR', 'django.utils.translation.get_language'))()
      if lang == 'zh-hans':
          lang = 'zh-cn'
      elif lang == 'zh-hant':
          lang = 'zh'
      self.config['language'] = lang

Using custom storage method for image upload

While uploading files through DefaultStorage which is provided by Django, the query-string authentication is enabled by default. We need to store the URL of the image while uploading it, and therefore, query-string authentication cannot be used in this case. To prevent this, we created a custom storage class named PublicMediaRootS3BotoStorage which inherits from the S3BotoStorage package. In the settings.py file:
```
  CKEDITOR_STORAGE_BACKEND = 'custom_storages/PublicMediaRootS3BotoStorage'
```
In the ckeditor_uploader/utils.py file, we added a new method to fetch the new storage:
```
  # Allow for a custom storage backend defined in settings.
  def get_storage_class():
      return import_string(getattr(settings, 'CKEDITOR_STORAGE_BACKEND', 'django.core.files.storage.DefaultStorage'))()
  storage = get_storage_class()
```
We replaced default_storage with storage in all the respective files to make it work seamlessly.

Migrating the existing problem data to make it compatible with CKEditor
- Problem
  
  There are approximately 2.5 lakh problems that contain mathematical symbols spread across multiple tables in our database. All the problems had LaTeX code written within $$ and $$. CKEditor provides the support for rendering mathematical symbols using the MathJax plugin which reads the LaTeX written between $ and )\ enclosed by a span containing a unique class. The class to be used has to be defined in the settings.py file as we did above. For example <span class="mathjax-latex">\(Z_{i} = P*X(Z_{i-1})+Q$</span>
- Solution
  
  We wrote a script that uses RegEx to fetch all the mathematical symbols enclosed within $$ from a problem and makes them compatible with CKEditor. Running the script on a whopping 2.5 lakh problems took only 15 minutes to complete!

Here is a snapshot of what the editor looks like in the Recruiter dashboard:

Last words…

The integration of CKEditor with HackerEarth’s Recruit platform has brought a plethora of new features that make the job of setting problems easier and interesting. On a platform where hundreds of problems are created and reviewed every day, the amount of work required to deploy the editor into production was worth the effort.

Peace out!

Posted by Himanshu Malhotra

http://engineering.hackerearth.com/2018/07/19/what-you-see-is-what-you-get

Introducing Nuskha

Jul 7, 2018

History

We at HackerEarth created a single-page document with common and special CSS classes to make layouts, grids, buttons, inputs, tables, tooltips, and form elements in late 2017. That was our first attempt to develop on our own front-end framework.

Old Nuskha screens:

Framework is a platform, foundation on which ready software solutions are built, in this particular case – web interfaces. For this purpose front-end framework consists of ready components, which are used by a developer when working on a project. What is more, aforementioned components, if necessary, can be modified or adjusted to current needs. - Merix Studio

Our development was inspired from Bootstrap. But, we still had miles to go before calling it a framework.

To be a part of something that will impact the whole organization was exciting for the bunch of us. After rejecting Kriya Kalaap, Kalakari, Retro, Tattva, Lipstick, and many more, we named our nascent framework Nuskha.

“The name ‘Nuskha’ is inspired by one of the art deities "Nuska" from the Mesopotamian mythology. The word ‘Nuskha’ is a Hindi word which translates to ‘formula’ in English - a formula to create or build something.”

Old Nuskha helped but was still inefficient. We did not have any React components. We had started developing one of our products in React (version 16+) while the other product was in the transitioning phase. With strict deadlines for other important tasks, we were unable to contribute much to Nuskha and inevitably the implementation of the same components in different projects was duplicated. We needed a better framework to unify the components. We needed it for consistency, ease of use, and faster development. Yes, we also wanted to DRY up our code base.

Fast forward to few months, as the tradition at HackerEarth goes, we had our internal hackathon scheduled. The timing was perfect. I paired up with Akanksha, another Frontend Engineer at HackerEarth, this time to build Nuskha 2.0 which will eventually be known as Nuskha.

The pain points we solved

During one of our Product All-Hands meeting, we learned about the ‘easiest’ way to use our own HackerEarth icon fonts that had been created recently. The process was as follows:

A designer gives the Invision file to the developer.
The developer copies the unicode for the icon that they intend to use.
The developer then searches the icon class in the PDF file (which contains the icons list) provided to them.

These steps needed to be repeated for every icon, which was incredibly frustrating and time-consuming!

There were multiple small React applications being made simultaneously in the team. One of the code reviews revealed that we had been duplicating a lot of code for React components and other small functionalities. The maintenance of design consistency always required extra inputs in the code base. This was frustrating too.

A small change needed in the font size or color required thorough inspection from a developer and then a proper regression testing. This consumed a lot of time and effort and slowed down the release cycle.

What we built

The idea was simple. We wanted to create a Single Page Application (SPA) where all the common React components would be linked in the sidebar. Their individual pages would give details about how to use them along with React and HTML codes. Their expected props and associated details would be shown in the table. We had our own icon fonts. We wanted to make these components and icons searchable. The basic interface design was to have a header, sidebar, and body section.

Creating the app architecture

We used create-react-app, an excellent package from Facebook, to get started with our application. But soon, we ran into a problem. While writing the components’ methods, we were accustomed to using different decorators. One of them was autobind from core-decorators. To use decorators, a config change was required in Babel because these decorators are not available natively. Facebook’s create-react-app has a limitation in this case.

Every time I used to start the webpack server, a new tab in browser would open. This was irritating because I already had a previous tab with the same address. We also wanted some configuration change to handle .styl files and finally decided to eject the default configuration.

After ejecting the default config, we updated our packages and the Babel config to get the decorators up and running.

For enabling the browser tab open by default, the app used open browser util (react-dev-utils/openBrowser). It was used in the start.js inside the scripts directory. While configuring the devServer.listen method, the call to open the browser method was commented out. This gave us the desired result.

Building the pages

We started building different pages for common components such as buttons and icons. We tried to make development as modular as possible during the Hackathon. For the routes, we even separated the URLs of the pages in a file.

Every page was supposed to show all the use cases of the component as examples. For example, in case of buttons, we showed all the 13 different types of buttons. Another important task was to display all the proptypes. This was displayed in a table, which was another common element.

We were building these pages for developers, and therefore, the most important section was how to use these components. We planned to show the implementation of the component along with the code. Earlier, we thought of providing editable code but later went on to implement multiple read-only editors due to limited time and to avoid confusion.

We implemented two editors. One showed a pure HTML implementation while the other showed the React implementation. To implement the editor, we used the Brace editor. The basic config for the React editor was as follows:

<AceEditor
    value={value}
    mode='html'
    theme='crimson_editor'
    height={height || '250px'}
    readOnly
    showGutter={false}
    highlightActiveLine={false}
    highlightGutterLine={false}
    showPrintMargin={false}
    wrapEnabled
    fontSize={14}
    setOptions={{showLineNumbers: false}}
    editorProps={{$blockScrolling: true}}
/>

This editor had a wrapper which included only one functionality–the Copy button. To implement the copy functionality, we faked a textarea with the desired content. A click on the button transitioned the text value to ‘Copied’ for less than a second if the content was successfully copied to the clipboard. Once all these were done, the page was made accessible from the sidebar.

Icons directory

We wanted to implement a searchable list of icons. We used another common component, card, to list the icons. A card component is a simple rectangular box with shadows. It showed the icon and its details required to use it as an HTML or a React component. We created an icon React component also.

There were 576 icons. To make the listing modular, we created an icons map file. This file contained the details of the CSS content and name of each of the icons. To make searching easy, we introduced tags. Tags contained a list of synonyms or related words for the icons names. For example, an icon with name “safety-locker” had “almirah, cupboard, and cabinet” as the tags. The React component created for an icon, expected name of the icon. It was also configurable via props for color, size, tooltip, and click handler.

To make the search bar, an Input component that was already available was used. It was a controlled React component. The onChange handler, updated the state. The logic for list update is as follows:

const {iconsMap, searchInput} = this.state;
let filteredIconsMap = null;
function checkIfStringPresent(stringToCheck) {
  return stringToCheck.toLowerCase().indexOf(searchInput.toLowerCase()) > -1;
}

if(searchInput.length > 0) {
  filteredIconsMap = iconsMap.filter(function(icon) {
    return (checkIfStringPresent(icon.name)
      || checkIfStringPresent(icon.tags.toString())
      ||  checkIfStringPresent(icon.character));
  });
} else {
  filteredIconsMap = iconsMap;
}
const filteredIconsList = filteredIconsMap.map(function(iconData, i) {
  return (
    <Card key={i.toString()} klass='icons-container align-center'>
      <HEIcon name={iconData.name} size='30px' />
      <p className='align-left padding-top-10 no-margin'>
        Class: {iconData.name}
      </p>
      <p className='align-left no-margin'>
        Character: {iconData.character}
      </p>
      <p className='align-left no-margin'>
        Tags: {iconData.tags.toString()}
      </p>
    </Card>
  );
});

The icons map had a name, character, and tags. So, whenever an input was given, all the three were searched. A filtered list was created, then that list was mapped (filteredIconsList) to create the list of icons. We were wary of the performance, but it worked smoothly.

End product

Future scope

More components
More utility functions
Open source

We have recently converted our legacy search bar code written in jQuery into a React component following the Redux architecture. We had to rewrite most of the functionalities because of the complex code. However, now the component is ready to be added to our internal products.

On 1st June, I pushed a couple of helper functions to check if all the keys in a JavaScript object have boolean true values and to compare two JavaScript objects for equality. Not only are we making it a better front-end framework for web interfaces but also adding potential generic code that we might need in multiple products day-by-day.

I am happy to know that, Nuskha, which started as a hackathon product, is now evolving into something bigger and better.

Adios amigos!

Posted by Chandransh Srivastava

http://engineering.hackerearth.com/2018/07/07/introducing-nuskha

Streaming Android applications via the browser

Apr 3, 2017

HackerEarth prides itself in its scalable & automated evaluation system. What was initially designed keeping standard programming problems in mind (check this post out), gradually evolved to accommodate a plethora of problem types across various tech domains.

Currently supported Problem Types Programming Frontend Objective Android Subjective File based Multiplayer Approximate Golf Machine Learning SQL Regex File eval Map Reduce

Note: Not all of the problem types are accessible by end users publicly. Some are reserved for HackerEarth’s Recruit product.

While most of them have their own evaluation stack and are automated in the complete sense of the word, evaluating submissions for some of these problem types requires partial manual intervention. Evaluation of Android submissions for instance, is not automated.

Why though?

An android submission is basically an apk. Given a requirement, the user has the freedom of designing and implementing an app in whatever way he/she deems fit. Given the nature of such submissions, it would be ill conceived to design an evaluation system in a one-size-fits-all manner. Hence the need for manual intervention.

Evaluating Android Submissions

Each submission is rated based on the following parameters:

Requirement compliance
Bugs
Performance benchmarks
Look and feel of the application
User experience

For a given hiring challenge, a dashboard containing all the submissions along with the candidate details is provided to the recruiter. The recruiter then follows each of these steps:

Download a candidate’s apk onto his/her local machine
Install the apk onto a connected android device or emulator
Test and interact with the app on a device or emulator
Update score for the candidate in the dashboard

Straight off, you can identify serious cons to this approach.

Android Studio should be set up on your local machine(for a non-technical guy, this can be a daunting task in itself)
Manually install & uninstall apps from an emulator/device
Error-prone bookkeeping while updating scores(more pronounced when evaluating 100s of submissions)
No means of running automated tests

This was tedious and not scalable.

The Fix

We set out to solve these problems and defined the following objectives:

A mechanism to provision and allot emulators on the fly
A means to interact with the emulator from the browser
Programmatically run certain defined operations on the emulator(like install apk, start package, unlock screen, etc)
Integration in the recruiter dashboard for android submissions

Note: The technique we apply for streaming applications is not tailor made for the emulator, it applies to any kind of GUI application.

Before I present to you a system-level overview, lets run through the core components. We will introduce each component and it’s functionality via usecases.

Running GUI apps on a Headless Server

Any GUI application requires a graphical system that provides a basic framework, or primitives, for building GUI environments. Basic primitives like:

Drawing and moving windows on the display
Interacting with a mouse, keyboard or touchscreen.

All *nix based systems provide the X Window System for the same. Every GUI application that runs on a Linux machine interacts with what is called an X Server.

As depicted above, the X Server relies on input and output devices like the Monitor, Keyboard and Mouse.

Note: DISPLAY is an environment variable that instructs an X client(read GUI application) which X server it is to connect to by default.

Headless Server

A Headless Server, does not have a screen, keyboard or mouse attached. Most servers are configured to be headless so as to reduce operational cost. Since the server has no associated display, there is no X Server running on such machines.

Which is why, you get something like this when you try running a GUI app on a headless server.

$ glxgears
Error: couldn't open display (null)

Note: glxgears is just a GUI app that we’ll utilise for the purpose of this demo.

So how does one go about running GUI apps on a headless server? Enter Xvfb.

Xvfb

Xvfb or X virtual framebuffer is a display server that performs all graphical operations in memory without showing any screen output. This virtual server does not require the computer it is running on to have a screen or any input device. Only a network layer is necessary.

Here’s how you’d run glxgears, with Xvfb set up.

$ Xvfb :1 -screen 0 1024x768x24 > /dev/null 2>&1 &
$ DISPLAY=:1 glxgears
11749 frames in 5.0 seconds = 2349.722 FPS
12433 frames in 5.0 seconds = 2486.563 FPS
12668 frames in 5.0 seconds = 2533.599 FPS
...(truncated)

This time, the app was able to successfuly run. But you won’t see anything, because this machine is by definition headless(i.e without a display). In order to actually interact with it, we need to setup a VNC server.

So how does one go about running GUI apps on a headless server? Enter Xvfb.

Virtual Network Computing Primer RFB protocol

RFB (remote framebuffer) is an open simple protocol for remote access to graphical user interfaces. It enables one to transmit GUI frames and input events across a server and a client, over the network.

VNC

VNC (Virtual Network Computing) is a graphical desktop sharing system that uses the Remote Frame Buffer protocol (RFB) to remotely control another computer.

A VNC implementation comprises of:

Server: The program on the machine that shares its screen. It passively allows the client to take control of it.
Client(or viewer): The program that watches, controls, and interacts with the server. The client controls the server.
Protocol: The Server and Client talk using the RFB protocol.

Various implementations for VNC Servers and clients exist. Most VNC Servers create their own virtual X Display. But we already had that covered by using Xvfb. We simply needed to export an existing X Server, which is why we opted for the x11vnc server.

Here’s how you’d start an x11vnc server pointing to a DISPLAY at :1.

$ x11vnc -display :1 -quiet -nopw

The VNC desktop is:      deathstar:0
PORT=5900

Note: Make note of the port 5900, which is the port that the server is listening on

Connecting to this VNC session can be done using a VNC viewer(read client) or via SSH X11 forwarding.

Here we connect to a VNC server running on localhost at port 5900 using a vncviewer.

$ vncviewer localhost:0

which then allows us to interact with the DISPLAY :1, which is what the VNC Server was connected to.

Connecting to a VNC Server from the browser

In order to interact with a VNC Server, one would need to install a VNC viewer. This is a no go, if we wish to eliminate dependencies at the recruiter’s end. Ideally, we want to establish a VNC session from within the browser. Basically we needed a VNC client built for the browser.

NoVNC

NoVNC is a VNC client using HTML5 (Web Sockets, Canvas). The project is available on GitHub. Particularly check out the Integration section.

Connecting to a VNC server from javascript is as simple as:

// Initialise rfb object(refer to the modules documentation in the project)
var rfb = new RFB({'target': document.getElementById('noVNC_canvas'});
// Initialise host and port of the VNC server
var host = 'localhost';
var port = 5900;
// Establish session. On success, this starts drawing on #noVNC_canvas.
rfb.connect(host, port);

With the VNC Server process running, if you try establishing a connection from the noVNC client, it will fail with the following error.

Skipping unsupported WebSocket binary sub-protocol

The reason this happens is because the noVNC client communicates using WebSockets, which is different from the raw TCP protocol that the VNC server uses. To bride this gap, the folks at noVNC built a websocket proxy called websockify, that translates low-level TCP traffic into WebSocket traffic and vice-versa.

Basically we needed a VNC client built for the browser.

The following command listens for WebSocket traffic at port 6080, translates and forwards the same onto port 5900(which is where the VNC server is listening). Communication is bi-directional.

$ websockify :6080 :5900
WebSocket server settings:
  - Listen on :6080
  - Flash security policy server
  - No SSL/TLS support (no cert file)
  - proxying from :6080 to :5900

Consequently, we update our noVNC client to instead connect to the websocket port.

// port 5900 is no longer relevant to the client
var port = 6080;

The noVNC client should now be able to establish a connection.

Setting up the Android Emulator

The Android Emulator supports several hardware acceleration features to improve performance, sometimes drastically. An Android Developer usually concerns himself/herself with these details only once. One can find a comprehensive guide in the Android Docs.

Configuring the emulator VM for acceleration

The emulator runs inside a Virtual Machine. In order to configure the VM for acceleration, the underlying hardware needs to expose what is called a hypervisor. On a Linux machine this comes in the form of Kernel-based Virtual Machine(KVM).

In order to get access to the hypervisor, we needed to spin a bare metal machine(as opposed to running the emulator on a machine provisioned by a cloud provider, which are VMs themselves).

You can refer to the docs for how to go about exposing the hypervisor.

Custom AVDs

An Android Virtual Device (AVD) definition lets you define the characteristics of an Android phone, tablet, Android Wear, or Android TV device that you want to simulate in the Android Emulator.

We pretty much stuck to the procedure defined in the docs and defined a bunch of AVDs, one for each emulator that we were planning to run. Few things worth a mention here are:

An AVD can be utilized by any one emulator at most
Virtual Memory and Heap Size can be defined in the AVD
Internal Storage and SD card size can be defined as well
Set the GPU option to auto to dynamically choose between a hardware/software renderer

Starting the emulator

In order to start an emulator, you need to first define at least one AVD. While there are multiple parameters that can be specified, we will concern ourselves with only 2 of them.

gpu - We use a software renderer called swiftshader. We use this because the host machine does not have a dedicated GPU.
qemu - We configure the emulator to utilise kvm which results in a great performance boost.

# Point to the running Xvfb process
DISPLAY=:1
# Start emulator
emulator -avd <name-of-predefined-avd> -no-boot-anim -nojni -netfast -gpu swiftshader -qemu -enable-kvm

Note: As we saw in the Running GUI apps on Headless Server section, we need to point the DISPLAY to the running Xvfb process first.

Sizing up the machine

Let’s break down the requirements for running a single emulator:

Memory : 2 GiB of RAM (defined by the AVD)
Compute : 1 CPU dedicated for a VM
Storage : 1 GiB of disk space(Internal Storage + SD Card; defined by AVD)
Rendering : We utilised an OpenGL Software Renderer - Swiftshader, which implied CPU overhead.
Hypervisor : Underlying platform needs to be expose the hypervisor for performant emulators(this constraint applies to all emulators)

Note: In software renderers, graphical computations are performed on the CPU itself. One would opt for a software renderer if there is no GPU available, since it introduces quite a bit of computational overhead.

Factoring in these requirements. We came up with the following machine specifications to parallely run 6 emulators:

Memory : Min. 16 GiB RAM
Compute : 8 CPUs (Factoring in GPU Rendering overhead)
Storage : Min. 256 GiB of disk space(factoring in dependencies, kernel, sdk, etc)
Bare Metalness == True (for the hypervisor)

We purchased a dedicated server meeting these specifications from this vendor called Hetzner. Their pricing is surprisingly reasonable and their servers have consistently performed well.

Interacting with the Emulator

While there was now a way to facilitate live interaction with an emulator, we needed certain primitives to be exposed on the emulator so that we may run them programmatically.

For this, we leveraged the ADB (Android Debug Bridge), which let us send commands, among other things, into a running emulator by exposing a port(ADB PORT) on the host machine.

Note: When running more than one emulator on a machine, one needs to explicitly specify the ADB PORT for the consecutive emulators to avoid port clashes.

We were now able to perform operations like:

Installing/uninstalling apks

$ adb install /path/to/apk

Get/Set various states in the emulator

$ adb shell getprop sys.boot_completed

Starting packages

$ adb shell am start -n com.package.name/com.package.name.ActivityName

Running instrumented tests

$ adb shell am instrument -w <test_package_name>/<runner_class>

Sending intents to activities, clearing the activity stack, etc.

Putting it all together

And here’s a simple bash script that incorporates all of the components, to finally expose a websocket that a noVNC client can connect to.

# Start Xvfb at DISPLAY :1
Xvfb :1 -screen 0 1024x768x24 > xvfb.log 2>&1 &
# Point DISPLAY to virtual X Server
export DISPLAY=:1
# Start emulator for a pre-defined avd
emulator -avd Nexus_5X_API_24 -gpu swiftshader -no-boot-anim -nojni -netfast -qemu -enable-kvm > emulator.log 2>&1 &
# Start a VNC server and point it to the same display
x11vnc -display :1 -quiet -nopw -rfbport 5900 -bg -o vnc.log
# Proxy websocket traffic to raw tcp traffic
websockify -D :6080 :5900 > websockify.log 2>&1

Note:This is a grossly oversimplified version. Because there are so many moving parts, we need to ensure each of these processes have been initialised and are in running state before starting the next one.

The entire setup can be considered as one unit, which we will refer to as an endpoint hereon. We built a Thrift service that provisions such endpoints, along with exposing certain other primitives like - Installing/uninstalling of apks, Flushing of the Activity Stack, etc.

And here’s the thrift definition for Droid Service.

enum ErrCode {
  DEFAULT = 0,
  NO_DROIDS_AVAILABLE = 1,
}

struct ConnParams {
  1: string host,
  2: string port,
  3: optional string password,
}

exception ApplicationException {
  1: string msg,
  2: optional i32 code = ErrCode.DEFAULT,
}

service DroidService {
   void ping(),

   string get_package_name(1: string apk_url) throws (
          1: ApplicationException ae),

   ConnParams get_endpoint(1: string endpoint_id),

   bool run_operation(1: string endpoint_id, 2: string operation, 3: string apk_url) throws (
          1: ApplicationException ae),
}

Scaling out

In Sizing up the machine we saw that we could run at most 6 emulators on a machine with those specs. We had to be able to accommodate more endpoints by horizontally scaling up.

We employed the Apache Zookeeper project to this effect. One of it’s usecases lies in centrallized configuration management, which provides recipes for designing distributed systems.

Concepts

Each Droid Service is configured to attach itself to ZooKeeper on startup
Scaling up is simply a matter of starting another Droid Service
A Master Droid Service simply directs RPC calls to the relevant Droid Service and relays the response back to external clients(for eg. a Django app).

And here’s the thrift definition for the Master Droid Service

enum ErrCode {
  DEFAULT = 0,
  NO_DROIDS_AVAILABLE = 1,
}

struct DroidRequest {
  1: string user,
  2: optional string apk_url,
  3: optional string op,
}

struct ConnParams {
  1: string host,
  2: string port,
  3: optional string password,
}

exception ApplicationException {
  1: string msg,
  2: optional i32 code = ErrCode.DEFAULT,
}

service DroidKeeper {
   void ping(),

   string get_package_name(1: string apk_url) throws (
          1: ApplicationException ae),

   ConnParams get_endpoint_for_user(1: string user) throws (
          1: ApplicationException ae),

   bool interact_with_endpoint(1: DroidRequest dr) throws (
          1: ApplicationException ae),

   oneway void release_endpoint_for_user(1: string user),
}

What’s next?

Leveraging VirtualGL and exploring the TurboVNC project to spin emulators that use host GPUs.
Parameterize AVD creation so as to specify API Levels.

Posted by Vishal Gowda · @VishalGowdaN · GitHub

http://engineering.hackerearth.com/2017/04/03/streaming-android-apps-via-the-browser

Monitoring and alert system using Graphite and Cabot

Mar 21, 2017

Introduction

The infrastructure that powers a product and all of the services that it provides can be huge and complex because the product is scaled to serve millions of users. In most cases, each service might depend on various components for seamless functioning. With a product that houses a variety of features with critical infrastructure components and services powering these features, it becomes vital to monitor these components and services and keep them running at any cost.

This monitoring system has to handle the following:

Gathering data from all the components and services
Storing the data efficiently and in an easily accessible manner
Visualizing the data for faster comprehension
Making sense of this data and relaying alerts to the respective owners of the services and components
Managing the on call team and alerting them immediately

At HackerEarth…

Initially when we began facing problems because of some of our machines and services going down, we wrote ad hoc monitoring scripts that ran as crons to send email alerts. We also set up AWS Cloudwatch alarms to send notifications via email. There came a time when we had a very high number of components to be monitored and we realised that we were not getting enough insight into the load and usage of our machines. This is when we decided to put a system in place to collect data from these monitors and services. We also added a monitoring component to send alerts in more reliable ways (through phone calls) to the product owners and our on-call team, in case of any downtime.

Components of this system

This system consists of the following components that work together,

Collection: Tool for collecting metrics from all the infrastructure components
Forwarding: Tool for aggregating the metrics that are recieved from various machines and services and routing it to different storage backends
Visualization: Tool for generating graphs and visualizing metrics
APIs: Backend that provides APIs to query the metrics data for dashboards and alerting systems
Storage backends: Database for storing the time-series metrics data
Monitoring: Monitoring system for performing checks on the metrics and sending alert notifications

There are a lot of tools out there for each of these components that can be put together to build this system rather than writing it from the scratch. We were very happy with what Cabot had to offer as a monitoring and alert system. Since it used Graphite APIs for the monitoring checks, we decided to use Graphite to record and serve all our metrics.

Traditional approach

Graphite is usually used with the following components at its core:

Carbon as a forwarding/relaying component
Graphite-Web as a web and API component
Whisper as a library for storing time-series data with MySQL or PostgreSQL as the database
Grafana for visualization and any tool like collectd or statsd as a metric collector

Limitations

While these are useful, there are many issues that surface as you try to scale this system, Some of the major limitations seen with this approach include:

Carbon is written in Python, and cannot scale to millions of metrics that are being reported to it. It starts dropping metrics when it is unable handle the load
Using databases like MySQL and PostgreSQL for storing time-series data is not efficient. Graphs on Grafana start rendering very slowly. Its understandable as these databases were not built for time-series data

Influx to the rescue!

InfluxDB is a database that is written in Go, primarily for time-series metric data. This database is by far the best option for a database to be used with Graphite. We also found alternative components for:

Using the API with InfluxDB as a storage component
Mitigating the limitations of using Graphite with the components mentioned above

The components that we found to be better include:

Collectd: This is the collection component that will collect metrics from all the machines
carbon-relay-ng: This is a carbon that is written in Go. It can scale to 2500 metrics/s or 150k minutely. It is used for forwarding the metrics to InfluxDB which has a graphite listener to receive the data in the Graphite format
InfluxDB serves as a kickass backend to store the metrics data!
Grafana: This has the capability of adding InfluxDB as a data source and helps in creating and rendering graphs effortlessly
graphite-api: This is the API component. It is the APIs in graphite-web without the web interface and is much lighter. It uses InfluxGraph as a storage backend.
Cabot: This is the monitoring component that conducts periodic checks on the metrics in InfluxDB using the graphite-api and sends alerts.

Putting these components together Collectd

Collectd is used for the metrics-collection component. You can get collectd from here. Collectd provides a lot of plugins for collecting various metrics. Some important plugins to look at are CPU, memory, and disk. It also provides plugins for other tools like MySQL, Redis, PostgreSQL and amqp.

Custom metrics can be reported to carbon-relay-ng using the Python plugin that runs a Python script to collect metrics. This is an example of writing custom-service plugins (in Python) that are used by collectd. Most importantly the write_graphite plugin is used to report metrics to carbon-relay-ng in the graphite line format.

This is the format of the write_graphite plugin that should be added in the collectd config file

<Plugin write_graphite>     
    <Node "graphite">       
        Host "<carbon-server-ip>"
        Port "<carbon-server-port>"         
        Protocol "tcp"      
        LogSendErrors true  
        Prefix "random_prefix."
        StoreRates true     
        AlwaysAppendDS false
        EscapeCharacter "_" 
    </Node>                 
</Plugin>

# Important: You must use the right prefix. If it does not match with
# any of the carbon routes that have been configured, then carbon-relay-ng will
# drop the metrics.

The intervals in which collectd sends metrics is configured a the global level in the config file for all the plugins by using the Interval parameter. It can be specified inside each plugin to override the interval for that plugin.

Collectd is one of the tools that is used to report metrics. Many other tools can be used to report metrics to graphite with this architecture. Some alternatives can be found here.

Carbon Relay Ng

You can install carbon-relay-ng from here. The following points have to be kept in mind while using it:

The conf file in /etc/carbon-relay-ng/carbon-relay-ng.conf has the params spool-dir and pid-file. In case of errors due to these parameters, set them to /var/spool/carbon-relay-ng and /var/run/carbon-relay-ng.pid. Create the directory in spool, if required.
It supports routing to various sources by creating routes in the config file. For more information, read the documentation.
It allows the blacklisting of metrics based on the prefix, regex, or substring match.
It provides aggregators to batch metrics for an interval and applies aggregation functions on the values. The aggregated value is then routed to the destination.
It also allows rewriters to rewrite metric names.
It comes with a UI where new routes can be added during runtime.

Sample Route

Format : addRoute <type> <key> [opts]   <dest>  [<dest>[...]]
'addRoute sendAllMatch collectd  127.0.0.1:2005 prefix=hemetrics'

Multiple such routes can be added to the init list in the config file. Here the command addRoute adds a new route of the type sendAllMatch, which sends metrics to all destinations that match! Metrics can be matched by using prefixes, substring, or regex matches.

In this case the route has the key collectd and sends metric data with the prefix hemetrics to all matching destinations where the destination is expected to be a TCP endpoint (IP:port combination).

Carbon or carbon-relay-ng in our case, can be fed externally too by using the line-text protocol. For more information, read the relevant documentation here.

carbon-relay-ng provides a lot more options for matching and transforming data. For more information, read the documentation. :)

InfluxDB

InlfuxDB is a very good choice for storing our metrics data as it is a time-series database and has a very high performance (written in Go). Some of the other features that make InfluxDB awesome:

Supports mutilple protocols for data ingestion including Graphite
Has a query language that is very SQL-like
Allows tagging of data which is indexed thus making queries efficient
Provides retention policies to automatically expire data
Has a built in web interface
Provides continuous queries that automatically aggregate data making frequent queries faster

# Config for the graphite listener
[[graphite]]
  # Determines whether the graphite endpoint is enabled.
  enabled = true
  database = "graphite"
  # This is what the carbon-relay-ng route should use as the destination
  bind-address = "localhost:2005"
  protocol = "tcp"

Grafana

Grafana as a visualization component, provides a plethora of options like histograms, heat maps, and graphs. Grafana provides alerting options and notifications on alerts to PagerDuty, Slack, and a few other services. However, since we wanted something more than that we decided to deploy Cabot as the monitoring and alerting component.

Grafana enables us to create beautiful dashboards and renders it seamlessly for the on-call team to monitor various components in a single dashboard. Installing grafana and creating dashboards is a cakewalk. Start creating your dashboards here.

Graphite API & Influx Graph

Graphite-API is grahpite-web with just the HTTP APIs and not the web interface. It implements most of the APIs. To make Graphite-API use InfluxDB as the datasource, install Influx Graph (storage plugin for Graphite-API).

The configuration for Graphite-API resides in /etc/graphite-api.yaml. An example for it can be found here

Installing Influx Graph installs Graphite-API and a lot more dependencies including InfluxDB. Graphite-API provides a lot of configuration options for using standard templates for queries, caching queries in memcache, aggregation functions, and grouping data bases on intervals combined with query-retention policies to cache queries of certain intervals for specific amounts of time.

When templates are added, the API expects queries for the metrics of the format that is specified in these templates only. Any other queries will return zero results.

Graphite-API can be deployed in various ways. At HackerEarth we have deployed it using NGINX and uWSGI.

Cabot

Cabot is an open-source monitoring and alert system written in Python and Django. It provides the best features of PagerDuty free of cost.

The documentation is pretty good for setting it up and getting it work without any hassles. Some of the best features of Cabot include:

Web interface to configure services, checksi, and users
Good coupling of services, checks, and users. This allows certain default users to be notified for every service.
Checks that include graphite metric checks, jenkins job checks, and http endpoint checks.
Alerts sent through email, Hipchat, Slack, phone and SMS (Twilio).
Recovery instructions can be tagged with every check
Easy integration of custom alerts.
On-call management, users on call are picked up from a Google Calendar and updated every half hour.
Easy deployment

All of the cabot configuration options can be found here.

Note: Cabot uses Celery to run all the service check tasks. The configuration parameter CELERY_BROKER_URL requires a redis host with the right authentication. Also set the WWW_HTTP_HOST to the FQDN of the server.

Check if all the checks are running by visiting /status/ endpoint.

Complete workflow

Final words

It has been two months since we started using this system at HackerEarth. It has been very stable and completely reliable. We collect hundreds of metrics from various machines and services every minute. This has given us a better understanding of the load on our critical machines and services, thus helping us manage our infrastruture more efficiently and minimising our downtime.

Posted by Karthik Srivatsa

http://engineering.hackerearth.com/2017/03/21/monitoring-and-alert-system-using-graphite-and-cabot

Leveraging ReactJS in HackerEarth Assessment Environment

Mar 7, 2017

ReactJS, as the name suggests, helps create reactive (read interactive) UIs. If we have a UI with many interactive elements and on each interaction a bunch of elements change, ReactJS efficiently updates and renders the required elements. At HackerEarth, one such UI is the programming assessment environment.

Programming assessment environment is one of the most critical products of HackerEarth. The mockup below is a broad idea of what it is composed of.

The system in place

Let’s briefly understand the components in the mockup above. There are three major components:

The left pane
- This is the primary navigational part of the interface. This controls which question is visible in the right pane.
- It enables candidates to switch between questions as per their convenience.
The upper right pane
- This contains the detailed description of the question which is selected in the left pane.
The lower right pane
- This is a medium for the candidate to submit answer to the question above. As of now, we are assuming that we have a programming question. So let’s have a Code Editor here.

Now, let’s take a look at how things work in the current architecture.

In the first HTTP response, we render all the questions in the left pane. There’s no html in right pane. On selecting a question in left pane, an ajax call is made to fetch the data. The call returns a pre-rendered html of the entire question description, and it provides the candidate with a way to make further ajax calls to load the code editor. We won’t get into the details of each part of the question description, however, for now, let’s assume that it is composed of many smaller components.

When the candidate is ready to attempt the question, another ajax call is made to render the code editor. We have extended ace editor and written a wrapper over it to fit in our requirements. One such requirement is to record every keystroke and create a frame for it. Later, we can play all those frames and see the entire code editing session as a video. If that is of interest to you, read more about how we went about doing that.

Pain points in the current system

So far, we are good. Now, let’s talk about some of the pain points in the implementation above.

Every time we switch between questions using the list in the left pane, we make an ajax call to fetch that data. Now, mind that a candidate has, on an average, 1.5 hours to attempt the test. Assuming that this action of switching and re-rendering takes about 2 seconds and that there are about 20 questions in the test, a candidate is losing about 40 seconds (considering only 20 switches were made). As a candidate, you would also want to revisit each question in the end and go through your answers. 40 seconds seem like a small duration, however, imagine the frustration of waiting for 2 seconds for viewing the question.

Next comes the part where we load the code editor via ajax. Every time, a code editor is rendered, there is a list of files that are needed:

Ace.js, the main component
Mode files for each of the languages we support in our editor; it provides highlighting depending on the language
Autocomplete files, again for all languages, help in providing realtime suggestions and completion of statements or functions
CodePlayer.js, used to record keystrokes and frames to play as video
AceWrapper.js, our own custom file, the final wrapper over Ace.js

Mode files, when combined and minified, used to take up more than 1000 KB. Autocomplete files up to 700 KB, Ace.js up to 400KB and CodePlayer.js & AceWrapper.js combined took about 50KB. So, on loading the code editor, we had to fetch more than 2 MB of data. This isn’t much of a problem, however, when companies went to conduct campus tests in universities in the remote areas where the Internet speed is about 10 MBPS, things didn’t work out well. Moreover, that bandwidth is shared among hundreds of candidates taking the test simultaneously. And, this again makes some more ajax calls to help render the correct code editor settings depending on the question.

Leveraging ReactJS to tackle the pain points

First, let’s solve the problem of an ajax request being made for every question switch made by the candidate. There were multiple thoughts around this. We can bring in the entire html of all the questions and simply hide/unhide at client side. This proved to be ineffective because the html was sized more than 5 MB for just 10 questions. So, this approach is out of the window.

Next, we thought of returning json instead of html and render it on the client side. Once, in the past, we used handlebars to facilitate this. However, we were open to exploring newer technologies and that’s when we stumbled upon ReactJS. After about 2 weeks of research and analyzing how we can fit ReactJS in our architecture, we decided to go ahead with it.

We knew it was still going to take some time to load the heavy ReactJS vendor files. So, we introduced a loader (inspired by GMail) while rendering the first HTTP response.

This gave us enough time to load all heavy js files and make some ajax calls. The json returned by ajax call was less than 100 KB for 30 questions. This was pretty good! We let React do what it does best and render the json on the client side.

Another beautiful feature of ReactJS is its ability to bind the rendered elements with the state of the elements. If we want to update the content or child elements of any element, we just update the state of the parent element and React will re-render the component. We will use the words element and component interchangeably; component refers to the js code of the html element.

When we create a complex web page with multiple action points, it in turn triggers another action in the interface and changes the interface. For instance, when a candidate answers a question in the right pane, we make changes in the left pane indicating that this question has been attempted. Instead of using tons of listeners for various actions and making changes in interface, just updating the state of the component will suffice in ReactJS.

Also, we have many different types of questions such as Programming, SQL, Android, etc. There are a lot of common child components among these components. Using React, we were easily able to reduce the code complexity and repetition by creating independent child components and reusing them to compose larger components.

Leveraging all these advantages of ReactJS, we were able to render all the questions at once in the test interface and hide/unhide the question depending on the selection in the left pane. This solved the first part of waiting for about 2 seconds for each question switch.

More complexity to the above structure

We have Multiple Choice Questions that come with a capability of having a timer per question. For example: The question description will not be visible to the user in the beginning (as shown in the image below). Once the user loads the question description, a timer will start for that particular question.

The problem with this is that we cannot send the question description in the initial ajax call to get the json. A smart candidate can see the response of the ajax request and the whole concept could have been a disaster. We made a compromise in this case and allowed ajax call to be made for each timed multiple choice question. However, this time around, we will not send the html from backend; we will still return json and update the state of one of the child components. Initially, it sounded like a troublesome problem, but the solution was quite easy. And this whole process of making ajax call to fetch json and re-rendering that child component did not take more than 200 ms.

Code Editor in ReactJS

The AceWrapper.js that we wrote had become quite messy over time, and it had become more like a legacy code that nobody dared to touch. After all, code editor is one of the primary user action elements and if that breaks, then we are doomed. We found that somebody had already written a React wrapper for Ace.js, react-ace. This was a good starting points for us, we forked it right away and wrote yet another wrapper component over it to facilitate our own requirements such as the code video player. Let’s keep the technicals of AceEditor in React for another day.

We ended up with a separate instance of AceEditor for every question. This did make the DOM a little heavy, however, the benefits listed below made us ignore this little problem.

Save the state of the code written for any question.
The state is saved even if the Internet is interrupted for a while.
A candidate can still switch between questions and write code without having to worry that the code in another problem was not saved in our backend. We lazily update the database using the state of all the editor instances.
The render time reduced from 4-5 seconds to 300 ms.

Wrapping up

Conclusions from the above discussion:

We were able to avoid ajax calls to fetch each problem and decrease the time from ~2 seconds to ~200 milliseconds.
The load time for the code editor was reduced from ~4 seconds to ~300 milliseconds.
The React way of composing larger components using smaller components helped in writing reusable and maintainable code.
We replaced tons of js listeners by writing handlers in specific React components.
One downside is the initial load time, which fetches react and ace vendor js files.

Future scope of improvement

We can try improving the initial load time, maybe we can create chunks of vendors depending on the type of test. For instance, if there are no programming questions in a test, we can avoid creating vendor files for ace.js.
We can update the question in realtime while the candidate is taking the test. For instance, due to some reason, let’s say a question description was altered in the backend by test admin, we ask the candidate to reload the page. If we can push the change to all test taking candidates and update the state of that child, the test taking experience can be further enhanced.

If you’d like to have a first-hand experience of the test environment, go ahead and take this test. Let us know your feedback and how we can improve further.

Evíva!

Posted by Ravi Ojha · @ivarojha · Rookie’s Lab

http://engineering.hackerearth.com/2017/03/07/leveraging-reactjs-in-hackerearth-assessment-environment

WTF is MVP ?

Nov 17, 2016

If you are here searching for answers about Minimum Viable Product or you are here as a result of watching the first episode of the first season of Silicon Valley, this might not be the blog you are looking for. If you are a software engineer and you develop apps (especially on Android), this is a must read. Either way, you can share this blog among other software developers you know! :)

Model-View-Presenter

MVP stands for Model, View, Presenter. MVP is a way to abstract or decouple different components to make them independent of each other. This makes the codebase cleaner, improves readability, improves maintainability and also helps in rigorous testing.

Model : Data access layer such as database API or remote server API.

View : Layer that shows/displays data and reacts to user actions. This could be an Activity, Fragment, View or Dialog. This contains almost no logic. Converts presenters commands to UI actions and listens to user actions which are passed to the presenter.

Presenter : Layer that provides View with the data from Model. Presenters essentially sits in between Models and Views.

Why do we need MVP ?

KISS : Stands for Keep It Simple, Stupid or Keep It Stupid Simple. Don’t fight with the Views. Fight with business logic.
Decouple : Helps in concentrating on the problem. Helps solving issues like configuration changes, background tasks, etc
Most problems will be handled by the architecture itself and the app wouldn’t need external libraries to handle specific issues.
Rigorous testing : Helps in building testable apps by writing automation tests.

How do we develop apps for the next billion users ?

We start defining contracts for each layer. A contract is a class or an agreement. We define contracts for all the layers in the architecture - Model, View and Presenter.

Let’s take an example of showing the latest challenges on HackerEarth to users. ChallengesContract.java class defines two interfaces, one each for the View and the Presenter. The view specific functions and variables are declared in the ChallengesContract.View interface. This should mostly include functions to update the UI and to listen to user actions. ChallengesContract.Presenter interface which is defined in the ChallengesContract.java class declares functions to get data from the remote server and functions to handle user actions like clicks, long clicks, etc.

ChallengesContract.View can be implemented in two Android components - Activity and Fragment. We choose fragments to implement the ChallengesContract.View for two main reasons :

Fragments help us in defining layouts for tablet devices.
As the Fragment is controlled by the Activity, the control of creating the presenter and the view remains with the Activity too.

The ChallengesListFragment.java implements ChallengesContract.View.

While the fragment is made visible to the user, we instantiate the presenter object and call the start() method on the presenter. We used onStart() and not onResume() as we don’t want users to keep waiting for the updated view while the fragment is actively running in the foreground. challengesPresenter.start() starts by deleting old data in the cache (if there is any) and makes a request to remote servers and receives the latest challenges on HackerEarth. The presenter handles all the business logic. It requests for data from the Model and provides necessary data to the View. The presenter now needs the Model layer.

Model layer needs a Contract or an agreement. This helps other layers (View and Presenter) to communicate. The ChallengesPresistenceContract.java defines authorities, MIME types for Uri and the scheme for the all the Uri to be used in the ContentProvider. We define the Entry classes as well. Entry classes are essentially BaseColumns which represents each table. BaseColumns is an interface which defines contsants for the count off rows in the directory/table and for the unique ID for each row in the table. ChallengeEntry class defines the table name, columns and provides the Uri for the table.

As we know, for any persistence to work we need to define models for each tables. We create immutable final classes for models which represents each table. An immutable model class has its advantages :

We won’t face synchronization issues as immutable objects are thread-safe
It also helps in parallelization as there are no conflicts between the objects.
References to to immutable objects will not change and this helps in caching of those objects and reusing them later.

Once we have our models and contracts ready we can go ahead and create tables by using the SQLiteOpenHelper class. Though SQLiteOpenHelper helps us in creating and/or upgrading the database for the app, we will have to extend it and provide custom database names, versions and tables.

Creating databases will bring us closer to the core of the Model layer - Data Store or Data Access Object.

I can hear you say : ” Hey I have heard about Data Access Object. That is DAO!! Did you use greenDAO?! It’s awesome!”.

I’m here to tell you that we didn’t use greenDAO or any other DAO libraries as we wanted to reduce the apk size as much as possible and reduce the dependencies in the codebase.

DAO abstracts and encapsulates all access to the data source. The data source could be a persistent storage like SQLite database on Android, a remote data store or a cache. This will completely hide the data source implementation details from its clients. The DAO allows to adapt to different data source implementations without affecting other components. The ChallengesDataStore.java interface will serve as the main entry point for accessing data. We also define callbacks to various operations. The callbacks helps us determine the state of each operation and use them accordingly in the Presenter.

The interface ChallengesDataStore gives us the opportunity to define different implementations for the local/persistent storage and for handling remote server connections. We create two classes for each - One for handling persistent data and another for handling remote data. ChallengesLocalDataStore.java handles the local persistence and ChallengesRemoteDataStore.java handles the remote data. Both classes are implementations of ChallengesDataStore.

We follow the Repository Model for the Model layer. As Martin Fowler puts it A Repository mediates between the domain and data mapping layers using a collection-like interface for accessing domain objects. What this means is that a Repository should be responsible in deciding the source of the data (local or remote) to be used for updating the View through the Presenter.

In the architecture and the use cases we have here, the repository is responsible fetch data from remote server, store data and notify the ContentProvider. The ContentProvider notifies the change in data to the Uri and every client observing that Uri will receive an update. We create ChallengesRepo.java class which implements ChallengesDataStore. This will have the logic behind which data source to use and when to use the data. This also helps in creating offline mode in apps. The ChallengesRepo can also define the callbacks to data availability. This helps in showing relevant views through the presenter.

It’s always recommended to have one ContentProvider per app. A single ContentProvider should handle different types of Uri. One can create 2 or more ContentProvider if there are 2 sets of data; one that needs to be shared and another that should not be. We can also have 2 or more ContentProvider per app if there are 2 or more databases. A ContentProvider manages data access for a database to different tables. It helps us in executing CRUD operations as well. ContentProvider manages access to structured set of data. Loaders helps in loading of data and observes data changes to the data ChallengesAppContentProvider.java here shows the implementation of ContentPovider.

The presenter contains all the business logic. Presenter will never know the source of the data, it’s sole responsibility in handling data is to query for data, process data and update the View with the data. It listens to user interactions from the UI like View.OnClickListener, View.OnLongClickListener, View.OnTouchListener, etc. Presenters are also responsible in retrieving data from the repository and updates the UI as required. ChallengesPresenter.java implements the ChallengesContract.Presenter, ChallengesRemoteDataSource.GetChallengesCallback, ChallengesRepo.LoadDataCallback and [LoaderManager.LoaderCallbacks](https://developer.android.com/reference/android/app/LoaderManager.LoaderCallbacks.html).

If you have come this far, you have come far enough! MVP is the first step towards building an app that scales to billion users, helps us developers build testable apps and test on device farms like Amazon Device Farm or Firebase Test Lab and enables to build offline apps.

Stay tuned for more updates! Feel free to comment below or ping us at support@hackerearth.com if you have any suggestions!

Posted by Vishnu Sosale
Follow me @vishnusosale

http://engineering.hackerearth.com/2016/11/17/wtf-is-mvp

Sending emails to our half million and growing user community

Feb 11, 2016

At hackerearth we send emails to keep our users updated on upcoming challenges and their activities, for example, when a user successfully solves a problem, receives test-invitation, updates on user comments. Basically whenever it is appropriate.

Architecture

It takes lot of computational power to send emails in such large quantities synchronously. So we have implemented an asynchronous architecture to send emails.

Here is brief overview of the architecture:

Step 1: Construct an email and save the serialized email object in database.
Step 2: Queue the metadata for later consumption.
Step 3: Consume the metadata, recreate the email object and deliver.

The diagram below shows high level architecture of emailing system. The solid line represents the data flow between different components. The dotted line represents the communications. Hackerearth email infrastructure consists of MySQL database, MongoDB database, RabbitMQ queues.

Journey Of Email

Step 1 - Construct email:

There are two different type of emails.

Text - Plain text emails
Html - Emails with rich interface using html elements. These emails are made using django templates

API used by hackerearth developers for sending email -

    send_email(ctx, template, subject, from_email, html=False, async=True,
                **kwargs)

The above API creates Sendgrid Mail object, serializes and saves it in the db with some additional data.

A piece of code similar to the bit shown below is used to create sendgrid Mail object


    import sendgrid

    sg = sendgrid.SendGridClient('YOUR_SENDGRID_API_KEY')

    message = sendgrid.Mail()
    message.add_to('John Doe <john@email.com>')
    message.set_subject('Example')
    message.set_html('Body')
    message.set_text('Body')
    message.set_from('Doe John <doe@email.com>')
    status, msg = sg.send(message)

Model below is used for storing the serialized mail object and additional data.

    class Message():
            # The actual data - a pickled sendgrid.Mail object
            message_data = models.TextField()
            when_added = models.DateTimeField(default=datetime.now, db_index=True)

After constructing and saving the email object in the database, metadata is queued in the rabbitmq queues. Following section explains this in detail.

Note: send_email() API can send synchronous emails. Switch the flag ‘async’ to False to send synchronous emails. This will bypass all the asynchronous architecture and directly delivers the emails to inbox. But this is used to send extremely important emails, for example, infrastructure monitoring, alarms, and for monitoring email infrastructure itself.

Step 2 - Queue the metadata:

Not all emails have same importance in terms of delivery time. So, we have created multiple queues to reduce waiting time in queue for important mails.

High priority queue
Medium priority queue
Low priority queue

It’s up to the application developer to decide the importance of the email and queue it in appropriate queue.

We queue the following metadata in the queue as a json object:

{‘message_id’: 123}

Step 3 - Reconstruct and deliver:

We run delivery workers, which consume metadata from queues, reconstruct email object and deliver it.

These workers consumes the messages from rabbitmq queues and fetches the message object from Message model(explained in the section above), deserializes the data to reconstruct the sendgrid Mail object.

We run different number of workers depending on the volume of emails in each queue.

Before sending email we do final checks which help us to make decision whether to deliver the email or not. For example, if the email id is blacklisted, if the emails have non-zero number of receivers.

After request is sent to sendgrid for delivering the email, these email objects are logged into a MongoDB to maintain the history of delivered emails.

###A/B Test In Emails###

Million emails requires optimization to improve user experience. This is done through A/B tests on emails type. We can test emails for subject and content variations. Every user on hackearth is assigned a bucket number to ensure emails are consistent during the experiment . Every A/B experiment is defined as dictionary mapped constants with all the information.

Here is one example of an A/B test with subject variation.

"""
    EMAIL_VERSION_A_B
    format of writing A/B test
    key: test_email_type_version_number
    value: email_dict


    format for email_dict
    keys: tuple(user_buckets)
    values: category, subject, template
"""

EMAIL_VERSION_A_B = {
                     'A_B_TEST_1': {
                     tuple(user_bucket_numbers):{'a_b_category': 'email_category_v1',
                                                 'subject': 'Hello hackerearth',
                                                 'template': 'emails/email.html'
                                                },
                     tuple(user_bucket_numbers):{'a_b_category': 'email_category_v2',
                                                 'subject': 'Welcome hackerearth',
                                                 'template': 'emails/email.html'
                                                }
                     }}

New Experiments must update EMAIL_VERSION_A_B with experiment data. Information from EMAIL_VERSION_A_B is used to update the key word arguments of hackerearth sending email API(send_email). The category is propagated to update the category of sendgrid Mail object. Categories are used to see the variations in open rate and click rate for different A/B experiments.

Feel free to comment below or ping us at support@hackerearth.com if you have any suggestions!

Posted by Kaushik Kumar.

Thanks to Pradeepkumar Gayam for improving it

http://engineering.hackerearth.com/2016/02/11/sending-emails-to-our-half-million-and-growing-user-community

Analyzing submissions in real time for social media updates

Feb 2, 2016

Objective

In Jan 2015, HackerEarth conducted nearly 10-12 hiring challenges, 5-6 coding challenges and numerous college challenges. HackerEarth has a decent social media presence and we wanted to inform our followers about the events at HackerEarth. One of the main objectives of this project was to provide flexibility to the marketing team to automate simple jobs and to focus on sophisticated campaigns.

Design Goals

As a first step, we decided to post about our events and their highlights on twitter. We covered event reminders, start/end of contests, who scored first AC and leaderboard updates at the end of a contest. We chose to do it by reading from the biggest and the meanest table of our database of Submissions.

Challenges

The submissions table is a very large table. An additional query on the submissions table during peak hours was not favourable. Hence, we did not count the submissions in-place and instead queued them to be processed later.

Preventing duplicate tweets while maintaining state is also a challenge.

Solution

The application made a high volume of reads, few writes/updates. So any key/value stored would do the job. We chose Redis in lieu of Memcached.

Redis offers data persistence in the event of node failure. This is very useful to avoid duplicate tweets. For instance, Two different users being credited for first AC submission in an event.
By setting key expiry time and less number of keys for a single event, we prevented our Redis server from being overloaded.
We maintained a key in Redis to keep count of submissions for an event. Reading from the database was not recommended because it would make a read call per submission during peak time and Redis performed faster reads.

The application is an asynchronous worker and payload containing submission_id and event_id are passed to it using Kafka. So when the Redis key counter hit the magic numbers (1, 100, 500, multiples of 1000), The worker makes a DB query and posts a tweet.

Worker subscribes to a Kafka broker on the submission topic to receive the payloads pertaining to it. Here is the code of the worker.

class ConsumePostTweets(KafkaConsumer):
  def __init__(self):
      routing_key = KafkaConsumer.KAFKA_SUBMISSION_TOPIC
      self.redis_cache = get_redis(
          settings.REDIS_DATA_CONNECTION_URL)
      super(ConsumePostTweets, self).__init__(routing_key)

  def on_message(self, body):
      message = json.loads(body)
      submission_id = message.get('submission_id')
      if submission_id is None:
          autocommit_transaction()
          return
      try:
          message = json.loads(body)
          message.update({"redis_cache": self.redis_cache})
          process_tweets(**message)
          autocommit_transaction()
      except Exception, e:
          log_tags = ["tweets", "queue", "consumer", str(submission_id)]
          tb = traceback.format_exc()
          silo.log(tb, tags=log_tags, type="LiveTweeter")
          autocommit_transaction()

We also run a Cron job to post event reminders.

def post_challenge_reminder():
  """Task to post challenge reminders.
  """
  now = datetime.now()
  later_1_hour = now + timedelta(hours=1)
  events = Event.objects.filter(
      Q(start__gt=now) & Q(start__lte=later_1_hour))
  events = [event for event in events if is_tweet_allowed_for_event(event)]
  for event in events:
      tweet_grammar = random.choice(grammar.CHALLENGE_REMINDER_FEEDS)
      post_tweet(tweet_grammar, event)

As the number of challenges in a time slot increased, we needed to lower the number of tweets in a particular time interval. We queued our tweet payloads with a delay and posted them at intervals.

Here is a screenshot of the app in production.

Epilogue

Finally, we wrapped it as a valentine’s day gift for the marketing team and they have been loving us more since that day.

Send an email to support@hackerearth.com for any bugs or suggestions. Posted by Sreeram Boyapati

http://engineering.hackerearth.com/2016/02/02/analyzing-submissions-realtime-for-social-media-updates

Beautiful Math Symbols

Feb 2, 2016

Introduction:

By nature HackerEarth has so many programming puzzles. These puzzles bound to have mathematical equations, statements and symbols. Problem Setters often requested us to add support for Latex.

Implementation:

The obvious and easy solution for this is to implement Latex support for individual pages. But that’s not scalable and maintainable. Then we came with up with a solution where we didn’t have to write custom code for every page. We examined the site and figured out three type of rendering that happens in the browser. Before going to solutions case by case, let’s first understand how MathJax, the library we use for typesetting Latex, works.

Everything in MathJax works asynchronously. After initializing MathJax, it executes a set of tasks and then typesets the queued content. Everything in MathJax works asynchronously using queues. Mathjax docs explains it in detail.

Types of rendering:

Rendering of preview section in editor
Static content rendering (synchronous)
Rendering of dynamic content via Ajax (Asynchronous)

Read through rest of the article for this categorization to make sense.

Editor:

We use pagedown editor throughout our site. It has a preview section which displays the converted markdown content. Now this should also display typesetted Latex content.

Pagedown editor has feature called hooks. This is a mechanism for plugging external text processors in between various steps of markdown processing. We have written a hook for typesetting Latex macros. This is chained to the hooks at the end, after all the markdown processing is completed. This hook takes markdown processed content as input and spits out Latex typesetted content.

function renderLatex(text) {
    var invisible_div = document.createElement("div");
    invisible_div.style.cssText = "display:hidden";
    attr = document.createAttribute("id");
    attr.value = "mathjax_text";

    invisible_div.setAttributeNode(attr);
    invisible_div.innerHTML = text;
    document.body.appendChild(invisible_div);

    elem = document.getElementById("mathjax_text");
    MathJax.Hub.Queue(["Typeset", MathJax.Hub, elem]);

    child_node = document.body.removeChild(elem);
    return child_node.innerHTML;
}

This function creates an invisible div, sets the input text the to this div, queues this div in MathJax queue. Then MathJax typesets the div.

We are doing this because, MathJax work asynchronously. It doesn’t takes input text and spits out typeset text, like normal functions do. Hence this workaround. When you’re working with MathJax this is not a workaround. This is the ideal way.

Webpages:

The content that’s written in the editor is saved in the database and shown in the web pages. And this content displayed in the browser in two different way.

Synchronous
Asynchronous aka Ajax

The way MathJax works, we can only typeset the Latex content only after it has loaded into the page. Once the page loaded, we typeset the whole page. But when the user interacts with the page, content keep changing. New content is added to page via ajax. It is really inefficient to typeset the whole page every time a bit of content changes. Sometimes it even causes errors.

Synchronous:

This bit is easy. After the page loads completely, ask MathJax to render the all the latex content in the page. This can be achieved by following piece of code.

<script type="text/javascript">
    window.addEventListener("load", function() {
                MathJax.Hub.Queue(["Typeset", MathJax.Hub]);
            });
</script>

Asynchronous:

In this case, instead of queueing the whole page for typesetting, we queue only the divs that are modified, for typesetting. Luckily(rather pragmatically) we use very few set of javascript functions throughout the site for fetching ajax content. All we have to do is to modify these function so that they will queue the modified divs for typesetting.

This is the function for typesetting divs.

function latexifyAjaxDiv(div) {
    setTimeout(function() {
        MathJax.Hub.Queue(["Typeset", MathJax.Hub, div[0]]);
    }, 10);
}

This function is called in ajax utility functions, right after setting content to the divs.

This approach makes almost all the pages in our site Latex compatible.

Posted by Pradeepkumar Gayam

Thanks to Ravi Ojha for improving it

http://engineering.hackerearth.com/2016/02/02/latex-support-using-MathJax

Profiling django views for SQL queries

Feb 1, 2016

We at HackerEarth regularly conduct 24-hours internal hackathons usually once a month to boost ourselves to get familiar with new technologies and to come up with great ideas and hacks to increase our productivity. A hackathon project can be anything from creating a new product to creating some tools which helps our own devlopment. In the hakathon in Dec 2015, I came up with the idea to create a profiler which could tell which code inside django views causes SQL queries so that we can optimize them easily.

Initial thoughts

There are already many good django packages to profile views for SQL queries. One of them is django-toolbar which we already use. Django-toolbar is great but it shows all raw SQL queries which are happening inside a view and you have to analyze each query, see the whole traceback and figure out which line of code inside the view is triggering the query. This way, you can only figure out the line number of the code in a file, not the exact function or expression or attribute access which is causing the query. What I wanted was that a profiler should tell about the exact expressions which trigger the SQL queries.

There were some solutions which came up in my mind to get it done like using tracebacks or by manipulating AST of the python function. Praveen had just told me about the python’s ast module which can parse and modify code during runtime and I was fascinated about using it in future. I chose to use AST manipulation to implement the profiler and the trick here is to patch every function call, attribute access and other constructs. Here is what I thought:

Suppose there is a function f which internally calls get_user triggering a SQL query

@profile
def f(request):
    ...
    user = get_user(request)
    ...

and decorating the function f with profile decorator will manipulate its AST code. It will find all locations of function calls inside function body and will encapsulate it in our special function call_handler. So the function f’s definition will become

def f(request):
    ...
    user = call_handler(get_user, request)
    ...

call_handler will call the passed function with given arguments. Before calling it will start tracking for the SQL queries. And after the function has been called it will collect the sql_queries and will store it with the function so that we can see later if during call any SQL query was made. Moreover, to attain the functionality of collecting queries I will have to patch the django code which makes SQL queries (I will talk about it later). It’s definition would be something like

def call_handler(func, *args, **kwargs):
    sql_counter.start_tracking()
    result = func(*args, **kwargs)
    sql_queries = sql_counter.collect_queries()
    store_queries(func, sql_queries)
    return result

And then we’ll be able to see what calls inside function made SQL queries.

Implementation How to manipulate AST?

The python docs of ast doesn’t give much information on usage of it. Then I found Green Tree Snakes. Also I found an example to convert python code to javascript in a stackoverflow’s answer using ast transformation. I found them pretty helpful.

There were following improvements in initial ideas while implementing:

There are many cases other than function calling where python code gets executed e.g. on attribute access, automatic coercion to bool in if/else if statements and automatic coercion to iter in for statement. So I’d to cover them up too.
The call_handler was making whole function body unhygienic because any other usage of name call_handler will clash with it. So I thought to make a namespace with very unique name that nobody will define in any other place. Here, I came up with word goofy and replaced decorator profile with goofy.profile() and replaced call_handler with goofy.call_handler. And the only thing that I’ll have to import is goofy.
In goofy.call_handler I need to pass other information like lineno, colno too along with the calling function.

Here is the code which transforms the AST

import ast

class Transformer(ast.NodeTransformer):
    def __init__(self, lineno):
        """ lineno is the actual position of function code in source file
        """
        self._lineno = lineno
        self._dec_lineno = 1

    def visit_FunctionDef(self, func):
        """ Remove decorators so that the decorators doesn't get applied more.
        """
        for dec in func.decorator_list:
            if (isinstance(dec, ast.Call) and isinstance(dec.func,
                    ast.Attribute) and isinstance(dec.func.value,
                    ast.Name) and dec.func.value.id == 'goofy' and
                    dec.func.attr == 'profile'):
                self._dec_lineno = dec.lineno
        func.decorator_list = []
        func_ast = self.generic_visit(func)
        return func_ast

    def visit_Call(self, call):
        """ Change the function calling syntax
        func(*args, **kwargs) will be transformed to
        goofy.call_handler(func, line_no, col_no, *args, **kwargs)
        """
        call_ast = self.generic_visit(call)
        call_ast.args.insert(0, call_ast.func)
        call_ast.func = ast.Attribute(
                value=ast.Name(id='goofy', ctx=ast.Load()),
                attr='call_handler', ctx=ast.Load())
        call_lineno = self._lineno - self._dec_lineno + call_ast.lineno
        call_colno = call_ast.col_offset + 1
        call_ast.args.insert(1, ast.Num(call_lineno))
        call_ast.args.insert(2, ast.Num(call_colno))
        return call_ast

There are many variants of statements and expressions in python, details of which you can find in ast docs’s Abstract grammar section. So any type of node that has to be changed while transforming the AST, a corresponding method of visit_<NodeClass> in Transformer class has to be written. (like visit_FunctionDef and visit_Call methods in above code). In actual code, I’d to implement visit_Assign, visit_Attribute, visit_If, visit_BoolOp and visit_For etc. too.

And here’s the code for goofy class.

def goofy_profiler(f):
    frame, filename, line_number, _, lines, _ = inspect.stack()[1]
    if lines[0].startswith('def'):
        line_number -= 1
    source = inspect.getsource(f)
    decorator_lineno = source.count('\n', 0, source.index('@goofy.profile')+1) + 1
    tree = ast.parse(source)

    transformer = Transformer(line_number)
    transformed_tree = transformer.visit(tree)

    ast.fix_missing_locations(transformed_tree)
    ast.increment_lineno(transformed_tree, line_number - decorator_lineno)

    module_globals = inspect.getmodule(f).__dict__
    exec(compile(transformed_tree, filename=filename, mode="exec"), module_globals)
    func = eval(f.__name__, module_globals)
    func = deco(func)
    return func

def deco(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        SQLCounter.clear_data()
        SQLCounter.current_func(func)
        start_time = datetime.now()
        result = func(*args, **kwargs)
        timedelta = datetime.now() - start_time
        print 'Function: {}.{}'.format(
                func.__module__, func.func_name)
        print 'Total processing time: {} ms'.format(
                timedelta.total_seconds()*1000)
        SQLCounter.show_data()
        return result
    return wrapper

class goofy(object):
    @staticmethod
    def profile():
        return goofy_profiler

    @staticmethod
    def call_handler(func, lineno, colno,  *args, **kwargs):
        sargs = (" {}(..)".format(func.__name__), lineno, colno)
        SQLCounter.before(*sargs)
        result = func(*args, **kwargs)
        SQLCounter.after(*sargs)
        return result

Also I’d to patch the execute_sql method of SQLCompiler to collect sql queries.

from django.db.models.sql.compiler import SQLCompiler
from django.db.models.sql.datastructures import EmptyResultSet
from django.db.models.sql.constants import MULTI

def execute_sql(self, *args, **kwargs):
    try:
        q, params = self.as_sql()
        if not q:
            raise EmptyResultSet
    except EmptyResultSet:
        if kwargs.get('result_type', MULTI) == MULTI:
            return iter([])
        else:
            return
    start = datetime.now()
    try:
        return self.__execute_sql(*args, **kwargs)
    finally:
        d = (datetime.now() - start)
        SQLCounter.insert({
            'query' : q, 'type' : 'sql',
            'time' : 0.0 + d.seconds * 1000.0 + d.microseconds/1000.0
        })

SQLCompiler.__execute_sql = SQLCompiler.execute_sql
SQLCompiler.execute_sql = execute_sql

And here is the code of SQLCounter which collects SQL queries and also the information of what SQL queries got executed at different position inside a function.

class SQLCounter(object):
    @classmethod
    def clear_data(cls):
        cls.current_code = ''
        cls.check = False
        cls.data = defaultdict(list)
        cls.hit_count = defaultdict(int)

    @classmethod
    def before(cls, activity, lineno, colno):
        cls.check = True
        cls.current_code = ("line no {:<4}: {}".format(lineno, activity),
                lineno)
        cls.hit_count[cls.current_code] += 1

    @classmethod
    def after(cls, activity, lineno, colno):
        cls.check = False

    @classmethod
    def insert(cls, data):
        cls.data[cls.current_code].append(data)

    @classmethod
    def current_func(cls, func):
        cls.func = func

    @classmethod
    def show_data(cls):
        if not cls.data:
            return
        data = sorted(cls.data.items(),key=lambda a: a[0][1])
        table = []
        headers = ['Location', 'Hit', 'Queries', 'Time (ms)', 'Avg Time (ms)']
        total_tm = 0.0
        total_qs = 0
        for current_code, qdata in data:
            activity, _ = current_code
            qs = len(qdata)
            total_qs += qs
            tm = sum(k['time'] for k in qdata)
            total_tm += tm
            avg_tm = tm / qs
            hit = cls.hit_count[current_code]
            table.append([activity, hit, qs, tm, avg_tm])
        print "Total SQL queries: {},  Total time: {} ms".format(
                total_qs, total_tm)
        tabular_table = tabulate(table, headers, tablefmt="simple")
        print tabular_table

Usage

There is a view get_bot_submission_response in our codebase and we applied goofy.profile() decorator on it to profile it.

from goofy import goofy

@goofy.profile()
def get_bot_submission_response(request, game):
    ...

When the view gets called it prints following output to console.

Function: problems.views.get_bot_submission_response
Total processing time: 201.499 ms
Total SQL queries: 8,  Total time: 9.856 ms
Location                                        Hit    Queries    Time (ms)    Avg Time (ms)
--------------------------------------------  -----  ---------  -----------  ---------------
line no 100 : .user                               1          1        0.97            0.97
line no 100 : .player1                            1          1        1.38            1.38
line no 112 : .problem                            1          1        1.52            1.52
line no 119 :  player_2_name(..)                  1          2        2.193           1.0965
line no 138 :  get_game_data(..)                  1          2        2.244           1.122
line no 163 :  render_to_string(..)               1          1        1.549           1.549

All the .<attribute> tells the location of attribute access which triggered SQL queries. .user, .player1, .problem are attribute access and player_2_name(..), get_game_data(..), render_to_string(..) are function calls. I’ve used the tabulate package to pretty print the table.

What’s next

Not only SQL but any type of queries can be profiled in this model. We also added the feature to profile memcached queries in goofy profiler. Lots of more improvements can be done upon it. We’ll soon clean the goofy profiler’s code and open source it on github. Stay tuned!

Posted by Shubham Jain. You can follow me on twitter @shhaumb

http://engineering.hackerearth.com/2016/02/01/profiling-django-views

A/B testing using Django

Jan 29, 2016

Whenever we roll out an improvement on our platform, we at HackerEarth love to conduct A/B tests on the improvement to understand which iteration helps our users more in using the platform in a better way. Since the available third party libraries did not quite meet our needs, we wrote our own A/B testing framework in Django. In this post we will share a few insights as to how we accomplished this.

The basics

A lot of products, especially on web, use a method called A/B testing or split testing to quantify how well a new page or layout performs as compared to the old one. The crux of the method is to show layout ‘A’ to a certain set or bucket of users and layout ‘B’ to another set of users. The next step is to track user actions leading to certain milestones, which would provide critical data regarding the ‘effectiveness’ of both the pages or layouts.

Before we began writing code for the framework, we made a list of all the things that we wanted the framework to do -

Route users to multiple views (with different templates)
Route users to a single view with different templates
Make the views/templates stick for users
A/B test visitors who do not have an account on HackerEarth (anonymous users)
Sticky views/templates for anonymous users as well
Support for A/A/B or A/B/C/D…./n/ testing (just for the heck of it!)
Analytics

We went out to grab some pizza and beer, and when we got back we came up with this wire-frame -

A/B for Views

A/B for Templates

Getting the logic right

To begin with, we had to categorize our users into buckets. So all our users were assigned a bucket number ranging from 1 to 120. This numbering is not strict and the range can be arbitrary or as per your needs. Next, we defined two constants - the first one specifies which view a user is routed to, and the second one specifies the fallback or primary view. The tuples in the first constant are the bucket numbers assigned to users. The primary view in the second constant will be used when we do not want to A/B test on anonymous users.

AB_TEST = {
        tuple(xrange(1,61)): 'example_app.views.view_a',
        tuple(xrange(61,121)): 'example_app.views.view_b',
    }

AB_TEST_PRIMARY = 'example_app.views.view_a'

Next we wrote two decorators which we could wrap around views - one for handling views and the other for handling templates. In the first scenario, the decorator would take a dictionary of views i.e. the first constant that we defined, a primary view i.e. the second constant, and a boolean value which specifies if anonymous users should be A/B tested as well.

Here’s what the decorator essentially does for logged in users -

Get the user’s bucket number
Check which view is assigned to that bucket number
Return the corresponding view

The flow is a bit different in case of anonymous users. If we do not want to perform A/B testing on anonymous users, then we just return the primary or fallback view that we had defined earlier. However, if we want to include anonymous users in the A/B tests, we need a couple of extra things to begin with -

Set a unique cookie for the user which is independent of the session
A simple and fast key-value pair storage e.g. Redis

Once we have these things in place, here’s what we need to do -

Get the user’s unique cookie
Check if a key exists in redis for that cookie value
If a key is found, get the value of the key and return it
If no key is found, choose a view randomly from the view dictionary
Set a key in redis corresponding to the user with the chosen view as value
Return the chosen view

Now, the A/B will work perfectly for anonymous users as well. Once an anonymous user gets routed to one of the views, that view will stick for him or her.

Let’s dive into code

An example for the view decorator is given below -

"""
Decorator to A/B test different views.
Args:
    primary_view:       Fallback view.
    anon_sticky:        Determines whether A/B testing should be performed on   
                        anonymous users as well.
    view_dict:          A dictionary of views(as string) with buckets as keys.
"""
def ab_views(
        primary_view=None,
        anon_sticky=False,
        view_dict={}):
    def decorator(f):
        @wraps(f)
        def _ab_views(request, *args, **kwargs):
            # if you want to do something with the dict returned
            # by the view, you can do it here.
            # ctx = f(request, *args, **kwargs)
            view = None
            try:
                if user_is_logged_in():
                    view = _get_view(request, f, view_dict, primary_view)
                else:
                    redis = initialize_redis_obj()
                    view = _get_view_anonymous(request, redis, f, view_dict,
                            primary_view, anon_sticky)
            except:
                view = primary_view
            view = str_to_func(view)
            return view(request, *args, **kwargs)

        def _get_view(request, f, view_dict, primary_view):
            bucket = get_user_bucket(request)
            view = get_view_for_bucket(bucket)
            return view

        def _get_view_anonymous(request, redis, f, view_dict,
                primary_view, anon_sticky):
            view = None
            if anon_sticky:
                cookie = get_cookie_from_request(request)
                if cookie:
                    view = get_value_from_redis(cookie)
                else:
                    view = random.choice(view_dict.values())
                    set_cookie_value_in_redis(cookie)
            else:
                view = primary_view
            return view

        return _ab_views
    return decorator

The noteworthy piece of code here is the function str_to_func(). This returns a view object from a view path (string).

def str_to_func (func_string):
    func = None
    func_string_splitted = func_string.split('.')
    module_name = '.'.join(func_string_splitted[:-1])
    function_name = func_string_splitted[-1]
    module = import_module(module_name)
    if module and function_name:
        func = getattr(module, function_name)
    return func

We can write another decorator for A/B testing multiple templates using the same view in a similar way. Instead of passing a view dictionary, pass a template dictionary and return a template.

Putting things together

Now, let’s assume that we have already written the ‘A’ and ‘B’ views which are to be A/B tested. Let’s call them ‘view_a’ and ‘view_b’. To get the entire thing working, we will write a new view. Let’s call this view as ‘view_ab’. We will wrap this view with one of the decorators we wrote above and create a new url to point to this new view. You may refer to the code snippet below -

@ab_views(
        primary_view=AB_TEST_PRIMARY,
        anon_sticky=True,
        view_dict=AB_TEST,
        )
def view_ab(request):
    ctx = {}
    return ctx

Just for the sake of convenience we require that this new view returns a dictionary.

Finally, we need to integrate analytics into this framework so that we have quantifiable data regarding the performance or effectiveness of both the views or layouts. We decided to use mixpanel at the JavaScript end to track user behaviour on these pages. You can also use any analytics or event tracking tool out there for this purpose.

This is just one of the ways you can do A/B testing using Django. You can always take this basic framework and improve it or add new features.

P.S. : If you want to experiment with an A/A/B or A/B/C testing, all you need to do is change the first constant that we defined i.e. AB_TEST

Feel free to comment below or ping us at support@hackerearth.com if you have any suggestions!

Posted by Arindam Mani Das.
Follow me @arindammanidas

http://engineering.hackerearth.com/2016/01/29/ab-testing-using-django

Logging Javascript errors in production

Jan 29, 2016

We had implemented a javascript logger to capture the pesky issues our users faced and had problems when using the site. When we came up with the idea, we thought it would be just a five minute job where we would just have to add a snippet and would be done, but as it is well know, nothing is simple when it comes to real world production issues.

After analyzing a lot of loggers, we decided to use errorception as our javascript logger. It was simple in its approach and it provided the data we needed. This was the easy part, next came the part of integration, and yes there was a snippet which we just had to paste in our base javascript file, but one thing we had forgotten, CORS.

We host our static files on S3 and they are served via fastly, because of which they are delivered through another domain. The error tracking snippet could not log the errors because ‘window.onerror’ would not return the necessary stack trace and information.

Many of you may have heard about CORS, it is a mechanism that allows restricted resources (e.g. fonts) on a web page to be requested from another domain outside the domain from which the resource originated. For security reasons, browsers restrict cross-origin HTTP requests initiated from within scripts. The only way to solve this and get errors to be posted to errorception was to allow CORS in the header of the static files.

After some research (errorception also has a good blog on it), we finally managed to have the error logging implemented. Below are the steps we had to go through:

Configuring S3

S3 has this unnecessarily complicated “CORS configuration” that you need to create. Here’s the steps to get that right:

Log into your AWS S3 console, select your bucket, and select “Properties”. S3 CORS configurations seem to apply at the level of the bucket, and not the file. I have no clue why.
Expand the “Permissions” pane, and click on “Add CORS configuration” or “Edit CORS configuration” depending on what you see.
You should already be provided with a default permission configuration XML. If not, use the following XML to get started.

<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
    <CORSRule>
        <AllowedOrigin>*</AllowedOrigin>
        <AllowedMethod>GET</AllowedMethod>
        <MaxAgeSeconds>3000</MaxAgeSeconds>
        <AllowedHeader>Authorization</AllowedHeader>
    </CORSRule>
</CORSConfiguration>

You should look at Amazon’s docs to see what this configuration means.

Make sure you add <?xml ?> declaration, if you omit these, Amazon will fail silently, showing you a happy looking green tick!
Once you’ve saved the configuration, give it a couple of minutes.
Test if everything’s looking right. You could use a tool like curl to specify the additional headers needed for a “correct” CORS request:

$ curl -sI -H "Origin: example.com" -H "Access-Control-Request-Method: GET" https://s3.amazonaws.com/bucket/script.js

HTTP/1.1 200 OK

Date: Wed, 05 Nov 2014 13:37:20 GMT

Access-Control-Allow-Origin: *

Access-Control-Allow-Methods: GET

Access-Control-Max-Age: 3000

Vary: Origin, Access-Control-Request-Headers, Access-Control-Request-Method

Cache-Control: max-age=604800, public

...snip...

You should see the “Access-Control-Allow-Origin: *” header, and the “Vary: Origin” header in the output. If you do, you’re golden.

Configuring Fastly

For configuring fastly, we just had to follow a few simple steps and then we were done:

Set up a custom HTTP header via the Content pane for your service.
Then, create a custom header with the following information that adds the required “Access-Control-Allow-Origin” header for all requests.

Name            -   CORS S3 Allow
Type/Action     -   Cache       Set
Destination     -   http.Access-Control-Allow-Origin
Source          -   "*"
Ignore if Set   -   No
Priority        -   10

Finally test it out: Running the command curl -I your-hostname.com/path/to/resource should include similar information to the following in your header:

Access-Control-Allow-Origin: http://your-hostname.tld
Access-Control-Allow-Methods: GET
Access-Control-Expose-Headers: Content-Length, Connection, Date...

Once we had the “Access-Control-Allow-Origin” set up for both S3 and Fastly, we had one major thing left to do, to add crossorigin=’anonymous’ to all our script tags. For this we used a simple regex to modify the existing script tags in all the files:

Find:       (<script)(.*?)(STATIC_URL)(.*?)(.js)(.*?)("|')(>)

Replace:    $1$2$3$4$5$6$7 crossorigin=’anonymous’$8

After this we just bumped the version of all js files (to clear cache) and then we were done. We finally had the js logger implemented.

This helped us to identify the javascript issues in realtime and make the user experience better across all browsers.

And as always, *Happy Coding!*

References:
http://blog.errorception.com/2012/12/catching-cross-domain-js-errors.html
http://blog.errorception.com/2014/11/enabling-cors-on-amazon-cloudfront-with.html
https://docs.fastly.com/guides/performance-tuning/enabling-cross-origin-resource-sharing

Send an email to support@hackerearth.com for any bugs or suggestions.
Posted by Shivindera Singh.

http://engineering.hackerearth.com/2016/01/29/hackerearth-logging-javascript-errors-production

Managing roles and access control in a web application

Jan 29, 2016

HackerEarth Recruit, is a platform for technical recruitment. Many companies use this platform for candidate assessments and interviewing. There can be multiple admins for a company account. As teams grow in size, access control is a special concern for applications that deal with financial and privacy data. Access control is concerned with determining the allowed activities of legitimate users, but we required more sophisticated and complex control mediating every attempt by a user to access a resource in the application based on the sensitivity level of various features.

A state of access control is said to be safe if no permission can be leaked to an unauthorized or uninvited principal.

We figured that the simpliest solution to restrict access was to use ACL.

What is ACL ?

An access control list (ACL), with respect to a computer file system, is a list of permissions attached to an object. An ACL specifies which users or system processes are granted access to objects, as well as what operations are allowed on given objects.

Many kinds of systems implement ACL, or have a historical implementation like Filesystem ACLs and Networking ACLs.

A filesystem ACL is a data structure (usually a table) containing entries that specify individual user or group rights to specific system objects such as programs, processes, or files.

In Networking ACL refers to rules that are applied to port numbers or IP addresses that are available on a host or other layer 3, each with a list of hosts and/or networks permitted to use the service.

For Recruit, the approach had to be role based access restriction to authorized admins. This implementation of access control mechanism is defined around roles and privileges.

Implementation (Python/Django)

Access control Lists can be configured to map roles to features. In this ACL implementation, roles are named after existing features which require access control. Each access right should have a unique name, and also assign a unique value to each.

The example code snippets below are self explanatory.

acl.py

# defining account admin roles based on the required critria.
SUPERADMIN = 1
TEST_ADMIN = 2
INTERVIEW_ADMIN = 3
LIBRARY_ADMIN = 4

# access permissions are mapped to human readable names.
COMPANY_ADMIN_ROLES = {
    SUPERADMIN: 'Super Admin',
    TEST_ADMIN: 'Tests Admin',
    INTERVIEW_ADMIN: 'Interviews Admin',
    LIBRARY_ADMIN: 'Library Admin',
}

# used as variable names in context processors, explained below.
MAP_ROLE_ID_NAME = {
    SUPERADMIN : 'SUPERADMIN',
    TEST_ADMIN : 'TEST_ADMIN',
    INTERVIEW_ADMIN : 'INTERVIEW_ADMIN',
    LIBRARY_ADMIN : 'LIBRARY_ADMIN',
}

acl.py

To retrive assigned roles for any given account admin, a utility is written. If the given admin is a SuperAdmin, all the roles are returned as SuperAdmin has access to all the features.

def get_company_admin_roles(user):
    roles = []
    admin = user.admin
    
    if admin is not None:
        if SUPERADMIN in admin.roles_list:
            roles = COMPANY_ADMIN_ROLES.keys()
        else:
            roles = admin.roles
    return roles

decorators.py

At the view level, access restriction is handled by wrapping views with decorator which checks for access permissions. The decorator will raise 404 if an admin has no access permission to the feature being accessed.

def has_admin_access(role):
    def decorator(f):
        @wraps(f)
        @login_required
        def _company_acl(request, *args, **kwargs):

            # Checks an admin is a superuser or admin has
            # permission to access view
            roles = get_company_admin_roles(user)
            if SUPERADMIN in roles or role in roles:
                return f(request, *args, **kwargs)

            # if admin has no access then raise no access
            raise Http404
        return _company_acl
    return decorator


views.py

from acl.py import LIBRARY_ADMIN
from decorators.py import has_admin_access

# decorator check before processing the request.

@has_admin_access(LIBRARY_ADMIN)
def library(request):

    template = 'library.html'
    ...
    ...

In Recruit app, menu options and page contents are also customized based on account admin roles and hence the need to implement access restriction at template level too.

This is achieved by writing a context processor which makes the account admin roles avaiable as variables to the templates. This can also done at view level, but it violates the DRY principle.

context_processors.py

def company_admin_roles(request):
    return_dict = {}

    admin = request.user.admin
    admin_roles = admin.roles_list

    # if admin is superadmin set all roles to true

    if SUPERADMIN in admin_roles:
        for key, value in MAP_ROLE_ID_NAME.items():
            return_dict.update({value: True})
    else:
        for role in admin_roles:
            return_dict.update({MAP_ROLE_ID_NAME[role]: True})

    return return_dict

In templates the context variables can be used to check the access permissions. Refer to the code below :-

menu.html

<ul>
    <li><div class="">HackerEarth</div></li>
    <li><a href="">Home</div></a></li>

    {% if TEST_ADMIN %}
    <li><a href="">Tests</a></li>
    {% endbif %}

    {% if LIBRARY_ADMIN %}
    <li><a href="">Questions Library</div></a></li>
    {% endif %}


</ul>

Posted by Aishwarya Reddy.

http://engineering.hackerearth.com/2016/01/29/managing-roles-and-access-control

Smart suggestions with Django, Elasticsearch and Haystack

Jan 29, 2016

Introduction

One of the primary issues when gathering information from users is suggesting the right options that they are looking for. At HackerEarth, we gather information from all our developers which help us provide them a better experience. So there came a time when we had to suggest very smartly to our users! :D

When humongous amounts of data has to be indexed and suggested intelligently, one of the efficient ways to do it is by using an inverted index. An inverted index basically is a map of words that appear in documents to a list of documents the word is found in. Popular Lucene based search servers like Elasticsearch and Solr are tools to maintain large inverted indexes and provide an efficient means to look up documents.

Here is an example from the profiles page on HackerEarth.

We use Elasticsearch to index millions of documents with various fields. Two hurdles to be crossed while solving this problem are latency and relevance. Relevent documents have to be suggested to the user while keeping the time taken to retrieve them (ie. latency) as low as possible. Elasticsearch uses analyzers that help in achieving good relevence, but only if used in a witty manner. It also allows us to build our own custom analyzers. So by assaying the user input, astute analyzers can be built to increase relevance. A simple example for a document can be something like,

{ 
    '_id' : 'AVJUN6QaLYvICHZxvYEq',
    'username': 'ksrvtsa',
    'location': 'Bangalore',
    'hobbies': ['music', 'reading', 'hiking'],
}

So what are analyzers?

An analyzer converts the text to be indexed and creates lookups for finding the text when needed using appropriate search terms. An analyzer is composed of a tokenizer that splits your text into multiple tokens which is followed by many token filters which modify, delete or add new tokens. The tokenizer can be preceded by character filters which modify the text before passing it to the tokenizer.

Every field in a document has an index analyzer and a search analyzer. The index analyzer is used while the text for that field is being indexed for a particular document. And the search analyzer is used when a search is being made for documents based on that particular field. These analyzers for all the fields can be provided using the mapping for the particular index type in the index. Various combinations of these tokenizers, token filters and character filters can be used to build custom analyzers in the settings. An example of a mapping and a setting are,

"mappings": {
    "user": {
        "properties": {
            "username": {
                "type": "string",
                "index_analyzer": "name_analyzer",
                "search_analyzer": "standard"
                },
            "email_stub":{
                "type": "string",
                "analyzer": "name_analyzer"
            }
        }
    }
}

"settings": {
    "analysis": {
        "filter": {
            "custom_ngram": {
                "type": "nGram",
                "min_gram": 3,
                "max_gram": 10,
            },
            "custom_edge_ngram":{
                “type: “edgeNGram”,
                “min_gram”: 4,
                “max_gram”: 8,
                “side”: left    
            },
        },
        "analyzer": {
            "name_analyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": ["custom_ngram"]
            },
            "name_edge_analyzer": {
                “type”: “custom”,
                “tokenizer”: “standard”,
                “filter”: [“custom_edge_ngram”]
            }           
        }
    }
}

By default Elasticsearch uses the Standard analyzer for indexing and searching. The Standard analyzer comprises of Standard Tokenizer with the Standard Token Filter, Lower Case Token Filter, and Stop Token Filter. It splits the text on spaces and converts all tokens to lowercase.

An example of its usage

curl 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'This is HackerEarth'

{
    "tokens" : [ {
        "token" : "this",
        "start_offset" : 0,
        "end_offset" : 4,
        "type" : "<ALPHANUM>",
        "position" : 1
    }, {
        "token" : "is",
        "start_offset" : 5,
        "end_offset" : 7,
        "type" : "<ALPHANUM>",
        "position" : 2
    }, {
        "token" : "hackerearth",
        "start_offset" : 8,
        "end_offset" : 19,
        "type" : "<ALPHANUM>",
        "position" : 3
    } ]
}

Notice that the text is split on space and the converted to lowercase.

So the tokens generated are ‘this’, ‘is’, ‘hackerearth’, but unless the user queries with these words Elasticsearch will not look up the document. So to increase the discoverability and the relevancy of the search Ngrams and Edge Ngrams are used. The topic below explains them in detail.

Elasticsearch provides many filters, tokenizers and analyzers. So go ahead and read about them as Elasticsearch gives complete freedom to mash them up to build our own analyzers.

The secret sauce!

Ngrams and Edge Ngrams are the secret ingredients when it comes to suggesting the right document based on a user query. So what are they? Wikipedia explains Ngrams as a contiguous sequence of n items from a given sequence of text or speech. They are basically a set of co-occurring letters in a piece of text in the case of Elasticsearch. For example,

If N = 4, and the text is 'hackerearth', the ngrams generated are,

        ’hack’  ‘acke’  ‘cker’
        ‘kere’  ‘erea’  ‘rear’
        ‘eart’  ‘arth’

Elasticsearch provides both, Ngram tokenizer and Ngram token filter which basically split the token into various ngrams for looking up.

In the above shown example for settings a custom Ngram analyzer is created with an Ngram filter. If you notice there are two parameters min_gram and max_gram that are provided. These are the min and max sizes of the ngrams that are to be generated for the lookup tokens. For example,

If min_gram = 4 and max_gram=6, and the text is “hackerearth”, the ngrams generated are, 

    ’hack’  ‘acke’  ‘cker’      ( N = 4 )
    ‘kere’  ‘erea’  ‘rear’
    ‘eart’  ‘arth’

    ‘hacke’ ‘acker’ ’ckere’     ( N = 5 )
    ‘kerea’ ‘erear’ ‘reart’
    ‘earth’

    ‘hacker’ ‘ackere’ ‘ckerea’  ( N = 6 )
    ‘kerear’ ‘ereart’ ‘rearth’

If you notice the ngrams are generated for size 4, 5 and 6.

The only difference between Edge Ngram and Ngram is that the Edge Ngram generates the ngrams from one of the two edges of the text which will be used for the lookup. Elasticsearch provides an Edge Ngram filter and a tokenizer which again do the same thing, and can be used based on how you design your custom analyzer. Edge Ngrams take an extra parameter “side” which denotes the side of the text from which the ngrams have to be generated, an example is provided in the settings above. An edge ngram example,

If the text is ‘hacker’, min_gram is 2, max_gram is 6 and side is left.
The ngrams generated are,

    ‘ha’, ‘hac’, ‘hack’, ‘hacke’, ‘hacker’

So for an intelligent way to suggest documents to the user, Ngrams or Edge Ngrams can be used to create custom analyzers for indexing and querying on the fields of the document type.

Deployment

For deployment, we have used Haystack to index the models and query the index. Haystack provides an easy way of creating, updating, building and rebuilding indexes. Since some of the fields require their own analyzers for indexing and searching, we have created custom fields for the search indexes.

from haystack import indexes
from dummy_app.models import Dummy

class CustomNgramField(indexes.CharField):                                       
    field_type = 'ngram'                                                         
                                                                                
    def __init__(self, **kwargs):                                                
        if kwargs.get('search_analyzer'):                                        
            self.search_analyzer = kwargs['search_analyzer']                     
            del(kwargs['search_analyzer'])                                       
        if kwargs.get('index_analyzer'):                                         
            self.index_analyzer = kwargs['index_analyzer']                       
            del(kwargs['index_analyzer'])                                        
                                                                                
    super(CustomNgramField, self).__init__(**kwargs)                         

class DummyIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    dummy_field = CustomNgramField(model_attr='dummy_field',
                                    index_analyzer='analyzer_1',
                                    search_analyzer='analyzer_2')
    
    def get_model(self):
        return Dummy

    def index_queryset(self, using=None):
        return Dummy.objects.all()

And now for creating our own custom analyzers, we have overridden the build_schema function by creating a custom backend for Elasticsearch. The ElasticseachSearchBackend is inherited and the ‘DEFAULT_SETTINGS’ parameters can be set with our custom Elasticsearch settings. This creates all our custom analyzers for usage.

from haystack.backends import BaseEngine
from haystack.backends.elasticsearch_backend import ElasticsearchSearchBackend 
from haystack.backends.elasticsearch_backend import ElasticsearchSearchQuery
from myapp.settings import CUSTOM_ELASTIC_SETTINGS   
                                                                            
                                                                            
class CustomElasticSearchBackend(ElasticsearchSearchBackend):                  
                                                                            
    def __init__(self, connection_alias, **connection_options):                
        super(CustomElasticSearchBackend, self).__init__(connection_alias,     
                                                        **connection_options) 
                                                                            
        setattr(self, 'DEFAULT_SETTINGS', CUSTOM_ELASTIC_SETTINGS)         
                                                                            
    def build_schema(self, fields):                                            
        content_field_name = ''                                                
        mapping = {}                                                           
                                                                            
        content_field_name, mapping = super(CustomElasticSearchBackend,        
                                            self).build_schema(fields)         
                                                                            
        for field_name, field_class in fields.items():                         
            field_mapping = mapping[field_class.index_fieldname]               
            if hasattr(field_class, 'index_analyzer'):                         
                field_mapping['index_analyzer'] = field_class.index_analyzer   
                if 'analyzer' in field_mapping:                                
                    del(field_mapping['analyzer'])                             
            if hasattr(field_class, 'search_analyzer'):                        
                if 'analyzer' in field_mapping:                                
                    del(field_mapping['analyzer'])                             
                field_mapping['search_analyzer'] = field_class.search_analyzer 
            mapping[field_class.index_fieldname] = field_mapping               
                                                                            
        return (content_field_name, mapping)                                   
                                                                            
                                                                            
class CustomElasticsearchSearchEngine(BaseEngine):                             
    backend = CustomElasticSearchBackend                                       
    query = ElasticsearchSearchQuery

Now that all of this is setup, index your data and suggest smartly!

Going ahead, we plan to deploy this site wide and make suggestions better by analysing the user input for creating new options in the dropdowns.

Send an email to support@hackerearth.com for any bugs or suggestions.
Posted by Karthik Srivatsa

http://engineering.hackerearth.com/2016/01/29/smart-sugesstions-with-elasticsearch

HackerEarth Question Library: Stats, Usage Analysis and Health

Oct 31, 2015

We, at HackerEarth, cater a huge number of questions in Assessment tool. Recruiters can choose from wide varieties of Multiple Choice and Programming/Coding questions to assess candidates. Every week or two, new questions are added to the Questions Library. As the time passed, thousands of questions got stacked up and recruiters started to have a hard time figuring out what questions to choose from such a huge library.

We, developers, work closely with our sales team to understand recruiters’ needs. Sometimes we directly get in touch with recruiters to provide on call technical support and understand how they use the product and what improvements can be made to the product in order to make it more easier to use. That’s how we figured it was about time we helped recruiters figure the best questions out of our library and hence we released a feature called “Health”.

What it means?

Health, in context of any library question, indicates how usable the question is. It is not only about question’s difficulty level. We consider various factors such as number of users attempted the question, users solved, times the question has been used and when was the question last used while determining Health of any question. Simple and short, higher the health, more usable the question.

How we calculate the Health value?

The whole process was split into 3 parts:

Data Segregation and Analysis
Data Structure to store Health data
Mathematical formula to calculate Health value

1. Data Segregation and Analysis

We tried to gather as much data as possible for any question. Then, filter out the ones that could help us in finding the health value.

Number of tests in which the question was used
When was the question last used in any test
Question accuracy (Number of users who solved it correctly / Number of users attempted)
Problem ratings (Ratings submitted by user for any question)
Problem tags
When was the question added
How frequently the question is used
When was the question last used

We figured that high weighing factors in the Health of any questions were question accuracy, number of times the question has been used and when was the question last used. We stuck to only 3 factors to keep things easy and started to build a basic version of question health.

2. Data Structure to store Health data

Each question can be used in many tests. Every time we load question in library, we will have to calculate health of each of them by analyzing their use in all those tests. So, writing a brute forcer is probably not a good idea considering the number of database hits it would make.

Our requirement is that it should be possible to get health data of a list of questions in single query. This calls for a simple generic Health model that goes somewhat like this:

class Health(Base, Generic):
    """
    Generic model to store health of any object
    """
    percentage = models.FloatField(
        validators = [MinValueValidator(0.0), MaxValueValidator(100.0)],
        default=0.0, db_index=True)
    usage_data = models.ForeignKey(ProblemUsageData, null=True, blank=True)

    class Meta:
        verbose_name = 'Health'

This Health model is generic, meaning that it can be used for any object of any model. Next, we need a helper model using which we can populate Health model anytime we want. The helper model will be specific to objects of different models. Such structure would help us in future, in case we want to extend Health feature.

For questions health, we create a helper model ProblemUsageData which stores question analytics, based on the factors we discussed earlier. We update this model as and when a question is used in any test. Later, to update Health model we have to formulate a tiny mathematical equation using the attributes of ProblemUsageData model.

class ProblemUsageData(Base, Generic):
    """
    Generic model to store usage data of any question in library
    """
    times_used = models.PositiveIntegerField(default=0, db_index=True)

    # Last used in Event can be used to find the last used date
    last_used_event = models.ForeignKey(Event, null=True, blank=True)

    # Following two fields can be used to calculate accuracy
    users_attempted = models.PositiveIntegerField(default=0, db_index=True)
    users_solved = models.PositiveIntegerField(default=0, db_index=True)

That’s all with the data structure, let’s move on to Health calculation.

3. Mathematical formula to calculate Health value

The three factors that we consider while generating health can be obtained through ProblemUsageData model as follows:

First comes accuracy, which as simple as this:

accuracy = (users_solved)*100/users_attempted)

Next, we have times_used, to which we multiply some constant which can be found through following graph. Pick the range according to question accuracy and look for the respective multiplier. This graph was generated after a lot of hit and trial and it will be different for different types of question such as Multiple Choice, Programming/Coding etc.

# get_question_accuracy_weight is just a utility function which gets the multiplier weight
times_used = times_used*get_question_accuracy_weight(accuracy, question_type)

Lastly, from last_used_event field of ProblemUsageData we get when was the question last used. We have some threshold cooldown number of days until a question is again safely reusable.

delta = datetime.now() - last_used_event.timestamp
# We have kept HEALTH_COOLDOWN_DAYS as 30 as of now
days_factor = (delta.days - HEALTH_COOLDOWN_DAYS)

Finally, we directly calculate health percentage by following formula

# Final health_percentage by multiplying above factors with some weightage
health_percentage = acc_factor*100*0.65 - times_used*0.15 + days_factor*0.20

The multiplying factors 0.65, 0.15 and 0.20 were again determined through hit and trial on known data sets. We make a check that none of these factors exceed 100.

Now that we are done with all the calculation, we can show data using one simple query which looks like this:

Health.objects.filter(content_type_id=<id of question type>, object_id__in=<question ids>)

Final data at UI level looks like this, on hover we show some stats about the question:

Got a burning question you want to get answered?, ask it in the comments or mail me at ravi[at]hackerearth[dot]com.

Posted by Ravi Ojha.

http://engineering.hackerearth.com/2015/10/31/hackerearth-question-library-data-analysis-and-health

Logging millions of requests everyday and what it takes

Feb 26, 2015

HackerEarth’s web servers handle millions of requests every day. These request logs can be analyzed to mine some highly useful insights as well as metrics critical for the business, for example, no. of views per day, no. of views per sub product, most popular user navigation flow etc.

Initial Thoughts

HackerEarth uses Django as its primary web development framework and a host of other components which have been customized for performance and scalability. During normal operations, our servers handle 80-90 requests/sec on an average and this surges to 200-250 requests/sec when multiple contests overlap in a time delta. We needed a system which could easily scale to a peak traffic 500 requests/sec. Also, this system should add minimum processing overhead to the webservers and the data collected should be stored for crunching and offline processing.

Architecture

The diagram above shows a high level architecture of our request log collection system. The solid connection lines represent the data flow between different components and the dotted lines represent the communications. The whole architecture is message based and stateless and so individual components can easily be removed/replaced without any downtime.

Below is a more detailed explanation about each component in the order of data flow.

Web Servers

On the web servers, we employ a Django Middleware that asynchronously retrieves required data for a given request and then forwards it to the Transporter Cluster servers. This is done using a thread and the middleware adds an overhead of 2 milli seconds to the Django request/response cycle.

class RequestLoggerMiddleware(object):
    """
    Logs data from requests
    """
    def process_request(self, request):
        if settings.LOCAL or settings.DEBUG:
            return None

        if request.is_ajax():
            is_ajax = True
        request.META['IS_AJAX'] = is_ajax

        before = datetime.datetime.now()

        DISALLOWED_USER_AGENTS = ["ELB-HealthChecker/1.0"]


        http_user_agent = request.environ.get('HTTP_USER_AGENT','')

        if http_user_agent in DISALLOWED_USER_AGENTS:
            return None

        # this creates a thread which collects required data and forwards
        # it to the transporter cluster
        run_async(log_request_async, request)
        after = datetime.datetime.now()

        log("TotalTimeTakenByMiddleware %s"%((after-before).total_seconds()))
        return None

Transporter Cluster

The transporter cluster is an array of Non Blocking Thrift servers for the sole purpose of receiving data from the web servers and routing them to any other component like MongoDB, RabbitMQ, Kafka etc. Where a given message should be routed to is specified in the message itself from the webservers. There is only one way communication from webservers to the transporter servers and this saves some time resource spent in the acknowledgement of the message reception by thrift servers. We may lose some request logs due to this but we can afford to do so. The request logs are currently routed to the Kafka cluster. The communication between the webservers and the transporter servers takes 1-2 milli seconds on an average and can be horizontally scaled to handle an increase in load.

Following is a part of the thrift config file. The file defines a DataTransporter service supporting a method with oneway as a modifier which basically means that the RPC call will return immedeiately.

service DataTransporter {
    oneway void transport(1:map<string, string> message)
}

Kafka Cluster

Kafka is a high throughput distributed messaging system that supports publish/subscribe messaging pattern. This messaging infrastructure enables us to build other pipelines that depend upon this stream of request logs. Our Kafka cluster stores last 15 days worth of logs and so we can make any new consumer that we implement start processing data 15 days back in time.

Useful reference for setting up a kafka cluster.

Pipeline Manager Server

This server manages the consumption of request log messages from the Kafka topics, storing them in MongoDB and then later moving them to Amazon S3 as well as Amazon Redshift. MongoDB acts merely as a staging area for the data consumed from the Kafka topics and this data is transferred to S3 at hourly intervals. Every file that is saved in S3 is loaded into Amazon Redshift which is a data warehouse solution that can scale to petabytes of data. We use Amazon Redshift for analyzing/metrics calculation from request log data. This server works in conjunction with a RabbitMQ cluster which it uses to communicate about task completion and initiation.

Here is the script that loads data from S3 into Redshift. This script handles insertion of duplicate data first by removing any duplicate rows and then inserting the new data.

import os
import sys
import subprocess

from django.conf import settings


def load_s3_delta_into_redshift(s3_delta_file_path):
    """s3_delta_file_path is path after the bucket
    name.
    """
    bigdata_bucket = settings.BIGDATA_S3_BUCKET

    attrs = {
        'bigdata_bucket': bigdata_bucket,
        's3_delta_file_path': s3_delta_file_path,
    }

    complete_delta_file_path = "s3://{bigdata_bucket}/{s3_delta_file_path}".format(**attrs)

    schema_file_path = "s3://{bigdata_bucket}/request_log/s3_col_schema.json".format(**attrs)

    data = {
            'AWS_ACCESS_KEY_ID': settings.AWS_ACCESS_KEY_ID,
            'AWS_SECRET_ACCESS_KEY': settings.AWS_SECRET_ACCESS_KEY,
            'LOG_FILE':  complete_delta_file_path,
            'schema_file_path': schema_file_path
          }

    S3_REDSHIFT_COPY_COMMAND = " ".join([
        "copy requestlog_staging from '{LOG_FILE}' ",
        "CREDENTIALS 'aws_access_key_id={AWS_ACCESS_KEY_ID};aws_secret_access_key={AWS_SECRET_ACCESS_KEY}'",
        "json '{schema_file_path}';"
    ]).format(**data)


    LOADDATA_COMMAND = " ".join([
        "begin transaction;",
        "create temp table if not exists requestlog_staging(like requestlog);",
        S3_REDSHIFT_COPY_COMMAND,
        'delete from requestlog using requestlog_staging where requestlog.row_id=requestlog_staging.row_id;',
        'insert into requestlog select * from requestlog_staging;',
        "drop table requestlog_staging;",
        'end transaction;'
        #'vacuum;' #sorts new data added
    ])

    redshift_conn_args = {
        'host': settings.REDSHIFT_HOST,
        'port': settings.REDSHIFT_PORT,
        'username': settings.REDSHIFT_DB_USERNAME
    }

    REDSHIFT_CONNECT_CMD = 'psql -U {username} -h {host} -p {port}'.format(**redshift_conn_args)

    PSQL_LOADDATA_CMD = '%s -c "%s"'%(REDSHIFT_CONNECT_CMD, LOADDATA_COMMAND)

    returncode = subprocess.call(PSQL_LOADDATA_CMD, shell=True)
    if returncode !=0:
        raise Exception("Unable to load s3 delta file into redshift ",
                s3_delta_file_path)

What’s next

Data is like gold for any web application. The insights that it can provide and growth it can drive is amazing, if done the right way. There are dozens of features and insights that can be built with the requests logs, including recommendation engine, better content delivery, and improving the overall product. All of this is a step towards making HackerEarth better each & every day for our users.

If you have any queries or wish to talk more about this architecture or any of the technologies involved, you can mail me at praveen@hackerearth.com.

Posted by Praveen Kumar.

http://engineering.hackerearth.com/2015/02/26/logging-millions-requests-what-it-takes

Patching django sessions to control user sessions

Feb 14, 2015

###Introduction HackerEarth uses django framework at its heart. We use two third party django packages for the purpose of user authentication and session management:

django-allauth: Provides pre-built modules for email-based as well as all popular social authentication mechanisms.
django-redis-sessions: Allows storage of user session data in redis(a memory based data store that writes on disk) for fast retrieval. We used MySQL earlier for this purpose but the retrieval became very slow as number of users grew.

Django sessions are simple dictionaries which look something like this:

{
    '_session_cache': {
        '_auth_user_id': 2L,
        '_auth_user_backend': 'allauth.account.auth_backends.AuthenticationBackend',
    },
    '_session_key': '44617f83e234b6aa7e632abb8b44b906',
    'modified': False,
    'accessed': True
}

The _session_cache contains the information about the user who is logged in, the backend that is used for user authentication(since we do not use django’s default authentication backend, the value is different here). Also if you set any other data on the session it will be present inside the _session_cache dictionary. The _session_key is generated by the session backend using a random function.

All the sessions are stored in redis in the form of key value pairs where the key is the _session_key and value is _session_cache in encoded format.

###The problem

As it can be clearly seen, there is no way to determine which key belongs to which user apart from getting that key’s data from the data-store and checking the user id associated with that data.

Now consider a scenario where you want to find all the sessions associated with a given user. One of the use cases can be when a user changes their password, we would want to delete all their existing sessions. In such a scenario, you will have to iterate over all the rows of data, decode it and check if it belongs to a certain user. This might work well when there are a few hundred users on your site, but with a large number of users, this is not such a good idea.

###The solution

Redis lets you fetch values of keys containing a certain pattern. If a user’s session keys can contain a certain constant string, we can get all their sessions using that constant string.

We realized that inserting a constant string inside the session key was all we needed to do to solve our problem.

###The implementation

The implementation is divided into two steps:

Change the key creation logic in the SessionStore class:

The session objects in django are abstracted using a class called SessionStore. This class has a method _get_new_session_key which is responsible for generating session_keys. We define our own CustomSessionStore which only overrides the above mentioned method.

from django.contrib.session.backends.db import SessionStore

class CustomSessionStore(SessionStore):

    def _get_new_session_key(self):
        session_key = super(CustomSessionStore, self)._get_new_session_key()

        # If the user's information is present in the session, get it and
        # inject it inside the session key, else inject a random string to
        # keep the session key pattern consistent
        if '_auth_user_id' in self._session:
            user_id = self._session.get('_auth_user_id')
            encoded_user_id = user_encoder_function(user_id)
            session_key =   '%s:%s' %(encoded_user_id, session_key)
        else:
            session_key = '%s:%s' % (some_random_string, session_key)
        return session_key

Overriding django Session middleware

Django has a SessionMiddleware which is responsible for initializing the session object on the request as well as setting the session cookie on the response object. We only need to override the process_request function so that the newly defined CustomSessionStore class can be used.

from django.conf import settings
from django.contrib.session.middleware import SessionMiddleware

class CustomSessionMiddleware(SessionMiddleware):

    def process_request(self, request):

        # settings.SESSION_ENGINE is the path to your session store class.
        # It will be the path to CustomSessionStore in this case.
        engine = import_module(settings.SESSION_ENGINE)

        # session_key name is defined in the settings file
        session_key = request.COOKIES.get(settings.SESSION_COOKIE_NAME, None)

        # Earlier the session_key was a single string with no delimiters in
        # between. We inserted the ':' delimiter in between for easy
        # segregation of the two components of the session_key. If an old
        # session is found we copy its data to the new style session class and cycle
        # its key. The cycle_key method internally calls the
        # _get_new_session_key which now will generate a session key in
        # the new format but the old data will remain intact. All this
        # hassle is for preserving user authentication state when we deploy this code.
        # If we change the keys directly, users' existing sessions will get lost and
        # they will get logged out resulting in an unpleasant experience.
        if session_key is not None and len(session_key.split(':')) != 2:
            old_session = SessionStore(session_key=session_key)
            old_data = old_session.load()
            request.session = engine.CustomSessionStore(session_key=session_key)
            request.session._session_cache = old_session.load()
            request.session.cycle_key()
        else:
            request.session = engine.CustomSessionStore(session_key=session_key)

###Conclusion

All the sessions for a given user_id can be fetched using the following pseudo code:

redis_conn = get a redis connection
encoded_user_id = user_encoder_function(user_id)
# This pattern represents any key starting with encoded_user_id followed by
# a ':' and any string after that, which is how are sessions are store in
# redis.
key_pattern = encoded_user_id + ':*'
keys = redis_conn.keys(key_pattern)
for key in keys:
    session = redis_conn.get(key)
    # Do something with the sesssion

This approach helped us solve a lot of problems like deleting all user sessions on password change, keeping track of active user sessions to name a few.

There might be multiple ways of implementing this but we preferred this approach because it did not involve any change in django’s source and only a couple of the existing methods were overridden, which makes it less prone to bugs.

Feel free to comment below or reach us at support@hackerearth.com for any suggestions, queries or bugs.

Posted by Virendra Jain
Follow me @virendra2334

http://engineering.hackerearth.com/2015/02/14/django-sessions-patch

Building a powerful comment system

Jan 27, 2015

In this post, I am going to briefly descibe the challenges we faced while building a powerful comment system.

Comments have become an integral part of our website. They are integrated almost everywhere - challenges, problems, notes etc. and soon will be added to our new products. We have been working to make it more powerful and usable.

Here is what we did:

Pluggable architecture
Ajaxifying comments
Realtime sync
Tagging people

From the beginning

Our comment system is built using an open source django app, named django-threadedcomments. In threadedcomments, commenters can reply both to the original item, and reply to other comments as well. This open source app best suited our requirements in terms of UX, hence we decided to use it. But later we realised it was not powerful enough, and we decided to add our own features in it.

Pluggable architecture

Our commmenting system is a plug and play app. We added a layer on top of django-threadedcommnets which lets us integrate comments anywhere on our website easily without writing the same code again.

Below is the snippet which we can include in any django template to add comments.

<div id="comments-{{model.get_content_type.id}}-{{model.id}}" class="pagelet-inview standard-margin comments" ajax="{% url render_comments model.id model.get_content_type.id %}" target="comments-{{model.get_content_type.id}}-{{model.id}}"></div>

Above single line of code renders complete comment list(including reply form) using bigpipe. Isn’t it cool?

One more reason I am calling our comment system plug and play is that we can easily override most of the comments display logic, show comments differently on different pages. For example, comments appearing on a note and problem page needs to be shown differently based on the logic ‘who is the moderator of the content’. This couldn’t have been possible without django template block tag.

comments/base/list.html

<!-- This blocks handles logic to determine if user is a normal user or a moderator -->
{% block userextrainfo %}
<!-- Override this block in your app -->
{% endblock %}

comments/problem/list.html

{% block userextrainfo %}
<!-- A moderator can be problem owner, tester, editorialist or event admin -->
{% endblock %}

comments/notes/list.html

{% block userextrainfo %}
<!-- A moderator can be note owner -->
{% endblock %}

Ajaxifying comments

This was the most challenging task because of the way django-threadedcomments is built. Special thanks to Virendra for taking the initiative and finding an easy to implement solution.

Posting a comment via AJAX request was realtively easy compared to deleting it because of comment’s threaded nature. Whenever a comment is deleted, we first determine if that comment has atleast a single child which is not deleted. Based on that logic we decide the format in which deleted comment will be shown to user. If you didn’t understand a word of what I wrote above, look at the images below.

Initial comments

After deleting comment 2, child comment 2.1 should be visible

After deleting comment 2.1, delete complete tree

We implemented BFS algorithm to handle all the scenarios and corner cases.

class ThreadedComments(Comment):
  """
  ThreadedComment model
  """

  def _child_exists(self):
    """
    Returns boolean
    Implemets BFS to check if comment obj has
    atleast one child which is not removed.
    Uses cache to avoid using BFS everytime.
    """
    key = self.get_child_exists_key()
    is_child = cache.get(key, None)
    if is_child is None:
        queue = deque()
        queue.append(self)
        while len(queue):
            comment = queue.popleft()
            children_exists, childs = self.get_child_exists_and_childs(comment)
            if children_exists:
                is_child = True
                break
            else:
                for child in childs:
                    queue.append(child)

        if is_child is None:
            is_child = False

        cache.set(key, is_child, CACHE_TIME_MONTH)
    return is_child
  child_exists = property(_child_exists)

Realtime sync

After ajaxifying comments, we decide to put the cherry on top. Making comments appear in realtime was not easy at all. We are experimenting with Pusher to do the realtime job for us.

Below is generic python code for pushing data to pusher via rabbitmq:

class PusherClient(BaseClient):
    def __init__(self):
        routing_key = PUSHER_ROUTING_KEY
        retry(super(PusherClient, self).__init__, routing_key)

    def call(self, message):
        retry(super(PusherClient, self)._call, message)

class PusherWorker(ConsumeQueue):
    """
    Push data to pusher service
    """

    def on_message(self, body):
        message = json.loads(body)
        channel = message.get('channel', None)
        event = message.get('event', None)
        data = message.get('data', '')
        if channel is not None and event is not None:
            pusher_instance = get_pusher_instance()
            # Socket id to exclude
            if data:
                socket_id = data.get('socket_id', None)      
            else:                                            
                socket_id = None
            if socket_id:
                pusher_instance[channel].trigger(event, data,
socket_id)
            else:
                pusher_instance[channel].trigger(event, data)

Pusher is great for broadcasting messages in realtime but it has some drawbacks. It doesn’t have a scalable presence system, means it’s difficult to store more than 100 clients info on their servers. Thus making it difficult to write complex logic on client side.

Javascript code to post/delete comment

function subscribeComment(channel_name) {
    var pusher = get_pusher_instance();
    if(pusher) {
      var channel = pusher.subscribe(channel_name);
      channel.bind('comment_added', function(data) {
        var comment_html = data.html;
        var parent_comment_id = data.parent_id;
        addComment(parent_comment_id, comment_html);
        /* Some hacks to decide whether to keep reply, PM, delete link */
      });
      channel.bind('comment_removed', function(data) {
        // Comment id to be delete
        var comment_id = data.comment_id;
        var has_child = data.has_child;
        deleteComment(comment_id, has_child);
      });
    }
}

Tagging people

It does exactly what it says, that means you can now tag people in comments and they will be notified by email. Checkout the screenshot below.

Tagging people using @

Comment posted after tagging

I worked on this feature in our very first internal hackathon. I tried to make it as generic as possible by binding event handler on ‘mentionable’ class.

<textarea ajax="{{AJAX_URL}}" rows="10" result-div-id="search-users-dropdown" name="comment" id="id_comment" cols="40" class="mentionable"></textarea>

$('.mentionable').live('keyup', function(e) {
    var url = $(this).attr('ajax');
    var result_div_id = $(this).attr('result-div-id');
    var name_str = 'developer'; // will be made generic
    var val = $(this).val();
    var cursorPos = $(this).prop('selectionStart');
    var result_div = $('#' + result_div_id);
    val = val.substr(0, cursorPos);

    $.ajax({
       url: url,
       type: 'GET',
       data: {'q': q},
       id: $.now(),
    }).done(function(data, method) {
       if(method==='success') {
           var r_time = this.id;
       } else {
           var r_time = $.now();
       }
       var data_time = result_div.attr('timestamp');
       if(data_time===undefined || data_time<r_time) {
           var html = $.trim(data.html);
           result_div.html(html);
           if(html.length>0) {
               result_div.show();
           }
           result_div.attr('timestamp', r_time);
       }
    }).fail(function() {
    });
});

In backend we are querying from graph search database.

More to come

There are still a lot of improvements like UI changes etc. in pipeline which will be executed soon. If you have some suggestions, do let us know.

Hope these improvements have made comments on HackerEarth better and easier to engage.

Posted by Lalit Khattar. Follow me @LalitKhattar

http://engineering.hackerearth.com/2015/01/27/making-comments-more-powerful

Aggregating Apache logs with Fluentd and Amazon S3

Oct 17, 2014

HackerEarth infrastructure is hosted on Amazon services. At any given point of time many webservers are running concurrently serving thousands of requests. This generates tons of access and error logs on each server separately. The task here was to parse the logs on all these webservers and store them at one place in a format that can further be used to derive meaningful insights from the data. We tried to accomplish this using fluentd and Amazon S3.

####Mechanism

Fluentd does the following things:

Continuously tails apache log files.
Parses incoming entries into meaning fields like ip, address etc and buffers them.
Writes the buffered data to Amazon S3 periodically.

####Installation The stable version of fluentd is called td-agent and we are using the same for our purpose here. For ubuntu 12.04 LTS the following shell command will install td-agent on your system.

curl -L http://toolbelt.treasuredata.com/sh/install-ubuntu-precise-td-agent2.sh | sh

The other supported operating systems and installation methods are listed here.

Please note that if you are installing ruby using Ruby Gems, you will have to install Amazon S3 output plugin separately. This can be done by:

gem install fluent-plugin-s3

####Configuration

Once td-agent is installed you will find a td-agent.conf file in /etc/td-agent/ directory. For parsing apache access logs you will need to add the following configuration:

<source>
    type tail                           # for continuosly tailing the log
    format apache2                      # for default format of apache access logs
    time_format %d/%b/%Y:%H:%M:%S %z    # time format in access logs
    path /var/log/apache2/access.log    # path from where log is to be read

    # if td-agent restarts, it starts reading from the
    #last position td-agent read before the restart
    pos_file /var/log/td-agent/apache2.access_log.pos

    tag s3.apache.access                # for identifying the log stream uniquely
</source>

<match s3.*.*>
    type s3                             # plugin for writing the log to s3
    aws_key_id <YOUR AWS KEY ID>
    aws_sec_key <YOUR AWS SECRET KEY>
    s3_bucket <YOUR S3 BUCKET NAME>
    path <PATH ON BUCKET>

    #place where the stream is stored before being written on s3
    buffer_path /var/log/td-agent/s3/

    # this specifies the interval at which logs are to written to s3.
    # this format specifies daily writes
    time_slice_format %Y%m%d

    # the amount of time fluentd will wait for old logs to arrive

    time_slice_wait 10m

    buffer_chunk_limit 256m            # max size of a buffer chunk
</match>

This configuration is for the default apache access.log file and the filter for this is predefined in td-agent(i.e format apache2). If you want to use some other log format you will need to write a regular expression for parsing those logs.

If you are using vhost_combined format for access logs, all you need to do is to replace apache2 in the second line the source block with this:

format /^(?<virtualhost>[^ ]*)[:](?<port>[^ ]*) (?<host>[^]*)"(?<forwardedfor>[^\"]*)" [^ ]* (?<user>[^ ]*)\[(?<time>[^\]]*)\]"(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*)(?<size>[^ ]*)(?:"(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$/

Please refer to this article for other pre-available log formats.

####Testing

To test your configuration run this command in your terminal

ab -n 100 -c 10 http://localhost/

Now login to your Amazon console and check the generated logs. With the above configuration you should be sucessfully able to write your apache access logs to Amazon S3 on a daily basis.

Further these logs can be used to analyzed using ElasticSearch and LogStash/Kibana to analyze all the requests that your web servers receive.

This blog is mostly a reproduction of the official fluentd blog with a little detailed expanation.

P.S. I am a developer at HackerEarth Reach out to me at virendra@hackerearth.com for any suggestion, bugs or even if you just want to chat! Follow me @virendra2334

Posted by Virendra Jain

http://engineering.hackerearth.com/2014/10/17/using-fluentd

Using APIs with Python Requests Module

Aug 21, 2014

One of the most liked feature of the newly launched HackerEarth profile is the accounts connections through which you can boast about your coding activity in various platforms.

Github and StackOverflow provide their API to pull out various kinds of data. The API documentation of Github and StackOverflow can be found here.

Github : https://developer.github.com/v3/
StackOverflow : http://api.stackexchange.com/docs

But what do we use to communicate with these APIs?

Working with HTTP is a painful task. Python includes a module called urllib2 but working with it can become cumbersome.

Requests was written by Kenneth Reitz which simplies the common use cases and the tool for HackerEarth to do all the HTTP operations.

Here is a code by @kenneth himself distinguishing urllib2 and requests

So, this above code clearly distinguishes why we went for the requests module.

###Installation

Installing Requests via pip is fairly simple, just run this in your terminal.

$ pip install requests

###Making your first Request

First of all, you need to import requests

>>> import requests

Now let’s make a GET requests to get Github’s public timeline

>>> r = requests.get('https://github.com/timeline.json')

Now, we have Response object called r using which we can get all the information.

Requests’ simple API means that all forms of HTTP request are as obvious. For example, this is how you make an HTTP POST request:

>>> r = requests.post("http://httpbin.org/post")

Similarly the other HTTP request types: PUT, DELETE, HEAD and OPTIONS?

>>> r = requests.put("http://httpbin.org/put")
>>> r = requests.delete("http://httpbin.org/delete")
>>> r = requests.head("http://httpbin.org/get")
>>> r = requests.options("http://httpbin.org/get")

Now, let’s consider the GitHub timeline again:

>>> import requests
>>> r = requests.get('https://github.com/timeline.json')
>>> r.text
u'[{"created_at":"2014-06-08T20:50:27-07:00","payload":{"sha...

Requests will automatically decode content from the server. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property:

>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'

There’s an builtin JSON decoder on which we heavily rely on

>>> import requests
>>> r = requests.get('https://github.com/timeline.json')
>>> r.json()
[{u'actor_attributes': {u'name': u'Tin...

###Passing parameters with URLs

You might often need to pass parameters. If you were constructing the URL by hand, this data would be given as key/value pairs in the URL after a question mark, e.g. httpbin.org/get?key=val

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.get("http://httpbin.org/get", params=payload)

You can see that the URL has been correctly encoded by printing the URL:

>>> print(r.url)
http://httpbin.org/get?key2=value2&key1=value1

Let’s take a similar use case from HackerEarth where we get the information related to the repositories. Pagination in Github says that by default a requests that return multiple items will be paginated to 30 items. But we can set a custom page size using ?per_page parameter.

'https://api.github.com/user/repos?per_page=100'

and if you want to specify request a specific page you need to pass the ?page parameter. The page numbering is 1-based and that omitting the ?page parameter will return the first page.

So, the URL for requesting 100 items from second page might look like

'https://api.github.com/user/repos?page=2&per_page=100'

Let’s perform this via requests.

>>> params = {'page': 2, 'per_page':100}
>>> r = requests.get('https://api.github.com/user/repo/', params=params)

###Custom Headers

If you’d like to add HTTP headers to a request, simply pass in a dict to the headers parameter.

For example, we didn’t specify our content-type in the previous example:

>>> import json
>>> url = 'https://api.github.com/some/endpoint'
>>> payload = {'some': 'data'}
>>> headers = {'content-type': 'application/json'}
>>> r = requests.post(url, data=json.dumps(payload), headers=headers)

Let’s take the earlier example of fetching data from repositories. Most of the APIs require require access token for requesting data. The access token needs to be added to HTTP headers.

>>> headers = {'Authorization':'token %s' % token}
>>> params  = {'page': 2, 'per_page': 100}
>>> r = requests.get(url, params=params, headers=headers)

###Response Status Codes

We can check the status codes for the response using:

>>> r = requests.get('http://httpbin.org/get')
>>> r.status_code
200

If we made a bad request like 4XX or 5XX, we can raise it with Response.raise_for_status()

>>> bad_r = requests.get('http://httpbin.org/status/404')
>>> bad_r.status_code
404

>>> bad_r.raise_for_status()
Traceback (most recent call last):
File "requests/models.py", line 832, in raise_for_status
    raise http_error
requests.exceptions.HTTPError: 404 Client Error
Response.raise_for_status() returns None for status_code 200

###Response Headers

We can view the server’s response headers using a Python dictionary:

>>> r.headers
{
    'content-encoding': 'gzip',
    'transfer-encoding': 'chunked',
    'connection': 'close',
    'server': 'nginx/1.0.4',
    'x-runtime': '148ms',
    'etag': '"e1ca502697e5c9317743dc078f67693f"',
    'content-type': 'application/json'
}

>>> r.headers['Content-Type']
'application/json'

Let’s take the earlier repository example again. Github uses pagination in their API.

>>> url = 'https://api.github.com/users/sayanchowdhury/repos?page=1&per_page=10'
>>> r = requests.head(url=url)
>>> r.headers['link']
'<https://api.github.com/user/500628/repos?page=2&per_page=10>; rel="next", <https://api.github.com/user/500628/repos?page=8&per_page=10>; rel="last"'

So, we parsed out the next url out of the headers:

>>> link = r.headers.pop('link').split(',')[0]
>>> link
'<https://api.github.com/user/500628/repos?page=2&per_page=10>; rel="next"'
>>> import re
>>> url = re.findall(r'<(.*?)>', link)[0]
'https://api.github.com/user/500628/repos?page=2&per_page=10'

But, requests has a intuitive way to do it.

>>> r.links['next']
'https://api.github.com/users/500628/repos?page=2&per_page=10'

>>> r.links['last']
'https://api.github.com/users/500628/repos?page=6&per_page=10'

###Timeouts

You can tell requests to stop waiting for a response after a given number of seconds with the timeout parameter:

>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: Request timed out.

###Errors and Exception

In case of network problem (e.g. DNS failure, refused connection, etc), Requests will raise a ConnectionError exception.
In the event of the rare invalid HTTP response, Requests will raise an HTTPError exception.
If a request times out, a Timeout exception is raised.
If a request exceeds the configured number of maximum redirections, a TooManyRedirects exception is raised.
All exceptions that Requests explicitly raises inherit from requests.exceptions.RequestException.

References - http://docs.python-requests.org/en/latest/

Posted by Sayan Chowdhury. Follow me @chowdhury_sayan. Write to me at sayan@hackerearth.com.

http://engineering.hackerearth.com/2014/08/21/python-requests-module

HackerEarth Streak: An exciting data about your HackerEarth activity

Jun 18, 2014

We have been constantly adding features to our HackerEarth Developer Profile, making it better with every new update.
One of the exciting data in it is HackerEarth Streak on the HackerEarth Activity page.

This post explains how we process user data and compute such results. We split it into two parts:

Code Streak: Maximum number of unique problems solved continuously
Day Streak: Maximum number of days such that one new problem is solved each day

Code Streak:

Extracting Relevant Data

All we have is the data about all the submissions made by any user. In order to process Code Streak we need to identify two things.

To which problem the submission was made to?

Now we have several different types of Problems in Challenges and Practice Problem. For eg: Programming Problem, Approximate Problem, Golf Problem etc.

Each problem is assigned two attributes. Type (Programming Problem, Approximate Problem, Golf Problem etc) and ID (1,2,3 and so on).

So in order to uniquely identify any problem we need both the attributes, because a Programming Problem can have the same ID as Approximate Problem, the difference is in the Type.

Was the submission graded Correct or Incorrect?

This is directly accessible from a single boolean attribute Solved which is True if Correct and False if Incorrect.

Now that we have extracted the data enough to calculate Code Streak, we move to the Data Structure part.

Solution - The Data Structure behind

Recently there was a question inspired from Code Streak in June Easy Challenge as Roy and Code Streak. It was simplified and ID ranges were restricted to convert the problem into easier one. A linear DP (Dynamic Programming) solution was enough to solve the problem.

Algorithm steps are as follows:

Find all the ranges of Correct submissions
For each range, count all the unique problems which have never been solved before
Store the count in an array
Find the maximum from the array

The second point is the challenge here. How do we know whether the problem with particular Type and ID was solved before or not? Also the range of ID is not restricted here, unlike the problem in June Easy Challenge where DP solution worked. So DP fails here. So we switched to Hashing. When a problem is solved for the first time, we increase the count and add the problem to the Hash. So the next time when we encounter the problem again it will be there in the Hash and hence we don’t increase the count.

After processing all the ranges of Correct Submissions, all we have to do is find the max of all the counts. That’s it. Code Streak calculated.

Day Streak:

Extracting Relevant Data

All the data that we extracted for Code Streak is reused here. Additional info is Timestamp (only Date part), which is also noted down when any solution is submitted.

The Problem - The Solution

The problem is, how to process this data efficiently? One easy way out is, for each day extract all the user submissions and check if any new problem was solved that day. But for each day making such a query, means querying in a loop, seems inefficient.

So, I used a workaround, pre-calculate all the dates of current century, which turns out to be around 365*100 days. We have an advantage that, both the pre-calculated dates and extracted user data are sorted w.r.t. Date in Timestamp.

So we can traverse through pre-calculated dates and extracted user data simultaneously maintaining two different variables and whenever both pre-calulated and user-submission dates match, we check if any unique problem was solved that day (this checking is similar to the problem faced in Code Streak and again it was solved using Hashing)

Further, we maintain another array of the same size as the number of days in a century. A day is marked True if any new unique problem was solved that day else it is marked False. Now all we have to do is find the longest range of all Trues in this array. This can be easily done in linear time. There it is. Day Streak calculated.

Thus we avoided querying in a loop, yet above algorithm highly relies on number of submissions made by user one particular day.

Note: I believe we can still optimize these calculations by using any way other than Hashing. Please feel free to drop a comment below.

Posted by Ravi Ojha.

http://engineering.hackerearth.com/2014/06/18/hackerearth-streak-an-exciting-data-about-your-hackerearth-activity

Using Google Data APIs with django apps

Jun 7, 2014

####Setting up a Google Project In order to use any of the Google APIs for your application, first you need to set up a project in the Google Developer’s Console. Enable all the APIs that you want to use in the APIs tab under APIs and auth. Under the Credentials tab, create a Client ID and Client secret which is used for communication between your application and the API. Enter all the allowed redirect urls in the Redirect URIs field. These are the URLs to which the application redirects after a user is successfully authenticated. You can change these URLs any time you want. Now your Google App is ready for use.

We wanted to create a small application where users can invite their google contacts to join HackerEarth.

####Dependencies We use the GData python client library, which makes it easy to interact with these Google services. You can install it using pip:

sudo pip install gdata or install it from [source](https://code.google.com/p/gdata-python-client/downloads/list).

####Authentication First we need to define some constants that we will use througout the application

#Obtained from Google Project Settings 
GOOGLE_CLIENT_ID = <Your Client ID>
GOOGLE_CLIENT_SECRET = <Your Client Secret>

#Variable that specifies the data you want to access
GOOGLE_SCOPE = "http(s)://www.google.com/m8/feeds/"

#URL where the flow should go on successful authentication
GOOGLE_APPLICATION_REDIRECT_URI = <Some URL>

GOOGLE_REDIRECT_SESSION_VAR = <some arbitary value>

Since we wanted to use the Contacts API we used the above mentioned scope. A list of other scopes is given here.

The first step for authentication is generation of an authentication token. Since we have written a separate view to handle the auth token after successful login, we are setting the authentication token in the session so that the same token can be accessed in both the views.

#Try to fetch the authentication token from the session
auth_token = request.session.get('google_auth_token')

#If an authentication token does not exist already,
create one and store it in the session.
if not auth_token:
    auth_token = gdata.gauth.OAuth2Token(
            client_id=GOOGLE_CLIENT_ID,
            client_secret=GOOGLE_CLIENT_SECRET, 
            scope=GOOGLE_SCOPE, 
            user_agent=USER_AGENT)
    request.session['google_auth_token'] = auth_token

####The login view After successful authentication google returns a code which can be used to generate an access token which acts as a confirmation of the authentication. Here again we will set the authentication token on the session and redirect to the same view where we came from.

def google_login(request):
    
    #Fetch the auth_token that we set in our base view
    auth_token = request.session.get('google_auth_token')
    
    #The code that google sends in case of a successful authentication
    code = request.GET.get('code')
    
    if code and auth_token:
        #Set the redirect url on the token
        auth_token.redirect_uri = GOOGLE_APPLICATION_REDIRECT_URI
        
        #Generate the access token
        auth_token.get_access_token(code)
        
        request.session['google_auth_token'] = auth_token
        
        #Populate a session variable indicating successful authentication
        request.session[GOOGLE_COOKIE_CONSENT] = code
        
        #Redirect to your base page
        return redirect(request.session.get(GOOGLE_REDIRECT_SESSION_VAR))
    
    #If user has not authenticated the app   
    return redirect('wherever you want to')

####Making API calls Now that our authentication token has an access token we can make the API calls to fetch the data.

if request.session.get(GOOGLE_C00KIE_CONSENT):
    
    #Create a data client, in this case for the Contacts API
    gd_client = gdata.contacts.client.ContactsClient()
    
    #Authorize it with your authentication token
    auth_token.authorize(gd_client)

    #Get the data feed
    feed = gd_client.GetContacts()

else:
    
    #Since we want to get redirected back to the same page
    request.session[GOOGLE_REDIRECT_SESSION_VAR] = request.path
    
    #Generate the url on which authentication request will be sent
    authorize_url = auth_token.generate_authorize_url(
            redirect_uri=GOOGLE_APPLICATION_REDIRECT_URI)
    
    return redirect(authorize_url)

The flow will remain same for other APIs. The only things that will change are:

The scope URL
The GData client

P.S. I am a developer at HackerEarth. Reach out to me at virendra@hackerearth.com for any suggestion, bugs or even if you just want to chat! Follow me @virendra2334

Posted by Virendra Jain

http://engineering.hackerearth.com/2014/06/07/using-google-apis-in-django

Post-mortem: The big outage on January 25, 2014

Jan 27, 2014

25th January was a rather unfortunate day for us. The monthly challenge - January Jackpot 2014 which was scheduled at 9:30 PM that day was cancelled due to turn of events going wrong at the worst possible time. We regret once again for the incovenience caused to you, and this is a postemortem of what really happened behind the scenes.

It was Saturday, and the day was sunny here. Everything was running smoothly as usual. The whole HackerEarth team was out roaming in Bangalore, went to a lake, did boat ride, went for a lunch then and came back to office in evening. We all sang with the guitar and played counter strike. At the same time, there was a college contest - Epiphany January Challenge going on. Everything was smooth, the servers were sending replies happily and we didn’t have to worry about anything.

This is the email I received in the evening from Ravi from NIT Surat who handles the coding contests there.

from: Ravi Ojha
to: Vivek Prakash
date: Sat, Jan 25, 2014 at 7:37 PM

First, Thank You for HackerEarth!!  Its such a lucid platform for organizing
online contests. <br> ####How it went wrong Little did we know that there was a catastrophe waiting for us in the contest that was going to run at 9:30 PM. First let me assure you that the last thing we need to worry about is the number of requests that hit our servers. Read [Scaling Python/Django application with Apache and mod_wsgi](http://engineering.hackerearth.com/2013/11/21/scaling-python-django-application-apache-mod_wsgi/) to understand why.

And in this case, there were not many as the website had stopped respoding anyway. Here is the request graph of last one week:

The graph shows two significant spikes and two small spikes. The first one at 15,000 was on 24th January. The second one at 20,000+ was in the day on 25th January when Epiphany January contest was going on. The spike which just crosses 5000 in the later part of 25th was for the January Jackpot challenge. It’s clear that something else was wrong.

We got notification that things were not right and we sprung in action in no time. We found out that the servers were not sending replies to the requests, the requests were just sitting idle in queues and waiting for something to finish. That was the real bottlneck. There was no extra load on the servers, with their CPU utilization less than 30% at any point of time. The Status Server was continuously sending notifications that the requests are getting timed out. And in a situation like that, throwing extra servers also didn’t help.

####What exactly went wrong The reason it happened was that we have recently started collecting lots of data for the users activity on HackerEarth. They will be used in the Search Engine that we are rolling out this week. I didn’t want to announce it this way, and there will be separate blog post on its details later. But here is what happened in the nutshell:

We use haystack wrapper for our search, and we had been using the old version of it. It is not threadsafe, throws exceptions non-deterministically for requests. Our Apache server runs in a multi-process and multi-threaded configuration. This issue just exploded in the night.
In Django, signals are used to couple different applications. Signals notifies other applications when an application data is changed. Think like, if a user registers on the site you would want to update the search index so that the user can be made searchable now. This happens for each and every object generated on the site, a signal is sent to haystack which does the job then. We figured out it was an overkill, and this was not the right way.
We also saw that the servers reported no database connection to our analytics database, but rather rarely. This was something not expected. All the non-AJAX requests are logged in analytics database, and this was something to be worried about.

####What we did

We made the haystack thread-safe by using locking mechanisms.
We disabled all the unnecessary signals including those of haystack.
We have disabled the requests logging for now until we introduce fault tolerance for the analytics database, the resiliency that we have for master database right now. Read more about it.

These incidents remind us to be more careful, make us realize the impact that they have, and we are progressively moving towards a much stable product. Despite of these instances, we are commited to building the best product out there and will continue to do so. You might be interested in reading about the post on Programming challenges, uptime, and mistakes in 2013.

Thank you all for being so patient. We appreciate that, and we are doing our best to not let that happen again.

Let me know if you have any comments, suggestions or anything else.

Posted by Vivek Prakash. Follow me @vivekprakash. Write to me at vivek@hackerearth.com.

http://engineering.hackerearth.com/2014/01/27/big-outage-25-january

Programming challenges, uptime, and mistakes in 2013

Jan 22, 2014

HackerEarth hosted more than thousand contests in the year 2013 alone. Out of them, there were more than two dozen public programming contests by HackerEarth itself. They include our monthly challenges and hiring challenges. There were over 200 internal and public contests by colleges in the previous year. They include IIT Delhi, IIT Guwahati, IIT Ropar, NIT Warangal, IIIT Jabalpur, NIT Raipur, NIT Calicut, BITS Pilani and many others. And we have been able to do that without any sweat. But sometimes, we made mistakes too, most of them in the early half of 2013.

####Mayhem To tell you the truth, in the beginning it was chaotic, mayhem and scary. We would have to monitor that everything was working right. Sometimes, we would give in everything just to keep the site up and running. The problem of scaling always takes a toll on you. And that too when you want to build a word-class product. And particularly for a platform like ours where the concept of putting more servers on demand(auto-scaling) fails due to sudden burst in traffic, giving no time to bring more servers in action. Below is a request graph from production server on an usual day.

It’s important to realize that nothing scales automatically. 100% uptime is a constant struggle. But we were ready to roll up our sleeves and move towards that. And in later half of 2013, things have moved ahead at a really amazing pace. On a related note, whenever I read the Deploying Django post by Randall Degges, it gives me a good laugh any day. Particularly these lines:

Yes, grasshopper! You now see it: you have only begun to discover the amount of work that lays ahead. You’ve barely scratched the surface as to the tools, methods, and skills necessary to manage and operate even the simplest of production sites. Your work is cut out for you.

####What We Did We undertook a series of steps to make the experience nicer for end user:

We rewrote our code-checker server queueing system in early 2013 to make it asynchronous. This significantly reduced the process overhead on our frontend servers. Read more.
We wrote a very robust realtime server in Tornado which handles the live update of webpages. A very visible case is when you receive the result in browser on compiling/submitting the code in realtime. Read more.
We sharded our database and wrote database routers to reduce overhead on single database and further reduce the latency in the queries. Read more.
Now was the time to do some heavy stress testing of our website and we found that the results were terrible. This led to optimization of our servers and tweaking it to make it faster and resilient. Read more.

Amidst all this and contests being hosted on HackerEarth almost everyday, we had to roll out new features whenever they were ready and as fast as possible. This led to writing an in-house continuous deployment system which now allows us to put anything in production as soon as it’s ready. Tests are run automatically and then code is deployed in production, which keep us sane and make us brave in pushing fast. We still mess up sometimes during complicated deployments, which require series of steps to be done right and in correct order. For example, when slightly complicated package dependency is required in new commits, or when huge schema migration is required. We are working on many aspects and have invested a lot on infrastucture and builder-tools to make sure we don’t mess up at least 99% of the times.

####Uptime Our uptime increased from 99.65% in April 2013 to 99.97% in November 2013. In December 2013, we made some mistakes due to extremely fast deployment which brought down the uptime to 99.89% in that month. From 1st December 2013, we have alreay deployed over 500 times. Here is our uptime stats of last 6 months:

July 2013: 99.81%
August 2013: 99.64%
September 2013: 99.98%
October 2013: 99.84%
November 2013: 99.97%
December 2013: 99.89%

The way our product is used anytime and anywhere in different timezones, we can’t afford to shut down the site for 2 hours and do some upgradation, migration or deployment without any care. We knew it from beginning and have invested a lot of our time and brain in making sure everything always works smoothly.

####Monthly Challenges & Mistakes December Monthly Challenge

Remember December was the first challenge in last 4 months which didn’t see 100% uptime. The problem was something unexpected and baffling. The culture here is that we don’t throw servers at large number of requests and expect them to handle everything. We don’t take pride in running 100 servers. You know what, there are usually just two web-servers on which your request goes and you receive the response. We call that the frontend server. Most of the time, you are getting data from the cache. The cache is set or invalidated for each and almost many data. As of today, there are over a million key-value pairs in our memcached store. Your sessions are maintained in redis. Any other persistent data goes into MySQL or S3, but most of them are cached for some suitable lifetime. More importantly, any request that reaches our servers are made not to query the database 20 times at any point of time, whether the data is in cache or not. And when we say 20, it’s twenty. we count that. Becuase we don’t take pride in throwing more servers or databases at such problems.

In December monthly challenge, we discovered that we had defied our own rule of reading data from cache and were instead doing multiple hits to the database and S3. To make the situation worse, the read operations from S3 were expensive in terms of time. To make it even more bad, this was happening for users’ code that gets loaded in code editor for each language. Cumulatively, each visit of problem page was sending about 30 more queries to the database now and some more operations which were time and memory expensive. The ludicrous thing was that it was only happening on first time page load, as after that cache was being accessed. To salvage the situation, we threw 2-3 more servers at your requests bringing everything to normal in 16 minutes. When we found out the reason, it was due to a recent deployment of an upcoming feature. We also found out that we already had the data in the cache and didn’t even need to query the database or read from S3. But mistakes happen when you are building a complex product with a team of just 3-4 engineers, but I agree we should have been more careful.

Btw, the feature was CodePlayer which got announced yesterday.

####Current Architecture at HackerEarth To give you more insights into the architecture at HackerEarth, here are the different types of production servers that run. They might not necessarily be on different machines, and some of them might be in multiple numbers which are load balanced behind ELB or HAProxy, depending on the server.

Frontend server(s)
API server(s)
Code-checker server(s)
Search server(s) - Apache Solr & Elastic Search
Realtime server - written using Tornado
Status server
Toolchain server (Mainly used for continuous deployment)
Integration Test server (For integration testing of commits before deploying in production)
Log server
Memcached server
Few more servers for data crunching, processing our analytics database and background jobs.

There are many other components like RabbitMQ, Celery, etc. which glues many servers. Then there are monitoring servers, which monitor all the other servers and also push the data to status server. Our databases are sharded and are load balanced behing HAProxy. You might be interested in reading about our technology stack, though it’s a bit outdated and many layers have evolved since then. This investment in infrastructure allows us to take more breaks and roam the streets of Bangalore while our servers are happily serving thousands of requests every minute.

####Dream for tomorrow As we grow and scale further, the infrastructure and product will even grow beyond comprehension. But from the very beginning, we have been clear about one thing - that we will continue do this right whatever may come in the way. And today we can proudly say that we reached 50,000 user base in a breeze, hosted over 1000 contests and we are more stronger in our vision to solve the problem of technical recruiting. The products in pipeline are going to redefine many things, while our contests will become even more robust. The journey ahead is like never before, the nights are sleepless because of the excitement. We are here to make a dent, while singing and dancing all along the way.

‘Tis the witching hour of night,
Orbed is the moon and bright,
And the stars they glisten, glisten,
Seeming with bright eyes to listen…
…
I sing an infant’s lullaby,
A pretty lullaby.

John Keats

P.S. If you are associated with your college programming club, reach out to me directly at vivek@hackerearth.com. Refer to this quora answer.

If you are a recruiter, do try out HackerEarth Recruit. It’s free to signup and get started with. If you face any issue, please send us an email at support@hackerearth.com and we will be in touch with you asap.

Let me know if you have any comments, suggestions or anything else.

Posted by Vivek Prakash. Follow me @vivekprakash. Write to me at vivek@hackerearth.com.

http://engineering.hackerearth.com/2014/01/22/programming-challenges-uptime-mistakes

Introducing CodePlayer - watch your code like a movie

Jan 21, 2014

Ever thought of sharing solution of a coding problem in form of a video with someone, to teach them how you implemented the solution. Or, wanted to see what’s the thought process of a potential employee when he/she solves a difficult problem. Ofcourse you have thought about it. But, it was not exactly possible to watch it as seamlessly as a movie until now.

Today, we are announcing the release of HackerEarth’s CodePlayer that exactly does that.

tl;dr Watch a demo code video or Make your own code video.

How it works

CodePlayer is tightly integrated with all code editors in HackerEarth and sister site - CodeTable. That is, wherever you will find code editor in our site, there CodePlayer will also be activated. By choice, we have built it to be activated automatically in stealth mode on page load or first keystroke.

Our code editor is built on top of Ace Editor. In ace editor, keystrokes or deltas can be captured and applied programmatically using Ace API. This is where the idea of playing code like a video seemed possible.

CodePlayer was built on the following line of development:

Setup Video
Recording Keystrokes
Playing Video

Setup Video

After the code editor is loaded, an AJAX POST request is made to server to setup video info in Django model.

/**
 * This function sends ajax request to setup video.
 * This code is made generic to integrate it with code editor anywhere in site.
 * @arg video_obj: contains video info
 * @arg callback: called on success
 * @arg callback_arg_obj: above callback argument 
 */
function setup_video(video_obj, callback, callback_arg_obj) {
    $.ajax({
        url: '*****',
        type: 'POST',
        data: video_obj,
        dataType: 'json',
        callback: callback,
        callback_arg_obj: callback_arg_obj,
        success: function(response_obj) {
            this.callback(response_obj, this.callback_arg_obj);
        },  
        error: function(err) {
        }   
    }); 
}; <br> Video model*(Backend)*

class CodeVideo(Generic):
    final_code = models.TextField(null=True) # used as thumbnail
    lang = models.CharField(max_length=10)
    last_updated = models.DateTimeField(null=True)
    owner_id = models.PositiveIntegerField(null=True)
    uuid = UUIDField(auto=True) # unique id of video

Recording Keystrokes

One of the difficult task in recording keystrokes is to reduce no. of web requests and database insert queries. On an average there are 1k keystrokes in a single instance of code editor. Now, for only 100 active users on site, there was going to be 1000 * 100 web requests and db insert queries.

Although, our web servers are capable of handling these many requests but we didn’t want to waste resources. So we used batch requests in which keystrokes are first grouped locally(in Javascript) and then sent to web servers in batches. Below is the JS code that enqueues keystroke/changeset.

/**
 * Enqueues changeset in changeset queue.
 */
this.enqueue_changeset = function(delta, source, timestamp) {
    if(!delta)
        return false;

    var changeset = {
        delta: delta,
        source: source,
        timestamp: new Date().getTime()
    };
    this.changeset_queue.push(changeset);

    return true;
};

In backend, instead of multiple insert queries, we are using Django’s model api bulk_create method for batch insert.

Now, one question is still unanswered. At what intervals, these batch requests are sent? Actually, there is no fixed interval. A batch request is sent after an inactivity period of 3 seconds in editor(i.e. user has stopped typing for atleast 3 seconds), and there are still keystrokes left to be sent. If the user tries to move away from the browser or close the tab, all pending changesets are sent immediately to the server. This ensures that no changes are lost.

‘All changes saved’ animation on the top-right of editor confirms that a batch request is sent successfully.

Often people don’t write code continuously. There can be large intervals between successive coding sessions which means video length will increase drastically. To tackle this problem, we added a CodeSession model.

class CodeSession():
    code_video = models.ForeignKey(CodeVideo)
    initial_code = models.TextField()
    # sesson start time 
    start = models.DateTimeField()
    # sesson end time 
    end = models.DateTimeField()

Now, total video length is calculated using formula:

Σ(session_end_timei - session_start_timei)

Time interval between successive sessions has been kept to atleast 2 minutes.

####Playing Video

We decided to deploy ‘recording’ functionality before ‘playing’ so that we can generate some data and test the scalability of the system. Frankly, we were scared as we had just hacked the whole code editor sessions and user code rendering systems. The downside could have been that all contests would have gone for toss. But, we tested extensively before deployment and the results were amazing. Since deployment in last week of December 2013, we already have over 100,000 code videos.

And now we had the data, it was a matter of playing it using Ace API.

Each delta/changeset has a timestamp associated with it which is converted to video time according to video length. All these deltas are scheduled then applied programmatically(using applyDeltas) according to their video time. This basically, is the play functionality.

/**
 * Applies deltas/changesets, slides seekbar
 */
var play_timeout = function(changeset) {
    return function() {
        // Apply delta
        if(changeset.delta) {
            var delta = changeset.delta;
            editor.moveCursorToPosition(delta.range.start);
            var doc = new Document(editor.getValue());
            doc.applyDeltas([delta]);
            editor.setValue(doc.getValue(), 1);
            if(delta.action=='removeText')
                editor.moveCursorToPosition(delta.range.start);
            else
                editor.moveCursorToPosition(delta.range.end);
            video_state['cursor_position'] = delta.range.end;
        }
        // Save video state
        video_state['session_index'] = changeset['session_index'];
        video_state['changeset_index'] = changeset['changeset_index'];
        video_state['time'] = changeset['video_time'];
        video_state['code'] = editor.getValue();
        // Slide seekbar to last applied delta time
        seekbar.slider('value', changeset.video_time);
    }
};

To pause video at any point, all scheduled timeouts are cleared.

/**
 * Pauses video.
 */
this.pause = function() {
    // Clear all previously scheduled play timeouts.
    for(var i=0; i<play_timeout_ids.length; i++) {
        clearTimeout(play_timeout_ids[i]);
    }
    play_timeout_ids = [];
    video_playing = false;
    var play_id = this.player_elements['play_id'];
    show_play_menu_button(play_id);
};

Changing speed of video was a piece of cake. Say user clicks on 5x, divide timestamp(t) by 5 i.e. t=t/5.

/**
 * Time after which deltas will be applied.
 */
var play_after = function(video_time) {
    // Realtive video time
    var r_video_time = video_time-video_time_copy;
    // Convert to milliseconds
    r_video_time *= 1000;
    // Divide by play speed
    r_video_time /= play_speed;
    return r_video_time;
};

####Make your own code video I know you are excited to try out our CodePlayer. Go to http://code.hackerearth.com and start writing code. As soon as the default code is changed, you will see a ‘Replay Code’ button. Click on it to watch the video.

If you are writing any code on HackerEarth too, you will see the buttons below ‘Replay your code in CodePlayer’ on right hand side. All languages will have separate links to code video. You can share the link of code video with anyone to view, there is no login or other form of access required. Try solving the Fizz Buzz Test and then watch the video of your code.

The side effect of building the CodePlayer was that auto-save in code editor was implemented by default. Now, you don’t ever need to worry about losing the code you have written in editor. Next time, you login and visit the page, your latest code will be there for you.

PS: I worked on this project during my winter internship at HackerEarth. I also worked there as a summer intern. Read my summer internship experience here.

PPS: I will be joining the folks at HackerEarth full time after graduation in summer 2014 :)

Posted by Lalit Khattar. Follow me @LalitKhattar

http://engineering.hackerearth.com/2014/01/21/introducing-codeplayer

Scaling Python/Django application with Apache and mod_wsgi

Nov 21, 2013

HackerEarth is primarily based on Python & Django. And we use Apache with mod_wsgi for hosting the application. There is a general complaint that Apache sucks when it comes to hosting Python web applications. It’s said that it’s slow, bloated, uses lots of memory and doesn’t perform very well. It’s also said that it doesn’t handle a high number of concurrent requests.

All that is true if you are not running the Python application in the right way. If configured properly, Apache works fantastically and is usually never the reason of slowness. That is almost always due to the application bottlenecks and database latency.

Now I am ashamed to admit that we have been running HackerEarth frontend servers for a long time under bad configuration of Apache and mod_wsgi. This came up when we started load testing our servers with thousands of concurrent connections. Also, we previously used to see the memory usage going on a rocket trajectory when faced with sudden spike in traffic, which used to force us to scale up more than that was actually required.

There are couple reasons for excessive memory usage when running Python applications in Apache. First of all, they are very heavy to start with. The multiprocessing module (apache2-mpm-prefork) that comes by default with apache makes it even worse. Their poor configuration is just waiting for disaster again. And most importantly, there are tons of apache modules installed and loaded into memory, while most of them are never going to be used.

If Apache is setup properly keeping the Python web application and the machine resource contraints into account, Apache is fast and reliable. I will explain further what we did at HackerEarth. But before that, I will present some real data from the experiments that we did.

We created an exact replica of one of our production machine - which directly handles the requests and is running Apache server. Then, we tweaked apached configuration one by one and recorded the improvement. Everytime we sent 5000 requests at 100 req/sec to the optimized machine, and also to one of the production machine. The results were unbelievable.

Stats with optimized apache configuration

Total: connections: 5000, requests: 4673, replies: 4626, test-duration: 64.938 s

Connection rate: 77.0 conn/s (13.0 ms/conn, <=1004 concurrent connections)

Request rate: 72.0 req/s (13.9 ms/req)

Errors: total 374 client-timo 369 socket-timo 0 connrefused 0 connreset 5
Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0

This statistics tells that out of 5000 connections made, 4626 replies were received with timeout set to 10 secs. More importantly, 72 requests/sec were made and there were 1000+ concurrent connections at one time. We improved this metric further with errors ranging from 0-5 after further tweaking.

But let’s have a look at what happened on one of the production machine.

Stats with default apache configuration

Total: connections 2024 requests 363 replies 0 test-duration 94.801 s

Connection rate: 21.4 conn/s (46.8 ms/conn, <=1022 concurrent connections)

Request rate: 3.8 req/s (261.2 ms/req)

Errors: total 5000 client-timo 2024 socket-timo 0 connrefused 0 connreset 0
Errors: fd-unavail 2976 addrunavail 0 ftab-full 0 other 0

If you notice closely, you would say Holy Shit!. Without explaining in detail what happened here, it’s sufficient to say that at that number of requets, zero replies were received. The machine reported 100% memory and CPU utilization almost instantaneously after the requests were made. Apache didn’t know how to handle this situation and just went on creating more processes bringing everything to sandstill. This is how CPU utilization looked like, each time we sent some heavy requests to it.

Let’s go through the steps one by one to properly setup Apache and mod_wsgi.

1. Remove unnecessary modules

The first thing to do is to remove all the unnecessary apache modules that has been installed and are being loaded at runtime. At the end, only these modules should be enabled:

mod_alias.so
mod_authz_host.so
mod_deflate.so
mod_mime.so
mod_negotiation.so
mod_rewrite.so
mod_wsgi.so

You would also want to enable the following modules for debugging purposes. I will explain their usage later in this article. Although, you can skip them on production setup.

mod_status.so
mod_info.so

Before stripping down, as many as double the number of these modules were being loaded, including libphp5.so. Now why would we ever want to do that!

2. Use Apache MPM worker

Now purge apache2-mpm-prefork. It was built for PHP type applications which were not multi-thread safe. You can safely ditch it for the Python application, and only good days will follow.

On UNIX systems there are two main MPMs that are used. These are the prefork MPM and the worker MPM. The prefork MPM implements a multi process configuration where each process is single threaded. The worker MPM implements a multi process configuration but where each process is multi threaded.

On Ubuntu, you need to follow these steps:

sudo apt-get purge apache2-mpm-prefork
sudo apt-get install apache2-mpm-worker apache2-threaded-dev

You can read about MPMs in detail here.

3. KeepAlive Off

KeepAlive: Whether or not to allow persistent connections (more than one request per connection). Set to “Off” to deactivate.

KeepAlive can be turned off when you are not serving the static files with same server, where On mode is more beneficial. We serve static files from CloudFront, and so we decided to turn it off after doing dozens of experiments. You might have guessed the caveat is that for every request a new connection is created. But the advantage is that processes/threads are free to handle new requests instantenously rather than waiting for a request to arrive on the older connection.

4. Daemon Mode of mod_wsgi

By default, it’s usually always embedded mode + mpm-prefork, which is an absolute affliction if not understood properly. It’s best to set up mod_wsgi in daemon mode + mpm-worker and properly configure the MPM settings. We can get away with this setup almost always on a machine with limited memory and CPU cores. But embedded mode + mpm-prefork is less forgiving when encountered with sudden spike in traffic, it will keep creating processes and swamp the whole machine leaving it useless when you need it most.

When using daemon mode, the number of processes and threads is constant, which makes the resouce consumption predictable. Also with mod_wsgi running in daemon mode, when the Python web application is updated you just need to update the modification timestamp of WSGI file using touch. In embedded mode, you would have to restart the apache server. And try to do that when you are getting even a meagre 20 requests/sec on a production machine!

There is a fantastic post on this topic by Graham Dumpleton - the author of mod_wsgi. You can read this here.

5. Tweaking mpm-worker configuration

After doing several experiments on a replica of our production machine, we arrived at this configuration.

<IfModule mpm_worker_module>
    StartServers         2
    MinSpareThreads      10
    MaxSpareThreads      25
    ThreadLimit          25
    ThreadsPerChild      25
    MaxClients           75
    MaxRequestsPerChild   0
</IfModule>

This configuration enforces following rules:

Initial number of server processes started is two.
Maximum number of clients is restricted to 75.
Each process has 25 threads.
Maximum number of processes that could be created is 75/25 = 3.
Our process size is ~220 MB (very very fat, I know!), so that means we only need ~660 MB in the worst case.

It turns out that our application is more CPU intensive than memory intensive in the way it’s written. So, we intentionally restricted the number of processes although we had much higher memory on the production box, so that we are ready for the worst case.

6. Check configuration

The two modules mod_status.so and mod_info.so can be used for direct information as how the apache is being run. Put this snippet in your httpd.conf file.

<Location /server-status>
SetHandler server-status

Order Deny,Allow
Allow from all
</Location>

<Location /server-info>
SetHandler server-info

Order Deny,Allow
Allow from all
</Location>

Now access http://yourdomain.com/server-info/. It will show page with following information, in addition to tons of other useful info.

Loaded Modules: 
mod_wsgi.c, mod_status.c, mod_rewrite.c, mod_negotiation.c, mod_mime.c, mod_info.c, mod_deflate.c, mod_authz_host.c, mod_alias.c, mod_so.c, http_core.c, worker.c, mod_logio.c, mod_log_config.c, core.c

Server Settings
MPM Name: Worker
MPM Information: Max Daemons: 3 Threaded: yes Forked: yes

http://yourdomain.com/server-status/ will give information about the running server status.

Current Time: Thursday, 21-Nov-2013 03:08:29 CST
Restart Time: Thursday, 21-Nov-2013 00:27:16 CST
Parent Server Generation: 0
Server uptime: 2 hours 41 minutes 13 seconds
Total accesses: 5798 - Total Traffic: 56.5 MB
CPU Usage: u17.8 s7.75 cu.01 cs0 - .264% CPU load
3.1 requests/sec - 6.0 kB/second - 10.0 kB/request
6 requests currently being processed, 19 idle workers

All this signifcantly reduced the number of servers we had to run and made the application more stable and resilent to traffic bursts. Besides these, we have done tons of optimizations in the application itself and written it in a very distributed fashion. The frontend application server is just 10% of the whole server stack, but it’s the most important as it interfaces directly with website user. I will talk about other optimizations very soon which has enabled us to comfortably adapt to sudden burst in traffic without any sweat.

(We might have been sweating because we were in our weekend dance class while all servers were getting bombarded :D)

On a related note, you might be interested in reading engineering behind our database scaling.

Let me know if you have any comments, suggestions or any potential quirks in what all I have written here. Besides if you are passionate about solving such problems everyday, we are hiring.

Posted by Vivek Prakash. Follow me @vivekprakash. Write to me at vivek@hackerearth.com.

http://engineering.hackerearth.com/2013/11/21/scaling-python-django-application-apache-mod_wsgi

Scaling database with Django and HAProxy

Oct 7, 2013

###MySQL - Primary data store At HackerEarth, we use MySQL database as the primary data store. We have experimented with a few NoSQL databases on the way, but the results have been largely unsatisfactory. The distributed databases like MongoDB or CouchDB aren’t very scalable or stable. Right now, our status monitoring services use RethinkDB for storing the data in JSON format and that’s all for the NoSQL database usage right now.

With the growing data and number of requests/sec, it turns out that the database becomes the major bottlneck to scale the application dynamically. At this point if you are thinking that there are mythical (cloud) providers who can handle the growing need of your application, you can’t be more wrong. To make the problem even harder, you can’t spin a new database whenever you want to just like your frontend servers. To achieve a horizontal scalability at all levels, it requires massive rearchitecture of the system while being completely transparent to the end user. This is what a part of our team has focussed on in last few months, resulting in very high uptime and availability.

The master (and only) MySQL database had started being under heavy load recently. We thought we will delay any scalability at this level till the single database could handle the load, and we will work on other high priority things. But that was not supposed to go as planned and we experienced a few downtimes. After that we did a rearchitecture of our application, sharded the database, wrote database routers and wrappers on top of django ORM, put HAProxy load balancer infront of the MySQL databases, and refactored our codebase to optimize it significantly.

The image below shows a part of the architecture we have at HackerEarth. Many other components have been omitted for simplicity.

###Database slaves and router The idea was to create read replicas and route the write queries to master database and read queries to slave (read replica) databases. But that was not so simple again. We couldn’t and wouldn’t want to route all the read queries to slaves. There were some read queries which couldn’t afford stale data, which comes as a part of database replication. Though stale data might be the order of just a few seconds, these small number of read queries couldn’t even afford that.

The first database router was simple:

class MasterSlaveRouter(object):
    """
    Represents the router for database lookup.
    """
    def __init__(self):
        if settings.LOCAL:
            self._SLAVES = []
        else:
            self._SLAVES = SLAVES

    def db_for_read(self, model, **hints):
        """
        Reads go to default for now.
        """
        return 'default'

    def db_for_write(self, model, **hints):
        """
        Writes always go to default.
        """
        return 'default'

    def allow_relation(self, obj1, obj2, **hints):
        """
        Relations between objects are allowed if both objects are
        in the default/slave pool.
        """
        db_list = ('default',)
        for slave in zip(self._SLAVES):
            db_list += slave

        if obj1._state.db in db_list and obj2._state.db in db_list:
            return True
        return None

    def allow_migrate(self, db, model):
        return True

All the write and read queries go the master database, which you might think is weird here. Instead, we wrote get_from_slave(), filter_from_slave(), get_object_or_404_from_slave(), get_list_or_404_from_slave(), etc. as part of django ORM in our custom managers to read from slave. So whenever we know we can read from slaves, we call one of these functions. This was a sacrifice made for those small number of read queries which couldn’t afford the stale data.

Custom database manager to fetch data from slave:

# proxy_slave_X is the HAProxy endpoint, which does load balancing
# over all the databases.
SLAVES = ['proxy_slave_1', 'proxy_slave_2']

def get_slave():
    """
    Returns a slave randomly from the list.
    """
    if settings.LOCAL:
        db_list = []
    else:
        db_list = SLAVES

    return random.choice(db_list)

class BaseManager(models.Manager):
    # Wrappers to read from slave databases.
    def get_from_slave(self, *args, **kwargs):
        self._db = get_slave()
        return super(BaseManager, self).get_query_set().get(*args, **kwargs)

    def filter_from_slave(self, *args, **kwargs):
        self._db = get_slave()
        return super(BaseManager, self).get_query_set().filter(
                *args, **kwargs).exclude(Q(hidden=True) | Q(trashed=True))

###HAProxy for load balancing Now the slaves could be in any number at a time. One option was to update the database configuration in settings whenever we added/removed a slave. But that was very cumbersome and inefficient. The other better way was to put a HAProxy load balancer in front of all the databases and let it detect which one is up or down and route the read queries according to that. This would mean never editing the database configuration in our codebase, just what we wanted.

A snippet of /etc/haproxy/haproxy.cfg:

listen mysql *:3305
    mode tcp
    balance roundrobin
    option mysql-check user haproxyuser
    option log-health-checks
    server db00 db00.xxxxx.yyyyyyyyyy:3306 check port 3306 inter 1000
    server db01 db00.xxxxx.yyyyyyyyyy:3306 check port 3306 inter 1000
    server db02 db00.xxxxx.yyyyyyyyyy:3306 check port 3306 inter 1000

The configuration for slave in settings now looked like this:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'db_name',
        'USER': 'username',
        'PASSWORD': 'password',
        'HOST': 'db00.xxxxx.yyyyyyyyyy',
        'PORT': '3306',
    },
    'proxy_slave_1': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'db_name',
        'USER': 'username',
        'PASSWORD': 'password',
        'HOST': '127.0.0.1',
        'PORT': '3305',
    },
    'analytics': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'db_name',
        'USER': 'username',
        'PASSWORD': 'password',
        'HOST': 'db-analytics.xxxxx.yyyyyyyyyy',
        'PORT': '3306',
    },
}

But there is a caveat here too. If you spin off a new server with the haproxy configuration containing some endpoints which doesn’t exists, haproxy will throw an error and it won’t start, making the slave useless. It turns out there is no easy solution to this, and haproxy.cfg should contain existing server endpoints while initializing. The solution then was to let the webserver update its haproxy configuration from a central location whenever it starts. We wrote a simple script in fabric to do this. Besides, the webserver already used to update its binary when spinned off from an old image.

###Database sharding Next, we sharded the database. We created another database - analytics. It stores all the computed data and they form a major part of read queries. All the queries to analytics database are routed using the following router:

class AnalyticsRouter(object):
    """
    Represents the router for analytics database lookup.
    """
    def __init__(self):
        if settings.LOCAL:
            self._SLAVES = []
            self._db = 'default'
        else:
            self._SLAVES = []
            self._db = 'analytics'

    def db_for_read(self, model, **hints):
        """
        All reads go to analytics for now.
        """
        if model._meta.app_label == 'analytics':
            return self._db
        else:
            return None

    def db_for_write(self, model, **hints):
        """
        Writes always go to analytics.
        """
        if model._meta.app_label == 'analytics':
            return self._db
        else:
            return None

    def allow_relation(self, obj1, obj2, **hints):
        """
        Relations between objects are allowed if both objects are
        in the default/slave pool.
        """

        if obj1._meta.app_label == 'analytics' or \
                obj2._meta.app_label == 'analytics': 
            return True
        else:
            return None

    def allow_migrate(self, db, model):
        if db == self._db:
            return model._meta.app_label == 'analytics'
        elif model._meta.app_label == 'analytics':
            return False
        else:
            return None

To enable the two routers, we need to add them in our global settings:

DATABASE_ROUTERS = ['core.routers.AnalyticsRouter', 'core.routers.MasterSlaveRouter']

Here the order of routers is important. All the queries for analytics are routed to the analytics database and all the other queries are routed to the master database or their slaves according the nature of queries. For now, we have not put slaves for analytics database but as the usage grows that will be fairly straightforward to do now.

At the end, we had an architecture where we could spin off new read replicas, route the queries fairly simply and had a high performance load-balancer in front of the databases. All this has resulted in much higher uptime and stability in our application and we could focus more on what we love to do - building products for programmers. We already had an automated deployment system in place, which made the exprimentation easier and enabled us to test everything thoroughly. The refactoring and optimization that we did in codebase and architecture also helped us to reduce the servers count by more than two times. This has been a huge win for us, and we are now focussing on rolling out exciting products in next few weeks. Stay tuned!

I would love to know from others about how they have solved similar problems, give suggestions and point out potential quirks.

P.S. You might be interested in The HackerEarth Data Challenge that we are running.

Posted by Vivek Prakash. Follow me @vivekprakash. Write to me at vivek@hackerearth.com.

http://engineering.hackerearth.com/2013/10/07/scaling-database-with-django-and-haproxy

The HackerEarth Data Challenge

Sep 13, 2013

40,000+ programmers use HackerEarth. Everyday, people from all over India and other countries submit code on HackerEarth, solve problems and participate in online coding tests. Our CodeFactory server has processed over 500,000 requests till now. There are different types of challenges running every month. The technology stack consists of multiple servers of different types e.g. search-server, realtime-server, web-server, log-server, etc. running at any time. Over 100,000 lines of code are running to serve your requests, and we deploy a dozen times everyday.

And we have been able to achieve that with relatively very high uptime all along. To make this possible, we have written many monitoring services for our backend. The public status page listing few of the services is now publicly available at http://status.hackerearth.com/.

To make it even more interesting, we are making the data collected by status monitoring services public. All the data is in JSON format, they are over 800,000 in number, and they are available in a schema-less database - RethinkDB. Are you curious how that data looks like? There might be gold-rush in there, and we invite you to find that gold, to find something interesting out of the data and show what you can do with that in hand. There are umpteen stories to uncover, you just need to dig!

###Data Access

The data is available in JSON format in RethinkDB. Following are the details of host, database and table:

Endpoint: status-data-challenge.hackerearth.com
Port: 80
Database name: careerstack
Tables
- hackerearth_status: for HackerEarth webserver
- api_status: for HackerEarth API
- realtime_status: for Realtime server
- code_checker_status: for CodeChecker server
- celery_status: for task queue
- rabbitmq_status: for message queue
Web UI: http://status-data-challenge.hackerearth.com:8080/

To get started, you need to install rethinkdb-client drivers on your machine.

The query language is very simple and easy to get. You should go through RethinkDB QL for getting started with the database query.

Below is a sample Python code for reading hackerearth_status table:

import rethinkdb as r

# connect to rethinkdb
r.connect(host='status-data-challenge.hackerearth.com', db='careerstack', port=80).repl()

# get first 10 JSON data from hackerearth_status table
data = r.table('hackerearth_status').slice(0, 10).run()

for d in data:
    print d

It prints following output in the console:

{u'status': 200, u'response_received_time': 1374794680.909191, u'message': u'OK', u'request_time': 1374794680.889908, u'id': u'00004e52-2934-4446-850a-39414ab2e64e'}
{u'status': 200, u'response_received_time': 1375733282.5145938, u'message': u'OK', u'request_time': 1375733282.4913878, u'id': u'00019762-f0f7-4bf2-8a99-c9d6a148e647'}
{u'status': 200, u'response_received_time': 1374901211.485287, u'message': u'OK', u'request_time': 1374901211.444667, u'id': u'0003d0d1-b4fc-4efe-8bbb-566241bc19de'}
{u'status': 200, u'response_received_time': 1373712173.611913, u'message': u'OK', u'request_time': 1373712173.500822, u'id': u'0003e502-4a97-4214-9ebb-4264314c1523'}
{u'status': 200, u'response_received_time': 1376753907.3009229, u'message': u'OK', u'request_time': 1376753904.8444479, u'id': u'000608b2-529e-41de-956e-70b50ff8c585'}
{u'status': 200, u'response_received_time': 1374598010.706149, u'message': u'OK', u'request_time': 1374598008.899234, u'id': u'00078bfc-c9ac-4fce-a660-1a4bd6de6148'}
{u'status': 200, u'response_received_time': 1375150088.026571, u'message': u'OK', u'request_time': 1375150088.00546, u'id': u'000acc7f-f26d-499a-a395-d4c914fcf7c4'}
{u'status': 200, u'response_received_time': 1379057035.742864, u'message': u'OK', u'request_time': 1379057035.7246768, u'id': u'000b31df-9987-4f38-b0a9-61c659670317'}
{u'status': 200, u'response_received_time': 1374929387.17773, u'message': u'OK', u'request_time': 1374929387.126168, u'id': u'000b81e6-2e7e-491f-906a-c818c95a96f1'}
{u'status': 200, u'response_received_time': 1375804570.5828838, u'message': u'OK', u'request_time': 1375804570.4476259, u'id': u'000e3149-de7a-4b93-acb9-4eef5b030299'}

The data format varies a little for code_checker_status, celery_status, and rabbitmq_status. They have more key-value data in message.

You might have noticed, these are the following primary key-value pairs:

id: Unique identifier for the JSON document
status: Status code returned from the service
message: Message returned from the service ping-pong
request_time: Number of seconds since epoch when the service was pinged
response_received_time: Number of seconds since epoch when the service responded

You might have realized that response_received_time - request_time is the service latency.

To see the data stored in code_checker_status table, copy-paste the following script in the Data Explore in Web UI and hit ‘Run’.

r.db('careerstack').table('code_checker_status').slice(0, 1)

The data explorer and Web UI is completely exposed. This means anyone can delete the data too. But you are not advised to do so. In any case, the data is restored to original state every 10 minutes using a periodic asynchronous task.

Quering data is very easy and intuitive in RethinkDB in multiple languages. For creating visualizaton in the frontend, we recommend using d3js, a very good JavaScript library for creating graphs and other visualizations. You can read the basic tutorials here.

###To Enter Data Challenge

You have to host your code repository on Github and send a link to the repository along with images of your graph(s), table(s), or any other data analysis to vivek@hackerearth.com before midnight, October 13, 2013 IST.

###Prizes

We will vote on the favorite visualization and there will be a cash prize of $100 for the top entry. The winning entry will be featured in our blog. We will also send HackerEarth T-shirts to the next 5 entries. Winners will be announced in the week of October 21st, 2013.

Good luck with Gold rush!

Posted by Vivek Prakash. Follow me @vivekprakash. Write to me at vivek@hackerearth.com.

http://engineering.hackerearth.com/2013/09/13/the-hackerearth-data-challenge

HackerEarth API v2: Introducing asynchronous callbacks

Sep 9, 2013

We had already published HackerEarth API v1 in February, 2012 at http://developer.hackerearth.com. The API v1 was synchronous in nature. This means that your request kept hanging until the code evaluation was done and response was received. This seriously limited anyone from writing robust applications using the API.

We have been using the asynchronous API for a long time ourselves at HackerEarth for processing all the code submissions. It’s fast, it’s robust and works flawlessly. Today, we are making the async API public.

There is not a major change in the way API request is done, if you have already checked out the synchronous API v1. Now, you get a confirmation response as soon as you do the request. The actual response containing the code evaluation data arrives later at the callback as POST data, with the response contained in ‘payload’ POST parameter. The response format is always same and in JSON format.

Below is a technical description of how the Asynchronous API works.

The API defines the following endpoints:

http://api.hackerearth.com/code/compile/
http://api.hackerearth.com/code/run/

If you haven’t already got your secret key, register here.

To make an asynchronous request, now you just need to pass 1 as a value to ‘async’ parameter, along with other required parameters.

    
    import urllib

    COMPILE_URL = 'http://api.hackerearth.com/code/compile/'
    RUN_URL = 'http://api.hackerearth.com/code/run/'

    CLIENT_SECRET = '5db3f1c12c59caa1002d1cb5757e72c96d969a1a'

    source = open('sample.c').read()
    lang = 'C'

    post_data = {
        'client_secret': CLIENT_SECRET,
        
        # Asynchronous mode on
        'async': 1,

        'source': source,
        'lang': lang,
        'time_limit': 5,
        'memory_limit': 262144,

        # Id to keep track of request
        'id': 123,

        # Callback URL where processed response will later arrive
        # - this response is same as the response received in
        # synchronous API request.
        'callback': 'http://example.com/receive-hackerearth-response/'
    }

    post_data = urllib.urlencode(post_data)

    response = urllib.urlopen(RUN_URL, post_data)
    print "post_data: ",post_data
    print
    print response.read()

Output from the above script is:

post_data:  lang=C&source=%23include+%3Cstdio.h%3E%0A%0Aint+main%28%29+%7B%0A++++printf%28%22Hello%22%29%3B%0A++++int+n%3B%0A++++scanf%28%22%25d%22%2C+%26n%29%3B%0A++++printf%28%22%5Cn%25d%5Cn%22%2C+n%29%3B%0A++++return+0%3B%0A%7D%0A&callback=http%3A%2F%2Fexample.com%2Freceive-hackerearth-response%2F&async=1&time_limit=5&client_secret=4df77c2c2eb62f9adb20bd1127f6f44a4ce6cda4&id=123&memory_limit=262144

{"errors": {}, "code_id": "3d255bX", "id": 123, "message": "OK", "run_status": {"status": "NA", "time_limit": 5, "async": 1, "memory_limit": 262144}, "compile_status": "Compiling...", "web_link": "http://code.hackerearth.com/3d255bX"}

In the asynchronous API, ‘id’ and ‘callback’ are mandatory parameters. ‘id’ is given by client and is returned in the response. It’s also returned in the response sent later at callback URL. You might have noticed the following map in the JSON response:

"run_status": {"status": "NA", "time_limit": 5, "async": 1, "memory_limit": 262144}
"compile_status": "Compiling..."

The run_status.status emits ‘NA’, which means the result is not yet available. The compile_status shows ‘Compiling…’ which means the code compilation has started.

The processing of response received at callback URL given by you in API request can be done as following:

    import json

    def api_response(request):
        payload = request.POST.get('payload', '') 
        
        """
        This payload is in JSON format. You need to load it using
        native method to convert it into dictionary for easy operations
        later on.
        """
        payload = json.loads(payload)
        print payload

        """
        {u'errors': {}, u'code_id': u'3d255bX', u'web_link': u'http://code.hackerearth.com/3d255bX', u'compile_status': u'OK', u'id': u'123', u'async': 1, u'run_status': {u'status': u'AC', u'memory_used': u'64', u'output_html': u'Hello<br>1</br>', u'time_used': u'0.1006', u'signal': u'OTHER', u'status_detail': u'N/A', u'output': u'Hello\\n1\\n'}, u'message': u'OK'}
        """

        run_status = payload.get('run_status')
        o = run_status['output']
        print o

        """
        Hello
        1
        """
        return HttpResponse('API Response Recieved!')

The data is received as POST request, and the JSON response is contained in payload key of POST request dictionary. request.POST.get(‘payload’) returns the JSON response.

{u'errors': {}, u'code_id': u'3d255bX', u'web_link': u'http://code.hackerearth.com/3d255bX', u'compile_status': u'OK', u'id': u'123', u'async': 1, u'run_status': {u'status': u'AC', u'memory_used': u'64', u'output_html': u'Hello<br>1</br>', u'time_used': u'0.1006', u'signal': u'OTHER',u'status_detail': u'N/A', u'output': u'Hello\\n1\\n'}, u'message': u'OK'}

payload has id that was sent in the request. This way you can keep track of the response received at the callback URL from client side and which response came for which request if you are sending many requests in batch. payload also contains the full information about run_status and compile_status giving you all the data about code evaluation.

Asynchronous API has many advantages with no limit on number of requests per second. You can also build your own website for code evaluation just using the API. If you haven still not got your secret key, click here. For more detailed information, please go through the documentation at http://developer.hackerearth.com. Now build something interesting and dazzle the world :)

Contributions are welcome for improving the documentation of http://developer.hackerearth.com. You can fork the repository at https://github.com/HackerEarth/developer.hackerearth.com and send pull requests. Sample code in multiple languages are highly welcome.

Posted by Vivek Prakash. Follow me @vivekprakash. Write to me at vivek@hackerearth.com.

http://engineering.hackerearth.com/2013/09/09/hackerearth-api-v2-asynchronous-callbacks

Continuous Deployment System

Aug 5, 2013

This is one of the coolest and important thing we recently built at HackerEarth. What’s so cool about it? Just have a little patience, you will soon find out. But make sure you read till the end :)

I will try to make this post as resourceful, and clear so that people who always wondered how to implement a Continuous Deployment System(CDS) can gain insights.

At HackerEarth, we iterate over our product quickly and roll out new features as soon as they are production ready. In last two weeks, we deployed 100+ commits in production, and a major release is scheduled to be launched within a few days comprising over 150+ commits. Those commits consists of changes to backend app, website, static files, database and many more. We have over a dozen different types of servers running e.g. webserver, code-checker server, log server, wiki server, realtime server, NoSQL server, etc. And all of them are running on multiple ec2 instance at any point of time. Our codebase is still tightly integrated as one single project with many different components required for each server. And when there are changes to codebase, all the related servers and components need to be updated when deploying in production. Doing that manually would have just driven us crazy, and would have been a total waste of time!

See the table of commits deployed on a single day, and that too on lighter day!

With such speed of work, we needed a automated deployment system along with automated testing. Our implementation of CDS helps the team to roll out features in production with just a single command: git push origin master. Also, another reason to use CDS is that we are trying to automate the crap out of everything and I see us going in right direction.

####CDS Model

The process begins with developer pushing bunch of commits from his master branch to remote repository which in our case is setup on Bitbucket. We have setup a post hook on Bitbucket, so as soon as Bitbucket receives commits from developer, it generates a payload(containing information about commits) and sends it to toolchain server.

Toolchain server back-end receives payload and filters commits based on branch and neglects any commit other than from master branch or of type merge commit.

    def filter_commits(branch=settings.MASTER_BRANCH, all_commits=[]):
        """
        Filter commits by branch
        """
        commits = []

        # Reverse commits list so that we have branch info in first commit.
        all_commits.reverse()

        for commit in all_commits:
            if commit['branch'] is None:
                parents = commit['parents']
                # Ignore merge commits for now
                if parents.__len__() > 1:
                    # It's a merge commit and
                    # We don't know what to do yet!
                    continue

                # Check if we just stored the child commit.
                for lcommit in commits:
                    if commit['node'] in lcommit['parents']:
                        commit['branch'] = branch
                        commits.append(commit)
                        break
            elif commit['branch'] == branch:
                commits.append(commit)

        # Restore commits order
        commits.reverse()
        return commits

Filtered commits are then grouped intelligently using a file dependency algorithm.

    def group_commits(commits):
        """
        Creates groups of commits based on file dependency algorithm
        """

        # List of groups
        # Each group is a list of commits
        # In list, commits will be in the order they arrived
        groups_of_commits = []

        # Visited commits
        visited = {}

        # Store order of commits in which they arrived
        # Will be used later to sort commits inside each group
        for i, commit in enumerate(commits):
            commit['index'] = i

        # Loop over commits
        for commit in commits:
            queue = deque()

            # This may be one of the group in groups_of commits,
            # if not empty in the end
            commits_group = []

            commit_visited = visited.get(commit['raw_node'], None)
            if not commit_visited:
                queue.append(commit)

            while len(queue):
                c = queue.popleft()
                visited[c['raw_node']] = True
                commits_group.append(c)
                dependent_commits = get_dependent_commits_of(c, commits)

                for dep_commit in dependent_commits:
                    commit_visited = visited.get(dep_commit['raw_node'], None)
                    if not commit_visited:
                        queue.append(dep_commit)
            
            if len(commits_group)>0:
                # Remove duplicates
                nodes = []
                commits_group_new = []
                for commit in commits_group:
                    if commit['node'] not in nodes:
                        nodes.append(commit['node'])
                        commits_group_new.append(commit)
                commits_group = commits_group_new

                # Sort list using index key set earlier
                commits_group_sorted = sorted(commits_group, key= lambda
                        k: k['index'])
                groups_of_commits.append(commits_group_sorted)

        return groups_of_commits

Top commit of each group is sent for testing to integration test server via rabbitmq. First I wrote code which sent each commit for testing but it was too slow, so Vivek suggested to group commits from payload and run test on top commit of each group, which drastically reduces number of times tests are run.

Integration tests are run on integration test server. There is a separate branch called test on which tests are run. Commits are cherry-picked from master onto test branch. Integration test server is a simulated setup to replicated production behavior. If tests are passed then commits are put in release queue from where they are released in production. Otherwise test branch is rolled back to previous stable commit, and clean up actions are performed including notifying the developer whose commits failed the tests.

####Git Branch Model

In previous section you might have noticed there are three branches that we are using, namely- master, test and release. Master is the one where developer pushes its code. This branch can be unstable. Test branch is for integration test server and release branch for production servers. Release and test branch move parallel and they are always stable. As we write more and more tests, the uncertainty of a bad commit being deployed in production will reduce exponentially.

####Django Models

Each commit(or revision) is stored in database. This data is helpful in many circumstances like finding previously failed commits, relate commits to each other using file dependency algorithm, monitoring deployment etc.

Django models used are:-

Revision- commit_hash, commit_author, etc
Revision Status- revision_id, test_passed, deployed_on_production etc.
Revision Files- revision_id, file_path
Revision Dependencies

When top commit of each group is passed to integration test server, we first find its dependencies i.e. previously failed commits using file dependency algorithm and save it in Revision Dependencies model so that next time we can directly query from database.

def get_dependencies(revision_obj):
    dependencies = set()
    visited = {}

    queue = deque()
    filter_id = revision_obj.id
    queue.append(revision_obj)
    
    while len(queue):
        rev = queue.popleft()
        visited[rev.id] = True
        dependencies.add(rev)
        dependent_revs = get_all_dependent_revs(rev, filter_id)
        
        for rev in dependent_revs:
            r_visited = visited.get(rev.id, None)
            if not r_visited:
                queue.append(rev)
    #remove revision from it's own dependecies set.
    #makes sense, right?
    dependencies.remove(revision_obj)
    dependencies = list(dependencies)
    dependencies = sorted(dependencies, key=attrgetter('id'))
    return dependencies 
    
def get_all_dependent_revs(rev, filter_id):
    deps = rev.health_dependency.all()
    if len(deps)>0:
        return deps

    files_in_rev = rev.files.all()
    files_in_rev = [f.filepath for f in files_in_rev]
    
    reqd_revisions = Revision.objects.filter(files__filepath__in=files_in_rev, id__lt=filter_id, status__health_status=False) 
    return reqd_revisions

As told earlier in overview section, these commits are then cherry- picked onto test branch from master branch and process continues.

####Deploying on Production Commits that passed integration tests are now ready to be deployed but before that there are few things to keep in mind when deploying code on production like restarting webserver, deploying static files, running database migrations etc. The toolchain code intelligently decides which servers to restart, whether to collect static files or run database migrations, and which servers to deploy on based on what changes were done in the commits. You might have realized we do all this on basis of types and categories of files changed/modified/deleted in the commits to be released.

You might also have realized that we are controlling deployment on production and test server from toolchain server(the one which receives payload from bitbucket). We are using fabric to serve this purpose. A great tool indeed for executing remote administrative tasks!

from fabric.api import run, env, task, execute, parallel, sudo
@task
def deploy_prod(config, **kwargs):
    """
    Deploy code on production servers.
    """

    revision = kwargs['revision']
    commits_to_release = kwargs['commits_to_release']

    revisions = []
    for commit in commits_to_release:
        revisions.append(Revision.objects.get(raw_node=commit))

    result = init_deploy_static(revision, revisions=revisions, config=config,
                                commits_to_release=commits_to_release)
    is_restart_required = toolchain.deploy_utils.is_restart_required(revisions)
    if result is True:
        init_deploy_default(config=config, restart=is_restart_required)

All this process takes about 2 minutes for deployment on all machines for a group of commits or single push. This made our life a lot easier, we don’t fear now in pushing our code and we can see our feature or bug fix or anything else live in production in just a few mminutes. Undoubtedly, this will also help us in releasing new features without wasting much time. Now deploying is as simple as writing code and testing on local machine. We also deployed 100th commit in production a few days ago using automated deployment, which stands testimony to the robustness of this system.

P.S. I am an undergraduate student at IIT Roorkee. You can find me @LalitKhattar or on HackerEarth.

Posted by Lalit Khattar, Summer Intern 2013 @HackerEarth

http://engineering.hackerearth.com/2013/08/05/continuous-deployment-system

Scheduling emails with celery in Django

Jun 5, 2013

After a long journey with Django, you come to a place where you feel the need to get some tasks done asynchronously without any supervision of human. Some tasks need to be scheduled to run once at a particular time or after some time and some tasks have to be run periodically like crontab. One of the tasks is sending emails on specific triggers.

Here at HackerEarth , one of the major chunk of emails is sent to recruiters and participants after a contest is finished or when participant triggers finish-test button. Till now we had done this using crontab. But things have changed now and scaling with such process is time and resource consuming. Also, looking in to database if there is any task that has to be done with crontab process is not a good method, atleast for those tasks those have to run only once in the lifetime.

####Django-Celery Django-Celery comes to the rescue here. Celery gets tasks done asynchronously and also supports scheduling of tasks as well. Integrating Celery with Django codebase is easy enough, you just need to have some patience and go through the steps given in the official Celery site. There are two sides in Celery technology: Broker & Worker. Celery requires a solution to send and receive messages, usually this comes in the form of a separate service called a message broker. We use the default broker RabbitMQ to get this done. Worker fetches the tasks from the queue at time at which they were scheduled to run asynchronously. You will have to download celery init scripts to run the worker as daemon on Production. You can get those init scripts from GitHub
This is the configuration we used to run celery in our project:

# Name of nodes to start
CELERYD_NODES="w1 w2 w3"

# Where to chdir at start.
CELERYD_CHDIR="/hackerearth/"

# How to call "manage.py celeryd_multi"
CELERYD_MULTI="$CELERYD_CHDIR/manage.py celeryd_multi"

# How to call "manage.py celeryctl"
CELERYCTL="$CELERYD_CHDIR/manage.py celeryctl --settings=settings.hackerearth_settings"

# Extra arguments to celeryd
CELERYD_OPTS="--time-limit=300 --concurrency=8"

# %n will be replaced with the nodename.
CELERYD_LOG_FILE="/var/log/celery/%n.log"
CELERYD_PID_FILE="/var/run/celery/%n.pid"

# Workers should run as an unprivileged user.
CELERYD_USER="hackerearth"
CELERYD_GROUP="hackerearth"

# Name of the projects settings module.
export DJANGO_SETTINGS_MODULE="settings.hackerearth_settings"

####Another Problem After linking triggers to send emails after the contest time is finished or the participant has finished the test prematurely, all things were working properly. Now I could easily schedule a task to run asynchronously at any time. But I met a problem that there is no method to check if a particular task has already been scheduled that is assosiated with some Model instance. This happens when there are more than one triggers for the same task, and it can easily happen in a fairly complicated system. To get this done I had to store the task_id with that model instance into database using generic ContentType. So here is the hack that I came up with:

Generic ModelTask

This model stores the information of the scheduled task(task_id, name) and the information of the Model instance to which the task is assossiated.

from django.contrib.contenttypes import generic
from django.contrib.contenttypes.models import ContentType
from django.db import models

class ModelTask(models.Model):
    """
    For storing all scheduled tasks
    """
    task_id = models.CharField(max_length = 36)
    name = models.CharField(max_length = 200)
    content_type = models.ForeignKey(ContentType)
    object_id = models.PositiveIntegerField()
    content_object = generic.GenericForeignKey('content_type', 'object_id')

    def __unicode__(self):
        return "%s - %s" % (self.name, self.content_object)

    @staticmethod
    def create(async_result, instance):
        return ModelTask.objects.create(task_id=async_result.task_id,
                name=async_result.task_name, content_object=instance)

    @staticmethod
    def filter(task, instance):
        content_type = ContentType.objects.get_for_model(instance)
        object_id = instance.id
        return ModelTask.objects.filter(content_type=content_type,
                object_id=object_id, name=task.name)

A custom overridden task decorator ‘model_task’

Overrides the methods : ‘apply_async’ & ‘AsyncResult’ And attaches a new method : ‘exists_for’

import types

from django.db import models
from celery import task

from appname.models import ModelTask

def model_task(*args, **kwargs):
    def dec(func):
        task_dec = task(*args, **kwargs)
        task_instance = task_dec(func)

        def exists_for(self, instance):
            return ModelTask.filter(self,instance).exists()
        task_instance.exists_for = types.MethodType(exists_for, task_instance)

        def apply_async(self, *args, **kwargs):
            instance = kwargs.pop('instance',None)
            async_result = super(type(self), self).apply_async(*args, **kwargs)
            if instance and not self.exists_for(instance):
                ModelTask.create(async_result, instance)
            return async_result
        task_instance.apply_async = types.MethodType(apply_async, task_instance)

        def AsyncResult(self, *args, **kwargs):
            if args and isinstance(args[0], models.Model) and\
                    self.exists_for(args[0]):
                task_id = ModelTask.filter(self, args[0])[0].task_id
                return super(type(self), self).AsyncResult(task_id)
            else:
                return super(type(self), self).AsyncResult(*args, **kwargs)
        task_instance.AsyncResult = types.MethodType(AsyncResult, task_instance)

        return task_instance
    return dec

That’s it.

####The Use Case

Participation Model

This model contains the information of a User participating in a Event.

class Participation(models.Model):
    user = models.ForeignKey(User)
    event = models.ForiegnKey(Event)
    ...
    ...

Task for sending email to participant

@model_task()
def send_email_on_participation_complete(participation):
    code for sending email
    ...
    ...

Scheduling the task

duration = calculate_duration_in_seconds(participation)

# The extra keyword argument 'instance' is necessary as it will create a 
# ModelTask object.
send_email_on_participation_complete.apply_async((participation,),
        countdown=duration, instance=participation)

Check if the task has already been scheduled assossiated with a participation object

is_scheduled_before = send_email_on_participation_complete.exists_for(participation)

Get the AsyncResult object

# Returns the async_result object of the scheduled task that is assossiated
# with given Model instance (participation in our case)
async_result = send_email_on_participation_complete.AsyncResult(participation)

# gives the status of the scheduled task : PENDING/STARTED/SUCCESS/FAILURE
aync_result.status

# Contains the return value of the task (None in our case)
async_result.result

All this replaced the cron jobs, custom scripts and some manual tasks with a robust task (email) scheduling mechanism. This also lays the foundation for triggering many other types of tasks on top of django-celery architecture set up by me. And this will certainly make us more efficient and help us to focus on other core products, while tasks are performed asynchronously and we can enjoy the awesome weather on a fine day! :)

P.S. I am an undergraduate student at IIT Roorkee.You can reach out to me at shubham@hackerearth.com for any suggestion, bug or improvement. You can also find me @ShubhamJain.

Posted by Shubham Jain, Summer Intern 2013 @HackerEarth

http://engineering.hackerearth.com/2013/06/05/scheduling-emails-with-celery-in-django

The Robust Realtime Server

May 31, 2013

This is going to be a long blog post but I promise you will find some interesting piece of engineering here, so stay till the end.

The realtime server manages the live update of webpages when the data changes in the data storage system (database or cache). We had a realtime server in-place but there was a big problem with scaling it.

####Problem with nowjs I was told beforehand that I will be primarily working first on writing a realtime server beside many other things. Vivek Prakash told me he had written a realtime server implementation sometime ago with nowjs. But the problem with it is that it doesn’t scale well beyond ~200 simultaneous connections. In a conversation on Google Groups, I came across this:

In my experience, the underlying “socket.io” module is not able to scale well (more than 150 connections was a problem for me), so I had to retreat from using “nowjs” or more specifically, “socket.io” in one of my applications.

After further inspection, we also saw that there was an issue with file descriptor leak and nowjs server reported ENFILE/EMFILE (Too many open files). Also nowjs project was abandoned in 2012 and last commit in github repo is that of 1 year ago. So there was need of some good alternative which can handle large number of simultaneous connections (or users). I didn’t have to do much research as Vivek had already researched about it. He found Tornado and Meteor.js to be good alternative. Going by order of preference and popularity I chose Tornado, and also because it’s integration with existing system looked simpler and more efficient.

####The Use Case Vivek pretty much explained me how different components of code submission works. Here is a quick explanation of it. User submits the code and a POST request is sent to webserver which further sends submission details to a message queue in RabbitMQ server (a message broker to connect various application components). Code-checker engine (consumer of RabbitMQ here) gets the submission details, evaluates the code and submits result back to another message queue. It also notifies the web-servers about the result so that appropriate databse entry is made. The whole process is completely asynchronous. An amqp listener also takes the result out from message queue and finally sends it to the client(browser) using the nowjs communication APIs. The flowchart will give you a good idea of how different components are connected.

Now my first job was to replace the nowjs module with Tornado.

####A basic implementation Let’s code! Now I knew tornado server must read submission results from message queue and send it back to submission page (‘pages’ in case user has opened the same submission problem page in more than one tab). I used pika module inside Tornado IO loop to connect with RabbitMQ and read messages from it. On client side I used HTML5 WebSocket to connect to the tornado server. This basic implementation was completed in two days.

#####Frontend *(code snippet)*
    ...
    function openWebSocketConnection() {
        var ws = new WebSocket(url);

        // On websocket open
        ws.onopen = function() {
            clientInfo = {
                'name': getName()
            }
            clientInfo = JSON.stringify(clientInfo);
            ws.send(clientInfo);
        }

        // On recieve message from server
        ws.onmessage = function(evt) {
            onReceiveMessage(JSON.parse(evt.data));
        };
    }
    ...

#####Backend (code snippet)

    ...
    class RealtimeWebSocketHandler(websocket.WebSocketHandler):

        def on_message(self, message):
            '''
            Messages recieved from client
            '''
            message = json.loads(message)
            self.name = message['name']
            #listen to submission result messages from rabbitmq
            self.application.pc.add_listener(self)
            ...

        def on_close(self):
            #remove listener
            self.application.pc.remove_listener(self)

        def send(self, message):
            '''
            Send response back to client
            '''
            self.write_message(message)

    class PikaClient(object):

        def connect(self):
            '''
            Connect to RabbitMQ Sever
            '''
            ...
                
        def on_message(self, channel, method, header, body):
            '''
            Submission result messages from RabbitMQ
            '''
            message = json.loads(body)
            self.notify_listeners(message)
            ...

        def notify_listeners(self, message):
            listeners = self.listeners.copy() 

            for listener in listeners:
                #send message to specific listener
                if listener.name==message['name']:
                    listener.send(message)
    ...

####Testing locally Everything was working as expected in modern browsers but when I tested it on IE 7, 8, 9. This was my reaction- “IE sucks man!”. Of course, IE doesn’t support websocket, how it didn’t occur to me. So I was left with only one option to write a fallback implementation in long polling (also called comet programming) on both client and server side. Wait the problem is not yet solved. Cross domain requests are not supported. HackerEarth webserver and realtime server (tornado) are on different top-level domains. I either have to use CORS long polling (only supported in major browsers but more secure) or JSONP long polling(supported in every browser but insecure). I eventually used both. Here is a code snippet:

#####Frontend
    function connectToTornado() {
        // check for browser's websocket support
        if("WebSocket" in window) {
            openWebSocketConnection();
        }
        // fallback to cross origin requests(CORS) long polling
        else if($.support.cors || "XDomainRequest" in window) {
            openCORSLongPollingConnection();
        }
        // fallback to jsonp long polling
        else {
            // setTimeout is used to supress loading sign
            // in some old browsers.
            setTimeout(openJSONPLongPollingConnection, 1);
        }
    }

    function openCORSLongPollingConnection() {
        $.ajax({
            ...
            data: {
                'name':getName()
            },
            dataType: "json",
            success: function(data) {
                onReceiveMessage(data);
                // connect again to recieve new messages
                window.setTimeout(openCORSLongPollingConnection, 0);
            },
            ...
        });
    }

    function openJSONPLongPollingConnection() {
        $.ajax({
            ...
            data: {
                'name':getName(),
                'callback':'success' // **do not delete this line**
            },
            dataType: "jsonp", // must for jsonp
            jsonp: false,
            jsonpCallback: 'success', // **do not delete this line**
            success: function(data) {
                onReceiveMessage(data);
                // connect again to recieve new messages
                window.setTimeout(openJSONPLongPollingConnection, 0);
            },
            ...
        });
    }

#####Backend
    class RealtimeLongPollingHandler(web.RequestHandler):

        @tornado.web.asynchronous
        def post(self):
            self.transport = 'cors'
            self.name = urllib.unquote(self.request.body.split('=')[1])
            self.application.pc.add_listener(self)

        @tornado.web.asynchronous
        def get(self):
            self.transport = 'jsonp'
            self.name = self.get_argument('name', None)
            self.callback = self.get_argument('callback', None)
            self.application.pc.add_listener(self)

        def on_connection_close(self):
            self.application.pc.remove_listener(self)

        def send(self, message):
            if(self.transport=='jsonp' and self.callback is not None):
                self.finish(self.callback+'('+json.dumps(message)+')')
            elif(self.transport=='cors'):
                self.finish(message)
            self.application.pc.remove_listener(self)

####What if the client gets disconnected? On slow internet connections especially with browsers using long polling, messages sometimes get lost. So I had to create a buffer to store unsent messages and when client(browser) reconnects, the server will look into the buffer for latest message and will send it back to client and then again it will listen for new messages from RabbitMQ. Here is code snippet:

#####Reconnect with tornado  *(JavaScript)*
    function reconnectToTornado(callback) {
        // Sleep time is constant after 5 minutes
        if(errorSleepTime<300000)
            errorSleepTime *= 2;
        // callback can be websocket, cors, jsonp function
        window.setTimeout(callback, errorSleepTime);
    }
 
#####Look for unsent messages *(Python)*
    ...
        newest_message = self.application.pc.unsent_messages.newest(self.name)
        if(newest_message is not None):
            self.send(newest_message)
        else
            self.application.pc.add_listener(self)
    ...

    class UnsentMessageBuffer(object):

        def __init__(self):
            #unsent messages deque with capacity of 1000 objects
            self.messages = collections.deque([], maxlen=1000)

        def add(self, message):
            self.messages.append(message)

        def remove(self, message):
            try:
                self.messages.remove(message)
            except:
                pass

        def newest(self, client_name):
            messages = copy.deepcopy(self.messages)
            newest_message = None
            for message in messages:
                if(message['name']==client_name):
                    if(newest_message is None or message['id']>newest_message['id']):
                        newest_message = message
                    self.remove(message)
            return newest_message

It was also taken care of that older messages don’t replace the newer messages on browser if the delivery order is not sequential. I also wrote lot of fallback code to prevent the issue in older browsers, did testing with thousand simultaneous connections and fixed some bugs that were already present before. And that’s all that I did in just two weeks! Everything was working fine in my local machine now.

Few days ago, we tested it on development server. And after successful testing and few more bug fixes, it was pushed to production on 30th May, 2013 :) Some bugs still might be there and we are fixing them, but I am confident it would be more robust than ever before!

P.S. I am an undergraduate student at IIT Roorkee. You can find me @LalitKhattar or on HackerEarth.

Posted by Lalit Khattar, Summer Intern 2013 @HackerEarth

http://engineering.hackerearth.com/2013/05/31/the-robust-realtime-server

HackerEarth Technology Stack

Mar 20, 2013

Originally posted as Quora answer.

This might take a while, so go grab some popcorn. You are going to enjoy this :)

At HackerEarth, we deeply believe in open-source. Why not, our roots are in there. We use open-source software, hack it according to our needs and create something amazing out of that. At the same time, we don’t fear writing a seemingly complex project from scratch and turn it into beautiful piece of code. And we have done that so many times in a very short span.

Our application backend is primarily in Python/Django. We modified django-allauth for some of our custom needs. For example, it allows you to login on all sites - HackerEarth, MyCareerStack, and CodeTable with the same login credentials. We modified the django-threadedcomments for spam control (they can really eat you inside!) and enhanced moderator permissions. We wrote our own generic newsfeed system from scratch which can be plugged with any information schema to generate feeds. For example, the feed that you see on MyCareerStack and the recent submissions that you see on HackerEarth, they all come filtered from the same core engine. This feed engine will form one of the core of the upcoming platform and there are some interesting work being done in there - like faster filtering with advanced algorithms, implementation of relevant feed system, and other exciting stuffs. We wrote a notification engine on top of newsfeed system which generates notifications for you. These are part of MyCareerStack for now but as we integrate everything, they will be core of the whole product. We also wrote a generic poll application which tracks all the actions of a user on any item. For example, upvote/downvote on a question, like/dislike on a tutorial, etc. are all powered and regulated by the same application. These user actions can be easily integrated with any object model e.g. programming problem with small snippets of code. As this application develops, we will make it open-source eventually. We did some nice hack with the avatar application to make the image loads faster from Amazon S3. Similarly, we messed around with apache-solr to customize some of the backends involved in searching.

And all this is just a fraction. Currently there are around 60 django applications written in a very generic and modular way which communicate with each other to power everything.

Wait, I haven’t told you yet about the amazing stuffs. We wrote the code-checker from scratch, and it’s not the usual college project. It is known as [CodeFactory Engine] (http://engineering.hackerearth.com/2013/03/12/100000-strong/) internally and forms the core of HackerEarth. The core-engine is written in C and the server is written in C++ using Apache Thrift. It’s a very robust client-server architecture with auto-scaling and auto-deployment which gives result of each testcase in real-time. We have built an API around it and we are going to build some amazing things on top of it. [Vim plugin to compile/run code using API] (http://engineering.hackerearth.com/2013/03/11/hackerearth-vim-plugin/) is just a start. We wrote a realtime server using node.js & nowjs which pushes data to your browser as it gets updated or changed live. The result you get in realtime on submission of program is powered by this server.

In the start of January 2013, we undertook one major task which was not needed very much but we knew it will be essential very soon. We wanted to reduce the page load time using pipelining techniques, similar to BigPipe: Pipelining web pages for high performance. But there wasn’t anything out there which we could have been directly integrated. This led to our native implementation of bigpipe, which is still in a very nascent stage. But it brought wonders to the page load time, reducing them by almost half. Another major contribution was of memcached, we integrated it at view level and core level throughout the site. Most of the things are cached, there are systems in place which invalidates and updates them as the data changes. Remember the hit-miss algorithm from Comp. Science 101 ? We did that for webpages!

One weekend in October 2012, we had built the CodeTable to test our CodeFactory server. But then we integrated collaborative coding in it and it went on to become a full-fledged product in itself. It’s widely used by someone who wants to hack on some code instantly and share it with others in realtime and doesn’t bother about installing compilers/interpreters locally.

At the frontend, we write Stylus, JavaScript and jQuery. We have written custom jQuery plugins and JavaScript functions to work easily with lots of often required tasks. For example, we wrote a generic plugin for lazy scrolling, and it powers all the pages with infinite scrolling like newsfeed, recent activity, user submission list, etc.

All the sites, everything is part of one big project. This keeps us sane and allows us to have greater control. Having said that, as the code base grows over 100,000 lines of code and there are multiple servers running, then there are totally different sets of challenges. There are 5 different servers running right now - apache server, codeFactory server, realtime server, collaboration server, and search server. The apache servers (webserver) and CodeFactory servers are running on few different EC2 instances at any given moment. We wrote custom auto-deployment scripts, builder & developer tools to make the life easy and improve the productivity of our own. We have been using Git from day zero to manage all the source code and have written some nice hooks and wrappers on top of it to abstract a few tasks.

And we have done all this with just two of us working full time since October 2012, and that too with so many things going around with us. I am still a college student! Two others joined to work remotely from January and their contribution has been significant too. And you might acknowledge from all this that we don’t fear anything, we deploy fast, we fail fast and are growing aggressively. We still work the way we used to work in college

enjoying life each and every day. We are expanding the team in summer and are going to release tons of cool products that you will love and programmers will love. But most importantly, that will disrupt the way tech recruitment is done in India.

And all this has not been for nothing. Our user base have increased significantly in past few months. Many companies have become our customers. And we are working resolutely towards solving the problem we have to set out to - i.e. to connect smart programmers to awesome product companies coming out of India!

If all this sounds interesting and exciting to you, let’s have a chat. Email me at vivek@hackerearth.com, and I will be as enthusiastic to talk to you as you will be! You can also find me @vivekprakash.

Posted by Vivek Prakash, Co-founder - HackerEarth

http://engineering.hackerearth.com/2013/03/20/hackerearth-technology-stack

Analytics for Challenges

Mar 13, 2013

First view the awesome charts and graph for [HackerEarth Practice Challenge] (http://www.hackerearth.com/hackerearth-practice-challenge/analytics/). I will explain it’s implementation details later in the post.

Why bother for Challenge Analytics?

We thrive on challenge but challenge is no fun without a detailed analytics. In this number-driven world, analytics has evolved as a blanket term for the number of techniques to turn the raw data into useful information.

Late February, I started working on to implement Analytics using the data collected during the challenges over the period of time and present it lucrative manner. After performing a random wild-goose chase through various JavaScript charting libraries, I decided to stop at Google Charts Tools.

Why use Google Charts Tools?

There are a lot of reasons to choose Google Charts over the others charting libraries.

Free.
Lightweight and Reliable.
Healthy Documentation.

What you gain from Challenge Analytics?

The HackerEarth Challenge Analytics gives following details:

Submission Count Analytics
The Submission Count Analytics displays a Line-Chart with the total number of submissions at a particular instance of time. This chart is updated at regular time intervals.
Language Analytics
The Language Analytics displays a Pie-Chart depicting the popularity of a programming language supported by HackerEarth, during a challenge. Personally, this chart is my rock favorite as it helps me figure out the current trend of a particular language.
Submission Analytics
The Submission Analytics displays a Pie-Chart depicting the submission status namely
- AC - Accepted
- CE - Compilation Error
- TLE - Time Limit Exceeded
- MLE - Memory Limit Exceeded
- RE - Runtime Error
Pinhole Analytics
The Pinhole Analytics displays a Pie-Chart depicting the run-status for the solution against various test-cases.
Multilingual Users
A true programmer proves himself/herself by programming around in various kind of languages. This table is leaderboard for them, the more the language you use in a challenge the higher you get in the table.

Implementing Challenge Analytics

The Challenge Analytics uses two of the Google Charts

Line-Chart
Pie-Chart

As talked about earlier Submission Count Analytics uses a Line Chart for depicting the computed data

The Line Chart(Google Chart Tools) is documented here.

Google charts requires the JSAPI library, This library is loaded by:

    <script type="text/javascript" src="https://www.google.com/jsapi"></script>

The script given below loads the Google Visualization and the Chart libraries. This is also responsible for displaying the chart.

    <script type="text/javascript">
        google.load("visualization", "1",{ callback : drawChart, packages:["corechart"]});
        function drawChart() {
        var data = new google.visualization.DataTable();
        data.addColumn('string', 'Time');
        data.addColumn('number', 'Submissions');
        data.addColumn({type:'string',role:'tooltip'});
        data.addRows(  );

        var options = {
            'width': 850,
            'height': 500,
            'chartArea': {left:100, top:70},
            'pointSize': 4,
            hAxis: {
                title: 'Time',
                slantedText: true,
                slantedTextAngle: 20,
                textStyle: {fontSize: 10}
            },
            vAxis: {
                title: 'Submissions',
            }
        };
        var chart = new google.visualization.LineChart(document.getElementById('submission-count-chart'));
        chart.draw(data, options);
    }
    </script>
    
    <div id="submission-count-chart"></div>

The google.load package name is “corechart”

    google.load("visualization", "1", {packages: ["corechart"]});

The visualization’s class name is google.visualization.LineChart

    var visualization = new google.visualization.LineChart(container);

The drawChart() function creates a DataTable and is populated with computed data. The required number of columns are added mentioning the data format

    data.addColumn('number', 'Submissions');

For customizing the tooltip, an extra column was added and the ‘role’:’tooltip’ is specified.

    data.addColumn({type:'string',role:'tooltip'});

The option object is used for customizing the chart to be displayed. The customizable option for a Line-Chart is available here.

This instantiates an instance of chart class you want to use for e.g. LineChart, PieChar etc., by passing in some options. The Google chart constructors takes a single parameter: a reference to the DOM element in which to draw the visualization.

      var chart = new google.visualization.PieChart(document.getElementById('submission-count-chart'));

Once the chart is ready, a HTML element is created to hold the particular chart.

    <div id="submission-count-chart"></div><div id="submission-count-chart"></div>

PieChart works on a similar principle, except for:

The visualization’s class name is google.visualization.PieChart

        var visualization = new google.visualization.PieChart(container);

The Options for PieChart is documented here.

While implementing Google Charts, I faced a weird problem, due to the use of google.load() after the loading of the charts the page would go black instantly, but the problem was solved by using callback parameter to google.load(). A nice blog-post can be read here.

For any queries or suggestions, you can shoot me a mail at sayan@hackerearth.com.

P.S. I am an undergraduate student at Dr. B.C.Roy Engineering College, Durgapur. You can also find me @chowdhury_sayan

Posted by Sayan Chowdhury, Intern @HackerEarth

http://engineering.hackerearth.com/2013/03/13/challenges-analytics

100,000 strong - CodeFactory server

Mar 12, 2013

The Inception

January 2012 was an idyllic time for us. Three of us had just teamed up to build something cool. There was no planning for the future, no sort of agreement - we were just three geeks sitting in the dorm room who wanted to build a product. We started working on MyCareerStack where there was supposed to be interview questions, tutorials etc. Soon, we realized that there was code editor needed, but that was easy. The harder part was that people don’t only want to write code online, they want to run them. Now that I had never done before!

What resulted then was a hacky few hundred lines of code in C, which could compile and run code in just C, C++ and Java. It might have been even the worst piece of code I had ever written, but it worked and it was sweet!

	/* create FIFO to be used here & later for IPC. */
	char fifo_1[NAME_MAX];
	createFilePath(fifo_1, id, dir, FIFO_FILE_1);

	umask(0);		/* reset file creation mask. */
	
	int readfd_1, writefd_1, readfd_2;
 
	if (mkfifo(fifo_1, S_IRWXU | S_IRWXG | S_IRWXO) < 0)
		err(1, NULL);

	/* Fork a child to exec the process: {id}_a.out */
	pid_t pid;
	int status;

	signal(SIGALRM, handle_tle);

	/* change current working directory to mycareerstack */
	chdir(RUNTIME_DIR);

Wait, you haven’t seen the ugly part yet! This will make you cringe, it made me too. See what the hell I am doing with re-tries, but it was required!

    int tries = 0;

    /* try the exec 100 times. */
    while(tries < 100) {
        /* run the program. */
        if(execl(executable, executable, (char*) 0) < 0) {
            //
        }
        ++tries;
    }

    printf("Couldn't run the program!");
    _exit(0);

There were some serious reasons for which this hack was done. We were hosted on a shared webfaction server, with no access to root. This meant I couldn’t drop the user process privilege to any user with limited access permissions. I needed to create another user then with limited access which could be controlled by the original user. And again, I didn’t know of any way to know if there is already a process running with privileges dropped to the limited user. And that explains those number of retries.

By now, this hacky code-checker had executed over 10,000 code. This was a big achievement in itself.

Apache Thrift

Around August, I came to realize that this was going to be one of the core part of our future product - HackerEarth. I researched a little about existing framework which can allow cross-language services development and can help in easily building a distributed system. Apache Thrift came out to be an obvious choice. This was the first commit that I did, had stubs and other files - but it laid out the foundation for one of the most robust engine.

    class CodeCheckerHandler : virtual public CodeCheckerIf {
     public:
      CodeCheckerHandler() {
        // Your initialization goes here
      }

      void ping() {
        // Your implementation goes here
        printf("cpp-server got pinged...\n");
      }

      void run_code(RunResult& _return, const CodeInfo& code) {
        // Your implementation goes here
        printf("run_code\n");
      }

    };

    int main(int argc, char **argv) {
      int port = 9090;
      shared_ptr<CodeCheckerHandler> handler(new CodeCheckerHandler());
      shared_ptr<TProcessor> processor(new CodeCheckerProcessor(handler));
      shared_ptr<TServerTransport> serverTransport(new TServerSocket(port));
      shared_ptr<TTransportFactory> transportFactory(new TBufferedTransportFactory());
      shared_ptr<TProtocolFactory> protocolFactory(new TBinaryProtocolFactory());

      TSimpleServer server(processor, serverTransport, transportFactory, protocolFactory);
      server.serve();
      return 0;
    }

What resulted next was the implementation of whole client-server architecture using Thrift which came to be know as the CodeFactory server. I rewrote the whole evaluation part in C, added many languages to it like Perl, PHP, Python, Ruby, Haskell, etc. which came to be known as CodeFactory engine. But this was not the end.

The MVP - CodeTable

To test the CodeFactory server, it was pointless to wait for the HackerEarth to be launched. So on a weekend, I built the CodeTable and put the server out to be tested by real humans in this real world. I even posted on HackerNews and it brought so much traffic that we weren’t ready for it. And boy you can’t comprehend the bugs that will knock your doors. This was a very good example of MVP which allowed us to fix the bugs in no time.

There was a serious issue still left to be resolved. The current architecture allowed the python client to send request to the CodeFactory server, but it also left the python client hanging for response. When there were large number of submission, it meant more and more python client hanging for response which dramatically increased the memory consumption of the machine. After a while, the machine could even stop to respond. But we had not hit that phase yet, and there was not much point in building the complex system if it was not going to be used.

The Asynchronous System

But since December 2012, traffic had strarted increasing significantly. We could see now more and more failure rate in the submissions. It was now absolutely mandatory to resolve it. This resulted in the implementation of message queue and redefining the whole architecture.

    def _call(self, message):
        self.channel.basic_publish(exchange='',
                                   routing_key=self.routing_key,
                                   properties=pika.BasicProperties(
                                       # Make message persistent
                                       delivery_mode = 2,
                                       content_type = 'application/json',
                                       ),
                                   body=message)

        if DEBUG:
            print '[%s] Sent %r' % (self.routing_key, message,)
        self.connection.close()

The new architecture is completely asynchronous - which means that all the testcases for a submission are evaluated asynchronously, and all of them are evaluated even if you exceed the time limit or memory limit for a few ones. This was also a feature request as it helped the users to know more detail about their code execution. But hold on, pause a moment, and think about asynchronous system. It brings with it totally new set of challenges. In layman terms, you need to define and detect when your operation gets over. And I needed to detect it as soon as possible and send the submission result to the user.

The Realtime Server

I could have done it in using one of the easiest but nasty ways - polling. But then these decisions affect the product in subtle ways. I decided to implement a generic push server instead which came to be known as realtime product. nowjs appeared to be a good choice. I wrote the first few hundred lines of node.js server in it which could maintain the client information and could also receive the notifications from message queue itself and deliver to the client’s browser.

    // Receive messages
    q.subscribe(function (message, headers, deliveryInfo) {
        message = JSON.stringify(message);
        var data = JSON.parse(message);
        if (DEBUG) {
            console.log('Received message for ' + data.name);
        } 
        sendMessage(data.name, data.message, data.html_id);
    });

It’s this little piece of code coupled with other utility code thrown in which updates your browser with the result of each testcase in realtime. It made the evaluation process more engaging for the users, and reduced the first response time by orders of magnitude. Realtime server is still in a very nascent stage, obviously with some bugs, and it even stops working sometime. But I am sure it will result as the backbone of all our live push systems and many more in the upcoming platform. We are working hard to make your experience as seamless and simpler as possible.

Just a few days ago, the CodeFactory server processed 100,000th run request, and an almost equal number of compile request. The stats are growing exponentially and they are pretty interesting with submissions in all the languages supported on the platform. This is really exciting. We also released our API last month. It’s still in alpha stage and we are working on making it robust and more useful. Feel free to use the API to build your own codechef/spoj/topcoder :)

P.S. I am the co-founder of HackerEarth. Reach out to me at vivek@hackerearth.com for any suggestion, bugs or even if you just want to chat! Follow me @vivekprakash

Posted by Vivek Prakash

http://engineering.hackerearth.com/2013/03/12/100000-strong

Vim plugin to compile/run code using API

Mar 11, 2013

tl;dr

Check out https://github.com/HackerEarth/hackerearth.vim

I love Vim editor. And the idea of a plugin to compile and run code from my favorite code editor sounded exciting. HackerEarth API made it easier.

Oh man - this is going to be the coolest thing I have ever built. Whoa, wait a minute. Aren’t you forgetting something? You have never written a plugin before.

There always comes a time when you have to do something for the first time! ;)

Where to begin?

Google is the answer :) After exploring and reading some tutorials, I found out that VimL, also know as Vim script is the language used to built vim plugins. Here’s the best part. You only need very little knowledge of VimL to be able to write plugins, if you know Python (or Ruby). I chose Python.

Why Python?

Because using python gives so much flexibility. Think about using urllib/httplib/json/vim for accessing some web service that helps editing in Vim. This is why most of the plugins that work with web services are usually done in VimL+Python. Also, I am learning Python these days so it became a more solid reason to use it.

Let’s write a vim plugin!

Generally vim plugins start with few checks. In this case, it checks for VimL + Python support and also looks for HackerEarth API client key. Obviously, if there is some error, plugin should not be loaded.

    " check for python
    if !has('python')
        echoerr "HackerEarth: Plugin needs to be compiled wuth python support."
        finish
    endif

    " check for client key
    if exists("g:HackerEarthApiClientKey")
        let s:client_key = g:HackerEarthApiClientKey
    else
        echoerr "You need to set client key in your vimrc file"
        finish
    endif

As said earlier, Python can easily be integrated with VimL. Python code inside VimFunction starts after python « EOF and ends before EOF. Here is an example on how to do it.

    function! s:VimFunction()
    python << EOF
    # write python code here
    import os
    print "Hello World!"
    EOF
    endfunction

OOPS!

I have written most of the plugin code in Python and I love using Object Oriented Programming(OOP) so I have used it in here as well.

HackerEarth.vim python code contains three main classes: Api, Argument and VimInterface.

Api class handles all the Hacker Earth’s API related stuff like calling the web service, setting post data and handling json response. Argument class is responsible for evaluating command line arguments and it also decides what should be default argument values. VimInterface is an interesting piece of code. It basically loads a buffer, appends output and saves it, if users wishes to do so.

You may be wondering when all these classes gets instantiated. It is done inside s:HackerEarth function. In order to call the s:HackerEarth function, some commands are written.

function! s:HackerEarth(action, ...)
python << EOF
action = vim.eval("a:action")
argslist = vim.eval("a:000")
args = None if(not argslist) else argslist[0]
arg = Argument.evalargs(args)
arg.setaction(action)
if(not arg.Help):
    api = Api(arg)
    api.call()
else:
    vim.command("Hhelp")
EOF
endfunction

command! -nargs=? -complete=file Hrun :call <SID>HackerEarth("run", <f-args>)
command! -nargs=? -complete=file Hcompile :call <SID>HackerEarth("compile", <f-args>)
command! -nargs=0 Hhelp :call <SID>Hhelp()

It’s almost finished!

The only part left is to map the above commands to keyboard shortcuts. Ok, let’s do it.

map <C-h>r :Hrun<CR>
map <C-h>c :Hcompile<CR>
map <C-h>h :Hhelp<CR>

What’s next?

Check out https://github.com/HackerEarth/hackerearth.vim. Do some real run on your machine!

If you want to improve or fix anything, just do it and send us a pull request. Or send the diff to lalit@hackerearth.com. Feel free to report any issue or contribute to the github repository.

P.S. I am an undergraduate student at IIT Roorkee, and I will be joining the folks at HackerEarth in summer for 2 months internship. Follow me @LalitKhattar

Posted by Lalit Khattar, Summer Intern 2013

http://engineering.hackerearth.com/2013/03/11/hackerearth-vim-plugin

https://engineering.hackerearth.com/rss

Posts