Curtis Miller's Personal Website

ntguardian Oct 5, 2020

I describe my interview process.

Show full content

I wrote about my start of a job search two months ago, and… well, I have a job!

In fact, for me, finding a job was not that hard. I think I applied to around 50 jobs, more or less (but if I had to guess, I’d say less), and had two interviews. Considering the existence of posts like this on Reddit (also seen below) and the fact that it only took me a month and a half to get an offer for a well-paying (six figures!) position with great benefits (and I have zero years experience!), I feel like I have a good reason to be proud of myself. I guess having (or will have in December) a PhD in mathematical statistics paid off.

Image by u/Majikthese

For the sake of my own privacy, I will not disclose my new employer (in fact, despite this website being a place to showcase myself, I will be taking down my resume, too). What I will say is that I will primarily be working as a statistician and statistical consultant, and a part of what I will be doing is test science and experimental design, perhaps some research too. At least, that’s what I’ve inferred so far.

This, by the way, is a new field for me; I have studied and worked with econometrics during my studies and previous internships, so experimental design is new to me. Consulting is new to me, as well. I’ve been reading a lot before my start date to learn about these new roles for me, particularly the books Statistical Consulting by Cabrera and McDougall and Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building by Box, Hunter, and Hunter. I may share some further details of my studies (such as a side project to learn about factorial designs) in future blog posts.

I’m honestly nervous about this job (which I will start in a few weeks). I don’t think I deceived my interviewers nor that they made a mistake in choosing me (especially after that interview process; more on that later). Perhaps it’s a combination of imposter syndrome and just the butterflies that come with a significant life change such as this. Naturally, I want to do an excellent job with my new and exciting employer, and I don’t want to disappoint. Hopefully, the butterflies will go away soon.

That all said, people read my blog posts to obtain something for themselves, some insight or new knowledge. So I’ll describe my job search and interview experience.

The Job Search

I may have already described this part in my previous post, and I’ll just summarize the search experience. I decided early on that I was not going to stay in Salt Lake City or anywhere in Utah, for personal reasons. So I used several websites to find positions to apply for.

I found Glassdoor the most useful in the search, and it was the website where I found the job I got. Other websites I tried included Monster.com, LinkedIn, TripleByte, ZipRecruiter, and USAJobs for US government jobs. USAJobs was also a great site for jobs, with a uniform system for describing the jobs and applying for them; unfortunately, getting a government job can be hard, and if that’s not what you want, this site is not for you. TripleByte seemed like it was not made for the type of job I wanted. ZipRecruiter didn’t seem to do anything for me as a job seeker (though I hear ads for them a lot in podcasts targeting employers). Monster was laughable, giving me updates for janitorial jobs in NYC that I clearly didn’t want. Glassdoor had great data on employers, including reviews by employees and applicants, salary data, and the "easy apply" button. I did also apply to some companies I was interested in directly. My employer was not one of these companies.

I decided I wanted to live in a major coastal city like Seattle, San Francisco, Portland, New York City, Boston, or Washington (Denver also being on that list, being a large enough city in a state resembling Salt Lake City but without the rest of Utah holding it down). I looked for jobs in these areas with salaries over $100,000 and at companies with at least a four-star rating by employees. The job description also had to resemble what I wanted to do, of course, and list the languages I am familiar with (R and Python). I avoided jobs that looked too much like programming; you may be surprised, but while I’m willing to program, I don’t want to just write code, with that being my primary task. When my interviewers for my current job said, "You won’t be writing as much code here as you did as a graduate student," I said, "That’s great! I’m willing to write code, but I want to write less."

That, by the way, is a tight search for someone with zero years of experience. But I have (or almost have) a PhD in mathematical statistics, which… helps.

In the end, the city I got was Washington, DC. I’ve lived in Washington before, as an intern at Brownstein Hyatt Farber Schreck, a law and lobbying firm (oh, man, I can’t wait to give some awful task I don’t want to do to some poor intern). I liked the city, and I’m a political person, so I’m quite excited to return. When I lived there as an intern, I feel I did not get out enough; sure, I went to the museums and saw monuments, but instead of joining in on nightlife, I played Civilization V with my roommate (it was fun, but also wasting the city). When I move there early in 2021, I plan to take more advantage of the city. Having the money to do so also helps.

The Interview

The job I got involved an initial interview that lasted an hour; this was the interview prompting my earlier blog post. The interview was conducted via Zoom, and Zoom interview etiquette is still… being defined (do you wear a tie? should you wear pants even though no one can see them? my answers were "yes" and "no," respectively). But the interview was fine and pleasant, and the people I spoke with sold me on the organization as much as I sold them the idea that I was the right candidate.

The first part of the interview I remember best was the statistician’s equivalent of the "whiteboard" I hear programmers gripe about for programming interviews. I was given a scenario that came in three parts. The first part was a suggestion of a research issue a client was confronting and proposed trials. The interviewers wanted me to talk about the statistical issues potentially involved in the study and how they could be overcome. I was allowed to ask questions; in fact, in learning about statistical consulting, the consultant should ask a lot of questions (watch the YouTube video below to get some ideas for why), and that is what I did. There were no "tricks" per se; the interviewers said what they wanted to see was my thought process. The second part was, "You made your recommendations to the client, and they didn’t listen to you. This is the data set they obtained. [shows data set in Excel spreadsheet] How would you analyze this data?" A big part of this part, one should note, was identifying the "response" and "explanatory" variables that would be relevant to addressing the issue at hand. And finally, there was a scenario where the results would need to be presented to an end stakeholder, and I had to provide a way to explain my results to someone who could be presumed educated and intelligent but was not a statistician. This was the place where interpreting regression models and statistical significance for others mattered.

Other questions were at a more personal level, such as why I liked the job, why I may be good at it, and so on. I don’t have much to say about these questions; they’re standard. But all told, the interview was pleasant, and I felt like it went well.

It did go well. A week after, I was asked to do a follow-up interview. This interview was one where I usually would be flown from my home to DC to interview in person, with them footing the bill. Unfortunately, due to COVID-19 concerns, this was not going to happen (my job is remote now, and they’re reluctant to reopen their offices until the pandemic is under control). That said, perhaps this was a blessing. This second-round interview was a full-day, eight-hour interview, meeting several people at the organization. From looking at Glassdoor reviews, this part may also have other applicants too, and I would have met them. I was already stressing somewhat about my interview, and I think that seeing other applicants would have made me feel even more nervous and pressured, as I could be looking at my competition. Since this was done via Zoom, I could pretend that I was the only applicant for the job. That’s much less pressure and allowed me to act more naturally.

Before this interview, the interviewers requested letters from three references, and I recommended my advisor, a co-author, and my former overseer at Voices for Utah Children. The interviewers sent to my references a list of topics they wanted the letter authors to touch on in their letters. My advisor said that these topics, while likely asking something about my technical abilities, mostly concerned my ability to explain statistics to non-statisticians. They also wanted writing samples.

I had to prepare an hour-long presentation on a technical topic, but for a general, educated audience, and I had to ensure there would be time for questions. I spent the week before preparing the interview. I recorded rehearsal videos (which I plan to release eventually) and had my cousin, a PhD student in accounting, watch them; he was my proxy for the kind of audience I needed to address, as he’s well-educated and generally knowledgeable but not a professional statistician and not familiar with the area of the talk.

During the interview, I met with more individuals at the organization. Some just wanted to talk and answer some questions I may have. Some were more direct in their interview questions. I was once asked, "If you had to pick one statistical method to use, what would it be?" (linear regression). I then had to explain what linear regression is and how to do it properly to a hypothetical researcher. Another asked how I would persuade a client who did not want to do any further research (yes, that’s something the job needs to deal with) to do the recommended procedures and not just rely on what they think they already know. There were other statistical scenarios resembling the scenario from the first interview I was presented with and needed to discuss. There were more general conversations about the work environment and what was expected of the task. After my lunch, I gave my presentation, which went well. After two further interviews (one with HR, so less an interview and more discussing conventional employment topics such as raises and benefits), I was done… and exhausted, but not worn down.

Other Interview

I did have another interview with a different company for a role related to financial oversight, and that would work closely with the US SEC. This interview did involve a scenario, that being how I would design an application that would get data for an SEC commissioner that would get her essential information and how I would explain its use. This job seemed more interested in my technical abilities as a programmer and demanded less of my statistical skills. It was a strange interview; I thought it went okay, but not great, yet the interviewer wanted to consider me for other positions at the company. I did agree and was eventually offered nothing. However, I’m not too upset.

A funny story from that experience; I was asked how much I would be expecting to make for the job. I was initially planning on this job being in DC, but it was moved to New York. I listed the lower bound of the job I had applied for in DC as an expected salary, $120,000, to which the interviewer said, "Yeah, that sounds fair, or it could be $130,000 or $140,000." I replied, "Oh, yeah, this is in New York instead of DC, so $130,000 sounds good!"

Here I am bidding down my own salaries… (And that’s just one of my two "screw-ups" in that interview.)

Suggestions to Others

Nearly a month after my long interview, I was offered a job, and I accepted. Here I am now, about to start in a couple weeks, and with my academic career looking to be behind me. I had a goal of no longer looking into the future for the next job. I didn’t want to take a position, like an internship, that was only meant to prepare me for the final job and the start of my career. I succeeded. I don’t know how long I will be at this new job, but I have no immediate plans to leave, nor an expectation that I will leave; I could work in this position for the next 35 years and, assuming I was getting good raises, be just fine.

That feels so good.

With that said, my experience did provide me some insight into how my entire education helped me get this job. By that, I also include my high school debate team and literary magazine, along with my tutoring, my internship at Voices for Utah Children, and my experience as an instructor.

My job wanted a statistician but they were more willing to take for granted that I knew what I needed to know or could learn it. I told them that there were some subjects that I was not familiar with that they needed for this job. They said, "That’s common for us, and many of our new hires don’t know test science specifically; we’re hiring you for your brainpower, not for what specifically you know." What they tested me hard on was my communication abilities, especially when dealing with people who are professionals but not statisticians.

So I guess I would recommend to anyone wanting a well-paying statistics job, practice writing, speaking, and explanation. Math and coding only get you so far. Communication skills get you good jobs.

Will the Blog Go Away?

Well, now that I’m leaving academia, I feel less guilty about blogging. See, blogging felt like a distraction from writing papers and doing research. Now that I’m working a more conventional job (but with much of what comes with academia, like reading papers and attending conferences, and the occasional publication), I feel less like blogging is a poor use of my time.

I don’t know how much I will get into blogging again. As you can tell, my blog posts are long articles, which is never intentional but is the inevitable end when I start to write. That said, I do think I can write more now about personal projects and interests.

So keep an eye out. There may be more coming.

And finally, congratulations to me!

http://ntguardian.wordpress.com/?p=3790

Extensions

Into the Job Market I Go! The Search for a Statistical Job

ntguardian Aug 4, 2020

Show full content

It’s been a long time since I wrote a blog post. Actually, this is my 100th post, according to my metrics. I wanted to write a fancy article about my history of blogging, how I started doing this because I simply couldn’t justify not blogging anymore at the time, how my articles on stock market data went surprisingly viral and caused me to get far more daily views than I ever expected and even landed me book deals, but then I was never able to get hits like those again. I wanted a deep discussion of my blogging history and what I really like writing about.

This isn’t that article. That article is never coming. Oh, well. Maybe for my 1000th post. But for now, holding off writing until I can write that article means I’m just not writing blog posts, and I don’t like that.

So today, I’m writing an article that’s mostly for me, about my sudden leap into the job market.

Since my last blog post the world collapsed into the COVID-19 apocalypse. In 2018 I wrote two blog articles, one suggesting the US was due for a recession before the 2020 election and the other attempting to justify the claim using what I’m calling methods of a “mad statistician.” My forecast stumbled into truth; GDP dropped at an annualized rate of 32.9%, the worst drop ever, and the economy is in a recession. (Seriously, look at the chart in the article; it’s insane, and that drop will likely complicate econometric models for decades to come.) Life for me is not too bad; I’m teaching at home and am still being paid. Meeting the needs of my students during this crisis pushed more work onto me, and I recorded video lectures so I could flip the classroom. (They are available on YouTube, and are 48 hours of videos on statistics and R programming; I should write a second post about these.) The experience has been good; I actually liked the flipped classroom format as it made interacting with students and figuring out where there heads are at easier. I liked engaging with them. But my research also had to take a back seat for weeks, setting me back.

The virus has not materially affected me negatively yet. (Personally and emotionally I’m not doing as well; I moved to Salt Lake City in December partly with the hope to experience city life and meet new people, and then the virus not only prevented that but took me away from the friends I had, and now most days I’m just alone in my apartment without even a cat to cuddle) But it may have irreparably altered my career path. This is supposed to be my last year as a graduate student, the year I start searching for jobs. I was planning on going to statistics conferences in hopes of connecting with academics to get an academic job; I was going to JSM this year. When JSM went online, I cancelled; the point was to meet new people, and I’m meeting no one in a virtual poster session! (Why didn’t the ASA just cancel JSM? The virtual conference is silly.) When talking about what to do about jobs with my advisor, he first recommended I stay on another year. The academic job market was gutted by the virus, with no universities hiring and old faculty choosing not to retire. Better to stay on another year to ride out the bad market.

A couple weeks ago, when I was e-mailing him again, he said that this glut was not going to clear up in a year, and the market would be awful for years to come. He basically recommended I look for jobs in other sectors of the economy.

I’m a statistics graduate student because I love mathematical statistics, enough to devote my life to it. I went to graduate school because it felt like the right thing to do for me. Additionally, the fact that the non-academic job market for individuals with PhD’s in statistics is also good helped justify the economics of my choice. I like the idea of being a tenured faculty member advancing the field of statistics, studying and inventing procedures and publishing packages and papers, and not having to ever worry about a dress code. But I also knew that academic jobs can be hard to get. I’ve heard horror stories of post docs and adjunct faculty (a recent one being the story of Thea Hunter, published in The Atlantic), and want no part of it. I promised myself years ago not to harm myself just to try and stay in academia; it’s not worth it. And besides, “industry” (which for me just means everything but academia) pays better and is much more flexible if only because jobs are far more plentiful.

So goodbye academia. Industry, here I come!

I’m not looking for internships. I did two internships when I was an undergraduate, and I have been a grad student for years. Honestly, I’m tired of “someday” having a real job. I want a real, well paying job, now. I have been preparing for having a paying job since I was seven. I’ve been in college since 2010. That’s 10 years of college, or 22 years of education overall. I’m getting a PhD in statistics; my advisor says I have materials to complete a thesis, so I’m all-but-dissertation at this point, and by May of next year I’ll be just done, with dissertation. So no more preparation. I’m done preparing. I want in. I do want to squeeze out one more paper (two more are in the submission process currently) on functional data analysis, but yeah. No more. Need job now.

I started applying last week. I redid my website, in particular my résumé and my professional information section, as well as the welcome page. I used LaTeX for my resume, and I hope it’s not too flashy. I also hope it won’t be a problem it was created with LaTeX; I heard about some employers using automated systems to automatically reject résumés and one way to get rejected is if the system can’t read the résumé well; I wonder if I already got a rejection for that reason alone, since the rejection seemed to come very fast, only hours after the application.

I don’t know how many jobs I’ve applied to; let’s say around 20. Clearly I’m early in the process, and honestly, I’m nervous. I feel I have good reason to be. I’m almost 29 and leaving a 22-year journey for good. And not only that, I’m rather committed to leaving my home state of Utah, where I grew up. I just don’t feel I fit in here. I want to work in a big coastal city and experience the urban lifestyle. I lie awake at night thinking about what my life in six months will look like, where I will be, whether I will be happy, and how I can stay connected with my family, who are not going anywhere and will not be able to afford buying plane tickets to visit me. And meanwhile there are the jobs themselves. It felt really good to describe my education level as “Doctoral” on Monster.com, but that feeling was fleeting; now I get to read employer expectations and ask myself whether they’re matching me. There’s always something in the job posting that means I’m not a perfect match. Does that mean I don’t stand a chance, PhD or not? The last time I was applying for jobs like this, I was a senior in high school or just graduated, right when the 2008 financial crisis ended and the recession began. I got rejected for a job to be a cashier at the local grocery store and Sears. A cashier! Why was I not qualified to be a cashier? I’m in a much different place in life now and far more skilled, but I still remember that first job search, and it was not fun.

So yes, I’m anxious. I even posted on Reddit about my anxieties. But the Reddit comments I found reassuring. It doesn’t matter too much that I don’t know SAS, for example; I probably don’t want to work in a place that’s using SAS or SPSS. In any case, thanks Reddit people. You’re awesome.

So far in my search, while few have responded to my applications, I got four rejections and one interview. The rest have yet to respond (if they ever do). The interview was Friday of last week. And that’s largely why I started writing this blog post.

It’s been years since I did an interview, and this is my first interview for a real data job. I don’t know how it’s going to go. I don’t know what questions will be asked of me. The best I can go for is a Google search on “data job interview questions”. These articles have listed out some potential questions an interviewer may ask. This blog post is an excuse for me to practice answering those questions.

Just for fun, here are my answers. These are coming from the following websites: [1] and [2].

What do data analysts do?

In my view, data analysts use data to learn about phenomena. Data can be used for learning facts, such as what the typical pay of a person with a certain job title is, or what the typical trajectory of a rocket may be, or whether one drug or another is better at reducing blood pressure. Data can also be used for prediction, such as whether someone will develop high blood pressure based on their lifestyle, or what the trajectory of a rocket will be. Sometimes there is a clear question that needs to be answered and the data analyst will help refine the question into something that data can answer, establishing clear hypotheses and designs for studies and data sets. Sometimes there is not a clear question and the data analyst explores data sets to discover patterns that could lead to new hypotheses or beliefs of practitioners. The data analyst not only works with the data directly but also helps his clients both refine their questions and beliefs prior to analysis and effectively communicates conclusions after analysis to an audience that likely does not fully understand the methodology he used, but needs the insight he found.

Why do you want to be a data analyst?

When I was in high school I was on the debate team and the literary magazine. I took math classes but didn’t care for them. I think high-school me would be shocked that these are the types of jobs I’m applying for. But as I went through college I discovered I not only had mathematical talent but I was often just as interested in how new discoveries were made as much as I was interested in the end result. I enjoy the process of statistical analysis. I feel satisfaction when statistical software produces the final product I was seeking. Nothing has engaged me as much as statistics, so I cannot imagine myself not being a data analyst.

Please talk about a time when you could not meet a deadline.

Sometimes statistical analysis is straight forward but for the most part statistics is a part of research in one form or another, and research, in my experience, is hard to fix to a timeline. By its nature, it forges into uncharted territory, and thus delays can easily happen. When I was an intern with Voices for Utah Children, we would often make projections for when we thought our projects would be done, only to find those projections too optimistic. This was generally happening when we thought we were nearly done, only to find that we were not. We simply were honest about the state of the project, gave reasons for the delay, and did the best we could, but delayed.

For one of my recent papers I was finding myself being pressed against the submission deadline. The paper was simulation intensive and the simulations took days to complete even running in parallel on a supercomputer, and the simulations could not be sped up. I initially tried to scale back my co-authors’ expectations in what could be done within a reasonable time frame, as I was the only person working with code. When we found that some code was incorrect and some simulations were thus corrupt and unusable, we abandoned those simulation contexts to meet the deadline, deciding that they were slightly redundant anyway. And I burned the midnight oil to do what I could to get work done in a timely manner. In the end, we had a product that could be submitted and did address referee concerns, and we just barely made the deadline.

Try to be conservative with time estimates, scale back expectations on what can be done within a time frame, burn the midnight oil when there are things you can do to meet the deadline, and you will be pleasantly surprised when you’re ahead of schedule.

Which area would you prefer to work in and why?

I studied statistical methods that are generally described as econometrics during my career as a PhD student. This area aligned best with my affinity for the social sciences, economics in particular (I have a bachelor’s degree in economics). I also studied time series analysis and models and have been curious about financial data sets for years; my most popular blog posts are about stock data, after all. I am interested in money, like most people, and data dealing directly with money and its behavior intrigues me, hence my interest in time series. But I can see myself working in many areas, as each present their own challenges. For instance, I care about helping the planet and have started exploring climate data, using functional models (models that deal with data that is understood to be a function over a dense, continuous domain) to explore patterns there.

Which data analysis software are you well-versed in?

I understand R well. I have one R package on CRAN. I have read Hadley Wickham’s books and understand how R views the world, being comfortable with R’s functional aspects and unique approach to OOP. I’m also comfortable using Python and have no problem using Python packages like NumPy, Pandas, and Scikit-Learn.

While I don’t have a computer science background I care greatly about the coding process and writing robust code. This is a skill I had to teach myself through the years as a graduate student. I have personally adopted the organizational paradigm of viewing research projects as package development, since packages provide key tools, such as unit testing protocols and documentation standards, necessary to keeping projects well organized and robust to inevitable changes. Packages also encourage good division of code into digestible functions that help keep the development process flexible.

Because I greatly value flexibility, I generally dislike tools such as R markdown and Jupyter notebooks. I have no problem using these tools to produce what amounts to some final product, like a paper, but they should not be used in intermediate steps or even beginning steps as they are too rigid. Instead I like to write executable scripts with narrow and specific jobs that can then be fit together. I try to emulate the Unix philosophy of keeping programs small and well designed for one job so that they can be recombined as needed as so that changing the pipeline is not too difficult. Not surprisingly I drifted from using Windows to doing my serious work in (Arch) Linux because that environment is ideal for software development. Tools such as make and git are instrumental in my process, helping to keep things flexible. I also like to do a lot of work with shell scripts, as they’re easy to write and are key to automation. (Read my thoughts on organizing research projects here.)

I prefer to use R and Python for high-level work. When performance bottlenecks emerged in my work, I have used C++, via Rcpp, to clear the performance blockage, as the C++ data structures and general approach (such as the use of pointers to help keep down unnecessary copying of data) produce much faster software. I am not a C++ expert but I can use it well enough to write performance-critical code and thus make the higher-level R code fast.

In the past I have used Stata for econometric work that the R packages did not do well. I also learned MATLAB so that I could review a co-author’s code and identify why it was not working correctly. I don’t find either of these languages particularly difficult, at least at a surface level.

I used JavaScript to create D3 visualizations for a school project once. That was years ago, but I did enjoy it. I got a 30 minute crash course in Java and JSP for the sake of a databases class and while I was able to complete the project the code was hideous and I’m not proud of it. SQL doesn’t seem hard but I have not needed to use it since 2017. I’m confident I can get up to speed with it again; at the end of the day, it’s for working with relational databases, which I conceptually understand.

After one encounters many programming languages they start to look the same. All told, I’m confident in my practical coding abilities for the purposes of data analysis.

What was your most difficult data analysis project?

My undergraduate thesis on decomposing the gender gap in wages may be the most difficult data analysis project if only because building the wage and selection models was difficult; deciding which variables belong in which equation can drastically affect the end result. My papers studying change point detection procedures were also quite difficult because the algorithms needed to be fine-tuned in order to produce answers fast. That project forced me to learn C++ and think carefully about algorithm design in order to get a product that ran fast enough to be used in millions of simulations.

Take a few minutes to explain how you would estimate how many shoes could potentially be sold in New York City each June.

Let’s start simple. Let’s assume that everyone in New York City owns one pair of shoes at a time and they buy a new pair of shoes once a year. They are equally likely to buy a pair of shoes in each month of the year. So take NYC’s total population and divide it by 12. I would call this a Fermi solution, as it’s using Fermi’s famous ad hoc method to get a quick estimate.

Clearly this approach is crude. Let’s examine why. First, we’re assuming people only want one pair of shoes at a time and only buy one pair of shoes a year. This is wrong because some people wear no shoes (but they’re rare), some people want multiple shoes, and some people need to buy multiple shoes a year even if they only wanted one pair of shoes. Children need to buy many pairs of shoes as they grow out of shoes fast. People often have separate dress shoes and general shoes. Some people like having lots of choice of shoe within a given class. Some occupations are more intensive than others and thus put the shoes through more wear and tear. Women and men may have different shoe preferences and want more or fewer shoes. So we should break down our analysis based on gender, age, and occupation at least. We could also allow for ethnic preferences too. Income needs to be incorporated, as low-income individuals are less likely to buy shoes they don’t truly need.

Also, assuming shoe needs are uniform across the year is unrealistic. Shoe sales are likely seasonal and correlated with age. People are not born uniformly through the year; August is a more common birth month than February. We could learn about at what ages in months people buy new shoes for their children to see when the kids will be getting new shoes. June is the first month of summer; do people start buying more shoes for summer? (For that matter, what counts as a shoe? Are $2 flip-flops shoes? This conversation should be had.) What are the migration patterns for the month of June? How many tourists are buying shoes in June?

We can use questions like these to start building a sophisticated model for shoe sales. We can use time series dynamics to examine the relationship between June and January shoe sales to make forecasts for shoe sales, perhaps using seasonal ARIMA models regressing on the demographic variables mentioned above. Surveys on shoe buying patterns could be conducted to understand consumer preferences and join those preferences with our demographic projections. This could perhaps produce a sophisticated shoe sale forecast.

Or we could just survey people and ask how many shoes they bought in June. This would be much simpler, though give us a rather dull fact.

Take a few minutes to explain how you would estimate how many tourists visit Paris every May.

This question is not a good one. It has a problematic framing. Because of global population growth and migration, the number of people visiting Paris every May is not constant. We could talk about an average number, but I don’t think a historical average of the number of people visiting Paris every May is all that meaningful, unless the study is essentially a historical study. Instead we should investigate the trend of the number of May tourists.

We also need to define what it means for someone to be a tourist in Paris. We must have excluded business travel. But does someone being a tourist mean that they are in France to see sites and visit locations for pure entertainment? I like playing competitive card games; if I go to Paris for a tournament, am I a tourist? If a business traveler visits the Eiffel Tower, is she a tourist? These questions need to be answered.

I have never left American borders so I cannot say what tourism in Paris is like. Perhaps we could say “any person who visits selected cultural sites in Paris is a tourist”, which would also be calling Parisians tourists, but maybe we’re okay with that. Perhaps there are turn-styles at some locations counting visitors; if we have that data we should incorporate it. Other locations may not have turn-styles and we may need to literally count each person entering some perimeter as a tourist; modern technology could perhaps provide that count. We could then collect this data every May and aggregate it to produce some tourism metrics.

What is your process when you start a new project?

I first perform a literature review to see what about the topic at hand is already known and what methods have been done before. “Literature review” could include talking to experienced individuals who have knowledge about the topic I don’t. After a literature review I next get the data and examine it. I look at its structure and see what variables are present. I will look at interesting variables and see their basic properties; what is its type, what data is missing and how do I know its missing, and what is the data’s range? Is there evidence of outliers or erroneously entered values? I try to load in the data and make some basic visualizations, but before then I’m likely writing some scripts that transform the raw data into something ready for the tools I’m using. The earliest scripts are already being written to stand alone in a pipeline, and I’m already assuming that my code could become a package meant for working with this data set and performing a particular kind of analysis, so I’m making sure that even early code is well-documented and well compartmentalized into functions. I proceed from more basic visualizations and analyses to more complex ones, learning more about the data set as I go. If this project is for prediction, I may early on divide the data into training and testing sets, along with a final test set, so my exploratory analyses are being kept away from testing data.

What are your communication strengths?

I think I’m great at presentations. I know how to make complicated ideas accessible to others, and know how to meet my audience. Just watch this video of me giving a presentation for the Utah chapter of Women in Architecture on Utah’s gender gap in wages, a packed house of over 300 looking to be entertained.

How do you handle pressure and stress?

I feel like pressure and stress are a state of mind. If there is a job that needs to be done quickly, I work to the best of my ability to get it done fast, and make it the top priority. That is all one can do. If I cannot meet the deadline, I need to let the relevant people know.

What are your long-term goals?

I want a career as a data analyst. My goals are to get a good statistical job and do good work, perhaps even encountering problems that lead to publications, but publications are not the reason I work.

Why should we hire you?

I should be hired because I am a quick and independent learner with a good work ethic and strong statistical intuition.

Hopefully these are good answers to questions I may encounter. Hopefully I won’t be looking for a job soon. But until that time, I just need to keep applying.

http://ntguardian.wordpress.com/?p=3779

Extensions

Learn Python Statistics and Machine Learning with My New Book: Training Systems using Python Statistical Modeling

ntguardian Aug 12, 2019

Packt Publishing has turned another one of my video courses, Training Your Systems with Python Statistical Modeling, into a book! This book is now available for purchase. It is a tutorial book for Python statistics and machine learning.

Show full content

Packt Publishing has turned another one of my video courses, Training Your Systems with Python Statistical Modeling, into a book! This book is now available for purchase.

Training Systems using Python Statistical Modeling is now available from Packt Publishing’s website and from Amazon. This book was created by a team at Packt Publishing who took my video course and turned it into book form. If you’re like me and love books that you can hold in your hand, touch, thumb through, etc., and you’re looking to learn about statistics and machine learning methodology as used in Python, give my book a look.

My previous book, Hands-On Data Analysis with NumPy and Pandas, covered the basics of managing data sets in Python using two common tools, NumPy and pandas, along with how to use common tools for Python data analysis such as Anaconda and Jupyter Notebooks. Training Systems using Python Statistical Modeling follows naturally from that book, going from just managing data in Python to drawing inferences and developing useful applications from that data.

In addition to NumPy and pandas, this book shows how to use Scikit-Learn, SciPy, and statsmodels for statistical inference and machine learning tasks in Python. I start the book discussing basic statistical inference, including hypothesis testing and parameter intervals. This section uses statsmodels the most. Then the book progresses to supervised learning. I start with explaining basic concepts of supervised learning and how one should choose, tune, and evaluate supervised learning algorithms. After these basics I present learning models for classification and regression using scikit-learn, including decision trees, support vector machines, logistic regression, linear regression, Ridge/LASSO regression, and neural networks. After supervised learning comes unsupervised learning. This includes clustering methods such as the k-means algorithm and dimensionality reduction techniques such as principle component analysis. And this is only a broad overview of the techniques and algorithms the book covers; there are far more than I’ve mentioned here!

While the book offers basic explanations about how these methods work, it is not a theoretical book; it primarily moves forward with real data examples. Readers not only see a description of the methods and basic parameters but how to apply them to real data sets. In many ways this is a hands-on introduction to machine learning and statistics. Thus it’s a great resource for anyone looking to start right away developing data models in Python.

I would like to add that of the books I worked on I found this book the most enjoyable to write. While Hands-On Data Analysis with NumPy and Pandas is my personal best-seller it was frankly boring to write; its content is extremely basic and, while essential, does not show the powerful things people can do with data. This book, in comparison, finally shows how data can be useful and shows how to do interesting things with it. It also covers much more material, being nearly twice as long as Hands-On Data Analysis with NumPy and Pandas.

I list the book’s chapters below:

Classical Statistical Analysis
Introduction to Supervised Learning
Binary Prediction Models
Regression Analysis and How to Use It
Neural Networks
Clustering Techniques
Dimensionality Reduction

Also, check out the book’s GitHub page to see code samples used in the book.

I would like to thank the staff at Packt Publishing for their work on this book, particularly Joseph Sunil. I was so pleased when I received my copies in the mail and I thank them for their hard work to make this possible.

The MSRP for the book is $27.99, but is currenlty on sale for $19.59 (30% off) on Packt’s website as part of their summer sale, so pick it up while it’s cheap! If you’re not interested in buying this particular book but still want access to it, perhaps consider getting a Mapt subscription. You’ll have access to thousands of books and video courses (including all of my content), and can even get one book to keep for free (without DRM) every month! Perhaps that book will be mine! It’s a great deal you should consider. Also, if you read the book, please leave a review (good or bad). These reviews help others decide whether this book is right for them. Good reviews help earn me sales. “Dissatisfied” reviews give me and Packt’s team feedback for later work (or perhaps to errata the book; there have been mistakes in it that slipped through the editing process).

smalltrainingsystemsusingpythonstatisticalmodeling

http://ntguardian.wordpress.com/?p=3750

Extensions

Grades Aren’t Normal

ntguardian Jul 30, 2019

Grades are not Normally distributed. That's not what's seen naturally in grades and the idea is not supported by statistics. You can force grades to look Normally distributed, but doing so has costs.

Show full content

Introduction

This article is also available in PDF form.

A while back someone posted on Reddit about the grading policies of their academic department. Specifically, the department chair made a statement claiming that grades should be Normally distributed with a C average. I responded, claiming that no statistician would ever take the idea that grades follow a Normal distribution seriously. Some asked for context, and I wrote a long response explaining my position. I repeat that argument here, and also give some R code demonstrations showing what curving grades does.

Grades Are Not Naturally Normal

A cheap shot would be to say that Normal random variables have no minimum or maximum so since there is a minimum or maximum grade, grades cannot be Normally distributed. This is a cheap shot because lots of phenomena that’s effectively bounded this way are fit to Normal distributions and no one bats an eye since the probability of being that far away from the mean is vanishingly small (albeit non-zero). However it could matter to grades since a larger standard deviation in grades and clumping of grades near the higher end of the distribution could mean that the probability of seeing an impossibly high grade is higher than tolerable if the grades were modeled with a Normal random variable.

Next we should agree the objective of grades is to measure students’ understanding and competency, with an “A” grade meaning “This student has mastered the material” and an “F” meaning “This student is not competent in the material”, which ranges anywhere from the student knowing something about the material but not enough to the student basically knowing nothing at all about the material.1 If the class size is somewhat small, it will be hard to see a Normal distribution naturally arise due to the natural fluctuation of students’ innate ability. It’s easier for instructors to get “smart” classes where the students overall are above-average, and also to get classes where the students are not like that. Part of this is just randomness, part is associated with semester and time of day. But there should be some natural variation due to random sampling that can make natural grades not look exactly Normal with some specified mean and variance. This could be less of a problem for “jumbo” classes, though.

Now let’s talk about grading. There’s likely some scheme that awards points to students based on their performance. These schemes are never perfect and always arbitrary but there’s generally some truth in the resulting numbers. Some people say that grading this way produces bimodal distributions, suggesting there’s clumps of students that either do or do not get the material. I often observe left-skewed distributions, where most students range between mediocre to good, some are great, and some are catastrophically bad. Neither of these are features of Normal distributions.2

Making Grades Look Normal

So to get a Normal distribution one would have to take grading based off of points then determine each student’s percentile in the class and then see what the corresponding grade would be for the respective percentiles if grades were actually Normally distributed.

Two things: first, this produces a distortion in how points work. You get non-linear benefits for points scored on anything, from homework to quizzes to tests. Specifically, it’s possible for the third point to be more valuable to your grade than the second point, and the second less valuable than the first. This is technically already true since grades are generally thresholds, but each point has an equal contribution (withing their own assignment) to reaching a particular threshold. This will not be the case if grades are curved. Students are going to struggle to understand and appreciate that.

Second: when you’re doing this, the Normal distribution actually doesn’t matter. You’re effectively assigning grades based on percentiles, not Normal percentiles specifically. You could make the grades fit any distribution you want this way, from Normal to uniform to beta to Cauchy to exponential and so on forever. You’re just saying that the lowest 20% will get Fs and the highest 20% will get A’s (or the lowest 5% will get Fs and the highest 5% will get As, if you’re actually sticking to a Normal distribution). But you’re putting artificial distance between, say, the students at the top; two top students could have mastered the material but one got an extra point because of luck, yet her grade will look much better than the other stellar student. This is because of how tails work in the Normal distribution. In fact, if students at the top figured this out, they’d start obsessing over every single point because every point would have a big impact on their grade, which would be bad for their learning and mental health.

Let’s see a data example. Here’s grades from one of the classes I taught at the University of Utah; I will not say what class and I added random noise to the grades so they’re not exactly like the class. Also, a small number of grades are fictitious.

grades <- c(10, 24, 32, 41, 49, 54, 67,  67,  70,  70, 72, 74, 76, 77, 77, 
            77, 78, 80, 80, 81, 85, 85,  86,  88,  90, 91, 91, 91, 91, 92,
            92, 92, 93, 96, 98, 98, 98, 101, 102, 102)

Figure 1: Histogram of student scores in grades.

Figure 1 presents a histogram of the data in grades. When I see that distribution, it appears to be a left-skewed distribution. Most students are in the 60-100 range, some scored more than 100, and some scored much less than 60.3 The median grade is 83, the first quartile 71.5, and the third quartile 92. Overall, not a bad distribution that emerged naturally. (No curve was applied.)

If we want to make the grades appear to follow a different distribution, we will need to do the following:

Find the percentile of each grade;
Find the corresponding distribution of the target distribution; and
Replace the original grade at a certain percentile with the corresponding grade from the target distribution.4

The following code obtains percentiles:

perc <- (1:length(grades) - 0.5)/length(grades)

Now we need to decide on the target distribution. Supposedly, according to some department chair, grades should be Normally distributed with a mean of a C, which I will take to be 75. That leaves us picking the standard deviation. We probably should pick a standard deviation such that the probability of scoring above 100% is very small; three standard deviations away from the mean should suffice. So we will say that the standard deviation is 9, so $\text{Grade}_{i} \sim \operatorname{N}(75, 81)$ .5

Let’s now transform grades; to avoid extra controversy, we will also round grades up.

curvegrades <- ceiling(qnorm(perc, mean = 75, sd = 9))

Figure 2: Histogram of curved grades.

Figure 3: Back-to-back stem plots for grades and curvegrades; produced by aplpack::stem.leaf.backback(grades, curvegrades). $\includegraphics{verbatimblock.pdf}$

These grades are displayed in Figures 2 and 3. Here are some things to notice from doing this:

The students in the tail benefited considerably from the curving, gaining considerably and going to D grades when they probably should have failed.
Most students were penalized by the curve, and in hard-to-understand, seemingly arbitrary ways.
Only one student will get an A. Another who would have gotten an A got an A-, and several other A students were pushed to the high Bs.

The curve has a very strong effect at the top of the distribution; two students with likely equivalent skill got very different grades, and the student in third place who appears to be just as skilled as the other two if it were not for luck got a B instead of an A. This appears to be very unfair.

Now we could screw around with the parameters and perhaps get a better distribution at the top of the curve, but that raises the question of why any distribution should be forced onto the data, let alone a Normal one. We could just as easily swapped qnorm() with qcauchy() and got a very different distribution for our scores. The data itself doesn’t suggest it came from a Normal distribution, so what makes the Normal distribution special, above all others?

What’s So Special about the Normal Distribution

The Normal distribution has a long history, dating back to the beginning of probability theory. It is the prominent distribution in the Central Limit Theorem and many well-known statistical tests, such at the $t$ -test and ANOVA. When people talk about “the bell curve” they are almost always referring to the Normal distribution (there is more than one “bell curve” in probability theory). The Fields Medalist Cédric Villani once said in a 2006 TED talk that if the Greeks had known of the Normal distribution they would have worshipped it like a god.

So why does the Normal distribution hold the place it does? For reference, below is the formula for the PDF of the Normal distribution with mean $\mu$ and variance $\sigma ^2$ :

$\begin{displaymath} f\left(x;\mu,\sigma^2\right) = \frac{1}{\sqrt{2\pi \sigma^2} }e^{-\frac{(x - \mu)^2}{2 \sigma^2}} .\end{displaymath}$

Figure 4: A plot of the PDF of a Normal distribution with mean $\mu$ and variance $\sigma ^2$ .

A plot of the Normal distribution is given in Figure 4. At first glance $f\left(x; \mu, \sigma^2\right)$ looks complicated, but it’s actually well-behaved and easy to work with. It’s rich in mathematical properties. While in principle any number could be produced by a Normally distributed random variable, in practice seeing anything farther than three standard deviations from the mean is unlikely. It is closed under addition; the sum of two (joinly) Normal random variables is a Normal random variable. And of course it features prominently in the Central Limit Theorem; the sum of IID random variables with finite variance starts to look Normally distributed, and this can happen even when these assumptions are relaxed. Additionally, Central Limit Theorems exists for vectors, functions, and partial sums, and in those cases the limiting distribution is some version of a Normal distribution.

Most practitioners, though, do not appreciate the mathematical “beauty” of the Normal distribution; I doubt this is why people would insist grades should be Normally distributed. Well, perhaps that’s not quite true; people may know that the Normal distribution is special even if they themselves cannot say why, and they may want to see Normal distributions appear to keep with a fad that’s been strong since eugenics. But “fad” feels like a cop-out answer, and I think there are better explanations.6

Many people get rudimentary statistical training, and the result is “cargo-cult statistics”, as described by (1);7 they practice something that on the surface looks like statistics but lacks true understanding of why the statistical methods work or why certain assumptions were made. People in statistics classes learned about the Normal distribution and their instructors (rightly) drilled its features and its importance into their heads, but the result is that they think data should be Normally distributed since it’s what they know when in reality data can follow any distribution, usually non-Normal ones.

Additionally, statistics’ most popular tests–in particular, the $t$ -test and ANOVA–calls for Normally distributed data in order to be applied. And in the defense of practitioners, there are a lot of tests calling for Normally distributed data, especially the ones they learned. But they don’t appreciate why these procedures use the Normal distribution.

The $t$ -test and ANOVA, in particular, are some of the oldest tests in existence, being developed by Fisher and Student around the turn of the century, and they prompted a revolution in science. But why did these tests use the Normal distribution? I speculate that a parametric test that worked for Normally distributed data was simply a low-hanging fruit; assuming the data was Normally distributed was the easiest way to produce a meaningful, useful product. Many tests with the same objectives as the $t$ -test and ANOVA have been developed that don’t require Normality, but these tests came later and they’re harder to do. (That said, it’s just as easy to do the $t$ -test as it is to do an equivalent non-parametric test these days with software, but software is new and also it’s harder to explain what the non-parametric test does to novices.) Additionally, results such as the Central Limit Theorem cause tests requiring Normality to work anyway in large data sets.

Good products often come for Normal things first; generalizations are more difficult and may take more time to be produced and be used. That said, statisticians appreciate the fact that most phenomena is not Normally distributed and that tweaks will need to be made when working with real data. Most people practicing statistics, though, are not statisticians; cargo-cult statistics flourishes.

Conclusion

Since statistics became prominent in science statisticians have struggled with how to handle their own success and most statistics being done by non-statisticians. Statistical education is a big topic since statistics is a hard topic to teach well. Also, failure to understand statistics produces real-world problems, from junk statistics to junk science and policy motivated by it. Assuming grades are Normally distributed is but one aspect of this phenomenon, and one that some students unfortunately feel personally.

Perhaps the first step to dealing with such problems is reading an article like these and appreciating the message. Perhaps it will change an administrator’s mind (but I’m a pessimist). But perhaps the student herself reading this will see the injustice she suffers from such a policy and appreciate why the statisticians are on her side, then commit to never being so irresponsible herself.

Bibliography 1

Philip B. Stark and Andrea Saltelli.
Cargo-cult statistics and scientific crisis.
Significance, 15 (4): 40-43, 2018.
doi: rm10.1111/j.1740-9713.2018.01174.x.
URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1740-9713.2018.01174.x.

About this document …

Grades Aren’t Normal

This document was generated using the LaTeX2HTML translator Version 2019 (Released January 1, 2019)

The command line arguments were:
latex2html -split 0 -nonavigation -lcase_tags -image_type gif simpledoc.tex

The translation was initiated on 2019-07-29

Footnotes … material.1 This is not the only possible objective of grading. Other grading objectives could be to rate students growth (how much a student improved since the beginning of the class) or to simply say which students were best and which were worst in the class. I take issue personally with either of these alternative “objectives” of grading; the first can be arbitrary and not produce useful information on the student since they could get an “A” for going from “terrible” to “mediocre”, while the second not only suffers from a similar problem (perhaps the “best” student knows little about the material) but also feel elitist. … distributions.2 Here is an argument for why grades might appear Normally distributed; since a final grade is the sum of grades from assignments, quizzes, tests, and so on, and grades generally emerge from summation of points, we could apply a version of the Central Limit Theorem to argue that the end result should be grades that appear Normally distributed. But there are still assumptions on how strong a dependence there is in points and assignments, and in fact while an individual student’s grades might start to look Normally distributed if such a process were to continue for a long time, this says nothing about the student body since one could say each student is her own data-generating process with her own parameters and those parameters do not follow some larger distribution. … 60.3 Scoring above 100 is possible in many of the classes I teach due to extra credit opportunities. The low grades are often those belonging to students who have effectively given up on the class, or at least are only moderately connected to it. … distribution.4 In the data set grades, there are ties. This data was rounded; real data would not have such an issue, and presumably an instructor would have access to the original data that wasn’t rounded. ….5 I use the variance notation for the Normal distribution; that is, $X \sim \operatorname{N}(\mu,\sigma^2)$ for a Normal random variable

with mean $\mu$ and variance $\sigma ^2$ . … explanations.6 Tests such as IQ tests and SAT/ACT tests often seem to produce scores following a Normal distribution, but this seems to follow more from construction than from nature. One can see why an educator would start to think that student grades, which also assess “intelligence,” should also be Normally distributed, but this appears to just be accepting a deception. …starksaltelli18;7 The term “cargo-cult” describes a phenomenon observed after the American Pacific campaign of World War II at the islands that once were American bases; island natives would build replicas of the bases and imitate their operations after the military left. They saw and imitated the activities without understanding why they worked. Since then the term has been use to suggest “imitation without understanding;” for example, Nobel physicist Richard Feynman coined the term “cargo-cult science” to mean activity that looks like science but does not honestly apply the scientific method.

Packt Publishing published a book for me entitled Hands-On Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

$\text{Grade}_{i} \sim \operatorname{N}(75, 81)$

$\includegraphics{verbatimblock.pdf}$

$\mu$

$\sigma ^2$

$\begin{displaymath} f\left(x;\mu,\sigma^2\right) = \frac{1}{\sqrt{2\pi \sigma^2} }e^{-\frac{(x - \mu)^2}{2 \sigma^2}} .\end{displaymath}$

http://ntguardian.wordpress.com/?p=3722

Extensions

CPAT and the Rényi-Type Statistic; End-of-Sample Change Point Detection in R

ntguardian Jul 24, 2019

My second package implements change point testing procedures, especially those for end-of-sample change points. I demonstrate on an example computing a stock's alpha and beta.

Show full content

This article is also available in PDF form.

Introduction

I started my first research project as a graduate student when I was only in the MSTAT program at the University of Utah, at the very end of 2015 (or very beginning of 2016; not sure exactly when) with my current advisor, Lajos Horváth. While I am disappointed it took this long, I am glad to say that the project is finished and I am finally published.

Our article, entitled “A new class of change point test statstics of Rényi type”, is available online, published by the Journal of Business and Economic Statistics (or JBES)1 (7), but is also available on arXiv for free. In the article we present a class of statistics we call Rényi-type statistics, since they are inspired by (16). These test statistics are tailored for detecting changes that occur near the ends of the sample rather than the middle of the sample. We show that these statistics do well when structural changes happen recently (or early) in the sample, better than other popular methods.

My advisor, Lajos Horváth, and his former graduate student/assistant professor at the University of Waterloo Greg Rice developed the theory for this statistic and most of the paper had already been written when Prof. Horváth invited me to the project. I was responsible for all coding-related matters of the project, including implementing the statistics, developing and performing simulations and finding and applying the statistic to a data example. I learned a lot in the process. I was forced to learn how to organize code-intensive research projects, thus prompting (9) and (13). I learned how to write R packages and I continue to do so to organize my research projects.

This project produced CPAT, a package for change point analysis. This package exists specifically for implementing test statistics of research interest to this project but should also be useful in general. It includes not only the Rényi-type statistic but also the CUSUM statistic, the Darling-Erdös statistic, instances of the (6) statistic, and instances of the (2) statistic. To my knowledge, this is the only implementation of our new statistic and the Andrews procedure in R.

In this article I will be discussing change point testing (focusing on time series), the Rényi-type statistics, CPAT, and some examples. I will not go into full depth, including the theory of the statistic and the full simulation results; for that, see the paper we wrote. This topic was also the subject of my first Ph.D. oral defense (10), and the slides can be viewed online.

Change Point Hypothesis Testing

Suppose we have a series of data $X_1,X_2,\ldots,X_T$ . In change point hypothesis testing, some aspect of the distribution changes at some unknown point in time $t^* \in [T]$ (where $T = \left\{ 1, 2, \ldots, t \right\}$ ). Often we ask whether the mean is constant over the sample or not. Let $\E{X_i} =\mu_{i}$ . We wish to test

$\begin{displaymath} \Hypoth{0} : \mu_1=\mu_2=\cdots=\mu_{T}=\mu \end{displaymath}$

against

$\begin{displaymath} \Hypoth{A} : \mu_1=\cdots=\mu_{t^*-1}\neq \mu_{t^*}=\cdots=\mu_{T},~t^* \in [T] .\end{displaymath}$

Critically, $t^*$ is unknown; if it were known, then we could simply use classic $t$ -tests known since the turn of the century. Here, though, the location of $t^*$ is unknown. Thus, part of the challenge is determining which of $\Hypoth{0}$ and $\Hypoth{A}$ are true, and part of the challenge is estimating $t^*$ .

To some, checking for a change in $\mu_{t}$ is interesting. However, change point testing extends beyond checking for changes in the mean. In fact, change point testing can attempt to detect just about any structural change in a data set. For instance, we can check for changes in the variance of a data set, or changes in a linear regression model or a time series model. There are tests for changes in functional data and even for checking for changes in distribution.

Why do people care about these things? I can easily see why an online test–where the test is attempting to detect changes as data flows in–would be interesting; I recently read of an online algorithm for detecting a change in distribution that sought to detect changes in radiation levels (a life-or-death matter in some cases) (15). However, that’s not the context I’m discussing here; I’m interested in historical changes. Why is this of interest?

First, there’s the classical example: change point testing was developed for quality control purposes. For example, a batch of widgets would be produced by a machine, and a change point test could determine if the machine ever became miscalibrated during production of the widgets. Beyond that, though, detecting a change point in a sample can be the first step to further research; the researcher would figure out why the change occurred after discovering it. There’s a third reason that every forecaster and user of time series should be aware of, though: most statistical methods using time series data require that the data be stationary, meaning that the properties of the data remain unchanging in the sample. One can hardly justify fitting a model to a data set when not all the data was produced by the same process. So change point testing checks for a particular type of pathology in data sets that everyone using time series data should watch out for.

In this article I’m not interested in estimating $t^*$ but only in deciding between $\Hypoth{0}$ and $\Hypoth{A}$ . Change point inference is a rich area of research and several statistics are already well known and actively used. One such statistic is the CUSUM statistic:

$\begin{align} A_T= \frac{1}{\sqrt{T} }\max_{1\le t\le T} \frac{1}{\hat{\sigma}_... ...rt \sum_{s=1}^{t} x_s - \frac{t}{T}\sum_{s=1}^{T} x_s \right\vert .\end{align}$

( $\hat{\sigma}^2_{t,T}$ is an estimator of the long-run variance of $X_{t}$ ; I’ll discuss this more later.) In short, the CUSUM statistic compares subsample means of a growing window of data to the overall mean of the sample. If a subsample of the data has a mean very different from the overall mean, the null hypothesis should be rejected.

Let $W(t)$ be a Wiener process and $B(t) = W(t) - tW(1)$ be a Brownian bridge. If $\Hypoth{0}$ is true, then the limiting distribution of (1) is the Kolmogorov distribution as $T\to \infty$ :

$\begin{align} A_{T} \dto \sup_{0\le t\le 1}\left\vert B(t) \right\vert .\end{align}$

This is the distribution we use for hypothesis testing. Additionally, we have some power results: if $\Hypoth{A}$ is true and $t^* \approx \theta T$ for some $\theta\in (0,1)$ , then $A_{T} \pto \infty$ . Not only that, $A_{T}$ may have the best power among all statistic with such $t^*$ .

In our paper we call such a $t^*$ “mid-sample” (even though in principle $\theta$ could be small). We were interested more in the case where, as $T\to \infty$ , $t^*\to x\in \left\{ 0, 1 \right\}$ . We call such change points “early” or “late”. The CUSUM statistic is not as effective in those situations, but the subject of our paper was a statistic that is quite effective when the change is early or late:

$\begin{align} D_{T}(t_{T}) = t_{T}^{1 / 2} \max_{t_T\le t\le T-t_{T}} \frac{1}{... ...um_{s=1}^{t} x_s - \frac{1}{T - t}\sum_{s=t+1}^{T} x_s \right\vert .\end{align}$

Here, $t_{T}$ is a trimming parameter such that $t_{T}\to \infty$ but $\frac{t_T}{T}\to 0$ . What the Rényi-type statistic does is compare the mean of the first part of the data to the mean of the rest of the data, in the latter part. If the difference between the mean of the former part of the data and the mean of the latter part becomes large in magnitude, the null hypothesis of no change should be rejected.

In our paper we showed that, whem $\Hypoth{0}$ is true, then as $T\to \infty$ , $D_{T}(t_{T}) \dto \max(\xi_1,\xi_2)$ , where $\xi_1$ and $\xi_2$ are IID copies of $\xi= \sup_{0\le t\le 1}\left\vert W(t) \right\vert$ . This establishes the distribution for computing $p$ -values. As for $\Hypoth{A}$ , we gave conditions on $t^*$ such that when $\Hypoth{A}$ is true, then $D_{T}(t_{T}) \pto \infty$ . These conditions not only include most of those for which $A_{T} \pto \infty$ , (such as when $t^* \approx \theta T$ ), but also conditions not covered by the CUSUM statistic in which the change tends to be early or late in the sample. For example, it’s possible to show that, as $T\to \infty$ , the CUSUM statistic does not have power asymptotically when $t^* =O(\log(T))$ or when $t^{*}=O(T^{1 / k})$ for $k > 1$ , but $D_{T}(t_{T})$ does when $t_{T}$ is chosen appropriately (letting $t_T = \log(T)$ or $t_T = \sqrt{T}$ often works).

As mentioned above, while testing for changes in the mean is nice, we want to check for structural changes not just in the mean but in for many statistics and models. And in fact we can. For example, let’s suppose we have a regression model:

$\begin{align} y_{t}= \bm{\beta}_{t}^\top x_{t} + \epsilon_{t},~x_{t}, \bm{\beta}_{t}\in \R^{d},~\epsilon_{t}\in \R .\end{align}$

Here, $\{\epsilon_{t}\}_{t\in \Z}$ is a stationary white noise process ( $\E{\epsilon_{t}} =0$ and $\epsilon_{t}$ has more than two moments) and $\{x_{t}\}_{t\in \Z}$ is a sequence of stationary random vectors. We wish to test

$\begin{displaymath} \Hypoth{0} : \bm{\beta}_{1} = \cdots = \bm{\beta}_{T} = \bm{\beta} \end{displaymath}$

against

$\begin{displaymath} \Hypoth{A} : \bm{\beta}_{1} = \cdots \bm{\beta}_{t^* - 1} \neq \bm{\beta}_{t^*} = \cdots = \bm{\beta}_{T},~t^* \in [T] .\end{displaymath}$

In fact, to do so, all you need to do is:

Assume $\Hypoth{0}$ is true and compute an estimate $\hat{\bm{\beta}}$ of $\bm{\beta}$ ;
Compute the residuals of the model, $\hat{\epsilon}_{t}=y_{t} - \hat{\bm{\beta}}_{t}^\top x_{t}$ ; and
Compute $D_{T}(t_{T})$ using $\left\{ \hat{\epsilon}_{t} \right\}_{t \in [T]}$ as your data.

This procedure does work; the statistic has the same limiting distribution under the null hypothesis as before and the test has power when the null hypothesis is false, if $\E{x_{t}}$ and $\bm{\beta}_{T}-\bm{\beta}_{1}$ also have an appropriate relationship (you can construct situations where the test does not have power, such as when $\E{x_t}$ and $\bm{\beta}_{T}-\bm{\beta}_{1}$ are orthogonal). The same procedure can be used with the CUSUM statistic, too, and both procedures have similar relationships to $t^*$ : the CUSUM statistic has better power when $t^*$ is mid-sample and the Rényi-type statistic has better power early/late sample.

This idea–use the residuals of the estimated model you’re testing for a change in–works not just for regression models but for time series models, generalized method of moments models, and I’m sure others that we haven’t considered. Later I will demonstrate checking for changes in regression models.

However, there is an issue I have ignored until now: $\hat{\sigma}_{t,T}^2$ . This is an estimator for the long-run variance of the data. If the null hypothesis were true and there was no serial correlation in the data, one could simply use the sample variance computed on the entire sample. However, even in the uncorrelated case we use an estimator that is a consistent estimator even when the break point occurs at $t \in [T]$ for every $t$ . However, when working with time series data, we recommend using a kernel density estimator for the long-run variance.

I will not repeat the estimator here since the formula is quite involved; you can look at the original paper if you’re interested. These estimators are generally robust to heteroskedasticity (such as GARCH behavior) and autocorrelation, though they are far from perfect. In our simulations, we observed significant size inflation when there was noticeable serial correlation in the data since these estimators tend to underestimate the long-run variance. This is not a problem of the Rényi-type statistic but any statistic relying on these estimators for long-run variance estimation. Additionally, practitioners should play around with the kernel and bandwidth parameters to get the best results, since some kernels/bandwidths work better in some situations than others.

CPAT

My contribution to this project was primarily coding, from developing the implementations of these statistics to performing simulations. The result is the package CPAT (11), which is available both on CRAN and on GitHub.

While above I mentioned only the CUSUM test and the Rényi-type test, we investigated other tests as well. For instance, we considered weighted/trimmed variants of the CUSUM test, the Darling-Erdös test, the Hidalgo-Seo test, and Andrews’ test. Many of these tests are available in some form in CPAT, and have similar interfaces when sensible.

library(CPAT)

      ________ _________ _________ __________
     /       //  ___   //  ___   //         /
    /   ____//  /  /  //  /  /  //___   ___/
   /   /    /  /__/  //  /__/  /    /  /
  /   /___ /  ______//  ___   /    /  /
 /       //  /      /  /  /  /    /  /
/_______//__/      /__/  /__/    /__/        v. 0.1.0
Type citation("CPAT") for citing this R package in publications

Call me a dork, but I saw a message like this used by the package mclust (17) and Stata and I wanted to do so myself. I think it’s cool and I plan on making messages like these for my packages for no good reason (MCHT (12) has a similar on-load message).

Moving on, let’s create an artificial data set for playing with.

vec1 <- c(rnorm(100, 0), rnorm(100, 1))

Right in the middle of the sample, the mean of the data switches from 0 to 1. Thus there is a change point in the mean.

Let’s first apply the CUSUM test using the function CUSUM.test().

CUSUM.test(vec1)

        CUSUM Test for Change in Mean

data:  vec1
A = 4.0675, p-value = 8.66e-15
sample estimates:
t* 
97

All of the test functions, including CUSUM.test(), produce S1-class objects of class htest, the same objects that stats statistical testing functions produce.

In this situation, the CUSUM test worked well; it rejected the null hypothesis of no change, as it should. It even made a good guess as to the location where the mean occured, estimating that $t^* = 97$ , which is close to the true location of 100. Okay, let’s try another data set.

vec2 <- as.numeric(arima.sim(200, 
                             model = list(
                               order = c(1, 0, 0),
                               ar = 0.4)))
CUSUM.test(vec2)

        CUSUM Test for Change in Mean

data:  vec2
A = 1.393, p-value = 0.04127
sample estimates:
 t* 
166

This data set does not have a change in mean yet the CUSUM test rejected the null hypothesis. Why is that so? This is because the default procedure for estimating the long-run variance will not work if the data is serially correlated, and in this case $X_t \sim \operatorname{AR}(1)$ with autocorrelation parameter $\rho=0.4$ . We need to switch to kernel-based long-run variance estimation.

To turn on kernel-based methods we only need to set use_kernel_var = TRUE, but there are additional choices to make. Specifically, we need to choose the kernel and the bandwidth. There are defaults but in general CPAT outsources kernel and bandwidth selection to functions from cointReg (3). Thus they use a similar interface.

The default is to use the Bartlett kernel and select the bandwidth using the method recommended by (1). We can switch to the quadratic-spectral kernel and the (14) method like so:

CUSUM.test(vec2, use_kernel_var = TRUE, kernel = "qs", bandwidth = "nw")

        CUSUM Test for Change in Mean

data:  vec2
A = 0.86106, p-value = 0.4487
sample estimates:
 t* 
166

Now we get a much more reasonable $p$ -value.

Let’s work an example for detecting structural changes in linear regression models. Another package for change point detection is the package strucchange (18), and it provides a package tracking US income and expenditures. We can test for structural changes in expenditures like so:

library(strucchange)
library(dynlm)
data(USIncExp)
incexpres <- residuals(
               dynlm(d(log(expenditure)) ~ d(log(income)),
                     data = USIncExp))
CUSUM.test(incexpres, use_kernel_var = TRUE)

        CUSUM Test for Change in Mean

data:  incexpres
A = 1.5783, p-value = 0.01372
sample estimates:
 t* 
320

Notice I set use_kernel_var = TRUE; there is some evidence of autocorrelation in the data, though this is a good option to turn on in general since we never know whether data is auto-correlated or not.

Okay, enough about the CUSUM statistic: let’s discuss the Rényi-type statistic. Using the Rényi-type statistic works just like using the CUSUM statistic:

HR.test(vec1)

        Horvath-Rice Test for Change in Mean

data:  vec1
D = 2.655, log(T) = 5.2983, p-value = 0.03148
sample estimates:
t* 
97

The Rényi-type statistic was also able to detect the change in mean in the first sample.2 It wasn’t as authoritative as the CUSUM test but it was not incorrect. How about the second sample when we turn on the kernel-based long-run variance estimation?

HR.test(vec2, use_kernel_var = TRUE, kernel = "qs", bandwidth = "and")

        Horvath-Rice Test for Change in Mean

data:  vec2
D = 1.0009, log(T) = 5.2983, p-value = 0.8619
sample estimates:
 t* 
166

The Rényi-type statistic, though, has an additional parameter to set: the trimming parameter, which above was $t_T$ . Here the parameter is called kn, and HR.test() expects a function be supplied to it.

HR.test(vec2, use_kernel_var = TRUE, kn = sqrt)

        Horvath-Rice Test for Change in Mean

data:  vec2
D = 1.7249, sqrt(T) = 14.142, p-value = 0.3096
sample estimates:
 t* 
166

We mentioned that the Rényi-type test shines when the change point occurs very early or late in the sample. Here is a demonstration:

vec3 <- c(rnorm(5, mean = 0), rnorm(195, mean = 1))
CUSUM.test(vec3)
HR.test(vec3)

        CUSUM Test for Change in Mean

data:  vec3
A = 0.803, p-value = 0.5393
sample estimates:
t* 
27 


        Horvath-Rice Test for Change in Mean

data:  vec3
D = 2.6722, log(T) = 5.2983, p-value = 0.02991
sample estimates:
t* 
 5

Here the CUSUM statistic fails to detect the early structural change but the Rényi-type statistic does not.3In general, the CUSUM statistic will have better power when the change occurs near the middle of the sample, while the Rényi-type statistic will have better power when the change happens early or late.

Let’s wrap up our introduction to HR.test() with a demonstration of testing for changes in regression models:

HR.test(incexpres, use_kernel_var = TRUE)

        Horvath-Rice Test for Change in Mean

data:  incexpres
D = 1.4198, log(T) = 6.2246, p-value = 0.5257
sample estimates:
t* 
24

Finally, let’s consider some other statistics available in CPAT. First, there’s the Darling-Erdös statistic. Let

$\begin{displaymath} \overline{A}_T(\tau, t_T) = \frac{1}{\sqrt{T} }\max_{t_T\le... ...m_{s=1}^{t} x_s - \frac{t}{T}\sum_{s=1}^{T} x_s \right\vert .\end{displaymath}$

When $t_T=1$ , this is known as the weighted CUSUM statistic, and in general, for $\tau>\frac{1}{2}$ we know this as the weighted and trimmed CUSUM statistic.4 This version of the CUSUM statistic should handle early/late change points better, though our (unreported) simulations suggest that the Rényi-type statistic has better power in the extreme cases we considered. The Darling-Erdös statistic is

$\begin{displaymath} E_{T} = \left( 2\log\left( \log\left( \frac{T}{(\log(T))^{3... ...ght) \right) \right)^{1 / 2} \overline{A}_{T}(1 / 2) - M_{T} ,\end{displaymath}$

where $M_{T}=2\log\left( \log\left( \frac{T}{(\log(R))^{3 / 2}} \right) \right) - (1 /... ...g\left( \frac{T}{(\log(T))^{3 / 2}} \right) \right) \right) + (1 / 2)\log(\pi)$ . It exists in CPAT as DE.test().

DE.test(vec1)
DE.test(vec2, use_kernel_var = TRUE, kernel = "qs", bandwidth = "nw")
DE.test(vec3)

        Darling-Erdos Test for Change in Mean

data:  vec1
A = 6.9468, a(log(T)) = 1.8261, b(log(T)) = 3.0180, p-value = 0.001922
sample estimates:
t* 
97 


        Darling-Erdos Test for Change in Mean

data:  vec2
A = 1.1004, a(log(T)) = 1.8261, b(log(T)) = 3.0180, p-value = 0.486
sample estimates:
 t* 
166


        Darling-Erdos Test for Change in Mean

data:  vec3
A = 1.4832, a(log(T)) = 1.8261, b(log(T)) = 3.0180, p-value = 0.3648
sample estimates:
t*
 5

DE.test(incexpres, use_kernel_var = TRUE)

        Darling-Erdos Test for Change in Mean

data:  incexpres
A = 1.8017, a(log(T)) = 1.9123, b(log(T)) = 3.3864, p-value = 0.2811
sample estimates:
 t* 
320

Then there’s the Hidalgo-Seo statistic, existing in CPAT as HS.test(). This function was designed for working with univariate data; one could try to use HS.test() for testing for changes in regression models as we did above, but that would not be using the same statistic that Hidalgo and Seo presented.5 Additionally, this function does not have the same interface as the others in CPAT; it does not use kernel-based methods for long-run variance estimation but uses the procedure presented in the paper where the residuals are allowed to follow an $\operatorname{AR}(1)$ process. Thus, there is a parameter, corr, for controlling whether this approach is used or not (if not, the residuals are treated as if they are IID); corr = FALSE by default.

HS.test(vec1)
HS.test(vec2, corr = TRUE)
HS.test(vec3)

        Hidalgo-Seo Test for Change in Mean

data:  vec1
A = 10.638, Correlated Residuals = TRUE, p-value = 0.009749
sample estimates:
t* 
97 


        Hidalgo-Seo Test for Change in Mean

data:  vec2
A = 2.0908, Correlated Residuals = TRUE, p-value = 0.505
sample estimates:
 t* 
166 


        Hidalgo-Seo Test for Change in Mean

data:  vec3
A = 5.7549, Correlated Residuals = TRUE, p-value = 0.1065
sample estimates:
t* 
 5

Finally, there’s Andrews’ test. This is the most distinct test of the ones presented here. The version implemented here is designed for detecting late changes in the sample only; there is not a version for early or mid-sample changes. Apparently at least one person uses this test, which was surprising to me; I thought it would be the least used. But since someone is using the test, I may need to implement these other versions of Andrews’ test; I give the people what they want!

Andrews’ test does not test the same set of hypotheses we considered above; instead, the version for late changes claims, under the null hypothesis, that the change happens after some point $1<m\le T$ in the sample, and that $m$ is known (but $t^*$ supposedly is not); in short, under $\Hypoth{A}$ , $t^* \in \left\{ m, m+1,\ldots,T \right\}$ . We consider this an odd setup; if this were true and we knew where the change might happen, we could simply use a two-sample $t$ -test (or the equivalent for, say, regression models) like students learn in statistics 101 classes and the resulting test would be, provably, the most powerful test.

Nevertheless, CPAT supports this test via the Andrews.test() function. Different versions of the test will be used depending on the input. If the data is univariate, the version run will be specifically made for univariate data. Andrews.test() requires that the parameter $M$ be set, since this is a necessary input for the alternative hypothesis that cannot be guessed; this is the first point in the sample where a change could occur.

Andrews.test(vec1, 100)

        Andrews' Test for Structural Change

data:  vec1
S = 96.49, m = 100, p-value = 1

Andrews’ procedure does not worry about long-run variance estimation like the other statistics; this is because of the data-dependent subsampling procedure Andrews’ test uses for computing $p$ -values. (The setup of the test may be silly, but the end result is fascinating!)

Andrews.test(vec2, 100)

        Andrews' Test for Structural Change

data:  vec2
S = 100.92, m = 100, p-value = 1

Andrews’ test was designed for detecting changes near the ends of the sample. In this last example, the change occurs in the last 5 observations, and I tell the test that the change will happen after observation 190.

Andrews.test(rev(vec3), 190)

        Andrews' Test for Structural Change

data:  rev(vec3)
S = 13.253, m = 10, p-value = 0.6188

Again, the Rényi-type statistic comes out on top.

I mentioned that different versions of Andrews’ test will be used depending on the input. If the input to the test is a data.frame, then Andrews.test() will expect the user to supply a formula; it’s expecting to test for a structural change in a regression model.

mod <- dynlm(d(log(expenditure)) ~ d(log(income)), data = USIncExp)
X <- as.data.frame(model.frame(mod))
names(X) <- c("exp", "inc")
Andrews.test(exp ~ inc, x = X, M = 300)

        Andrews' Test for Structural Change

data:  X
S = 1.7764, m = 205, p-value = 0.9792

Change Point Detection Example: Beta

Now that I’ve shown how to use the main functions of CPAT, I want to demonstrate detecting structural change in a real-world context. There is an example in the paper that demonstrates that the Rényi-type statistic could have detected changes in the (5) five-factor model when estimated for a portfolio of banking stocks around the time of the 2008 financial crisis. Additionally, the functions presented here would have detected a change in the relationship between US worker productivity and compensation, which I wrote about before (8).

Let’s look at a different example; beta. Consider a stock; let $R_{t}$ be the return of this stock at time $t$ . Let $r$ be the risk-free rate of return (such as the return from US Treasury notes) and let $M_{t}$ be the market return (perhaps the return from investing in an S&P 500 index fund) at time $t$ . If you’re familiar with finance, you may have heard of quantities such as alpha and beta. Alpha is the excess return of the stock over the market; that is, how much of a profit one gets for investing in the stock rather than the market (less the risk-free rate of return). Beta, on the other hand, measures how much the stock moves like the market. If a stock’s beta is zero, it has no relation with the market; if its beta is positive, the stock does well when the market does well on average; and if beta is negative, the stock does well when the market does poorly, on average. If the beta is above one, the stock is more sensitive to the market (so a beta of two suggests that if the market had a return 1% in excess of average, the stock should be 2% in excess of its own average), while a beta less than one suggests a stock less sensitive to the movements of the market (an analogous statement can be made for betas above or below negative one).

I remember learning about beta in my mathematical finance class and a student asking “Why should beta be constant over time?” That’s a great observation, and one that I’m here to address. See, alpha and beta are computed by estimating parameter $\alpha$ and $\beta$ (for alpha and beta, respectively) in the regression model

$\begin{align} R_{t} - r = \alpha + \beta (M_{t} - r) + \epsilon_t .\end{align}$

Thus, checking that alpha and beta are constant over time equates to performing a change point hypothesis test. When estimating $\beta$ , it makes sense for people practicing quantitative finance to check for change points. If there is a change point (or perhaps more than one), then using more data won’t make the estimate of $\beta$ better, since some of the data was data produced by a different $\beta$ than the most recent one.

CPAT comes with a subset of the Fama-French factors needed for estimating Fama-French five-factor models. We will use only some of those factors here to compute the $\beta$ that most investors are familiar with. Since the data set also comes with estimates of the risk-free rate of return, we will use those as well.

Now for the stock. Let’s study Valeant Pharmaceuticals (now known as Bausch Health, in an effort to escape the reputation they rightfully earned), the subject of an episode of Dirty Money (4). Valeant had ticker symbol VRX (now BHC) and after 2008 it had a meteoric rise until Valeant’s business practices of buying drugs and hiking their prices to absurd, unaffordable levels–along with outright fraudulent behavior–attracted unwanted attention, preceding the collapse of the stock.

Figure 1: The rise and fall of Valeant.

We can download the data for VRX from Quandl.

library(Quandl)
library(xts)

VRX <- Quandl("WIKI/VRX", type = "xts")
VRX <- VRX$`Adj. Close`
names(VRX) <- "VRX"

The Fama-French factors are stored in the data.frame ff. Let’s next create an object containing both VRX’s log returns and the Fama-French factors.

data(ff)
ff <- as.xts(ff, order.by = as.Date(rownames(ff), "%Y%m%d"))
VRX <- diff(log(VRX)) * 100

VRXff <- merge(VRX, ff, all = FALSE)[-1,]

Let’s now compute the alpha and beta of VRX and test for structural change. Note that RF is the risk-free rate of return and Mkt.RF is the market return less the risk-free rate.

mod <- dynlm(I(VRX - RF) ~ Mkt.RF, data = as.zoo(VRXff["2009/"]))
CUSUM.test(residuals(mod), use_kernel_var = TRUE)

        CUSUM Test for Change in Mean

data:  residuals(mod)
A = 1.9245, p-value = 0.001214
sample estimates:
  t* 
1659

The CUSUM test does detect a change. Additionally, it believes the change happened in August 2015. This date is a good guess; the first bad news for Valeant occurred in October 2015, when a short seller named Andrew Left claimed that Valeant was engaging in perhaps illegal relations with a pharmacist named Philidor Rx Services.

But that conclusion comes after having 2016 and 2017 data, when Valeant had clearly fallen from their “prime.” What if it was the end of 2015 and you were recalibrating your models then? Would you have detected a change from the CUSUM test? Or even the Darling-Erdös test?

Let’s find out.

mod2 <- dynlm(I(VRX - RF) ~ Mkt.RF, data = as.zoo(VRXff["2009/2015"]))
CUSUM.test(residuals(mod2), use_kernel_var = TRUE)
DE.test(residuals(mod2), use_kernel_var = TRUE)

        CUSUM Test for Change in Mean

data:  residuals(mod2)
A = 0.96233, p-value = 0.3126
sample estimates:
  t* 
1652 


        Darling-Erdos Test for Change in Mean

data:  residuals(mod2)
A = 2.8559, a(log(T)) = 2.0057, b(log(T)) = 3.8000, p-value = 0.1086
sample estimates:
  t* 
1690

If you were just using the CUSUM test: no, you would not have noticed a change. The Darling-Erdös test is closer, but not convincing.

How about the Rényi-type test?

HR.test(residuals(mod2), use_kernel_var = TRUE, kn = sqrt)

        Horvath-Rice Test for Change in Mean

data:  residuals(mod2)
D = 3.6195, sqrt(T) = 41.976, p-value = 0.00118
sample estimates:
  t* 
1690

Yes, you would have detected a change if you were using the Rényi-type statistic.

Conclusion

First: how would I expect people to use the Rényi-type statistic in practice? Well, I’m not expecting the statistic to replace other statistics, such as the CUSUM statistic. The Rényi-type statistic has strengths not well emulated by other test statistics, but it also has relative weaknesses, too. My thought is that the test should be used along with those other tests. If both tests agree, cool; if they disagree, check where the tests suggest the change occurred and decide if you believe the test that rejected. And you should usually use kernel-based long-run variance estimation methods unless you have good reason not to do so.

We are continuing to study the Rényi-type statistic and hope to get more publications from it. One issue I think we did not adequately address is how to select the trimming parameter. This is an important decision that can change the results of tests, and we don’t have any rigorous suggestions on how to pick the parameter other than $t_T = \log(T)$ and $t_T = \sqrt{T}$ seem to work well, though one sometimes works better than the other.

We have a lot of material for those who want to learn more about this subject. There’s the aforementioned paper that goes into more depth than this brief article, and gives more convincing evidence that our new statistic works well in early/late change contexts. Additionally, the CPAT archive contains a version of the package that includes not only the version of the software used for the paper but also the code that allows others to recreate and “play” with our simulations. (In other words I attempted to make our work reproducible.)

If you’re using CPAT, I want to hear your feedback! I got feedback from one user and not only did it make my day it also gave me some idea of what I should be working on to release in future versions. And perhaps user questions will turn into future research.

Finally, if you’re working with time series data and are unaware of or not performing change point analysis, I hope that this article was a good introduction to the subject and convinced you to start using these methods. As my example demonstrated, good change point detection can matter even in a monetary way. Using more data won’t help if the old data is no longer relevant.

Bibliography 1

Donald W. K. Andrews.
Heteroskedasticity and autocorrelation consistent covariance matrix estimation.
Econometrica, 59 (3): 817-858, May 1991.

Donald W. K. Andrews.
End-of-sample instability tests.
Econometrica, 71 (6): 1661-1694, 2003.
ISSN 00129682, 14680262.
URL https://www.jstor.org/stable/1555535.

Philipp Aschersleben and Martin Wagner.
cointReg: Parameter Estimation and Inference in a Cointegrating Regression, 2016.
URL https://CRAN.R-project.org/package=cointReg.
R package version 0.2.0.

Erin Lee Carr.
Dirty money; drug short.
Film, January 2018.

Eugene F. Fama and Kenneth R. French.
A five-factor asset pricing model.
Journal of Financial Economics, 116 (1): 1 – 22, 2015.
ISSN 0304-405X.
doi: rm10.1016/j.jfineco.2014.10.010.
URL http://www.sciencedirect.com/science/article/pii/S0304405X14002323.

Javier Hidalgo and Myung Hwan Seo.
Testing for structural stability in the whole sample.
Journal of Econometrics, 175 (2): 84 – 93, 2013.
ISSN 0304-4076.
doi: rm10.1016/j.jeconom.2013.02.008.
URL http://www.sciencedirect.com/science/article/pii/S0304407613000626.

Lajos Horváth, Curtis Miller, and Gregory Rice.
A new class of change point test statistics of Rényi type.
Journal of Business & Economic Statistics, 0 (0): 1-19, 2019.
doi: rm10.1080/07350015.2018.1537923.
URL https://doi.org/10.1080/07350015.2018.1537923.

Curtis Miller.
Did wages detach from productivity in 1973? an investigation.
Website, September 2016.
URL https://ntguardian.wordpress.com/2016/09/12/wages-detach-productivity-1973/.
Blog.

Curtis Miller.
How should i organize my R research projects?
Website, August 2018a.
URL https://ntguardian.wordpress.com/2018/08/02/how-should-i-organize-my-r-research-projects/.
Blog.

Curtis Miller.
Change point statistics of Rényi type.
Online, November 2018b.
URL http://rgdoi.net/10.13140/RG.2.2.36035.45604.
Presentation.

Curtis Miller.
CPAT: Change Point Analysis Tests, 2018c.
URL https://cran.r-project.org/package=CPAT.
R package version 0.1.0.

Curtis Miller.
MCHT: Bootstrap and Monte Carlo Hypothesis Testing, 2018d.
R package version 0.1.0.

Curtis Miller.
Organizing R research projects: CPAT, a case study.
Website, February 2019.
URL https://ntguardian.wordpress.com/2019/02/04/organizing-r-research-projects-cpat-case-study/.
Blog.

Whitney K. Newey and Kenneth D. West.
Automatic lag selection in covariance matrix estimation.
Review of Economic Studies, 61 (4): 631-653, 1994.

Oscar Hernan Madrid Padilla, Alex Athey, Alex Reinhart, and James G. Scott.
Sequential nonparametric tests for a change in distribution: An application to detecting radiological anomalies.
Journal of the American Statistical Association, 114 (526): 514-528, 2019.
doi: rm10.1080/01621459.2018.1476245.
URL https://doi.org/10.1080/01621459.2018.1476245.

Alfréd Rényi.
On the theory of order statistics.
Acta Mathematica Academiae Scientiarum Hungarica, 4 (3): 191-231, Sep 1953.
ISSN 1588-2632.
doi: rm10.1007/BF02127580.
URL https://doi.org/10.1007/BF02127580.

Luca Scrucca, Michael Fop, Thomas Brendan Murphy, and Adrian E. Raftery.
mclust 5: clustering, classification and density estimation using Gaussian finite mixture models.
The R Journal, 8 (1): 205-233, 2016.
URL https://journal.r-project.org/archive/2016-1/scrucca-fop-murphy-etal.pdf.

Achim Zeileis, Christian Kleiber, Walter Krämer, and Kurt Hornik.
Testing and dating of structural changes in practice.
Computational Statistics & Data Analysis, 44: 109-123, 2003.

About this document …

CPAT and the Rényi-type Statistic: End-of-Sample Change Point Detection in R

This document was generated using the LaTeX2HTML translator Version 2019 (Released January 1, 2019)

The command line arguments were:
latex2html -split 0 -nonavigation -lcase_tags -image_type gif simpledoc.tex

The translation was initiated on 2019-07-23

Footnotes …JBES)1 It has yet to appear in the print edition. … sample.2 HR.test() also supplies an estimate for when the change occurred, but unlike for the CUSUM statistic there isn’t theory to back it up and we don’t recommend using it over the CUSUM estimate. It’s the location where the difference in the sub-sample means is greatest. In later versions we may delete this feature. … not.3 Astute readers may notice that <!– MATH $t^* $t^* < t_{T}=\log(T)$ . We don’t need $t^*\ge t_T$ or $t^* \le t - t_T$ in order for a change to be detected; in fact, in our simulations this was often not the case, yet the Rényi-type statistic could still detect such early changes. … statistic.4 There isn’t a publicly available implementation of the weighted-trimmed CUSUM statistic in CPAT, though we may add this in the future, especially if users request it. … presented.5 A statistic designed specifically for linear regression models may be coming in the future.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

$X_1,X_2,\ldots,X_T$

$t^* \in [T]$

$T = \left\{ 1, 2, \ldots, t \right\}$

$\E{X_i} =\mu_{i}$

$\begin{displaymath} \Hypoth{0} : \mu_1=\mu_2=\cdots=\mu_{T}=\mu \end{displaymath}$

$\begin{displaymath} \Hypoth{A} : \mu_1=\cdots=\mu_{t^*-1}\neq \mu_{t^*}=\cdots=\mu_{T},~t^* \in [T] .\end{displaymath}$

http://ntguardian.wordpress.com/?p=3635

Extensions

What is the probability that in a box of a dozen donuts picked from 14 flavors there’s no more than 3 flavors in the box?

ntguardian May 20, 2019

Dave's Donuts offers 14 flavors of donuts (consider the supply of each flavor as being unlimited). The “grab bag” box consists of flavors randomly selected to be in the box, each flavor equally likely for each one of the dozen donuts. What is the probability that at most three flavors are in the grab bag box of a dozen?

Show full content

Problem

Dave’s Donuts offers 14 flavors of donuts (consider the supply of each flavor as being unlimited). The “grab bag” box consists of flavors randomly selected to be in the box, each flavor equally likely for each one of the dozen donuts. What is the probability that at most three flavors are in the grab bag box of a dozen?

Solution

For this we will need the multinomial distribution, which is a discrete probability distribution. In a string of characters there are $k$ characters possible to fill one position of the string, which is $n$ characters long. Therandom variable $X_1$ counts the number of occurrences of character 1 in the string, $X_2$ the number of occurrences of character 2, and so on until $X_{k}$ . Let $p_1,\ldots,p_{k}$ be the individual probability each of the characters could appear in a position of the string; each position is filled independently of the characters in other positions. Let $x_1,\ldots,x_{k} \in [n]\cup \left\{0\right\}$ such that $x_1+\ldots+x_{k}=n$ . Then

$\displaystyle \Prob{X_1=x_1,\ldots,X_{k}=x_{k}} =\frac{n!}{x_1!\ldots x_{k}!}p_1^{x_1}\ldots p_{k}^{x_{k}}.$

Here, $k=14$ , $p_1=\ldots=p_{14}=\frac{1}{14}$ , and $n=12$ . So we can say

$\displaystyle \Prob{X_1=x_1,\ldots,X_{14}=x_{14}}=\frac{12!}{x_1!\ldots x_{14}!}\left(\frac{1}{14} \right)^{12}.$ (1)

We will say

$\displaystyle \Prob{\text{At most three flavors}}=$ $\displaystyle \Prob{\text{Exactly one flavor}}$ $\displaystyle +\Prob{\text{Exactly two flavors}}$ $\displaystyle +\Prob{\text{Exactly three flavors}}.$

Compute each of those probabilities separately.

If $X_1=12$ and $X_2=\ldots=X_{14}=0$ , there is exactly one flavor in the box.(1) shows the probability this happens is $\frac{1}{14^{12}}$ . Since we could pick an $X_{i}=12$ and there were 14 ways to make this decision, we can say

$\displaystyle \Prob{\text{Exactly one flavor}}=\frac{1}{14^{11}}.$ (1)

Let’s now compute $\Prob{\text{Exactly two flavors}}$ . We start by fixing $X_3=\ldots=X_{14}=0$ . We get

$\displaystyle \Prob{X_3=\ldots=X_{14}=0}=\sum_{i=0}^{12} \binom{n}{i} \left( \frac{1}{14}\right)^{12} = \left( \frac{2}{14} \right)^{12}.$ (2)

Unfortunately (2) includes cases where there’s actually only one flavor present in the box, so compute

$\displaystyle \Prob{X_1\ge 1,X_2\ge 1,X_3=0,\ldots,X_{14}=0}$ $\displaystyle = \sum_{i=1}^{11}\binom{n}{i} \left( \frac{1}{14} \right)^{12}$ (3) $\displaystyle = \left( \frac{2}{14}\right)^{12}-2\left(\frac{1}{14} \right)^{12}.$ (4)

Of course we could have picked different variables to fix at zero, andthere were $\binom{14}{2}$ ways to pick the variables to fix at zero (or equivalently, pick the variables to not fix at zero), finally yielding

$\displaystyle \Prob{\text{Exactly two flavors}}=\binom{14}{2} \left( \left( \frac{2}{14}\right)^{12}-2\left(\frac{1}{14} \right)^{12} \right).$ (5)

Now to compute $\Prob{\text{Exactly three flavors}}$ . Again we start by fixing $X_4=\ldots=X_{14}=0$ and compute

$\displaystyle \Prob{X_1\ge 1,X_2\ge 1,X_3\ge 1,X_4=\ldots=X_{14}=0}=\sum_{i=1}^......\left( \frac{12!}{i!j!(12 - i - j)!} \right) \left(\frac{1}{14} \right)^{12}.$ (6)

We could try and use tricks to compute (6) or we can acknowledge that we’re busy people and ask SymPy to do it. Check that the following Python code is correct:

from sympy import init_session, binomial
init_session()

def multinomial(params):
    if len(params) == 1:
        return 1
    return binomial(sum(params), params[-1]) * \
                    multinomial(params[:-1])

l1 = list()
for i in range(1, 10 + 1):
    v = sum([multinomial([i, j, (12 - i - j)]) for j in range(1,
                                                    11 - i + 1)])
    l1.append(v)

sum(l1)/14**12    # Solution

The resulting probability is $\frac{129789}{14173478093824}$ . We could have picked different flavors to fix, and there were $\binom{14}{3}$ ways to pick the flavors to fix, so we get

$\displaystyle \Prob{\text{Exactly three flavors}} = \binom{14}{3}\left( \frac{129789}{14173478093824} \right) = \frac{1687257}{506195646208}.$ (7)

We can write (1) and (5) as $\frac{1}{4049565169664}$ and $\frac{26611}{4049565169664}$ , respectively. Summing these probabilities yields

$\displaystyle \Prob{\text{At most three flavors}}=\frac{3381167}{1012391292416}\approx3.34\times 10^{-6}.$ (8) Related Question

This is the proper way to obtain the probability that there are at most three flavors in the “grab bag” box, but how many boxes exist in which there are at most three flavors when we discount the number of ways there are to arrange the donuts in a box?

If there’s exactly one flavor, then we pick it and fill the box with that flavor; there’s 14 ways to pick one flavor. If there’s exactly two flavors in the box, we’ll call them Flavor 1 and Flavor 2. There is at least one donut of Flavor 1 and one of Flavor 2. Now pick the rest of the donuts’ flavors, order doesn’t matter, there is replacement; there are $\binom{10 + 2 - 1}{10} =\binom{11}{10} = 11$ ways to do that. Then pick the two flavors: there’s $\binom{14}{2}$ ways to do that, and thus $11\binom{14}{2}$ boxes with exactly twoflavors. Similarly, for exactly three flavors, there are $\binom{14}{3}\binom{9 + 3 - 1}{9} = \binom{14}{3} \binom{11}{9}$ ways for there to be exactly three flavors. Sum these numbers. (See https://math.stackexchange.com/q/3230011.) There are 21,035 such boxes.

Special thanks to Math Stack Exchange user wavex for his help with this problem! He provided the following R script for simulating it:

total = 0
for (y in 1:10000000){
  x = rmultinom(1,12,c(1/14,1/14,1/14,1/14,1/14,
                       1/14,1/14,1/14,1/14,1/14,1/14,1/14,1/14,1/14))
  x <- c(x)

  count = 14
  for(i in x){
    if(i==0){
     count = count -1
   }   

  }
 if(count <= 3){total = total + 1}
}
sprintf("%.20f", total / 10000000)

In his run of the code this event occured only 27 out of 10,000,000 times. A rare event indeed!

About this document …

This document was generated using the LaTeX2HTML translator Version 2019 (Released January 1, 2019)

The command line arguments were:
latex2html -split 0 -nonavigation -lcase_tags -image_type gif Example6Solution.tex

The translation was initiated on 2019-05-20

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

$X_{k}$

$p_1,\ldots,p_{k}$

$x_1,\ldots,x_{k} \in [n]\cup \left\{0\right\}$

$x_1+\ldots+x_{k}=n$

http://ntguardian.wordpress.com/?p=3579

Extensions

Introducing Rank Data Analysis with Arkham Horror Data

ntguardian Feb 25, 2019

Introduction Last week I analyzed player rankings of the Arkham Horror LCG classes. This week I explain what I did in the data analysis. As I mentioned, this is the first time that I attempted inference with rank data, and I discovered how rich the subject is. A lot of the tools for the analysis…Read more Introducing Rank Data Analysis with Arkham Horror Data

Show full content

Introduction

Last week I analyzed player rankings of the Arkham Horror LCG classes. This week I explain what I did in the data analysis. As I mentioned, this is the first time that I attempted inference with rank data, and I discovered how rich the subject is. A lot of the tools for the analysis I had to write myself, so you now have the code I didn’t have access to when I started.

This post will not discuss rank data modelling. Instead, it will cover what one may consider basic statistics and inference. The primary reference for what I did here is Analyzing and Modeling Rank Data, by John Marden. So far I’ve enjoyed his book and I may even buy a personal copy.

What is Rank Data?

Suppose we have $m$ objects we ask our study participants (also known as “judges”) to rank. For example, suppose we asked people to rank apples, oranges, and bananas. What we then get is a prioritization of these objects according to our judges. This could come in the form

$(3, 1, 2)$

and we interpret the number in the $k^{th}$ position as the ranking of the $k^{th}$ item. In this case, if the tuple is in the order of apples, oranges, and bananas, then oranges recieved the highest ranking, bananas the second-highest, and apples the last position.

An alternative view of this data may be

$(\text{oranges}, \text{bananas}, \text{apples})$

where the items are arranged in order of preference. This form of describing a ranking has its uses, but we will consider only the first form in this introduction.

Ranking data has the following distinguishing characteristics from other data: first, the data is ordinal. All that matters is the order in which items were placed, not necessarily the numbers themselves. We could insist on writing rank data as $(100, 1, 50)$ and the information content would not have changed. (But of course we would never do this.) Second, every item gets a ranking. This excludes “Choose your top 3 out of 50”-type questions, since not every item would receive a ranking (this is called an incomplete ranking and requires special care; I won’t discuss this type of data in this article). Finally, every item’s ranking is distinct; no ties are allowed.

Thus ranking data is distinct even from just ordinal data since data comes from judges in the form of a tuple, not just a single ordinal value. (Thus we would not consider, say, Likert scale responses as automatically being an instance of rank data.) An ideal method for rank data would account for this unique nature and exploit its features.

Basic Descriptive Statistics

From this point on I will be working with the Arkham Horror player class ranking data. I made the Timestamp column nonsense to anonymize the data. You can download a CSV file of the data from here, then convert it to a .Rda file with the script below (which is intended to be run as an executable):

#!/usr/bin/Rscript
################################################################################
# ArkhamHorrorClassPreferenceSurveyDataCleaner.R
################################################################################
# 2019-02-10
# Curtis Miller
################################################################################
# This file takes a CSV file read in and cleans it for later analysis, saving
# the resulting data in a .Rda file.
################################################################################

# optparse: A package for handling command line arguments
if (!suppressPackageStartupMessages(require("optparse"))) {
  install.packages("optparse")
  require("optparse")
}

################################################################################
# MAIN FUNCTION DEFINITION
################################################################################

main <- function(input, output = "out.Rda", help = FALSE) {
  input_file <- read.csv(input)
  
  input_columns <- names(input_file)
  arkham_classes <- c("Survivor", "Guardian", "Rogue", "Seeker", "Mystic")
  for (cl in arkham_classes) {
    names(input_file)[grepl(cl, input_columns)] <- cl
  }
  names(input_file)[grepl("Reason", input_columns)] <- "Reason"

  input_file$Reason <- as.character(input_file$Reason)
  input_file$Timestamp <- as.POSIXct(input_file$Timestamp,
                                     format = "%m/%d/%Y %H:%M:%S", tz = "MST")
  for (cl in arkham_classes) {
    input_file[[cl]] <- substr(as.character(input_file[[cl]]), 1, 1)
    input_file[[cl]] <- as.numeric(input_file[[cl]])
  }

  survey_data <- input_file
  save(survey_data, file = output)
}

################################################################################
# INTERFACE SETUP
################################################################################

if (sys.nframe() == 0) {
  cl_args <- parse_args(OptionParser(
        description = paste("Converts a CSV file with survey data ranking",
                            "Arkham Horror classes into a .Rda file with a",
                            "well-formated data.frame"),
        option_list = list(
          make_option(c("--input", "-i"), type = "character",
                      help = "Name of input file"),
          make_option(c("--output", "-o"), type = "character",
                      default = "out.Rda",
                      help = "Name of output file to create")
        )
      ))

  do.call(main, cl_args)
}

(The script with all the code for the actual analysis appears at the end of this article.)

The first statistic we will compute for this data is the marginals matrix. This matrix simply records the proportion of times an item received a particular ranking in the sample. If we want to get mathematical, if $\omega$ is a ranking tuple and $\omega_i$ is the ranking of the $i^{th}$ option and the sample is $\omega^{(1)}, \ldots, \omega^{(n)}$ , then the $ij^{th}$ entry of the marginal’s matrix is

$\hat{M}_{ij} = \frac{1}{n} \sum_{l = 1}^n I_{\{\omega^{(l)}_i = j\}}$

where the function $\latex I_{{A}}$ is 1 if $A$ is true and 0 otherwise. (Thus the sum above simply counts how many times $\omega^{(l)}_i$ was equal to $j$ .)

The marginals matrix for the Arkham Horror data is given below

MARGINALS
---------
             1     2     3     4     5
Guardian 18.29 20.43 26.84 19.71 14.73
Mystic   19.71 18.29 17.81 20.90 23.28
Rogue    19.24 14.73 20.67 21.38 23.99
Seeker   28.03 25.18 17.10 18.53 11.16
Survivor 14.73 21.38 17.58 19.48 26.84

Below is a visual representation of the marginals matrix.

Marginals

From the marginals matrix you could compute the vector representing the “mean” ranking of the data. For instance, the mean ranking of the Guardian class is the sum of the ranking numbers (column headers) times their respective proportions (in the Guardian row); here, that’s about 2.9 for Guardians. Repeat this process for every other group to get the mean ranking vector; here, the mean rank vector is $(2.92, 3.10, 3.16, 2.60, 3.22)$ (keeping the ordering of the classes suggested by the rows above, which is alphabetical order; this will always be the ordering I use unless otherwise stated.) Of couse this is not a ranking vectors; rankings are integers. The corresponding ranking vector would be to rank the means themselves; this gives a ranking vector of $(2, 3, 4, 1, 5)$ .

I don’t like inference using the mean ranking vector. As mentioned above, this data is ordinal; that means the magnitude of the numbers themselves should not matter. We could replace 1, 2, 3, 4, 5 with 1, 10, 100, 1000, 10000 and the data would mean the same thing. That is not the case if you’re using the mean rank unless you first apply a transformation to the rankings. In short, I don’t think that the mean ranking vector appreciates the nature of the data well. And since the marginals matrix is closely tied to this notion of “mean”, I don’t think the matrix is fully informative.

Another matrix providing descriptive statistics is the pairs matrix. The matrix records the proportion of respondents who preferred one option to the other (specifically, the row option to the column option). Mathematically, the $ij^{th}$ entry of the pairs matrix is

The pairs matrix for the Arkham Horror data is below:

PAIRS
-----
         Guardian Mystic Rogue Seeker Survivor
Guardian     0.00  54.16 55.34  42.52    55.82
Mystic      45.84   0.00 51.07  39.90    53.44
Rogue       44.66  48.93  0.00  38.72    51.54
Seeker      57.48  60.10 61.28   0.00    61.52
Survivor    44.18  46.56 48.46  38.48     0.00

First, notice that the diagonal entries are all zero; this will always be the case. Second, the pairs matrix is essentially completely determined by the entries above the diagonal of the matrix. Other forms of interence use these upper-diagonal entries and don’t use the lower-diagonal entries since they give no new information. The number of upper-diagonal entries is $\bar{m} = m(m - 1)/2$ , which is the number of ways to pick pairs of $m$ classes.

The pairs matrix for the Arkham Horror data is visualized below.

Pairs

With the pairs matrix, crossing above or below 50% of the sample being in the bin is a significant event; it indicates which classes are preferred to the other. In fact, by counting how many times this threshold was crossed, we can estimate that the overall favorite class is the Seeker class, followed by Guardians, then Mystics, then Rogues, and finally Survivors. This is another estimate of the “central”, “modal”, or “consensus” ranking. (This agrees with the “mean” ranking, but that’s not always going to be the case; the metrics can disagree with each other.)

While I did not like the marginals matrix I do like the pairs matrix; I feel as if it accounts for the features of rank data I want any measures or inference to take account of. It turns out that the pairs matrix is also related to my favorite distance metric for analyzing rank data.

Distance Metrics for Rank Data

A distance metric is a generalized notion of distance, or “how far away” two objects $x$ and $y$ are. In order for a function $d(x, y)$ to be a metric, it must have the following properties:

$d(x, y) \geq 0$ for all $x$ and $y$ .
$d(x, y) = 0$ if and only if $x = y$ .
$d(x, y) = d(y, x)$ for all $x$ and $y$ .
$d(x, y) \leq d(x, z) + d(y, z)$ for all $x, y, z$ (the “triangle
inequality”)

The notion of distance you use in every-day life, the one taught in middle-school geometry and computed whenever you use a ruler, is known as Euclidean distance. It’s not the only notion of distance, though, and may not be the only distance function you use in real-life. For instance, Manhattan or taxi cab distance is the distance from one point to another when you can only make 90-degree turns and is the distance that makes the most sense when travelling in the city.

There are many distance metrics we could consider when working with rank data. The Spearman distance is the square of the Euclidean distance, while the footrule distance corresponds to the Manhattan distance. It turns out that the mean rank vector above minimizes the sum of Spearman distances. The distance metric I based my analysis on, though, was the Kendall distance. I like this distance metric since it is not connected to the mean and considers the distance between the rankings $(1, 2, 3, 4, 5)$ and $(5, 2, 3, 4, 1)$ to be greater than the distance between $(1, 2, 3, 4, 5)$ and $(2, 1, 3, 4, 5)$ (unlike, say, the Hamming distance, which gives the same distance in either case).

Kendall’s distance even has an interpretation. Suppose that two ranking tuples are seen as the ordering of books on a bookshelf. We want to go from one ordering of books to another ordering of books. The Kendall distance is how many times we would need to switch adjacent pairs of books (chosen well, so as not to waste time and energy) to go from one ordering to the other. Thus the Kendall distance between $(1, 2, 3, 4, 5)$ and $(2, 1, 3, 4, 5)$ is one; we only need to make one swap. The distance between $(1, 2, 3, 4, 5)$ and $(5, 2, 3, 4, 1)$ , in comparison, is seven, since we need to make seven swaps.

It also turns out that the Kendall distance is related to the pairs matrix. The average Kendall distance of the judges from any chosen ranking $\omega$ is

$\sum_{i, j: \omega_i > \omega_j} \hat{K}_{i,j}$

(There is a similar expression relating the Spearman distance to the marginal matrix.)

Central Ranking Estimator

Once we have a distance metric, we can define what the “best” estimate for the most central ranking is. The central ranking is the $\omega$ that minimizes

$\frac{1}{n}\sum_{l = 1}^{n} d(\omega^{(l)}, \omega)$

In other words, the most central ranking minimized the sum of distances of all the rankings in the data to that ranking.

Sometimes this ranking has already been determined. For instance, when using the Spearman distance, the central ranking emerges from the “mean” rankings. Otherwise, though, we may need to apply some search procedure to find this optimal ranking.

Since we’re working with rank data, though, it’s very tempting to not use any fancy optimization algorithms and simply compute the sum of distances for every possible ranking. This isn’t a bad idea at all if the number of items being ranked is relatively small. Here, since there are five items being ranked, the number of possible rankings is $5! = 120$ , which is not too big for a modern computer to handle. It may take some time for the exhaustive search approach to yield and answer, but the answer produced by exhaustive search comes with the reassurance that it does, in fact, minimize the sum of distances.

This is in fact what I did for estimating the central ranking when minimizing the sum of Kendall distances from said ranking. The resulting ranking, again, was Seeker/Guardian/Mystic/Rogue/Survivor (which agrees with what we determined just by looking at the pairs matrix; this likely is not a coincidence).

Statistical Inference

All of the above I consider falling into the category of descriptive statistics. It describes aspects of the sample without attempting to extrapolate to the rest of the population. With statistical inference we want to see what we can say about the population as a whole.

I should start by saying that the usual assumptions made in statistical inference are likely not satisfied by my sample. It was an opt-in sample; people chose to participate. That alone makes it a non-random sample. Additionally, only participants active on Facebook, Reddit, Twitter, Board Game Geek, and the Fantasy Flight forums were targeted by my advertising of the poll. Thus the Arkham players were likely those active on the Internet, likely at a particular time of day and day of the week (given how these websites try to push older content off the main page). They were likely young, male, and engaged enough in the game to be in the community (and unlikely to be a “casual” player). Thus the participants are likely to be more homogenous than the population of Arkham Horror players overall.

Just as a thought experiment, what would be a better study, one where we could feel confident in the inferential ability of our sample? Well, we would grab randomly selected people from the population (perhaps from pulling random names from the phone book), have them join our study, teach them how to play the game, make them play the game for many hours until they could form an educated opinion of the game (probably at least 100 hours), then ask them to rate the classes. This would be high-quality data and we could believe the data is reliable, but damn would it be expensive! No one at FFG would consider data of that quality worth the price, and frankly neither would I.

Having said that, while the sample I have is certainly flawed in how it was collected, I actually believe we can get good results from it. The opinions of the participants are likely educated ones, so we probably still have a good idea how the Arkham Horror classes compare to one another.

In rank data analysis there is a probability model called the uniform distribution that serves as a starting point for inference. Under the uniform distribution, every ranking vector is equally likely to be observed; in short, there is no preference among the judges among the choices. The marginals matrix should have all entries be $\frac{1}{m}$ , all off-diagonal entries of the pairs matrix should be $\frac{1}{2}$ , and any “central” ranking is meaningless since every ranking is equally likely to be seen. According to the uniform distribution, $P(\{\omega\}) = \frac{1}{m!}$ . If we cannot distinguish our data from data drawn from the uniform distribution, our work is done; we basically say there is no “common” ranking scheme and go about our day.

There are many tests for checking for the uniform distribution, and they are often based on the statistics we’ve already seen, such as the mean rank vector, the marginals matrix, and the pairs matrix. If $m$ is small enough relative to the sample size, we could even just base a test off of how frequently each particular ranking was seen. A test based off the latter could detect any form of non-uniformity in the data, while tests based off the marginals or pairs matrices or the mean vector cannot detect all forms of non-uniformity; that said, they often require much less data to be performed.

As mentioned, I like working with the pairs matrix/Kendall distance. The statistical test, though, involves a vector $\hat{\kappa}$ , which is the aforementioned upper triangle of the pairs matrix (excluding the diagonal entries which are always zero). (More specifically, $\hat{\kappa})$ is a vector containing the upper-diagonal entries of the pairs matrix laid out in row-major form.)

The test decides between

$H_0: \text{Data was drawn from uniform distribution}$

$H_A: H_0 \text{ is false}$

The test statistic is

$12n(\|\hat{\kappa} - \frac{1}{2} 1_{\bar{m}}\|^2 - \|\bar{y} - \frac{m + 1}{2} 1_m\|^2 / (m + 1))$

If the null hypothesis is true, then the test statistic, for large $n$ , a $\chi^2$ distribution with $\bar{m}$ degrees of freedom. (For the Arkham Horror classes case, $\bar{m} = 10$ .) Large test statistics are evidence against the null hypothesis, so $p$ -values are the area underneath the $\chi^2_{\bar{m}}$ curve to the right of the test statistic.

For our data set, the reported test statistic was 2309938376; not shockingly, the corresponding $p$ -value is near zero. So the data was not drawn from the uniform distribution. Arkham Horror players do have class preferences.

But what are plausible preferences players could have? We can answer this using a confidence interval. Specifically, we want to know what rankings are plausible, and thus what we want is a confidence set of rankings.

Finding a formula for a confidence set of the central ranking is extremely hard to do, but it’s not as hard to form one for one of the statistics we can compute from the rankings, then use the possible values of that statistic to find corresponding plausible central rankings. For example, once could find a confidence set for the mean ranking vector, then translate those mean rankings into ranking vectors (this is what Marden did in his book).

As I said before, I like the pairs matrix/Kendall distance in the rank data context, so I want to form a confidence set for $\kappa$ , the population equivalent of $\hat{\kappa}$ , the key entries of the pairs matrix. To do this, we cannot view the rank data the same way we did before; instead of seeing the $m$ -dimensional vector $(2, 1, 4, 3, 5)$ , we need to see the equivalent $\bar{m}$ -dimensional vector $(0, 1, 1, 1, 1, 1, 1, 0, 1, 1)$ that consists only of ones and zeros and records the pair-wise relationships among the ranks, rather than the ranks themselves (the latter vector literally says that item one is not ranked higher than item two, item one is ranked higher than item three, same for four, same for five, then that item two is ranked higher than item three, same for four, same for five, and so on, finally saying in its last entry that item four is ranked higher than item five).

We first compute $\hat{\kappa}$ by taking the means of these vectors. Then we compute the sample covariance matrix of the vectors; call it $\hat{\Sigma}$ . Then a $100(1 - \alpha)$ % confidence set for the true $\kappa$ , appropriate for large sample sizes, is:

$\{\kappa: n(\hat{\kappa} - \kappa)^{T} \hat{\Sigma}^{-1} (\hat{\kappa} - \kappa) \leq \chi^2_{\bar{m}, 1 - \alpha}\}$

where $\chi^2_{k, 1 - \alpha}$ is the $(1 - \alpha)^{th}$ percentile of the $\chi^2$ distribution with $k$ degrees of freedom.

The region I’ve just described is a $\bar{m}$ -dimensional ellipsoid, a football-like shape that lives in a space with (probably) more than three dimensions. It sounds daunting, but one can still figure out what rankings are plausible once this region is computed. The trick is to work with each of the coordinates of the vector $\kappa$ and determine whether there is a $\kappa$ in the ellipsoid where that coordinate is 1/2. If the answer is no, then the value of that coordinate, for all $\kappa$ in the ellipsoid, is either always above or always below 1/2. You can then look to $\hat{\kappa}$ (which is in the dead center of the ellipsoid) to determine which is the case.

What’s the significance of this? Let’s say that you listed all possible rankings in a table. Let’s suppose you did this procedure for the coordinate of $\kappa$ corresponding to the Seeker/Rogue pair. If you determine that this coordinate is not 1/2 and that all $\kappa$ in the ellipsoid ranks Seekers above Rogues, then you would take your list of rankings and remove all rankings that Rogues before Seekers, since these rankings are not in the confidence set.

If you do find a $\latex \kappa$ in the ellipsoid where the selected coordinate is 1/2, then you would not eliminate any rows in your list of rankings since you know that your confidence set must include some rankings that rank the two items one way and some rankings where the items are ranked the opposite way.

Repeat this procedure with every coordinate of $\kappa$ —that is, every possible pairing of choices—and you then have a confidence set for central rankings.

Determining whether there is a $\kappa$ vector in the ellipsoid with a select coordinate valued at 1/2 can be done via optimization. That is, find a $\latex \kappa$ that minimizes $n(\hat{\kappa} - \kappa)^{T} \hat{ \Sigma}^{-1} (\hat{\kappa} - \kappa)$ subject to the constraint that $\kappa_j = 1/2$ . You don’t even need fancy minimization algorithms for doing this; the minimum can, in principle, be computed analytically with multivariate calculus. After you found a minimizing $\kappa$ , determine what the value of $n(\hat{\kappa} - \kappa)^{T} \hat{\Sigma}^{-1} (\hat{\kappa} - \kappa)$ is at that $\kappa$ . If it is less than $\chi^2_{\bar{m}, 1 - \alpha}$ , then you found a $\kappa$ in the ellipsoid; otherwise, you know there is no such $\kappa$ .

This was the procedure I used on the Arkham Horror class ranking data. The 95% confidence interval so computed determined that Seekers were ranked higher than Rogues and Survivors. That means that Seekers cannot have a ranking worse than 3 and Rogues and Survivors could not have rankings better than 2. Any ranking consistent with these constraints, though, is a plausible population central ranking. In fact, this procedure suggested that all the rankings below are plausible central population rankings:

   Guardian Mystic Rogue Seeker Survivor
1         1      2     4      3        5
2         1      2     5      3        4
3         1      3     4      2        5
4         1      3     5      2        4
5         1      4     3      2        5
6         1      4     5      2        3
7         1      5     3      2        4
8         1      5     4      2        3
9         2      1     4      3        5
10        2      1     5      3        4
11        2      3     4      1        5
12        2      3     5      1        4
13        2      4     3      1        5
14        2      4     5      1        3
15        2      5     3      1        4
16        2      5     4      1        3
17        3      1     4      2        5
18        3      1     5      2        4
19        3      2     4      1        5
20        3      2     5      1        4
21        3      4     2      1        5
22        3      4     5      1        2
23        3      5     2      1        4
24        3      5     4      1        2
25        4      1     3      2        5
26        4      1     5      2        3
27        4      2     3      1        5
28        4      2     5      1        3
29        4      3     2      1        5
30        4      3     5      1        2
31        4      5     2      1        3
32        4      5     3      1        2
33        5      1     3      2        4
34        5      1     4      2        3
35        5      2     3      1        4
36        5      2     4      1        3
37        5      3     2      1        4
38        5      3     4      1        2
39        5      4     2      1        3
40        5      4     3      1        2

The confidence interval, by design, is much less bold than just an estimate of the most central ranking. Our interval suggests that there’s a lot we don’t know about what the central ranking is; we only know that whatever it is, it ranks Seekers above Rogues and Survivors.

The confidence set here is at least conservative in that it could perhaps contain too many candidate central rankings. I don’t know for sure whether we could improve on the set and eliminate more ranks from the plausible set by querying more from the confidence set for $\kappa$ . Perhaps there are certain combinations that cannot exist, like excluding rankings that give both Seekers and Guardians a high ranking at the same time. If I were a betting man, though, I’d bet that the confidence set found with this procedure could be improved, in that not every vector in the resulting set corresponds with a $\kappa$ in the original ellipsoidal confidence set. Improving this set, though, would take a lot of work as one would have to consider multiple coordinates of potential $\kappa$ simultaneously, then find a rule for eliminating ranking vectors based on the results.

Clustering

Matt Newman, the lead designer of Arkham Horror: The Card Game, does not believe all players are the same. Specifically, he believes that there are player types that determine how they like to play. In statistics we might say that Matt Newman believes that there are clusters of players within any sufficiently large and well-selected sample of players. This suggests we may want to perform cluster analysis to find these sub-populations.

If you haven’t heard the term before, clustering is the practice of finding “similar” data points, grouping them together, and identifying them as belonging to some sub-population for which no label was directly observed. It’s not unreasonable to believe that these sub-populations exist and so I sought to do clustering myself.

There are many ways to cluster. Prof. Malden said that a clustering of rank data into $L$ clusters should minimize the sum of the distances of each observation from their assigned cluster’s centers. However, he did not suggest a good algorithm for finding these clusters. He did suggest that for small samples, small $m$ and for a small number of clusters, we could exhaustively search for optimal clusters, an impractical idea.

I initially attempted a k-means-type algorithm for finding good clusters, one that used the Kendall distance rather than the Euclidean distance, but unfortunately I could not get the algorithm to give good results. I don’t know whether I have errors in my code (listed below) or whether the algorithm just doesn’t work for Kendall distances, but it didn’t work; in fact, it would take a good clustering and make it worse! I eventually abandoned my home-brewed k-centers algorithm (and the hours of work that went into it) and just used spectral clustering.

Spectral clustering isn’t easily described, but the idea of spectral clustering is to find groups of data that a random walker, walking from point to point along a weighted graph, would spend a long time in before moving to another group. (That’s the best simplification I can make; the rest is linear algebra.) In order to do spectral clustering, one must have a notion of “similarity” of data points. “Similarity” roughly means the opposite of “distance”; in fact, if you have a distance metric (and we do here), you can find a similarity measure by subtracting all distances from the maximum distance between any two objects. Similarity measures are not as strictly defined as distance metrics; any function that gives two “similar” items a high score and two “dissimilar” items a low score could be considered a similarity function.

Spectral clustering takes a matrix of similarity measures, computed for each pair of observations, and spits out cluster assignments. But in addition to the similarity measure, we need to decide how many clusters to find.

I find determining the “best” number of clusters to find the hardest part of clustering. We could have only one cluster, containing all our data; this is what we start with. We could also assign each data point to its own cluster; our aforementioned measure of cluster quality would then be zero, which would be great if it weren’t for the fact that our clusters mean nothing!

One approach people use for determining how many clusters to pick is the so-called elbow method. You take a plot of, say, Malden’s metric, compared against the number of clusters, and see if you can spot the “elbow” in the plot. The elbow corresponds to the “best” number of clusters.

Here’s the corresponding plot for the dataset here:

Cluster scores

If you’re unsure where the “elbow” of the plot is, that’s okay; I’m not sure either. My best guess is that it’s at five clusters; hence my choice of five clusters.

Another plot that people use is the silhouette plot, explained quite well by the scikit-learn documentation. The silhouette plot for the clustering found by spectral clustering is shown below:

Silhouette plot

Is this a good silhouette plot? I’m not sure. It’s not the worst silhouette plot I saw for this data set but it’s not as good as examples shown in the scikit-learn documentation. There are observations that appear to be in the wrong cluster according to the silhouette analysis. So… inconclusive?

I also computed the Dunn index of the clusters. I never got a value greater than 0.125. All together, these methods lead me to suspect that there are no meaningful clusters in this data set, at least none that can be found with this approach.

But people like cluster analysis, so if you’re one of those folks, I have results for you.

CLUSTERING
----------
Counts: Cluster
  1   2   3   4   5 
130  83  80  66  62 

Centers:
  Guardian Mystic Rogue Seeker Survivor
1        3      2     4      1        5
2        3      5     4      1        2
3        3      4     1      2        5
4        1      5     3      4        2
5        5      1     4      3        2

Score: 881 

CLUSTER CONFIDENCE INTERVALS
----------------------------

Cluster 1:

With 95% confidence: 
Guardian is better than Rogue
Guardian is better than Survivor
Mystic is better than Rogue
Mystic is better than Survivor
Seeker is better than Rogue
Seeker is better than Survivor

Plausible Modal Rankings:
   Guardian Mystic Rogue Seeker Survivor
1         1      2     4      3        5
2         1      2     5      3        4
3         1      3     4      2        5
4         1      3     5      2        4
5         2      1     4      3        5
6         2      1     5      3        4
7         2      3     4      1        5
8         2      3     5      1        4
9         3      1     4      2        5
10        3      1     5      2        4
11        3      2     4      1        5
12        3      2     5      1        4

Cluster 2:

With 95% confidence: 
Guardian is better than Mystic
Guardian is better than Rogue
Seeker is better than Guardian
Seeker is better than Mystic
Survivor is better than Mystic
Seeker is better than Rogue
Survivor is better than Rogue
Seeker is better than Survivor

Plausible Modal Rankings:
  Guardian Mystic Rogue Seeker Survivor
1        2      4     5      1        3
2        2      5     4      1        3
3        3      4     5      1        2
4        3      5     4      1        2

Cluster 3:

With 95% confidence: 
Rogue is better than Guardian
Rogue is better than Mystic
Rogue is better than Seeker
Rogue is better than Survivor
Seeker is better than Survivor

Plausible Modal Rankings:
   Guardian Mystic Rogue Seeker Survivor
1         2      3     1      4        5
2         2      4     1      3        5
3         2      5     1      3        4
4         3      2     1      4        5
5         3      4     1      2        5
6         3      5     1      2        4
7         4      2     1      3        5
8         4      3     1      2        5
9         4      5     1      2        3
10        5      2     1      3        4
11        5      3     1      2        4
12        5      4     1      2        3

Cluster 4:

With 95% confidence: 
Guardian is better than Mystic
Guardian is better than Seeker
Rogue is better than Mystic
Survivor is better than Mystic
Survivor is better than Seeker

Plausible Modal Rankings:
   Guardian Mystic Rogue Seeker Survivor
1         1      4     2      5        3
2         1      4     3      5        2
3         1      5     2      4        3
4         1      5     3      4        2
5         1      5     4      3        2
6         2      4     1      5        3
7         2      4     3      5        1
8         2      5     1      4        3
9         2      5     3      4        1
10        2      5     4      3        1
11        3      4     1      5        2
12        3      4     2      5        1
13        3      5     1      4        2
14        3      5     2      4        1

Cluster 5:

With 95% confidence: 
Mystic is better than Guardian
Survivor is better than Guardian
Mystic is better than Rogue
Mystic is better than Seeker
Survivor is better than Rogue
Survivor is better than Seeker

Plausible Modal Rankings:
   Guardian Mystic Rogue Seeker Survivor
1         3      1     4      5        2
2         3      1     5      4        2
3         3      2     4      5        1
4         3      2     5      4        1
5         4      1     3      5        2
6         4      1     5      3        2
7         4      2     3      5        1
8         4      2     5      3        1
9         5      1     3      4        2
10        5      1     4      3        2
11        5      2     3      4        1
12        5      2     4      3        1

When computing confidence sets for clusters I ran into an interesting problem: what if, say, you never see Seekers ranked below Guardians? This will cause one of the entries of $\hat{\kappa}$ to be either 0 or 1, and there is no “variance” in its value; it’s always the same. This will cause the covariance matrix to be non-invertible since it has rows/columns that are zero. The solution to this is to eliminate those rows and work only with the non-constant entries of $\hat{\kappa}$ . That said, I still treat the entries removed as if they were “statisticall significant” results and remove rankings from our confidence set that are inconsistent with what we saw in the data. In short, if Seekers are never ranked below Guardians, remove all rankings in the confidence set that rank Seekers below Guardians.

One usually isn’t satisfied with just a clustering; it would be nice to determine what a clustering signifies about those who are in the cluster. For instance, what type of player gets assigned to Cluster 1? I feel that inspecting the data in a more thoughtful and manual way can give a sense to what characteristic individuals assigned to a cluster share. For instance, I read the comments submitted by poll participants to hypothesize what types of players were being assigned to particular clusters. You can read these comments at the bottom of this article, after the code section.

Code

All source code used to do the rank analysis done here is listed below, in a .R file intended to be run as an executable from a command line. (I created and ran it on a Linux system.)

Several packages had useful functions specific for this type of analysis, such as pmr (meant for modelling rand data) and rankdist (which had a lot of tools for working with the Kendall distance). The confidence interval, central ranking estimator, and hypothesis testing tools, though, I wrote myself, and they may not exist elsewhere.

I at least feel that the script itself is well-documented and I no longer need to explain it. But I will warn others that it was tailored to my problem, and the methods employed may not work well with larger sample sizes or when more items need to be ranked.

Conclusion

This is only the tip of the iceberg for rank data analysis. We have not even touched on modelling for rank data, which can provide even richer inference. If you’re interested, I’ll refer you again to Malden’s book.

I enjoyed this analysis so much I asked a Reddit question about where else I could conduct surveys (while at the same time still being statistically sound) because I’d love to do it again. I feel like there’s much to learn from rank data; it has great potential. Hopefully this article sparked your interest too.

R Script for Analysis

#!/usr/bin/Rscript
################################################################################
# ArkhamHorrorClassPreferenceAnalysis.R
################################################################################
# 2019-02-10
# Curtis Miller
################################################################################
# Analyze Arkham Horror LCG class preference survey data.
################################################################################

# optparse: A package for handling command line arguments
if (!suppressPackageStartupMessages(require("optparse"))) {
  install.packages("optparse")
  require("optparse")
}

################################################################################
# CONSTANTS
################################################################################

CLASS_COUNT <- 5
CLASSES <- c("Guardian", "Mystic", "Rogue", "Seeker", "Survivor")
CLASS_COLORS <- c("Guardian" = "#00628C",
                  "Mystic" = "#44397D",
                  "Rogue" = "#17623B",
                  "Seeker" = "#B87D37",
                  "Survivor" = "#AA242D")

################################################################################
# FUNCTIONS
################################################################################

`%s%` <- function(x, y) {paste(x, y)}
`%s0%` <- function(x, y) {paste0(x, y)}

#' Sum of Kendall Distances
#'
#' Given a ranking vector and a matrix of rankings, compute the sum of Kendall
#' distances.
#'
#' @param r The ranking vector
#' @param mat The matrix of rankings, with each row having its own ranking
#' @param weight Optional vector weighting each row of \code{mat} in the sum,
#'               perhaps representing how many times that ranking is repeated
#' @return The (weighted) sum of the Kendall distances
#' @examples
#' mat <- rbind(1:3,
#'              3:1)
#' skd(c(2, 1, 3), mat)
skd <- function(r, mat, weight = 1) {
  dr <- partial(DistancePair, r2 = r)

  sum(apply(mat, 1, dr) * weight)
}

#' Least Sum of Kendall Distances Estimator
#'
#' Estimates the "central" ranking by minimizing the sum of Kendall distances,
#' via exhaustive search.
#'
#' @param mat The matrix of rankings, with each row having its own ranking
#' @param weight Optional vector weighting each row of \code{mat} in the sum,
#'                perhaps representing how many times that ranking is repeated
#' @return Ranking vector that minimizes the (weighted) sum of rankings
#' @examples
#' mat <- rbind(1:3,
#'              3:1)
#' lskd_estimator(mat)
lskd_estimator <- function(mat, weight = NULL) {
  if (is.null(weight)) {
    reduced <- rank_vec_count(mat)
    mat <- reduced$mat
    weight <- reduced$count
  }

  skdm <- partial(skd, mat = mat, weight = weight)
  m <- max(mat)
  permutation_mat <- permutations(m, m)

  sums <- apply(permutation_mat, 1, skdm)
  permutation_mat[which.min(sums),]
}

#' Identify Ranking With Center
#'
#' Find the index of the center closest to a ranking vector.
#'
#' @param r The ranking vector
#' @param mat The matrix of rankings, with each row having its own ranking
#' @return Index of row that is closest to \code{r}
#' @examples
#' mat <- rbind(1:3,
#'              3:1)
#' close_center(c(2, 1, 3), mat)
close_center <- function(r, mat) {
  dr <- partial(DistancePair, r2 = r)

  which.min(apply(mat, 1, dr))
}

#' Simplify Rank Matrix To Unique Rows
#'
#' Given a matrix with rows representing rankings, this function reduced the
#' matrix to rows of only unique rankings and also counts how many times a
#' ranking appeared.
#'
#' @param mat The matrix of rankings, with each row having its own ranking
#' @return A list with entries \code{"mat"} and \code{"count"}, with
#'         \code{"mat"} being a matrix now with unique rankings and
#'         \code{"count"} being a vector of times each row in new matrix
#'         appeared in the old matrix
#' @examples
#' mat <- rbind(1:3,
#'              3:1)
#' rank_vec_count(mat)
rank_vec_count <- function(mat) {
  old_col_names <- colnames(mat)
  old_row_names <- rownames(mat)

  res_df <- aggregate(list(numdup = rep(1, times = nrow(mat))),
                      as.data.frame(mat), length)

  count <- res_df$numdup
  new_mat <- res_df[1:ncol(mat)]
  colnames(new_mat) <- old_col_names
  rownames(new_mat) <- old_row_names

  list("mat" = as.matrix(new_mat), "count" = count)
}

#' Find \eqn{k} Ranking Clusters
#'
#' Estimate \eqn{k} clusters of rankings.
#'
#' The algorithm to find the ranking clusters resembles the \eqn{k}-means++
#' algorithm except that the distance metric is the Kendall distance.
#'
#' @param mat The matrix of rankings, with each row having its own ranking
#' @param k The number of clusters to find
#' @param max_iter The maximum number of iterations for algorithm
#' @param tol The numerical tolerance at which to end the algorithm if met
#' @return A list containing the central rankings of each cluster (in
#'         \code{"centers"}) and a vector with integers representing cluster
#'         assignments
#' @examples
#' mat <- rbind(1:3,
#'              3:1,
#'              c(2, 1, 3),
#'              c(3, 1, 2))
#' rank_cluster(mat, 2)
rank_cluster <- function(mat, k, init_type = c("spectral", "kmeans++"),
                         max_iter = 100, tol = 1e-4) {
  simplified_mat <- rank_vec_count(mat)
  mat <- simplified_mat$mat
  count <- simplified_mat$count
  init_type <- init_type[1]

  if (init_type == "kmeans++") {
    centers <- rank_cluster_center_init(mat, k)
  } else if (init_type == "spectral") {
    centers <- rank_cluster_spectral(mat, k)$centers
  } else {
    stop("Don't know init_type" %s% init_type)
  }
  old_centers <- centers

  cc_centers <- partial(close_center, mat = centers)
  clusters <- apply(mat, 1, cc_centers)

  for (iter in 1:max_iter) {
    centers <- find_cluster_centers(mat, clusters, count)

    stopifnot(all(dim(centers) == dim(old_centers)))
    cc_centers <- partial(close_center, mat = centers)
    clusters <- apply(mat, 1, cc_centers)
    if (center_distance_change(centers, old_centers) < tol) {
      break
    } else {
      old_centers <- centers
    }
  }

  if (iter == max_iter) {warning("Maximum iterations reached")}

  colnames(centers) <- colnames(mat)

  list("centers" = centers, "clusters" = rep(clusters, times = count))
}

#' Find the Distance Between Two Ranking Matrices
#'
#' Find the distance between two ranking matrices by summing the distance
#' between each row of the respective matrices.
#'
#' @param mat1 First matrix of ranks
#' @param mat2 Second matrix of ranks
#' @return The sum of distances between rows of \code{mat1} and \code{mat2}
#' @examples
#' mat <- rbind(1:3,
#'              3:1)
#' center_distance_change(mat, mat)
center_distance_change <- function(mat1, mat2) {
  if (any(dim(mat1) != dim(mat2))) {stop("Dimensions of matrices don't match")}

  sum(sapply(1:nrow(mat1), function(i) {DistancePair(mat1[i, ], mat2[i, ])}))
}

#' Initialize Cluster Centers
#'
#' Find initial cluster centers as prescribed by the \eqn{k}-means++ algorithm.
#'
#' @param mat The matrix of rankings, with each row having its own ranking
#' @param k The number of clusters to find
#' @return A matrix containing cluster centers.
#' @examples
#' mat <- rbind(1:3,
#'              3:1,
#'              c(2, 1, 3),
#'              c(3, 1, 2))
#' rank_cluster_center_init(mat, 2)
rank_cluster_center_init <- function(mat, k) {
  n <- nrow(mat)
  center <- mat[sample(1:n, 1), ]
  centers_mat <- rbind(center)

  for (i in 2:k) {
    min_distances <- sapply(1:n, function(l) {
      min(sapply(1:(i - 1), function(j) {
        DistancePair(mat[l, ], centers_mat[j, ])
      }))
    })
    center <- mat[sample(1:n, 1, prob = min_distances/sum(min_distances)), ]
    centers_mat <- rbind(centers_mat, center)
  }

  rownames(centers_mat) <- NULL
  colnames(centers_mat) <- colnames(mat)
  centers_mat
}

#' Evaluation Metric for Clustering Quality
#'
#' Evaluates a clustering's quality by summing the distance of each observation
#' to its assigned cluster center.
#'
#' @param mat Matrix of rankings (in the rows); the data
#' @param centers Matrix of rankings (in the rows) representing the centers of
#'                the clusters
#' @param clusters Vector of indices corresponding to cluster assignments (the
#'                 rows of the \code{clusters} matrix)
#' @return Score of the clustering
#' @examples
#' mat <- rbind(1:3,
#'              3:1,
#'              c(2, 1, 3),
#'              c(3, 1, 2))
#' centers <- rbind(1:3, 3:1)
#' clusters <- c(1, 1, 2, 2)
#' clustering_score(mat, centers, clusters)
clustering_score <- function(mat, centers, clusters) {
  sum(sapply(1:nrow(centers), function(i) {
               center <- centers[i, ]
               submat <- mat[which(clusters == i), ]
               skd(center, submat)
             }))
}

#' Clustering with Restarts
#'
#' Clusters multiple times and returns the clustering with the lowest clustering
#' score
#'
#' @param ... Parameters to pass to \code{\link{rank_cluster}}
#' @param restarts Number of restarts
#' @return A list containing the central rankings of each cluster (in
#'         \code{"centers"}) and a vector with integers representing cluster
#'         assignments
#' @examples
#' mat <- rbind(1:3,
#'              3:1,
#'              c(2, 1, 3),
#'              c(3, 1, 2))
#' rank_cluster_restarts(mat, 2, 5)
rank_cluster_restarts <- function(mat, ..., restarts = 10) {
  best_score <- Inf
  rank_cluster_args <- list(...)
  rank_cluster_args$mat <- mat
  for (i in 1:restarts) {
    new_cluster_scheme <- do.call(rank_cluster, rank_cluster_args)
    score <- clustering_score(mat, new_cluster_scheme$centers,
                              new_cluster_scheme$clusters)
    if (score < best_score) {
      best_score <- score
      best_scheme <- new_cluster_scheme
    }
  }

  return(best_scheme)
}

#' Given Clusters, Find Centers
#'
#' Given a collection of clusters, find centers for the clusters.
#'
#' @param mat Matrix of rankings (in rows)
#' @param clusters Vector containing integers identifying cluster assignments,
#'                 where the integers range from one to the number of clusters
#' @param weight Optional vector weighting each row of \code{mat} in the sum,
#'                perhaps representing how many times that ranking is repeated
#' @return Ranking vector that minimizes the (weighted) sum of rankings
#' @return A matrix of ranks representing cluster centers
#' @examples
#' mat <- rbind(1:3,
#'              3:1,
#'              c(2, 1, 3),
#'              c(3, 1, 2))
#' find_cluster_centers(mat, c(1, 1, 2, 2))
find_cluster_centers <- function(mat, clusters, weight = NULL) {
  if (is.null(weight)) {
    weight <- rep(1, times = nrow(mat))
  }
  centers <- t(sapply(unique(clusters), function(i) {
                          submat <- mat[which(clusters == i), ]
                          subweight <- weight[which(clusters == i)]
                          lskd_estimator(submat, subweight)
                        }))
  colnames(centers) <- colnames(mat)
  centers
}

#' Cluster Rankings Via Spectral Clustering
#'
#' Obtain a clustering of rank data via spectral clustering.
#'
#' @param mat Matrix containing rank data
#' @param k Number of clusters to find
#' @return A list with entries: \code{"centers"}, the centers of the clusters;
#'         and \code{"clusters"}, a vector assigning rows to clusters.
#' @examples
#' mat <- rbind(1:3,
#'              3:1,
#'              c(2, 1, 3),
#'              c(3, 1, 2))
#' rank_cluster_spectral(mat, 2)
rank_cluster_spectral <- function(mat, k = 2) {
  dist_mat <- DistanceMatrix(mat)
  sim_mat <- max(dist_mat) - dist_mat
  clusters <- spectralClustering(sim_mat, k)
  centers <- find_cluster_centers(mat, clusters)
  list("centers" = centers, "clusters" = clusters)
}

#' Compute the Test Statistic for Uniformity Based on the Pairs Matrix
#'
#' Compute a test for uniformity based on the estimated pairs matrix.
#'
#' Let \eqn{m} be the number of items ranked and \eqn{n} the size of the data
#' set. Let \eqn{\bar{k} = k(k - 1)/2} and \eqn{\bar{y}} the mean rank vector.
#' Let \eqn{\hat{K}^*} be the upper-triangular part of the estimated pairs
#' matrix (excluding the diagonal), laid out as a vector in row-major order.
#' Finally, let \eqn{1_k} be a vector of \eqn{k} ones. Then the test statistic
#' is
#'
#' \deqn{12n(\|\hat{K}^* - \frac{1}{2} 1_{\bar{m}}\|^2 - \|\bar{y} - \frac{m +
#' 1}{2} 1_m\|^2 / (m + 1))}
#'
#' Under the null hypothesis this statistic asympotically follow a \eqn{\chi^2}
#' distribution with \eqn{\bar{m}} degrees of freedom.
#'
#' @param mat The data matrix, with rankings in rows
#' @return The value of the test statistic
#' @examples
#' mat <- rbind(1:3,
#'              3:1,
#'              c(2, 1, 3),
#'              c(3, 1, 2))
#' pairs_uniform_test_stat(mat)
pairs_uniform_test_stat <- function(mat) {
  desc_stat <- suppressMessages(destat(mat))
  mean_rank <- desc_stat$mean.rank
  pair <- desc_stat$pair

  m <- ncol(mat) - 1
  n <- nrow(mat)
  mbar <- choose(m, 2)
  K <- pair[upper.tri(pair, diag = FALSE)]
  meanK <- rep(1/2, times = mbar)
  cm <- rep((m + 1)/2, times = m)

  12 * n * (sum((K - meanK)^2) - sum((mean_rank - cm)^2)/(m + 1))
}

#' Compute Covariance Matrix of Pairs Matrix Upper Triangle
#'
#' Compute the covariance matrix of the pairs matrix estimator.
#'
#' @param mat Data matrix, with each ranking having its own row
#' @return The \eqn{m(m - 1)/2}-square matrix representing the covariance matrix
#' @examples
#' mat <- rbind(1:3,
#'              3:1,
#'              c(2, 1, 3),
#'              c(3, 1, 2))
#' pairs_mat_cov(mat)
pairs_mat_cov <- function(mat) {
  n <- nrow(mat)
  m <- ncol(mat)
  pair <- kappa_est(mat)
  pair <- as.matrix(pair)
  
  # Transform data into a dataset of pair-wise rank comparisons
  if (m == 1) {
    return(0)
  }

  kappa_data <- sapply(2:m, function(j) {mat[, j] > mat[, 1]})
  for (i in 2:(m - 1)) {
    kappa_data <- cbind(kappa_data, sapply((i + 1):m, function(j) {
                          mat[, j] > mat[, i]
                        }))
  }

  kappa_data <- kappa_data + 0  # Converts to integers

  cov(kappa_data)
}

#' Estimate \eqn{\kappa} Vector
#'
#' Estimate the \eqn{\kappa} vector, which fully defines the pairs matrix.
#'
#' @param mat Data matrix, with each ranking having its own row
#' @return The \eqn{m(m - 1)/2}-dimensional vector
#' @examples
#' mat <- rbind(1:3,
#'              3:1,
#'              c(2, 1, 3),
#'              c(3, 1, 2))
#' kappa_est(mat)
kappa_est <- function(mat) {
  n <- nrow(mat)
  df <- as.data.frame(mat)
  df$n <- 1
  pair <- suppressMessages(destat(df))
  pair <- t(pair$pair)
  pair <- pair[lower.tri(pair, diag = FALSE)]/n
  pair
}

#' Get Plausible Rankings For Central Ranking Based on Kendall Distance
#'
#' Determine a set of plausible central rankings based on the Kendall distance.
#'
#' Let \eqn{\alpha} be one minus the confidence level, \eqn{m} the number of
#' options, \eqn{\bar{m} = m(m - 1)/2}, \eqn{\kappa} the vectorized
#' upper-triangle of the pairs matrix of the population, \eqn{\hat{\kappa}} the
#' sample estimate of \eqn{\kappa}, and \eqn{\hat{\Sigma}} the estimated
#' covariance matrix of \eqn{\hat{kappa}}. Then the approximate \eqn{100(1 -
#' \alpha)}% confidence interval for \eqn{\kappa} is
#'
#' \deqn{\kappa: (\hat{\kappa} - \kappa)^T \hat{\Sigma}^{-1} (\hat{kappa} -
#' \kappa) < \chi^2_{\bar{m}}}
#'
#' One we have such an interval the next task is to determine which ranking
#' vectors are consistent with plausible \eqn{\kappa}. To do this, the function
#' determines which choices could plausibly be tied according to the confidence
#' interval; that is, which entries of \eqn{\kappa} could plausibly be
#' \eqn{1/2}. Whenever this is rejected, there is a statistically significant
#' difference in the preference of the two choices; looking at \hat{\kappa} can
#' determine which of the two choices is favored. All ranking vectors that would
#' agree that disagree with that preference are eliminated from the space of
#' plausible central ranking vectors. The ranking vectors surviving at the end
#' of this process constitute the confidence interval.
#'
#' @param mat Matrix of rank data, each observation having its own row
#' @param conf_level Desired confidence level
#' @return A list with entries \code{"ranks"} holding the matrix of plausible
#'         rankings in the confidence interval and \code{"preference_string"}, a
#'         string enumerating which options are, with statistical significance,
#'         preferred over others
#' @examples
#' mat <- t(replicate(100, {sample(1:3)}))
#' kendall_rank_conf_interval(mat)
kendall_rank_conf_interval <- function(mat, conf_level = 0.95) {
  n <- nrow(mat)
  m <- max(mat)
  mbar <- choose(m, 2)
  kap <- kappa_est(mat)
  Sigma <- pairs_mat_cov(mat)
  crit_value <- qchisq(1 - conf_level, df = mbar, lower.tail = FALSE)

  # Find bad rows of Sigma, where the covariance is zero; that variable must be
  # constant
  const_vars <- which(colSums(Sigma^2) == 0)
  safe_vars <- which(colSums(Sigma^2) > 0)
  safe_kap <- kap[safe_vars]
  safe_Sigma <- Sigma[safe_vars, safe_vars]

  # Determine if hyperplanes where one coordinate is 1/2 intersect confidence
  # set
  b <- as.matrix(solve(safe_Sigma, safe_kap))
  a <- t(safe_kap) %*% b
  a <- a[1, 1]
  check_half <- partial(hei_check, x = 1/2, A = safe_Sigma, b = -2 * b,
                        d = crit_value/n - a, invert_A = TRUE)
  sig_diff_safe_vars <- !sapply(1:length(safe_vars), check_half)

  if (length(const_vars) > 0) {
    sig_diff <- rep(NA, times = mbar)
    sig_diff[safe_vars] <- sig_diff_safe_vars
    sig_diff[const_vars] <- TRUE
  } else {
    sig_diff <- sig_diff_safe_vars
  }

  idx_matrix <- matrix(0, nrow = m, ncol = m)
  idx_matrix[lower.tri(idx_matrix, diag = FALSE)] <- 1:mbar
  idx_matrix <- t(idx_matrix)
  rownames(idx_matrix) <- colnames(mat)
  colnames(idx_matrix) <- colnames(mat)

  # Remove rows of potential centers matrix to reflect confidence interval
  # results; also, record which groups seem to have significant difference in
  # ranking
  rank_string <- ""
  permutation_mat <- permutations(m, m)
  for (i in 1:(m - 1)) {
    for (j in (i + 1):m) {
      sig_diff_index <- idx_matrix[i, j]
      if (sig_diff[sig_diff_index]) {
        direction <- sign(kap[sig_diff_index] - 1/2)
        if (direction > 0) {
          # Row option (i) is preferred to column option (j)
          permutation_mat <- permutation_mat[permutation_mat[, i] <
            permutation_mat[, j], ]
          rank_string <- rank_string %s0% colnames(mat)[i] %s%
            "is better than" %s% colnames(mat)[j] %s0% '\n'
        } else if (direction < 0) {
          # Row option (i) is inferior to column option (j)
          permutation_mat <- permutation_mat[permutation_mat[, i] >
            permutation_mat[, j], ]
          rank_string <- rank_string %s0% colnames(mat)[j] %s%
            "is better than" %s% colnames(mat)[i] %s0% '\n'
        }
      }
    }
  }
  colnames(permutation_mat) <- colnames(mat)

  return(list("ranks" = permutation_mat, "preference_string" = rank_string))
}

#' Straight Hyperplane and Ellipse Intersection Test
#'
#' Test whether a hyperplane parallel to an axis intersects an ellipse.
#'
#' The ellipse is fully determined by the parameters \code{A}, \code{b}, and
#' \code{d}; in fact, the ellipse consists of all \eqn{x} such that
#'
#' \deqn{x^T A x + b^T x \leq d}
#'
#' \code{x} is the intercept of the hyperplane and \code{k} is the coordinate
#' that is fixed to the value \code{x} and thus determine along which axis the
#' hyperplane is parallel. A value of \code{TRUE} means that there is an
#' intersection, while \code{FALSE} means there is no intersection.
#'
#' @param x The fixed value of the hyperplane
#' @param k The coordinate fixed to \code{x}
#' @param A A \eqn{n \times n} matrix
#' @param b An \eqn{n}-dimensional vector
#' @param d A scalar representing the upper bound of the ellipse
#' @return \code{TRUE} or \code{FALSE} depending on whether the hyperplane
#'         intersects the ellipse or not
#' @examples
#' hei_check(1, 2, diag(3), rep(0, times = 3), 10)
hei_check <- function(x, k, A, b, d, invert_A = FALSE) {
  b <- as.matrix(b)
  n <- nrow(b)
  stopifnot(k >= 1 & k <= n)
  stopifnot(nrow(A) == ncol(A) & nrow(A) == n)
  stopifnot(all(eigen(A)$values > 0))

  all_but_k <- (1:n)[which(1:n != k)]

  s <- rep(0, times = n)
  s[k] <- x
  s <- as.matrix(s)
  if (invert_A) {
    tb <- as.matrix(solve(A, s))
  } else {
    tb <- A %*% s
  }
  td <- t(s) %*% tb + t(b) %*% s
  if (invert_A) {
    # XXX: curtis: NUMERICALLY BAD; FIX THIS -- Thu 14 Feb 2019 07:50:19 PM MST
    A <- solve(A)
  }
  tA <- A[all_but_k, all_but_k]

  tx <- -solve(tA, (b/2 + tb)[all_but_k, ])
  tx <- as.matrix(tx)

  val <- t(tx)%*% tA %*% tx + t((b + 2 * tb)[all_but_k]) %*% tx + td - d
  val <- val[1, 1]

  val <= 0
}

################################################################################
# MAIN FUNCTION DEFINITION
################################################################################

main <- function(input, prefix = "", width = 6, height = 4, clusters = 5,
                 conflevel = 95, comments = "AHLCGClusterComments.txt",
                 detailed = FALSE, help = FALSE) {
  suppressPackageStartupMessages(library(pmr))
  suppressPackageStartupMessages(library(ggplot2))
  suppressPackageStartupMessages(library(reshape2))
  suppressPackageStartupMessages(library(dplyr))
  suppressPackageStartupMessages(library(rankdist))
  suppressPackageStartupMessages(library(gtools))
  suppressPackageStartupMessages(library(purrr))
  suppressPackageStartupMessages(library(anocva))

  load(input)
  n <- nrow(survey_data)

  rank_data <- survey_data[CLASSES]
  rank_data$n <- 1
  rank_mat <- as.matrix(survey_data[CLASSES])

  # Get basic descriptive statistics: mean ranks, marginals, pairs
  desc_stat <- suppressMessages(destat(rank_data))
  mean_rank <- desc_stat$mean.rank
  marginal <- desc_stat$mar
  pair <- desc_stat$pair

  names(mean_rank) <- CLASSES
  rownames(marginal) <- CLASSES
  colnames(marginal) <- 1:CLASS_COUNT
  rownames(pair) <- CLASSES
  colnames(pair) <- CLASSES

  # Compute "typical" distance based on least sum of Kendall distances
  best_rank <- lskd_estimator(rank_mat)
  names(best_rank) <- CLASSES

  # Hypothesis Testing for Uniformity
  statistic <- pairs_uniform_test_stat(rank_data)

  # Confidence Interval
  ci <- kendall_rank_conf_interval(rank_mat, conf_level = conflevel / 100)

  # Cluster data
  rank_clustering <- rank_cluster_spectral(rank_mat, k = clusters)
  centers <- rank_clustering$centers
  Cluster <- rank_clustering$clusters  # Naming convention broke for printing
  rownames(centers) <- 1:nrow(centers)

  # Plotting
  marginal_plot <- ggplot(
      melt(100 * marginal / n, varnames = c("Class", "Rank"),
           value.name = "Percent"),
      aes(fill = Class, x = Class, y = Percent, group = Rank)) +
    geom_bar(position = "dodge", stat = "identity") +
    scale_fill_manual(values = CLASS_COLORS) +
    labs(title = "Class Rankings") +
    theme_bw()
  ggsave(prefix %s0% "marginal_plot.png", plot = marginal_plot,
         width = width, height = height, units = "in", dpi = 300)

  pair_plot <- ggplot(
      melt(100 * pair / n, varnames = c("Class", "Opposite"),
           value.name = "Percent") %>% filter(Percent > 0),
      aes(fill = Opposite, x = Class, y = Percent)) +
    geom_bar(position = "dodge", stat = "identity") +
    geom_hline(yintercept = 50, linetype = 2, color = "red") +
    scale_fill_manual(values = CLASS_COLORS) +
    labs(title = "Class Ranking Comparison") +
    theme_bw()
  ggsave(prefix %s0% "pair_plot.png", plot = pair_plot, width = width,
         height = height, units = "in", dpi = 300)

  # Place cluster comments in file
  comment_string <- ""
  for (i in 1:clusters) {
    comment_string <- comment_string %s0%
      "\n\nCLUSTER" %s% i %s0% "\n------------\n\n" %s0%
      paste(survey_data$Reason[survey_data$Reason != "" & Cluster == i],
            collapse = "\n\n-*-\n\n")
  }
  cat(comment_string, file = comments)

  # Printing
  cat("\nMEAN RANK\n---------\n")
  print(round(mean_rank, digits = 2))
  cat("\nMARGINALS\n---------\n")
  print(round(100 * marginal / n, digits = 2))
  cat("\nPAIRS\n-----\n")
  print(round(100 * pair / n, digits = 2))
  cat("\nUNIFORMITY TEST\n---------------\n")
  cat("Test Statistic:", statistic, "\n")
  cat("P-value:", pchisq(statistic, df = choose(CLASS_COUNT, 2),
                         lower.tail = FALSE), "\n")
  cat("\nOPTIMAL RANK ESTIMATE\n---------------------\n")
  print(sort(best_rank))
  cat("\nWith", conflevel %s0% '%', "confidence:",
      '\n' %s0% ci$preference_string)
  if (detailed) {
    cat("\nPlausible Modal Rankings:\n")
    print(as.data.frame(ci$ranks))
  }
  cat("\nCLUSTERING\n----------\nCounts: ")
  print(table(Cluster))
  cat("\nCenters:\n")
  print(centers)
  cat("\nScore:", clustering_score(rank_mat, centers, Cluster), "\n")
  if (detailed) {
    cat("\nCLUSTER CONFIDENCE INTERVALS\n----------------------------\n")
    for (i in 1:clusters) {
      cat("\nCluster", i %s0% ':\n')
      ci_cluster <- kendall_rank_conf_interval(rank_mat[Cluster == i, ])
      cat("\nWith", conflevel %s0% '%', "confidence:",
          '\n' %s0% ci_cluster$preference_string)
      cat("\nPlausible Modal Rankings:\n")
      print(as.data.frame(ci_cluster$ranks))
    }
  }
}

################################################################################
# INTERFACE SETUP
################################################################################

if (sys.nframe() == 0) {
  cl_args <- parse_args(OptionParser(
        description = paste("Analyze Arkham Horror LCG class preference survey",
                            "data and print results."),
        option_list = list(
          make_option(c("--input", "-i"), type = "character",
                      help = paste("Input file containing survey data")),
          make_option(c("--prefix", "-p"), type = "character", default = "",
                      help = "Another command-line argument"),
          make_option(c("--width", "-w"), type = "double", default = 6,
                      help = "Width of plots"),
          make_option(c("--height", "-H"), type = "double", default = 4,
                      help = "Height of plots"),
          make_option(c("--clusters", "-k"), type = "integer", default = 5,
                      help = "Number of clusters in spectral clustering"),
          make_option(c("--comments", "-c"), type = "character",
                      default = "AHLCGClusterComments.txt",
                      help = "File to store participant comments organized" %s%
                             "by cluster"),
          make_option(c("--conflevel", "-a"), type = "double", default = 95,
                      help = "Confidence level of confidence set"),
          make_option(c("--detailed", "-d"), action = "store_true",
                      default = FALSE, help = "More detail in report")
        )
      ))

  do.call(main, cl_args)
}

Report

$ ./ArkhamHorrorClassPreferenceAnalysis.R -i AHLCGClassPreferenceSurveys.Rda --detailed

MEAN RANK
---------
Guardian   Mystic    Rogue   Seeker Survivor 
    2.92     3.10     3.16     2.60     3.22 

MARGINALS
---------
             1     2     3     4     5
Guardian 18.29 20.43 26.84 19.71 14.73
Mystic   19.71 18.29 17.81 20.90 23.28
Rogue    19.24 14.73 20.67 21.38 23.99
Seeker   28.03 25.18 17.10 18.53 11.16
Survivor 14.73 21.38 17.58 19.48 26.84

PAIRS
-----
         Guardian Mystic Rogue Seeker Survivor
Guardian     0.00  54.16 55.34  42.52    55.82
Mystic      45.84   0.00 51.07  39.90    53.44
Rogue       44.66  48.93  0.00  38.72    51.54
Seeker      57.48  60.10 61.28   0.00    61.52
Survivor    44.18  46.56 48.46  38.48     0.00

UNIFORMITY TEST
---------------
Test Statistic: 2309938376 
P-value: 0 

OPTIMAL RANK ESTIMATE
---------------------
  Seeker Guardian   Mystic    Rogue Survivor 
       1        2        3        4        5 

With 95% confidence: 
Seeker is better than Rogue
Seeker is better than Survivor

Plausible Modal Rankings:
   Guardian Mystic Rogue Seeker Survivor
1         1      2     4      3        5
2         1      2     5      3        4
3         1      3     4      2        5
4         1      3     5      2        4
5         1      4     3      2        5
6         1      4     5      2        3
7         1      5     3      2        4
8         1      5     4      2        3
9         2      1     4      3        5
10        2      1     5      3        4
11        2      3     4      1        5
12        2      3     5      1        4
13        2      4     3      1        5
14        2      4     5      1        3
15        2      5     3      1        4
16        2      5     4      1        3
17        3      1     4      2        5
18        3      1     5      2        4
19        3      2     4      1        5
20        3      2     5      1        4
21        3      4     2      1        5
22        3      4     5      1        2
23        3      5     2      1        4
24        3      5     4      1        2
25        4      1     3      2        5
26        4      1     5      2        3
27        4      2     3      1        5
28        4      2     5      1        3
29        4      3     2      1        5
30        4      3     5      1        2
31        4      5     2      1        3
32        4      5     3      1        2
33        5      1     3      2        4
34        5      1     4      2        3
35        5      2     3      1        4
36        5      2     4      1        3
37        5      3     2      1        4
38        5      3     4      1        2
39        5      4     2      1        3
40        5      4     3      1        2

CLUSTERING
----------
Counts: Cluster
  1   2   3   4   5 
130  83  80  66  62 

Centers:
  Guardian Mystic Rogue Seeker Survivor
1        3      2     4      1        5
2        3      5     4      1        2
3        3      4     1      2        5
4        1      5     3      4        2
5        5      1     4      3        2

Score: 881 

CLUSTER CONFIDENCE INTERVALS
----------------------------

Cluster 1:

With 95% confidence: 
Guardian is better than Rogue
Guardian is better than Survivor
Mystic is better than Rogue
Mystic is better than Survivor
Seeker is better than Rogue
Seeker is better than Survivor

Plausible Modal Rankings:
   Guardian Mystic Rogue Seeker Survivor
1         1      2     4      3        5
2         1      2     5      3        4
3         1      3     4      2        5
4         1      3     5      2        4
5         2      1     4      3        5
6         2      1     5      3        4
7         2      3     4      1        5
8         2      3     5      1        4
9         3      1     4      2        5
10        3      1     5      2        4
11        3      2     4      1        5
12        3      2     5      1        4

Cluster 2:

With 95% confidence: 
Guardian is better than Mystic
Guardian is better than Rogue
Seeker is better than Guardian
Seeker is better than Mystic
Survivor is better than Mystic
Seeker is better than Rogue
Survivor is better than Rogue
Seeker is better than Survivor

Plausible Modal Rankings:
  Guardian Mystic Rogue Seeker Survivor
1        2      4     5      1        3
2        2      5     4      1        3
3        3      4     5      1        2
4        3      5     4      1        2

Cluster 3:

With 95% confidence: 
Rogue is better than Guardian
Rogue is better than Mystic
Rogue is better than Seeker
Rogue is better than Survivor
Seeker is better than Survivor

Plausible Modal Rankings:
   Guardian Mystic Rogue Seeker Survivor
1         2      3     1      4        5
2         2      4     1      3        5
3         2      5     1      3        4
4         3      2     1      4        5
5         3      4     1      2        5
6         3      5     1      2        4
7         4      2     1      3        5
8         4      3     1      2        5
9         4      5     1      2        3
10        5      2     1      3        4
11        5      3     1      2        4
12        5      4     1      2        3

Cluster 4:

With 95% confidence: 
Guardian is better than Mystic
Guardian is better than Seeker
Rogue is better than Mystic
Survivor is better than Mystic
Survivor is better than Seeker

Plausible Modal Rankings:
   Guardian Mystic Rogue Seeker Survivor
1         1      4     2      5        3
2         1      4     3      5        2
3         1      5     2      4        3
4         1      5     3      4        2
5         1      5     4      3        2
6         2      4     1      5        3
7         2      4     3      5        1
8         2      5     1      4        3
9         2      5     3      4        1
10        2      5     4      3        1
11        3      4     1      5        2
12        3      4     2      5        1
13        3      5     1      4        2
14        3      5     2      4        1

Cluster 5:

With 95% confidence: 
Mystic is better than Guardian
Survivor is better than Guardian
Mystic is better than Rogue
Mystic is better than Seeker
Survivor is better than Rogue
Survivor is better than Seeker

Plausible Modal Rankings:
   Guardian Mystic Rogue Seeker Survivor
1         3      1     4      5        2
2         3      1     5      4        2
3         3      2     4      5        1
4         3      2     5      4        1
5         4      1     3      5        2
6         4      1     5      3        2
7         4      2     3      5        1
8         4      2     5      3        1
9         5      1     3      4        2
10        5      1     4      3        2
11        5      2     3      4        1
12        5      2     4      3        1

Respondent Comments Groups By Cluster

CLUSTER 1
------------

Guardians have serious bling and they're awesome at what they do, so they're number 1. Seekers also have great cards that guzzle clues and generally provide solid deck building, so they're #2. Rogues have cards that look like a lot of fun (there's bling there too) and they are often good at both clue gathering and fighting, depending on which is needed. Mystic decks feel like they're all the same, so building decks with them is not as much fun. Survivor cards are extremely limited so they're my least favorite.

-*-

I love the Mystic spells, especially the versatility. Hated Rogues since Skids days, although Jenny is great and Preston is very good fun.
Guardians and Seeker fall very easy into the usable archetypes of Attack and Investigate.

-*-

I love supporting guardians and seekers. Control focused mistics are also fun.

-*-

Purple is top just because of Recall the Future and Premonition. Yellow for being weird, Green for extra-actions and Finn. Red for cool, weird interactions at a bargain price. Blue is boring.

-*-

I don't like playing Rogues, alright? Please don't crucify me! Oh, this is anonymous? Excellent.

-*-

Simplicity of play and planning.

-*-

I love spells and magic items

-*-

Guardian are probably te most rounded IMO. Seekers next, but great at clue gathering.

-*-

Seeker pool has best card draw & selection; guardian has stick to the plan + stand together + weapons; survivor pool is good but good xp options are less varied (will to survive/true survivor or bust); mystics delve too deep + bag control + David Renfield are good. Rogue pool is harder to build a full game plan around—-its best cards enable great turns (pocket watch, double or nothing, etc) and are valuable to have in the party, but they have a harder time spending actions 2-3 as usefully since some of their best things exhaust (lockpicks).

-*-

Mystic and Rogue tied for first. Mystic is my prefered and I like how I can stack my deck to be heavy in either investigating and/or combat. Rogue because most get a lot of recources where you can purchase more expensive cards.

-*-

I feel as though Mystic have the broadest tool kit and be specialise in almost any direction. However my experience is solely limited to two player with my wife and she plays a cloover, so we need someone with bashing power.

-*-

Matt's response

-*-

I primarily play a seeker (Daisy)

-*-

Yellow fits with my playstyle the best

-*-

I really like all of them, so there's not a ton of distance between them.

-*-

gameplay style, clear focus on purposes

-*-

Guardian and Seeker are very straightforward, and I like that. They have a clear objective, and they do it well.

-*-

While I feel that most classes have merit, the rogue is generally the worst at the core aspects of the game: fighting and clue finding. Evading does not have the punch that killing the enemy foes.

-*-

I prefer a support / team role, and play for consistency over tricks.

-*-

Most useful for the group

-*-

I just looked at options. Mystics have a lot of options in every way, shape or form, and so do Guardians. I just prefer the mystic combos better, since Guardians are pretty bland in that regard. I feel you really can now make different mystic decks, from support to tank and combat master, to main seeking investigator etc. They have everything and even playing one deck a few times is till fun because of so many exp. options. And while their decks are pretty deep, the premise is simple - boost willpower. That leaves them with a nice weakness you have to cover. Guardians have better weapons (more fun) than mystics have combat spells, although Shattered Aeons really gave Mystics a fun new icy option. And maybe I'd like to see a Mystic that wouldn't be pure Mystic if you get me. Some hybrid guy or girl, that's not just using spells and artifacts from the same class over and over again. That's really what they're missing. Guardians are just so great, because they are sooo well balanced imo. It's quite relaxing looking at their options. You have everything from amazing gear, weapons, allies, events that cover literally everything + your friends' asses, awesome skillcards that can also combo, fun and engaging exp. options etc. But they lack different kinds of investigators. They have options, just some other classes have more. Maybe my least favorite on investigator side. Mystics again are so simple to make in that regard. I gave Seekers 3. because they just have some 0 exp. cards that are just too strong for any class, not just for them. Otherwise I really like Seeker cards theme, maybe even more than Guardian, maybe even my favorite I'd say, but again, Seekers just have so much random stuff and OP stuff (you know what they are). I don't care for balance in a co-op game, OP cards can be really fun, but this stuff really limits their options and sometimes even other classes' options, because not including them just hinders your deck and you know it (example is Shortcut). And that's not good. They have really fun and diverse roster of investigators though. And their experience options are quite game breaking, but in a good way imo. There's seeking, combat, running and evading so much support and combos, really fun and diverse. Rogues have maybe some of my least favorite cards, but they have A LOT of options. They have quite a few very awesome weapons, but they also have SO MUCH cards that are meant for combos and while combo decks are fun, they, in my opinion, are niche, or at least not used in every game. Sometimes you just want a simple deck and Rouges have a limited card pool when you look at it that way (example: no useful combat ally or even asset - there is a new Guardian tarrot card for Jenny and Skids, but they need more imo). They got their quite fresh Lockpicks and the seeker gator and that was an amazing get. But more, Leo breaks their ally pool, because he's just too strong. They also have no pure combat investigators, but otherwise their investigators are really really fun and diverse. They have AMAZING experience options. Maybe the best in the game. And btw, they were my favorite to play before the last few expansions. I love Preston, but again the new cards are very niche. The new seeker agent Joe with 4 combat elevates seekers above Rogues for me in the options in card pool department though. They now have an optional pure combat investigator, while Rogues still don't. Survivors have AWESOME cards, especially investigators are just so fun and weird, but they just lack options in the card pool. You have so many "survive" cards, but they lack anything else strong. Their weapons are quite fun, but there are no heavy hitting options. That for me may be their biggest minus. Lack of experience pure combat options. They have quite a few very strong investigate cards though like Look What I Found and Newspaper 2 exp. And their allies, while strong, are still nicely balanced and quite diverse. They have a million evade options, maybe even too much. It would sometimes be nice to get something else rather than just another evade. These new Track Shoes are pretty cool though. Their skill cards are pretty awesome imo. But still, I feel like they have so much niche cards that only allow some very specific combos, like Rogues, and lack anything else meaningful. They are extremely fun to play though, with all their Survivor specializations like Seeker Urchin, combat Gravekeeper, being dead crazy guy, new athlete runner and evader etc. They may even be my favorite class, but they still lack options in a big way. And they even lack one investigator only available for 15 bucks along a cheaply written book.

CLUSTER 2
------------

survivors da best

-*-

Guardian just have so many cards that, when looking at them, seem useful. Mystic is my actual favourite class, but it has soo many cards where they went too far with the punishing effects that almost made them useless. Survivor on the other hand has too many events that end up feeling almost the same. Seekers I dont really know, Ive never played them, but everytime I see them looks like they can do many things. And rogue, while it has improved a bit more, I still miss a useful level 1 weapon

-*-

Difficulty wrapping my head around some classes

-*-

Mystics are incredibly dependent on their cards.

-*-

Seekers usually win the game, because the snitch is 150 points

-*-

Always cards in these classes that I have a hard time cutting. Which means they have the deepest pools marking them the most fun to me

-*-

I love deck manipulation for seekers, and the flexibility of survivors. I just can't get my head wrapped around mystics.

-*-

Guardians have a lot of great tools for not just fighting but getting clues. Seeker has the best support so splashing it is great. Rogue and survivor are ties for good splash but survivors card pool is mediocre to me. Mystic aren't bad but I haven't seen it great with others very well. Mystics are good as themselves but really expensive and not great for splash IMO.

-*-

Survivor have many nice tricks to survive and gather clues.
Guardians because they have the best weapons (flamethrower) and protective tools.
seeker for their upgradable cards and higher ed.
mystic for canceling cards but dont like their only good stat is willpower... rogues seems interesting but never played one.

-*-

Seekers have action economy (shortcut, pathfinder), card economy, resource economy (Dr Milan + cheap cards) and they advance the game quickly (i.e. discover clues).
Specialist decks are better than generalist decks (in multiplayer, which I play) as they accomplish their goals more consistently, and this favours seekers and guardians.
Stick To The Plan with Ever Vigilant is the most powerful deck element I am aware of.

-*-

I tend to play builds focused around consistency of succeeding at tests and action efficiency and my rankings reflect the build consistencies in order except rogue who are consistent but just not interesting.

-*-

Love survivors

-*-

Seeker is m'y main class

-*-

Firstly let me preface this with I only own 2 cores and the Dunwich cycle and have yet to play through Dunwich.

Survivor offers the most versatility and always seems to be one of the key factors when beating the odds in most cases as well as enhancing evasion and action economy (survival instinct etc).
Seeker cards are my second favourite due to the amount of utility included within them (i.e. Shortcut, Barricade, Medical Texts, Old Book of Lore etc) as well as allowing you what you need to catapult out in front of the agenda deck with cluevering abilities.
Guardian and Mystic operate on a similar field marginally behind Seeker to me though mystic finds itself slightly higher because of the unique interactions with the encounter deck and rule bending. though in my limited experience they both seem to be the more combat based of the card pools so operate in that same niche for me.
Rogue is unfortunately last but honestly that's just because I haven't had many interactions with them, most of their effects seem too situational to be able to use consistently.

-*-

I don't like taking the obvious solutions to a problem. I.E: Gun to the face, or Spells for everything.

-*-

Efficiency at what is needed to complete scenarios - mostly clue getting and combat.

-*-

Rogue and survivor seem to have the most cards that support each other to suggest a new way of playing. Recursion survivor is fun and different from no money survivor (though you can do both). Rogue has succeed by 2 and rich as options. Seeker has less of that but has the power of untranslated etc cards. Guardians are okay but kind of blah. I can’t see any fun decks to do with mystic. Like, messing with the bag is a cool thing to do in any deck, it isn’t a deck. Playing with doom is a couple cards that need each other but it isn’t a plan for how the game will play out.

-*-

Definitely hard to rank them, but ranked in order of which I'd most like to have as an off-class

-*-

I like the consistency in the Survivor card pool and how much individual support there is for the variety of Survivor investigators. Although I like the power level of Mystic cards, it always sucks to have your Rite of Seeking or Shriveling 15 cards down after a full mulligan for them.

-*-

More scenarios need cloovers and fighters, so all classes outside of Seeker and Guardian are more tricksy and less focused on the goal. This is a hard-enough game as it is!

-*-

Seeker cards are way too powerful. Rogues are the most fun to play. Survivor cards are super efficient at what they do. Guardian pool is decent but overpriced. Mystics have a few amazing cards, but the rest is pretty meh.

CLUSTER 3
------------

Vaguely from ‘most interactive’ to ‘most straightforward’ with a special mention for the Survivor card pool which has been largely static since roughly core with a few major exceptions.

-*-

Rogue cards are the most fun for me. More money, more actions, more fun.

-*-

I seem to like the classes that are less straight-forward than Guardian and Seeker tend to be. (In the sense that they are the archetypical fighters and cluevers.)

-*-

I like cards that cheat the system and don't depend on leveraging board state

-*-

Green and purple cards have fun and flashy effects. Blue and yellow cards have more standard effects and narrower deck building options.

-*-

I didn't play mystics a lot yet

-*-

The numbers are different depending whether we’re talking theory or practice. In theory the Mystic cards are my favorite, both for flavor and interesting mechanics. In practice I struggle with them and they’re usually the first cut.

-*-

Combos!

-*-

I like moneeeey

-*-

seekers have literally everything, and their cards usually aren't too expensive. rogues have adaptable, streetwise, and really good allies, but they're a bit low in damage output. guardians have really good cards but are limited by how expensive they are. mystic events are amazing, but they are 4th place because level 0 spells kinda suck and are expensive as hell. mystic cards are much better with exp. survivor cards are almost decent. it really sucks that many of their leveled up cards are exile cards, but survivors don't get any extra exp. but in general i find their cards to be lacking in clue-gathering capability and damage dealing. they can turn failures into successes, but that's about it.

-*-

Guardian is solid and predictable, Rogue is fun. Mystic is challenging, Seeker and Survivor are necessary.

-*-

THE JANK

-*-

I really dislike survivors as I simply dont understand how to properly build them (appart maybe Wendy). Even if I have rated mystics 4, I enjoy playing Mystic nearly as much as seeker (which I rated 1) rather than Survivor.

-*-

I think the rogue theme is portayed very well in their card pool

-*-

corelation between mechanisms and theme

-*-

I like big, flashy, ridiculous turns and risky plays, so rogue and mystic are the most fun for me. Guardian and seeker are fine and all, just a bit dry. I don’t understand survivor at all, but I’m happy other people have a thing they like.

-*-

Rogue and survivor give you interesting and powerful but situational tools that you have to figure out how to apply to the scenario. Mystic and guardian are more about powerful assets that you either draw early and use a bunch or wish you’d drawn earlier but can’t afford now and just commit for icons. Seeker pool makes me sleepy every time I look at it; the only mechanic there I really enjoy is the tome synergies and that’s only with Daisy (Rex, of course, is only played one way).

-*-

Role-play Value

-*-

I went for those that get me excited to play or provide thrills or cool combinations as I play (rather than, say, the power of the cards)

CLUSTER 4
------------

Lol moments. We’d all be survivor if we were in this game!

-*-

The top two were tricky to place; Rogues have fantastically fun combo plays available to them, while I love the 'feel' of many Survivor cards, fighting against fate as hard as they damn well can.
Overall, I find the Survivor pool *just* wins out, especially with the excellent Will To Survive and semi-immortal Pete Sylvestre.

Guardians and Seekers are two sides of the same coin; I'd say Guardians edge out, because while a Guardian has a few tools (including the infamous Flashlight) to find clues, Seekers have very few options to take care of enemies.
As with Survivors and Rogues, though, this is close.

Mystics... weeeeell. .. I acknowledge they are arguably the best class, once set up, and while their charges last on their spells. The ability to do everything while testing just one stat can make them very efficient. But... this is the class I enjoy the least, in part due to their over-reliance on their spells. Their solutions never feel more than stopgaps for me, so I find Mystics a hard class to play.

(That won't stop me taking Mystics for a spin though, especially for goodies like Delve Too Deep )

-*-

Ability to bounce off Investigators with good resource and action economy, other card pools (including Neutral), as well as capability to start off with no experience — all the way to full campaign with as much power-card investment as possible. Seeker may have 2 of the best cards in the game (Higher Education and Dr. Milan Christopher), but the Seeker card pool as a whole does not stand up. It is both narrow and shallow. Mystic is the most detailed and the most broad, but suffers from undue delay leading to deterioration. Guardian definitely needs to be more broad as well. Both Rogue and Survivor blend well, and provide the necessary breadth to take on challenges while melding with the high-economy Investigators. Rogue has a few 3, 4, and 5 xp cards that push it to the top spot. Even for Lola these statements hold up.

-*-

On a scale of most interesting vs. most boring. Options for rogues and survivors feel fresh and like there are multiple deck archetypes that are valid. Less so for the seeker and mystic card pools, where I feel like there are more "must include" cards which makes deck building less exciting and more rote.

-*-

survivor da bass

-*-

The card pool allows rogue/survivor decks to make specialists. Seekers are all just different flavours of clueverer

-*-

Personally, I like healing and support best, which guardian does quite well. Survivor has my second favorite card pool, though, for tricks and card recursion.

-*-

Not much between them but I like guns & ammo, survivor class is cool because it is normies vs horror

-*-

I really like the guardian cards as i enjoy fighting the monsters that appear in scenarios. Unfortunately my least favorite is mystic. Although they have powerful cards, they often take time to set up and I think that the draw backs on some of their cards are too harsh for what they do.

-*-

Just what I gravitate towards

-*-

I like killing monsters

-*-

Mystics have so much of everything with cool effects added on. Guardian cards are efficient at what they do, but really boring.

-*-

Survivors feel more unique, guardians kill stuff, seekers feel like you can't win without them (though you really can). Rogues and mystics by default. I like rogues better because of Finn and Sefina being really fun to play.

-*-

Almost always let my partner(s) play the seekers as I find the rogue and survivor cardpools allow you to fly by the seat of your pants, which I find even more exciting than just being the clue gatherer. Mystic card pool can sometimes take too long to develop. Also many marquis mystic cards flirt around with the doom mechanic which always bites me in the arse. Thirdly, mystic pool doesn't have a strong ally base. What's funny about that is I always play spellcasters in D n D.
Guardian pool is pretty straightforward, one I look at as more of a necessity within the context of the game,but doesn't tug at my heartstrings .
Apologize for the sloppy phrasing in my opine due to a long day. Rankings based on personal preferences only. No meta analysis

-*-

Agnes

-*-

Just prefer the in your face monster destruction that Guardian is themed with. Really enjoy that class.

-*-

Flexibility

-*-

I love killing things and then getting clues!

-*-

I like all of them but play seekers least, I also like that guardians can just take the mythos phase to the face

-*-

I like to be the tank, and with some of the new additions guardians have gotten with getting clues they just shine even more. Mystic I never really play but has so many cards I want if I am playing a dunwich survivor or anyone who can get them, same goes for survivor, very few cards from rogue or seeker makes it into my decks unless I am playing chars like Wendy or Leo who kinda needs them to make them work

-*-

Number of fun upgrade options in green, base card pool for red, upgrade options in blue, useful upgrades in seeker, purple sucks to play.

-*-

I like support / killing role, Carolyn FTW

CLUSTER 5
------------

Weapons are fun.

-*-

Leo is da alpha male.

-*-

Red's best

-*-

There’s more variety of cards that let me build decks I enjoy. As you go down the ranking, there’s less variety or fewer viable sub themes to build around.

-*-

Seeker is powerful but boring, while mystic getting to punch pack at the game is great, with good support to boot.

-*-

I enjoy the cardplay in survivor, and the mystic side of things. Seeker cards are generally very powerful. I don’t enjoy playing rogue but there is some good cardplay. Guardian I find less interesting overall as a class

-*-

Wendy is literally the best investigator in the game

-*-

I enjoy support cards and interesting, unique effects.

-*-

I tend to go for lazy game play, and usually guardians allow to smash enemies without worrying too much about strategy. Seekers I love them thematically. Mystics, I never understood how to play them efficiently

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

http://ntguardian.wordpress.com/?p=3565

Extensions

Comparing the Classes of Arkham Horror; Why Survivors Need Work

ntguardian Feb 18, 2019

Introduction This blog post was prompted by this meme posted in the Arkham Horror: The Card Game Facebook group: This meme got 144 comments and 94 reactions, and I’d say that most of them were in effect agreeing with the point the meme makes; the Survivor class feels like a black sheep among the Arkham…Read more Comparing the Classes of Arkham Horror; Why Survivors Need Work

Show full content

Introduction

This blog post was prompted by this meme posted in the Arkham Horror: The Card Game Facebook group:

Arkham Horror meme

This meme got 144 comments and 94 reactions, and I’d say that most of them were in effect agreeing with the point the meme makes; the Survivor class feels like a black sheep among the Arkham Horror LCG classes (not that there aren’t people who like Survivors).

One of the comments on the thread was made by me:

Seriously, we’re three cycles in, starting the fourth, and there’s no Level 4-5 Survivor cards? It makes you wonder if there will EVER be high-level Survivor cards!

I’m sure you can make one that does [sic]. Here’s one: King of the Hobos! When you have no resources, take two extra actions.

I dunno, just give us something. Late in the campaign it becomes a pain to upgrade your deck because all cards are low XP. It also makes seeing “Survivor cards level 0-5” a joke and Lola basically another survivor.

This comment, on its own, got 58 replies. So let’s just say that how the Arkham Horror designers are handling the Survivor class is a hot-button topic. It’s pretty clear what I think about how Survivors are handled. But I also wanted to see to what extent the community agreed or disagreed with me.

So I created a poll (now closed) asking people to rank the Arkham Horror LCG classes from best (1) to worst (5). I’ll first present this and other data, then present my own opinion.

Data

The first data I saw related to this was a response to my comment, showing the popularity of investigator decks on ArkhamDB. Below is a screenshot of the interactive tool (that you should check out):
Tableu deck popularity, Survivor emphasis

(This data set does not include the investigators released with The Circle Undone, but considering how new the cycle is that may be for the best.)

Survivors supposedly occupy the low tier of ArkhamDB decks, and thus are less popular. I don’t agree with this conclusion from this data set; as others pointed out by others, lots of people may be playing survivors without posting their decks, and the amount of new decks being made doesn’t necessarily correlate well with how people feel the class performs. For instance, supposedly there are not many Mystic decks because all Mystic decks include the same key cards and thus look largely the same. (And that’s not a good thing, by the way.)

I also want to add that the fundamental question is how good a card pool is rather than how good Survivor investigators are. I don’t think that there’s anything wrong with the Survivor investigators in a vacuum; they’re all great investigators. (That includes Calvin, too; he’s not just a “challenge” investigator, he can pull his weight and more in a game when played right.) That said, the identification of an investigator as a “Guardian” or “Seeker” or “Survivor” serves no mechanical purpose. Nothin in the game pings off the class of the investigator; all investigators could be in the “Neutral” class (like Lola) and the game would be the same because of the deck building requirements. All that the class of an investigator does is indicate what the deck building requirements will be. And some investigators, such as Carolyn and Norman, really drive this point home with their deck-building requirements.

Calvin Wright

So to answer this question, I created a poll, asking people to rate the card pools of the classes. My question was simple: “Rank which class card pool you prefer, from most favorite (1) to least favorite (5).” Not everyone agreed that this was the best question to ask; there are different aspects in which classes may be “better” or “worse”. My primary Arkham Horror LCG partner and friendly local game store owner Matt Freed (who owns Mind Games, LLC, in West Valley, Utah) refused to answer the question as phrased because his rankings would differ completely depending on whether the class in question was the primary class or an off class of an investigator (I told him that I cared most about the card pool class as a primary class of an investigator). That said, the simplicity of the question (plus attempting to publicize the poll on the Internet as well as I could) managed to get me 421 responses, a decent sample size.

Lola Hayes

When looking at this data one must remember that this is not a random sample as statisticians prefer. The people who participated chose to do so and they’re all from the Internet, which means that one can raise questions about the external validity of the data. That said, I think that we can still learn a lot from the data, even if it’s not perfect.

Carolyn Fern

By the way, analyzing the data I got was my first forray into analyzing rank data. I had to learn new statistical methods to meaningfully pry into the data to see what it said. I loved the methodology and the details of what I did will be presented in a later post, along with the scripts I wrote for doing this analaysis. For now, I’ll just mention that my primary reference for learning how to analyze the data was Analyzing and Modeling Rank Data, by John Marden.

Norman Withers

Let’s start with some basic charts. Below is the “marginals” plot; simply put, it’s how frequently a class was assigned a particular ranking.

Marginals

One reading of this plot is that Seekers are rated high, Survivors are rated low, Guardians are rated third, and it’s hard to tell how Mystics and Rogues rank, though it appears that Rogues are better liked than Mystics.

I however don’t like this plot since it doesn’t take advantage of the fact that the data is rank data. A plot that better accounts for the nature of the data is the “pairs” plot, which shows how many people prefer the $x$ -axis class to the class represented by the bar. If the bar is above 50%, then the $x$ -axis class is preferred to the bar class, while if the bar is below 50%, the bar’s class is preferred. The pairs plot (with a line marking the important 50% cutoff) is shown below:

Pairs

It’s much clearer from this plot which classes people prefer. Seekers are preferred to all other classes. Guardians are preferred to all classes but Seekers. Mystics are peferred to Rogues and Survivors, and Rogues are preferred only to Survivors. The Survivors are handily in last place, not being preferred overall to any other class.

This suggests that the ranking of the classes are, from best to worst: Seeker, Guardian, Mystic, Rogue, and finally Survivor. This ranking was the ranking obtained by one estimator of the central (or “closest to consensus”) ranking obtained by finding a ranking that minimizes the sum of Kendall distances (which is closely tied to the pairs plot). Also, a statistical test confirmed that the respondents were not equally likely to give any ranking and thus have preferences. A 95% confidence interval could only conclude, though, that Seekers were preferred to Rogues and Survivors; any other ranking is plausible under that confidence interval. That means that Seekers overall rank in the community is at least three, while Rogues and Survivors cannot be ranked first in the consensus ranking. (All tests and confidence intervals were based on the pairs matrix/Kendall distance.)

These are the statistics for the community considered as a whole, but it’s possible that players fall into different “archetypes” and thus may have different class preferences. Matt Newman, the lead designer of Arkham Horror, seems to believe so according to this article he wrote. I never asked players what “type” they were, but I attempted to determine types via cluster analysis, based on spectral clustering.

Let me start by saying that I’m not convinced there are meaningful clusters of players in this data. All the metrics for finding clusters were bad. But if you insist that there must be clusters of player types in the data, read on.

If there are clusters, my best guess is that there are five clusters of players. Based on reading the (optional) responses of players in each cluster, I’ve labeled (in a very subjective manner) cluster 1 as the “power” players (this is the cluster of Matt and I), cluster 2 as the “versatilityi/efficiency” cluster, cluster 3 as the “bling” cluster, cluster 4 as the “kill monsters” cluster, and cluster 5 as the cluster that likes “theme” in the game (although honestly this cluster was the hardest to define and often looked like the cluster of the most confused participants; the comment rate from this cluster was the lowest.)

31% of players were in the “power” cluster, 20% in the “versatility/efficiency” cluster, 19% in the “bling” cluster, 16% in the “kill monsters” cluster, and 14% in the “theme” cluster. The class preference in each cluster was, in order:

Seeker, Mystic, Guardian, Rogue, Survivor
Seeker, Survivor, Guardian, Mystic, Rogue
Rogue, Seeker, Guardian, Mystic, Survivor
Guardian, Survivor, Rogue, Seeker, Mystic
Mystic, Survivor, Seeker, Rogue, Guardian

The confidence intervals (which are starting to lose their validity in the cluster analysis due to sample size and some preference pairings never being seen in the cluster) suggest that the “power” players dislike Rogues and Survivors; any ranking that puts Rogues and Survivors in the two worst rankings could be this group’s “central” ranking. The second group is more difficult to infer; Guardians and Survivors are supposedly better than Mystics and Rogues, while Seekers are better than all other classes. The “bling” cluster loves Rogues more than any other class and prefers Seekers to Survivors. The “kill monsters” cluster prefers Guardians and Survivors to Mystics and Seekers and Rogues more than Mystics. Finally, the “theme” cluster prefers Mystics and Survivors to all other classes.

What’s Wrong with Survivors?

I would say that, taken together, this data suggests that the Survivor class is, indeed, problematic, and players on the whole are not a fan of how it’s handled. (That said, one could make a case that this is true for the Rogue class too.) Here’s my argument regarding what’s wrong with the Survivor card pool.

Let me start by saying there are great Survivor cards. Lucky! may be the best card in the game. Peter Sylvestre is a great ally (at both levels), and I even really like Level 3 Rabbit’s Foot. Survivor cards are among the top 20 most popular (faction) cards in ArkhamDB decks.

Top 20 cards

While no one can credibly argue that Survivors have bad cards, I think we can credibly argue that the Survivor card pool is weak and eventually becomes a drag to play in a campaign.

Lucky!

The two most recent Survivors I played were “Ashcan” Pete and William Yorick. My “Ashcan” deck was a deck that heavily utilized Yaotl, Cornered, and the desperate cards. I was playing this deck in The Forgotten Age cycle. At first I really enjoyed the deck and how it worked. I don’t think it was the most powerful deck but it was fun to play. Eventually, though, I fell out of love with the deck and I don’t think I will ever try it again. Eventually I could not upgrade the deck without replacing what I saw as “core” cards. This problem occured mid-campaign, too.

Ashcan Pete

The same problem happened with my William Yorick deck (again in Forgotten Age). Granted, Yorick has access to the Guardian card pool, which provided another outlet for spending my experience without removing core cards. The Guardian cards were an important experience point outlet. Eventually, though, I ran into the same problem: lots of experience points and nowhere to spend them (but at least it was late in the campaign when I hit this problem).

William Yorick

Thus I have basically one complaint with the Survivor card pool: there are no high level Survivor cards. It turns out this is by design; Matt Newman told the hosts of the Drawn to the Flame podcast in episode 22 that he liked keeping the level of Survivor cards capped at 3 since it fit with the Survivor theme of “not being ready.” Thus we are starting our fourth Arkham Horror cycle and there are no high-level Survivor cards.

Peter Sylvestre

What’s the consequences of this? Well, in my FLGS, “Survivor cards level 0-5” is a long-running joke since there are no high-level Survivor cards. Now Matt (the FLGS owner) said that, as an off-class, Survivor cards are one of the best pools to pull from, and I agree with him. Consider the table below:

Class Total Cards High-Level Cards (at least 3 XP) Seeker 81 18 Guardian 79 14 Mystic 80 17 Rogue 79 12 Survivor 80 11

69 Survivor cards are accessible to off-class Survivors, more than any other class. Furthermore, these are just about all the best Survivor cards.

Rabbit's Foot

Now let’s take a step back. Remember when Joe Diamond was announced to be a Seeker? That was a shock to many in the community, who were pretty sure that Joe Diamond would be a Guardian. Furthermore, making Joe Diamond a Seeker was a big deal that had major implications for his deck building. A primary-class Seeker/off-class Guardian will look very different from a primary-class Guardian/off-cass Seeker, both in deck and style of play.

Joe Diamond

Let’s take William Yorick. What would change if William Yorick went from primary-class Survivor/off-class Guardian to primary-class Guardian/off-class Survivor? Well, Yorick would lose access to the 11 Survivor cards that are level 3 and gain access to the 14 Guardian cards that are levels 3, 4, and 5. And to be completely honest, I wouldn’t miss any of the Survivor cards I lost; I didn’t use them in the Yorick deck I built and honestly none of these cards feel like great cards we’d enthusiastically put in decks. I’d say that the level 3 Survivor cards are generally cards that get placed in decks that have experience points to burn; that XP has to be spent on something. So by making this switch, my Yorick deck would, almost unambiguously, become better.

Yaotl

Thus my first point: high-level cards help distinguish class capabilities. It is because these powerful cards are not accessible to other investigators that these classes are distinctive. Seekers, Guardians, Rogues, and Mystics all have cards that make that class memorable and help separate that class from the rest. The Survivor card pool does not have this since off-class Survivors are basically just as good as primary-class Survivors. Heck, even Lola has full access to the Survivor pool! She may as well be one!

Cornered

I think that if Matt Newman were to read this his response would be “Survivors lack of high-level cards is what makes them distinctive. It’s their lack of preparedness that makes them thematically work.” First, I think we’ve seen that the lack of high-level cards makes the class worse from a gameplay perspective. Second, think about who we would call “survivalists” today, such as tribal people, prehistoric humans, or guys that go off into the woods and cut themselves off from civilization. There are actually many aspects of these people that appear almost superhuman. They’re impoverished but often skilled in everything necessary to survival. Most people from regular society would die if immediately forced into the dire situations survivalists deal with on a regular basis. Survivalism is not about being unprepared or low-skilled. It’s about being well-rounded and internalizing all strengths, thus not dependent on the tools available to you. Survivalists are actually well-prepared! And high-level cards can be designed to fit this ethos.

And thus my second criticism of the low-XP policy: it restricts player growth. Upgrading your deck not only is a way to get new toys; it shows how the encounters with the mythos caused the investigator to grow and improve. By preventing access to high-level cards, the investigator’s growth is restricted. There is access to a lot of low-level cards, but putting these cards into a deck pushes out other cards to such an extent that the deck at the beginning and end of the campaign look extremely different. This was the case with my aforementioned Ashcan deck; I could not upgrade it without drastically altering it. I could not keep the same deck archetype while at the same time staying a Survivor. The deck would have to transform in character in order to upgrade; it wasn’t really possible for the deck to just get better at what it already does.

Now there are the exile that can help players burn XP. But I hate those cards! Not only are they very narrow cards, I don’t like burning XP when I play a card. (That said, I like the upcoming Survivor ally Guiding Spirit and wouldn’t have a problem putting it into a deck even early in the campaign.) I think most players don’t want to use their experience points on Exile cards either, so I wish there were other places I could spend my experience points in the Survivor class than the exile cards.

Guiding Spirit

Closing Thoughts

I’ve spent this article picking on Survivor cards but while there is a lot of evidence suggesting this class is in need of the most work there’s also evidence that people don’t like the overall design of the Mystic and Rogue classes, either. I think people’s main complaint with the Mystic class is that deck building with Mystics feels stale; there are some key cards that every Mystic deck includes and thus they all start to look the same. I think this is a valid point, and one good way to fix this would be to make a Mystic permanent granting another Arcane slot. That would make more Mystic players willing to look beyond the Shrivelling/Rite of Seeking/Mists of R’lyeh staples.

Shrivelling

As for Rogues, I don’t see why Rogues get hate. The cluster analysis suggests there’s a class of player that loves Rogues. I think that Rogue-hate stems from a belief that Rogues are not good enough at investigating/fighting, or a general lack of interest in evading enemies (which Rogues should do well at).

Rite of Seeking

Seekers and Guardians are great classes and don’t need much work. If anything, those classes are too good. But no complaints from me.

Mists of R'lyeh

But I stand by my conclusion that Survivors need work, and that what they need are high-level cards. I think it is possible to give Survivors high-level cards while keeping with the ethos of the class. In keeping with the “Survivors’ strengths are innate and well-rounded,” I think high-level Survivor cards could consist of events, skill cards, and non-item and non-ally assets. For instance, perhaps a Level 5 permanent called “Survivalism” that allows a survivor to spend two resources to boost any skill, or a high-level Dark Horse that gives Survivors +2 to all their skills when they have no resources or assets. (I’m spitballing here, guys; I’m sure these card ideas suck.)

I bet there are probably die-hard Survivor fans who will bristle at my criticisms of the class. To them I have to ask: do you like the fact that you don’t have the option to buy high-level Survivor cards? I certainly don’t, and I hope that “Survivor cards level 0-5” will no longer be the joke it currently is.

Tableu deck popularity, Survivor emphasis

http://ntguardian.wordpress.com/?p=3561

Extensions

Organizing R Research Projects: CPAT, A Case Study

ntguardian Feb 4, 2019

Introduction Months ago, I asked a question to the community: how should I organize my R research projects? After writing that post, doing some reading, then putting a plan in practice, I now have my own answer. First, some background. In the early months of 2016 I began a research project with my current Ph.D.…Read more Organizing R Research Projects: CPAT, A Case Study

Show full content

Introduction

Months ago, I asked a question to the community: how should I organize my R research projects? After writing that post, doing some reading, then putting a plan in practice, I now have my own answer.

First, some background. In the early months of 2016 I began a research project with my current Ph.D. advisor that involved extensive coding and spanned over at least two years. My code was poorly organized and thus problematic, as managing the chaos and extending the code became difficult. Meanwhile, I was reading articles by programmers and researchers about ways to organize R code so that research results are reproducible, distributable, and extensible. I identified two different approaches to organizing a project to meet these goals: one centered around makefiles, and another around package development. Given these competing approaches and their differing advantages, I was unsure what to do.

Since writing that post, I did more reading. First, I read two of Hadley Wickham’s (excellent) books: R Packages and Advanced R. (I loved R Packages so much I bought a physical copy.) I also read a book I picked up in a Humble Bundle book sale called Code Craft; The Practice of Writing Excellent Code by Pete Goodliffe for learning about good coding practices. Finally, I read a good portion of the GNU make manual.

I also spent months restructuring the project to comply with what I learned. Many, many hours were spent just fixing the mess I had made by not doing things right in the first place.

The result is CPAT, an R package implementing some change point analysis statistical tests. What CPAT does will be the subject of a future post (it will be published when the accompanying paper is made available online); what I want to focus on in this article is how I learned to organize an R research project, and how that culminated in CPAT.

Executable vs. Package: Why Not Both?

In the earlier article I presented two approaches that I suggested were “competing” approaches to organizing a research project: the project as executable approach of Jon Zelner and the project as package approach of Csefalvay and Flight. Both approaches, in my view, possessed unique advantages, but seemed to be at odds.

They are not at odds. CPAT demonstrates that it is possible to view an R project as both an executable and as a package. That said, the package development approach becomes dominant; making the package executable (from the command line) is an additional feature that makes the project even more portable and extensible.

If one is going to adopt the package development approach, one must use the hierarchy R packages needs. So that means:

R code that defines the package (which are mostly just functions) is placed in the R/ directory.
Documentation is placed in the man/ directory (if you’re using roxygen2 and devtools like a sane human being, though, this is something you won’t do yourself, though).
Project data goes in the data/ directory.
Compiled code from other languages (such as C++ when using Rcpp) goes in the src/ directory.
Code tests—which are not optional and must be written!—go in the tests/ directory (but if you’re using testthat for your testing then the tests you actually wrote go in tests/testthat/).
Long-form documentation goes in vignettes/. This could be the paper itself, if written in the form of a vignette.
Other important files should be placed in a reasonably-organized inst/ directory, to be installed with the package, along with other files that should be installed into the base directory (such as Makefile). For example, I put all my plots in inst/plots/, and this would also be a good directory to put the paper that accompanies the project.
Put executable scripts, including R scripts, in exec/.

The approach championed by Zelner doesn’t require a particular organizational style but simply that there be a coherent organization to the project. R package development not only has a coherent structure but even enforces it. If that structure doesn’t quite work, then one can add other files and directories as needed and note them in the .Rbuildignore file, so they’re ignored when the package is built.

When writing an R package, the relevant R tools basically enforce some essential points of style such as documenting objects. Also, the developer-researcher starts to think of important functionality of the project in terms of reusable functions that should be added to the package to be called by the scripts that actually execute the analysis—with documentation and everything else. Having well-documented functions, even if they serve a minor purpose, helps greatly in making the project more easily understood and written not only by others but by the original author as well. In my case, since I wrote CPAT almost exclusively with vim, I wrote a UltiSnips snippet creating a function skeleton that not only defines the function but automatically adds the framework of the documentation, as seen below.

UltiSnipsDemo

While package development does place (helpful) constraints, it does not specify everything. In other words, there is room for style. I essentially define style to mean any aspect of programming in which a choice is made that was not determined by the programming language or software. Examples of style include naming conventions, indentation, etc. Consistent style makes for understandable code; having consistent style is arguably more important than the stylistic decisions made. So I decided to codify my own stylistic preferences in a style guide, and when I did my code rewrite I made the new code comply with my style guide, even if that aded more time to editing. Whenever I encountered a new “decision point” (such as, say, dataset naming conventions), I committed my decision to the style guide.

As I mentioned above, the package development approach turns out not to be mutually exclusive with the project-as-executable approach. While it seems like documentation on R package development (including Dr. Wickham’s book) mentions the exec/ directory of a package only in passing, I found it to be a good place to place executable R scripts. Similarly, make can still be used to automate analysis tasks; R packages allow for including make files.

So in addition to the files that essentially defined the package, I also wrote stand-alone, command line executable R scripts and placed them in the exec/ directory (which causes them to be flagged as “executable” when the package is installed). I wrote a Vim template file for R scripts that provides a skeleton for making the package executable from the command line. That template is listed below:

#!/usr/bin/Rscript
################################################################################
# MyFile.R
################################################################################
# 2018-12-31 (last modified date)
# John Doe (author)
################################################################################
# This is a one-line description of the file.
################################################################################

# optparse: A package for handling command line arguments
if (!suppressPackageStartupMessages(require("optparse"))) {
  install.packages("optparse")
  require("optparse")
}

################################################################################
# MAIN FUNCTION DEFINITION
################################################################################

main <- function(foo, bar, help = FALSE) {
  # This function will be executed when the script is called from the command
  # line; the help parameter does nothing, but is needed for do.call() to work

  quit()
}

################################################################################
# INTERFACE SETUP
################################################################################

if (sys.nframe() == 0) {
  cl_args <- parse_args(OptionParser(
        description = "This is a template for executable R scripts.",
        option_list = list(
          make_option(c("--foo", "-f"), type = "integer", default = 0,
                      help = "A command-line argument"),
          make_option(c("--bar", "-b"), type = "character",
                      help = "Another command-line argument")
        )
      ))

  do.call(main, cl_args)
}

Converting my scripts into modularized, executable programs was, not surprisingly, very time consuming, and the transition was not perfect; some scripts just could not be modularized well. Nevertheless, the end result was likely worth it, and I could then write a Makefile defining how the pieces fit together. This tamed the complexity of the project and made it more reproducible; someone looking to repeat my analysis should only have to type make in a Linux terminal1 to see the results themselves.

While I did make my project modular and executable, though, I did not try to make it stand-alone with, say, packrat or Docker. I did try to use packrat, even setting it up to work with my package. But I ran into severe problems when I tried to work with my package at the University of Utah Mathematics Department, since the computer system’s R installation is almost four years old as of this writing and highly tempermental due to how the system administrator set it up. packrat made complications working with the department computers even worse, and I disabled it in a huff one day and never looked back. As for Docker or GitLab, I did not want my project tied up with proprietary or web-based services, and I felt that the end result Zelner was seeking when using these services is overkill; when you’ve added packrat (which I didn’t because of complications, but still) and defined how the project pieces fit together with make, you’ve mostly conquered the reproducibility problem, in my view. So I never missed these services.

The end result of this work can be seen in the paper branch of CPAT, also permanently available in this tarball. The directory tree is also informative.

In Practice

In some sense the end goal is to have an R package that could be distributed to others via, say, CRAN, so they can use the methods you employed and developed, not just reproduce your research; at least, that’s the case for me, a mathematical statistician interested in analyzing and developing statistical tests and procedures. When a package is written to contain research and not just for software distribution, it comes with a lot of files that aren’t needed for the package to function; just look at the dirctory tree!

The solution is to just delete the files that can be recreated—perhaps with make clean if you set it up right—and consider adding other files to .Rbuildignore when you want to distribute the package for others to use. So this isn’t actually a big problem.

Another issue that I encountered and am still unsure about are functions that are useful to the project but not useful outside of it. If you look through the paper branch manual or even the public version manual you will find functions that were useful only for the project, perhaps for converting data structures created by scripts or making particular plots that make sense only for the paper. They’re all private functions that need to be accessed via the ::: operator, yet they’re still in the manual.

I’m undecided whether this is good style. On the one hand, it’s nice that when others read your code there’s manual entries even for functions that are local to the project to further document what was done and how the code works. Even when distributing the software, having every function documented, even ones that are “private” to the package, seems to be in concordance with the spirit of open source software, making the source code easier understood by users who need and want to know how your software works. It also could serve as a good way to modularize documentation; a statistical formula is kept with the function that computes it rather than the interface to that function (which likely links to that underlying function). Having examples for those internal functions also should provide an additional layer of testing and helps when others want to extend the package.

On the other hand… most of the pages of the manual are devoted to functions the user isn’t supposed to be calling directly in their work. Of all those functions, maybe five are functions the user is expected to use. Should all that documentation space be devoted to something the user doesn’t use?

While I’m not set in my opinion, I lean to having more documentation rather than less, even if most of it is for private functions. After all, it’s useful to me when I’m developing the project and package.

Conclusion

I feel like spending those months to make my project logical and reporducible was time well spent. Not only did I learn a lot in the process, I had a useful end product that is now available on CRAN. Additionally, this project is not over; my advisor and I are continuing to work on extending the results that lead to the creation of this package in the first place, which will call for more simulation experiments. Now that I’ve organized my work I now have a good base for continuing that work.

I hope that this article inspired others on how to organize their R research projects. Gauging from reactions to my previous article, I think this is an underappreciated topic, unfortunately. Having a plan for managing package complexity and organization goes a long way to keeping your work under control and helps others appreciate what you’ve done. It also can lead to your work having a greater impact since others can use it as well.

I got a lot of good feedback from my previous article. I look forward to hearing what the community has to say now. I’m always open to suggestion.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

Sadly my project is tied closely to the Unix/Linux setup; I have no idea how well it would work on Windows and I’m not very interested in making the project easy for Windows use (despite having Windows 10 installed on my primary laptop). What this means is that the goal of full reproducibility isn’t met for most Windows users, a large market of users. That said… if you’re a Windows user, just download VirtualBox for free, download and install Ubuntu or some other Linux distribution you like, install R and the needed packages, and now you can reproduce my work. You may even discover why I profer to work in a Linux environment yourself. ↩

http://ntguardian.wordpress.com/?p=3527

Extensions

Problems in Estimating GARCH Parameters in R (Part 2; rugarch)

ntguardian Jan 28, 2019

Introduction Now here is a blog post that has been sitting on the shelf far longer than it should have. Over a year ago I wrote an article about problems I was having when estimating the parameters of a GARCH(1,1) model in R. I documented the behavior of parameter estimates (with a focus on )…Read more Problems in Estimating GARCH Parameters in R (Part 2; rugarch)

Show full content

Introduction

Now here is a blog post that has been sitting on the shelf far longer than it should have. Over a year ago I wrote an article about problems I was having when estimating the parameters of a GARCH(1,1) model in R. I documented the behavior of parameter estimates (with a focus on $\beta$ ) and perceived pathological behavior when those estimates are computed using fGarch. I called for help from the R community, including sending out the blog post over the R Finance mailing list.

I was not disappointed in the feedback. You can see some mailing list feedback, and there were some comments on Reddit that were helpful, but I think the best feedback I got was through my own e-mail.

Dr. Brian G. Peterson, a member of the R finance community, sent some thought provoking e-mails. The first informed me that fGarch is no longer the go-to package for working with GARCH models. The RMetrics suite of packages (which include fGarch) was maintained by Prof. Diethelm Würtz at ETH Zürich. He was killed in a car accident in 2016.

Dr. Peterson recommended I look into two more modern packages for GARCH modelling, rugarch (for univariate GARCH models) and rmgarch (for multivariate GARCH models). I had not heard of these packages before (the reason I was aware of fGarch was because it was referred to in the time series textbook Time Series Analysis and Its Applications with R Examples by Shumway and Stoffer), so I’m very thankful for the suggestion. Since I’m interested in univariate time series for now, I only looked at rugarch. The package appears to have more features and power than fGarch, which may explain why it seems more difficult to use. However the package’s vignette is helpful and worth printing out.

Dr. Peterson also had interesting comments about my proposed applications. He argued that intraday data should be preferred to daily data and that simulated data (including simulated GARCH processes) has idiosyncracies not seen in real data. The ease of getting daily data (particularly for USD/JPY around the time of Asian financial crises, which was an intended application of a test statistic I’m studying) motivated my interest in daily data. His comments, though, may lead me to reconsider this application.1 (I might try to detect the 2010 eurozone financial crises via EUR/USD instead. I can get free intraday data from HistData.com for this.) However, if standard error estimates cannot be trusted for small sample sizes, our test statistic would still be in trouble since it involves estimating parameters even for small sample sizes.

He also warned that simulated data exhibits behaviors not seen in real data. That may be true, but simulated data is important since it can be considered a statistician’s best-case scenario. Additionally, the properties of the process that generated simulated data are known a priori, including the values of the generating parameters and whether certain hypotheses (such as whether there is a structural change in the series) are true. This allows for sanity checks of estimators and tests. This is impossible for real-world since we don’t have the a priori knowledge needed.

Prof. André Portela Santos asked that I repeat the simulations but with $\alpha 0.6$ since these values are supposedly more common than my choice of $\alpha = \beta = 0.2$ . It’s a good suggestion and I will consider parameters in this range in addition to $\alpha = \beta = 0.2$ in this post. However, my simulations seemed to suggest that when $\alpha = \beta = 0.2$ , the estimation procedures nevertheless seem to want to be near the range of large $\beta$ . I’m also surprised since my advisor gave me the impression that GARCH processes with either $\alpha$ or $\beta$ large are more difficult to work with. Finally, if the estimators are strongly biased, we might expect to see most estimated parameters to lie in that range, though that does not mean the “correct” values lie in that range. My simulations suggest fGarch struggles to discover $\alpha = \beta = 0.2$ even when those parameters are “true.”” Prof. Santos’ comment leads me to desire a metastudy about what common estimates of GARCH parameters are on real world. (There may or may not be one; I haven’t checked. If anyone knows of one, please share.)

My advisor contacted another expert on GARCH models and got some feedback. Supposedly the standard error for $\beta$ is large, so there should be great variation in parameter estimates. Some of my simulations agreed with this behavior even for small sample sizes, but at the same time showed an uncomfortable bias towards $\beta = 0$ and $\beta = 1$ . This might be a consequence of the optimization procedures, as I hypothesized.

So given this feedback, I will be conducting more simulation experiments. I won’t be looking at fGarch or tseries anymore; I will be working exclusively with rugarch. I will explore different optimization procedures supported by the package. I won’t be creating plots like I did in my first post; those plots were meant only to show the existence of a problem and its severity. Instead I will be looking at properties of the resulting estimators produced by different optimization procedures.

Introduction to rugarch

As mentioned above, rugarch is a package for working with GARCH models; a major use case is estimating their parameters, obviously. Here I will demonstrate how to specify a GARCH model, simulate data from the model, and estimate parameters. After this we can dive into simulation studies.

library(rugarch)

## Loading required package: parallel
## 
## Attaching package: 'rugarch'
## The following object is masked from 'package:stats':
## 
##     sigma

Specifying a $\text{GARCH}(1, 1)$ model

To work with a GARCH model we need to specify it. The function for doing this is ugarchspec(). I think the parameters variance.model and mean.model are the most important parameters.

variance.model is a list with named entries, perhaps the two most interesting being model and garchOrder. model is a string specify which type of GARCH model is being fitted. Many major classes of GARCH models (such as EGARCH, IGARCH, etc.) are supported; for the “vanilla” GARCH model, set this to "sGARCH" (or just omit it; the standard model is the default). garchOrder is a vector for the order of the ARCH and GARCH components of the model.

mean.model allows for fitting ARMA-GARCH models, and functions like variance.model in that it accepts a list of named entries, the most interesting being armaOrder and include.mean. armaOrder is like garchOrder; it’s a vector specifying the order of the ARMA model. include.mean is a boolean that, if true, allows for the ARMA part of the model to have non-zero mean.

When simulating a process, we need to set the values of our parameters. This is done via the fixed.pars parameter, which accepts a list of named elements, the elements of the list being numeric. They need to fit the conventions the function uses for parameters; for example, if we want to set the parameters of a $\text{GARCH}(1,1)$ model, the names of our list elements should be "alpha1" and "beta1". If the plan is to simulate a model, every parameter in the model should be set this way.

There are other parameters interesting in their own right but I focus on these since the default specification is an ARMA-GARCH model with ARMA order of $(1,1)$ with non-zero mean and a GARCH model of order $(1,1)$ . This is not a vanilla $\text{GARCH}(1,1)$ model as I desire, so I almost always change this.

spec1 <- ugarchspec(mean.model = list(armaOrder = c(0,0), include.mean = FALSE),
                    fixed.pars = list("omega" = 0.2, "alpha1" = 0.2,
                                      "beta1" = 0.2))
spec2 <- ugarchspec(mean.model = list(armaOrder = c(0,0), include.mean = FALSE),
                    fixed.pars = list("omega" = 0.2, "alpha1" = 0.1,
                                      "beta1" = 0.7))

show(spec1)

## 
## *---------------------------------*
## *       GARCH Model Spec          *
## *---------------------------------*
## 
## Conditional Variance Dynamics 	
## ------------------------------------
## GARCH Model		: sGARCH(1,1)
## Variance Targeting	: FALSE 
## 
## Conditional Mean Dynamics
## ------------------------------------
## Mean Model		: ARFIMA(0,0,0)
## Include Mean		: FALSE 
## GARCH-in-Mean		: FALSE 
## 
## Conditional Distribution
## ------------------------------------
## Distribution	:  norm 
## Includes Skew	:  FALSE 
## Includes Shape	:  FALSE 
## Includes Lambda	:  FALSE

show(spec2)

## 
## *---------------------------------*
## *       GARCH Model Spec          *
## *---------------------------------*
## 
## Conditional Variance Dynamics 	
## ------------------------------------
## GARCH Model		: sGARCH(1,1)
## Variance Targeting	: FALSE 
## 
## Conditional Mean Dynamics
## ------------------------------------
## Mean Model		: ARFIMA(0,0,0)
## Include Mean		: FALSE 
## GARCH-in-Mean		: FALSE 
## 
## Conditional Distribution
## ------------------------------------
## Distribution	:  norm 
## Includes Skew	:  FALSE 
## Includes Shape	:  FALSE 
## Includes Lambda	:  FALSE

Simulating a GARCH process

The function ugarchpath() simulates GARCH models specified via ugarchspec(). The function needs a specification objectect created by ugarchspec() first. The parameters n.sim and n.start specify the size of the process and the length of the burn-in period, respectively (with defaults 1000 and 0, respectively; I strongly recommend setting the burn-in period to at least 500, but I go for 1000). The function creates an object that contains not only the simulated series but also residuals and $\sigma_t$ .

The rseed parameter controls the random seed the function uses for generating data. Be warned that set.seed() is effectively ignored by this function, so if you want consistent results, you will need to set this parameter.

The plot() method accompanying these objects is not completely transparent; there are a few plots it could create and when calling plot() on a uGARCHpath object in the command line users are prompted to input a number corresponding to the desired plot. This is a pain sometimes so don’t forget to pass the desired plot’s number to the which parameter to avoid the prompt; setting which = 2 will give the plot of the series proper.

old_par <- par()
par(mfrow = c(2, 2))

x_obj <- ugarchpath(spec1, n.sim = 1000, n.start = 1000, rseed = 111217)
show(x_obj)

## 
## *------------------------------------*
## *     GARCH Model Path Simulation    *
## *------------------------------------*
## Model: sGARCH
## Horizon:  1000
## Simulations: 1
##                 Seed Sigma2.Mean Sigma2.Min Sigma2.Max Series.Mean
## sim1          111217       0.332      0.251      0.915    0.000165
## Mean(All)          0       0.332      0.251      0.915    0.000165
## Unconditional     NA       0.333         NA         NA    0.000000
##               Series.Min Series.Max
## sim1               -1.76       1.62
## Mean(All)          -1.76       1.62
## Unconditional         NA         NA

for (i in 1:4) {
  plot(x_obj, which = i)
}

par(old_par)

## Warning in par(old_par): graphical parameter "cin" cannot be set
## Warning in par(old_par): graphical parameter "cra" cannot be set
## Warning in par(old_par): graphical parameter "csi" cannot be set
## Warning in par(old_par): graphical parameter "cxy" cannot be set
## Warning in par(old_par): graphical parameter "din" cannot be set
## Warning in par(old_par): graphical parameter "page" cannot be set

# The actual series
x1 <- x_obj@path$seriesSim
plot.ts(x1)

Fitting a $\text{GARCH}(1,1)$ model

The ugarchfit() function fits GARCH models. The function needs a specification and a dataset. The solver parameter accepts a string stating which numerical optimizer to use to find the parameter estimates. Most of the parameters of the function manage interfacing with the numerical optimizer. In particular, solver.control can be given a list of arguments to pass to the optimizer. We will be looking at this in more detail later.

The specification used for generating the simulated data won’t be appropriate for ugarchfit(), since it contains fixed values for its parameters. In my case I will need to create a second specification object.

spec <- ugarchspec(mean.model = list(armaOrder = c(0, 0), include.mean = FALSE))
fit <- ugarchfit(spec, data = x1)
show(fit)

## 
## *---------------------------------*
## *          GARCH Model Fit        *
## *---------------------------------*
## 
## Conditional Variance Dynamics 	
## -----------------------------------
## GARCH Model	: sGARCH(1,1)
## Mean Model	: ARFIMA(0,0,0)
## Distribution	: norm 
## 
## Optimal Parameters
## ------------------------------------
##         Estimate  Std. Error    t value Pr(>|t|)
## omega   0.000713    0.001258    0.56696  0.57074
## alpha1  0.002905    0.003714    0.78206  0.43418
## beta1   0.994744    0.000357 2786.08631  0.00000
## 
## Robust Standard Errors:
##         Estimate  Std. Error    t value Pr(>|t|)
## omega   0.000713    0.001217    0.58597  0.55789
## alpha1  0.002905    0.003661    0.79330  0.42760
## beta1   0.994744    0.000137 7250.45186  0.00000
## 
## LogLikelihood : -860.486 
## 
## Information Criteria
## ------------------------------------
##                    
## Akaike       1.7270
## Bayes        1.7417
## Shibata      1.7270
## Hannan-Quinn 1.7326
## 
## Weighted Ljung-Box Test on Standardized Residuals
## ------------------------------------
##                         statistic p-value
## Lag[1]                      3.998 0.04555
## Lag[2*(p+q)+(p+q)-1][2]     4.507 0.05511
## Lag[4*(p+q)+(p+q)-1][5]     9.108 0.01555
## d.o.f=0
## H0 : No serial correlation
## 
## Weighted Ljung-Box Test on Standardized Squared Residuals
## ------------------------------------
##                         statistic   p-value
## Lag[1]                      29.12 6.786e-08
## Lag[2*(p+q)+(p+q)-1][5]     31.03 1.621e-08
## Lag[4*(p+q)+(p+q)-1][9]     32.26 1.044e-07
## d.o.f=2
## 
## Weighted ARCH LM Tests
## ------------------------------------
##             Statistic Shape Scale P-Value
## ARCH Lag[3]     1.422 0.500 2.000  0.2331
## ARCH Lag[5]     2.407 1.440 1.667  0.3882
## ARCH Lag[7]     2.627 2.315 1.543  0.5865
## 
## Nyblom stability test
## ------------------------------------
## Joint Statistic:  0.9518
## Individual Statistics:             
## omega  0.3296
## alpha1 0.2880
## beta1  0.3195
## 
## Asymptotic Critical Values (10% 5% 1%)
## Joint Statistic:     	 0.846 1.01 1.35
## Individual Statistic:	 0.35 0.47 0.75
## 
## Sign Bias Test
## ------------------------------------
##                    t-value      prob sig
## Sign Bias           0.3946 6.933e-01    
## Negative Sign Bias  3.2332 1.264e-03 ***
## Positive Sign Bias  4.2142 2.734e-05 ***
## Joint Effect       28.2986 3.144e-06 ***
## 
## 
## Adjusted Pearson Goodness-of-Fit Test:
## ------------------------------------
##   group statistic p-value(g-1)
## 1    20     20.28       0.3779
## 2    30     26.54       0.5965
## 3    40     36.56       0.5817
## 4    50     47.10       0.5505
## 
## 
## Elapsed time : 2.60606

par(mfrow = c(3, 4))
for (i in 1:12) {
  plot(fit, which = i)
}

## 
## please wait...calculating quantiles...

par(old_par)

## Warning in par(old_par): graphical parameter "cin" cannot be set
## Warning in par(old_par): graphical parameter "cra" cannot be set
## Warning in par(old_par): graphical parameter "csi" cannot be set
## Warning in par(old_par): graphical parameter "cxy" cannot be set
## Warning in par(old_par): graphical parameter "din" cannot be set
## Warning in par(old_par): graphical parameter "page" cannot be set

Notice the estimated parameters and standard errors? The estimates are nowhere near the “correct” numbers even for a sample size of 1000, and there is no way a reasonable confidence interval based on the estimated standard errors would contain the correct values. It looks like the problems I documented in my last post have not gone away.

Out of curiosity, what would happen with the other specification, one in the range Prof. Santos suggested?

x_obj <- ugarchpath(spec2, n.start = 1000, rseed = 111317)
x2 <- x_obj@path$seriesSim
fit <- ugarchfit(spec, x2)
show(fit)

## 
## *---------------------------------*
## *          GARCH Model Fit        *
## *---------------------------------*
## 
## Conditional Variance Dynamics 	
## -----------------------------------
## GARCH Model	: sGARCH(1,1)
## Mean Model	: ARFIMA(0,0,0)
## Distribution	: norm 
## 
## Optimal Parameters
## ------------------------------------
##         Estimate  Std. Error    t value Pr(>|t|)
## omega   0.001076    0.002501    0.43025  0.66701
## alpha1  0.001992    0.002948    0.67573  0.49921
## beta1   0.997008    0.000472 2112.23364  0.00000
## 
## Robust Standard Errors:
##         Estimate  Std. Error    t value Pr(>|t|)
## omega   0.001076    0.002957    0.36389  0.71594
## alpha1  0.001992    0.003510    0.56767  0.57026
## beta1   0.997008    0.000359 2777.24390  0.00000
## 
## LogLikelihood : -1375.951 
## 
## Information Criteria
## ------------------------------------
##                    
## Akaike       2.7579
## Bayes        2.7726
## Shibata      2.7579
## Hannan-Quinn 2.7635
## 
## Weighted Ljung-Box Test on Standardized Residuals
## ------------------------------------
##                         statistic p-value
## Lag[1]                     0.9901  0.3197
## Lag[2*(p+q)+(p+q)-1][2]    1.0274  0.4894
## Lag[4*(p+q)+(p+q)-1][5]    3.4159  0.3363
## d.o.f=0
## H0 : No serial correlation
## 
## Weighted Ljung-Box Test on Standardized Squared Residuals
## ------------------------------------
##                         statistic p-value
## Lag[1]                      3.768 0.05226
## Lag[2*(p+q)+(p+q)-1][5]     4.986 0.15424
## Lag[4*(p+q)+(p+q)-1][9]     7.473 0.16272
## d.o.f=2
## 
## Weighted ARCH LM Tests
## ------------------------------------
##             Statistic Shape Scale P-Value
## ARCH Lag[3]    0.2232 0.500 2.000  0.6366
## ARCH Lag[5]    0.4793 1.440 1.667  0.8897
## ARCH Lag[7]    2.2303 2.315 1.543  0.6686
## 
## Nyblom stability test
## ------------------------------------
## Joint Statistic:  0.3868
## Individual Statistics:             
## omega  0.2682
## alpha1 0.2683
## beta1  0.2669
## 
## Asymptotic Critical Values (10% 5% 1%)
## Joint Statistic:     	 0.846 1.01 1.35
## Individual Statistic:	 0.35 0.47 0.75
## 
## Sign Bias Test
## ------------------------------------
##                    t-value   prob sig
## Sign Bias           0.5793 0.5625    
## Negative Sign Bias  1.3358 0.1819    
## Positive Sign Bias  1.5552 0.1202    
## Joint Effect        5.3837 0.1458    
## 
## 
## Adjusted Pearson Goodness-of-Fit Test:
## ------------------------------------
##   group statistic p-value(g-1)
## 1    20     24.24       0.1871
## 2    30     30.50       0.3894
## 3    40     38.88       0.4753
## 4    50     48.40       0.4974
## 
## 
## Elapsed time : 2.841597

That’s no better Now let’s see what happens when we use different optimization routines.

Optimization and Parameter Estimation in rugarch Choice of optimizer

ugarchfit()‘s default parameters did a good job of finding appropriate parameters for what I will refer to as model 2 (where $\alpha = 0.1$ and $\beta = 0.7$ ) but not for model 1 ( $\alpha = \beta = 0.2$ ). What I want to know is when one solver seems to beat another.

As pointed out by Vivek Rao 2 on the R-SIG-Finance mailing list, the “best” estimate is the estimate that maximizes the likelihood function (or, equivalently, the log-likelihood function), and I omitted inspecting the log likelihood function’s values in my last post. Here I will see which optimization procedures lead to the maximum log-likelihood.

Below is a helper function that simplifies the process of fitting a GARCH model’s parameters and extracting the log-likelihood, parameter values, and standard errors while allowing for different values to be passed to solver and solver.control.

evalSolverFit <- function(spec, data, solver = "solnp",
                          solver.control = list()) {
  # Calls ugarchfit(spec, data, solver, solver.control), and returns a vector
  # containing the log likelihood, parameters, and parameter standard errors.
  # Parameters are equivalent to those seen in ugarchfit(). If the solver fails
  # to converge, NA will be returned
  
  vec <- NA
  tryCatch({
      fit <- ugarchfit(spec = spec, data = data, solver = solver,
                       solver.control = solver.control)
      
      coef_se_names <- paste("se", names(fit@fit$coef), sep = ".")
      se <- fit@fit$se.coef
      names(se) <- coef_se_names
      
      robust_coef_se_names <- paste("robust.se", names(fit@fit$coef), sep = ".")
      robust.se <- fit@fit$robust.se.coef
      names(robust.se) <- robust_coef_se_names
      
      vec <- c(fit@fit$coef, se, robust.se)
      vec["LLH"] <- fit@fit$LLH
    }, error = function(w) {
      NA
  })
  
  return(vec)
}

Below I list out all optimization schemes I will consider. I only fiddle with solver.control, but there may be other parameters that could help the numerical optimization routines, namely numderiv.control, which are control arguments passed to the numerical routines responsible for standard error computation. This utilizes the package numDeriv which performs numerical differentiation.

solvers <- list(  # A list of lists where each sublist contains parameters to
                  # pass to a solver
  list("solver" = "nlminb", "solver.control" = list()),
  list("solver" = "solnp", "solver.control" = list()),
  list("solver" = "lbfgs", "solver.control" = list()),
  list("solver" = "gosolnp", "solver.control" = list(
    "n.restarts" = 100, "n.sim" = 100
  )),
  list("solver" = "hybrid", "solver.control" = list()),
  list("solver" = "nloptr", "solver.control" = list("solver" = 1)),  # COBYLA
  list("solver" = "nloptr", "solver.control" = list("solver" = 2)),  # BOBYQA
  list("solver" = "nloptr", "solver.control" = list("solver" = 3)),  # PRAXIS
  list("solver" = "nloptr",
       "solver.control" = list("solver" = 4)),  # NELDERMEAD
  list("solver" = "nloptr", "solver.control" = list("solver" = 5)),  # SBPLX
  list("solver" = "nloptr",
       "solver.control" = list("solver" = 6)),  # AUGLAG+COBYLA
  list("solver" = "nloptr",
       "solver.control" = list("solver" = 7)),  # AUGLAG+BOBYQA
  list("solver" = "nloptr",
       "solver.control" = list("solver" = 8)),  # AUGLAG+PRAXIS
  list("solver" = "nloptr",
       "solver.control" = list("solver" = 9)),  # AUGLAG+NELDERMEAD
  list("solver" = "nloptr",
       "solver.control" = list("solver" = 10))  # AUGLAG+SBPLX
)

tags <- c(  # Names for the above list
  "nlminb",
  "solnp",
  "lbfgs",
  "gosolnp",
  "hybrid",
  "nloptr+COBYLA",
  "nloptr+BOBYQA",
  "nloptr+PRAXIS",
  "nloptr+NELDERMEAD",
  "nloptr+SBPLX",
  "nloptr+AUGLAG+COBYLA",
  "nloptr+AUGLAG+BOBYQA",
  "nloptr+AUGLAG+PRAXIS",
  "nloptr+AUGLAG+NELDERMEAD",
  "nloptr+AUGLAG+SBPLX"
)

names(solvers) <- tags

Now let’s run the gauntlet of optimization choices and see which produces the estimates with the largest log likelihood for data generated by model 1. The lbfgs method (low-storage version of the Broyden-Fletcher-Goldfarb-Shanno method, provided in nloptr) unfortunately does not converge for this series, so I omit it.

optMethodCompare <- function(data, spec, solvers) {
  # Runs all solvers in a list for a dataset
  #
  # Args:
  #   data: An object to pass to ugarchfit's data parameter containing the data
  #         to fit
  #   spec: A specification created by ugarchspec to pass to ugarchfit
  #   solvers: A list of lists containing strings of solvers and a list for
  #            solver.control
  #
  # Return:
  #   A matrix containing the result of the solvers (including parameters, se's,
  #   and LLH)
  
  model_solutions <- lapply(solvers, function(s) {
    args <- s
    args[["spec"]] <- spec
    args[["data"]] <- data
    res <- do.call(evalSolverFit, args = args)
    return(res)
  })
  model_solutions <- do.call(rbind, model_solutions)
  
  return(model_solutions)
}

round(optMethodCompare(x1, spec, solvers[c(1:2, 4:15)]), digits = 4)

##                              omega   alpha1    beta1   se.omega   se.alpha1   se.beta1   robust.se.omega   robust.se.alpha1   robust.se.beta1         LLH
## -------------------------  -------  -------  -------  ---------  ----------  ---------  ----------------  -----------------  ----------------  ----------
## nlminb                      0.2689   0.1774   0.0000     0.0787      0.0472     0.2447            0.0890             0.0352            0.2830   -849.6927
## solnp                       0.0007   0.0029   0.9947     0.0013      0.0037     0.0004            0.0012             0.0037            0.0001   -860.4860
## gosolnp                     0.2689   0.1774   0.0000     0.0787      0.0472     0.2446            0.0890             0.0352            0.2828   -849.6927
## hybrid                      0.0007   0.0029   0.9947     0.0013      0.0037     0.0004            0.0012             0.0037            0.0001   -860.4860
## nloptr+COBYLA               0.0006   0.0899   0.9101     0.0039      0.0306     0.0370            0.0052             0.0527            0.0677   -871.5006
## nloptr+BOBYQA               0.0003   0.0907   0.9093     0.0040      0.0298     0.0375            0.0057             0.0532            0.0718   -872.3436
## nloptr+PRAXIS               0.2689   0.1774   0.0000     0.0786      0.0472     0.2444            0.0888             0.0352            0.2823   -849.6927
## nloptr+NELDERMEAD           0.0010   0.0033   0.9935     0.0013      0.0039     0.0004            0.0013             0.0038            0.0001   -860.4845
## nloptr+SBPLX                0.0010   0.1000   0.9000     0.0042      0.0324     0.0386            0.0055             0.0536            0.0680   -872.2736
## nloptr+AUGLAG+COBYLA        0.0006   0.0899   0.9101     0.0039      0.0306     0.0370            0.0052             0.0527            0.0677   -871.5006
## nloptr+AUGLAG+BOBYQA        0.0003   0.0907   0.9093     0.0040      0.0298     0.0375            0.0057             0.0532            0.0718   -872.3412
## nloptr+AUGLAG+PRAXIS        0.1246   0.1232   0.4948     0.0620      0.0475     0.2225            0.0701             0.0439            0.2508   -851.0547
## nloptr+AUGLAG+NELDERMEAD    0.2689   0.1774   0.0000     0.0786      0.0472     0.2445            0.0889             0.0352            0.2826   -849.6927
## nloptr+AUGLAG+SBPLX         0.0010   0.1000   0.9000     0.0042      0.0324     0.0386            0.0055             0.0536            0.0680   -872.2736

According the the maximum likelihood criterion, the “best” result is achieved by gosolnp. The result has the unfortunate property that $\beta \approx 0$ , which is certainly not true, but at least the standard error for $\beta$ would create a confidence interval that contains $\beta$ ‘s true value. Of these, my preferred estimates are produced by AUGLAG+PRAXIS, as $\beta$ seems reasonable and in fact the estimates are all close to the truth, (at least in the sense that the confidence intervals contain the true values), but unfortunately the estimates do not maximize the log likelihood, even though they are the most reasonable.

If we looked at model 2, what do we see? Again, lbfgs does not converge so I omit it. Unfortunately, nlminb does not converge either, so it too must be omitted.

round(optMethodCompare(x2, spec, solvers[c(2, 4:15)]), digits = 4)

##                              omega   alpha1    beta1   se.omega   se.alpha1   se.beta1   robust.se.omega   robust.se.alpha1   robust.se.beta1         LLH
## -------------------------  -------  -------  -------  ---------  ----------  ---------  ----------------  -----------------  ----------------  ----------
## solnp                       0.0011   0.0020   0.9970     0.0025      0.0029     0.0005            0.0030             0.0035            0.0004   -1375.951
## gosolnp                     0.0011   0.0020   0.9970     0.0025      0.0029     0.0005            0.0030             0.0035            0.0004   -1375.951
## hybrid                      0.0011   0.0020   0.9970     0.0025      0.0029     0.0005            0.0030             0.0035            0.0004   -1375.951
## nloptr+COBYLA               0.0016   0.0888   0.9112     0.0175      0.0619     0.0790            0.0540             0.2167            0.2834   -1394.529
## nloptr+BOBYQA               0.0010   0.0892   0.9108     0.0194      0.0659     0.0874            0.0710             0.2631            0.3572   -1395.310
## nloptr+PRAXIS               0.5018   0.0739   0.3803     0.3178      0.0401     0.3637            0.2777             0.0341            0.3225   -1373.632
## nloptr+NELDERMEAD           0.0028   0.0026   0.9944     0.0028      0.0031     0.0004            0.0031             0.0035            0.0001   -1375.976
## nloptr+SBPLX                0.0029   0.1000   0.9000     0.0146      0.0475     0.0577            0.0275             0.1108            0.1408   -1395.807
## nloptr+AUGLAG+COBYLA        0.0016   0.0888   0.9112     0.0175      0.0619     0.0790            0.0540             0.2167            0.2834   -1394.529
## nloptr+AUGLAG+BOBYQA        0.0010   0.0892   0.9108     0.0194      0.0659     0.0874            0.0710             0.2631            0.3572   -1395.310
## nloptr+AUGLAG+PRAXIS        0.5018   0.0739   0.3803     0.3178      0.0401     0.3637            0.2777             0.0341            0.3225   -1373.632
## nloptr+AUGLAG+NELDERMEAD    0.0001   0.0000   1.0000     0.0003      0.0003     0.0000            0.0004             0.0004            0.0000   -1375.885
## nloptr+AUGLAG+SBPLX         0.0029   0.1000   0.9000     0.0146      0.0475     0.0577            0.0275             0.1108            0.1408   -1395.807

Here it was PRAXIS and AUGLAG+PRAXIS that gave the “optimal” result, and it was only those two methods that did. Other optimizers gave visibly bad results. That said, the “optimal” solution is the preferred on with the parameters being nonzero and their confidence intervals containing the correct values.

What happens if we restrict the sample to size 100? (lbfgs still does not work.)

round(optMethodCompare(x1[1:100], spec, solvers[c(1:2, 4:15)]), digits = 4)

##                              omega   alpha1    beta1   se.omega   se.alpha1   se.beta1   robust.se.omega   robust.se.alpha1   robust.se.beta1        LLH
## -------------------------  -------  -------  -------  ---------  ----------  ---------  ----------------  -----------------  ----------------  ---------
## nlminb                      0.0451   0.2742   0.5921     0.0280      0.1229     0.1296            0.0191             0.0905            0.0667   -80.6587
## solnp                       0.0451   0.2742   0.5921     0.0280      0.1229     0.1296            0.0191             0.0905            0.0667   -80.6587
## gosolnp                     0.0451   0.2742   0.5921     0.0280      0.1229     0.1296            0.0191             0.0905            0.0667   -80.6587
## hybrid                      0.0451   0.2742   0.5921     0.0280      0.1229     0.1296            0.0191             0.0905            0.0667   -80.6587
## nloptr+COBYLA               0.0007   0.1202   0.8798     0.0085      0.0999     0.0983            0.0081             0.1875            0.1778   -85.3121
## nloptr+BOBYQA               0.0005   0.1190   0.8810     0.0085      0.0994     0.0992            0.0084             0.1892            0.1831   -85.3717
## nloptr+PRAXIS               0.0451   0.2742   0.5921     0.0280      0.1229     0.1296            0.0191             0.0905            0.0667   -80.6587
## nloptr+NELDERMEAD           0.0451   0.2742   0.5920     0.0281      0.1230     0.1297            0.0191             0.0906            0.0667   -80.6587
## nloptr+SBPLX                0.0433   0.2740   0.5998     0.0269      0.1237     0.1268            0.0182             0.0916            0.0648   -80.6616
## nloptr+AUGLAG+COBYLA        0.0007   0.1202   0.8798     0.0085      0.0999     0.0983            0.0081             0.1875            0.1778   -85.3121
## nloptr+AUGLAG+BOBYQA        0.0005   0.1190   0.8810     0.0085      0.0994     0.0992            0.0084             0.1892            0.1831   -85.3717
## nloptr+AUGLAG+PRAXIS        0.0451   0.2742   0.5921     0.0280      0.1229     0.1296            0.0191             0.0905            0.0667   -80.6587
## nloptr+AUGLAG+NELDERMEAD    0.0451   0.2742   0.5921     0.0280      0.1229     0.1296            0.0191             0.0905            0.0667   -80.6587
## nloptr+AUGLAG+SBPLX         0.0450   0.2742   0.5924     0.0280      0.1230     0.1295            0.0191             0.0906            0.0666   -80.6587

round(optMethodCompare(x2[1:100], spec, solvers[c(1:2, 4:15)]), digits = 4)

##                              omega   alpha1    beta1   se.omega   se.alpha1   se.beta1   robust.se.omega   robust.se.alpha1   robust.se.beta1         LLH
## -------------------------  -------  -------  -------  ---------  ----------  ---------  ----------------  -----------------  ----------------  ----------
## nlminb                      0.7592   0.0850   0.0000     2.1366      0.4813     3.0945            7.5439             1.7763           11.0570   -132.4614
## solnp                       0.0008   0.0000   0.9990     0.0291      0.0417     0.0066            0.0232             0.0328            0.0034   -132.9182
## gosolnp                     0.0537   0.0000   0.9369     0.0521      0.0087     0.0713            0.0430             0.0012            0.0529   -132.9124
## hybrid                      0.0008   0.0000   0.9990     0.0291      0.0417     0.0066            0.0232             0.0328            0.0034   -132.9182
## nloptr+COBYLA               0.0014   0.0899   0.9101     0.0259      0.0330     0.1192            0.0709             0.0943            0.1344   -135.7495
## nloptr+BOBYQA               0.0008   0.0905   0.9095     0.0220      0.0051     0.1145            0.0687             0.0907            0.1261   -135.8228
## nloptr+PRAXIS               0.0602   0.0000   0.9293     0.0522      0.0088     0.0773            0.0462             0.0015            0.0565   -132.9125
## nloptr+NELDERMEAD           0.0024   0.0000   0.9971     0.0473      0.0629     0.0116            0.0499             0.0680            0.0066   -132.9186
## nloptr+SBPLX                0.0027   0.1000   0.9000     0.0238      0.0493     0.1308            0.0769             0.1049            0.1535   -135.9175
## nloptr+AUGLAG+COBYLA        0.0014   0.0899   0.9101     0.0259      0.0330     0.1192            0.0709             0.0943            0.1344   -135.7495
## nloptr+AUGLAG+BOBYQA        0.0008   0.0905   0.9095     0.0221      0.0053     0.1145            0.0687             0.0907            0.1262   -135.8226
## nloptr+AUGLAG+PRAXIS        0.0602   0.0000   0.9294     0.0523      0.0090     0.0771            0.0462             0.0014            0.0565   -132.9125
## nloptr+AUGLAG+NELDERMEAD    0.0000   0.0000   0.9999     0.0027      0.0006     0.0005            0.0013             0.0004            0.0003   -132.9180
## nloptr+AUGLAG+SBPLX         0.0027   0.1000   0.9000     0.0238      0.0493     0.1308            0.0769             0.1049            0.1535   -135.9175

The results are not thrilling. The “best” result for the series generated by model 1 was attained by multiple solvers, and the 95% confidence interval (CI) for $\omega$ would not contain $\omega$ ‘s true value, though the CIs for the other parameters would contain their true values. For the series generated by model 2 the best result was attained by the nlminb solver; the parameter values are not plausible and the standard errors are huge. At least the CI would contain the correct value.

From here we should no longer stick to two series but see the performance of these methods on many simulated series generated by both models. Simulations in this post will be too computationally intensive for my laptop so I will use my department’s supercomputer to perform them, taking advantage of its many cores for parallelization.

library(foreach)
library(doParallel)

logfile <- ""

# logfile <- "outfile.log"
# if (!file.exists(logfile)) {
#   file.create(logfile)
# }

cl <- makeCluster(detectCores() - 1, outfile = logfile)
registerDoParallel(cl)

optMethodSims <- function(gen_spec, n.sim = 1000, m.sim = 1000,
                          fit_spec = ugarchspec(mean.model = list(
                            armaOrder = c(0,0), include.mean = FALSE)),
                          solvers = list("solnp" = list(
                            "solver" = "solnp", "solver.control" = list())),
                          rseed = NA, verbose = FALSE) {
  # Performs simulations in parallel of GARCH processes via rugarch and returns
  # a list with the results of different optimization routines
  #
  # Args:
  #   gen_spec: The specification for generating a GARCH sequence, produced by
  #             ugarchspec
  #   n.sim: An integer denoting the length of the simulated series
  #   m.sim: An integer for the number of simulated sequences to generate
  #   fit_spec: A ugarchspec specification for the model to fit
  #   solvers: A list of lists containing strings of solvers and a list for
  #            solver.control
  #   rseed: Optional seeding value(s) for the random number generator. For
  #          m.sim>1, it is possible to provide either a single seed to
  #          initialize all values, or one seed per separate simulation (i.e.
  #          m.sim seeds). However, in the latter case this may result in some
  #          slight overhead depending on how large m.sim is
  #   verbose: Boolean for whether to write data tracking the progress of the 
  #            loop into an output file
  #   outfile: A string for the file to store verbose output to (relevant only
  #            if verbose is TRUE)
  #
  # Return:
  #   A list containing the result of calling optMethodCompare on each generated
  #   sequence
  
  fits <- foreach(i = 1:m.sim,
                  .packages = c("rugarch"),
                  .export = c("optMethodCompare", "evalSolverFit")) %dopar% {
                    if (is.na(rseed)) {
                      newseed <- NA
                    } else if (is.vector(rseed)) {
                      newseed <- rseed[i]
                    } else {
                      newseed <- rseed + i - 1
                    }
                    
                    if (verbose) {
                      cat(as.character(Sys.time()), ": Now on simulation ", i, "\n")
                    }
                    
                    sim <- ugarchpath(gen_spec, n.sim = n.sim, n.start = 1000,
                                      m.sim = 1, rseed = newseed)
                    x <- sim@path$seriesSim
                    optMethodCompare(x, spec = fit_spec, solvers = solvers)
  }
  
  return(fits)
}

# Specification 1 first
spec1_n100 <- optMethodSims(spec1, n.sim = 100, m.sim = 1000,
                            solvers = solvers, verbose = TRUE)
spec1_n500 <- optMethodSims(spec1, n.sim = 500, m.sim = 1000,
                            solvers = solvers, verbose = TRUE)
spec1_n1000 <- optMethodSims(spec1, n.sim = 1000, m.sim = 1000,
                             solvers = solvers, verbose = TRUE)

# Specification 2 next
spec2_n100 <- optMethodSims(spec2, n.sim = 100, m.sim = 1000,
                            solvers = solvers, verbose = TRUE)
spec2_n500 <- optMethodSims(spec2, n.sim = 500, m.sim = 1000,
                            solvers = solvers, verbose = TRUE)
spec2_n1000 <- optMethodSims(spec2, n.sim = 1000, m.sim = 1000,
                             solvers = solvers, verbose = TRUE)

Below is a set of helper functions I will use for the analytics I want.

optMethodSims_getAllVals <- function(param, solver, reslist) {
  # Get all values for a parameter obtained by a certain solver after getting a
  # list of results via optMethodSims
  #
  # Args:
  #   param: A string for the parameter to get (such as "beta1")
  #   solver: A string for the solver for which to get the parameter (such as
  #           "nlminb")
  #   reslist: A list created by optMethodSims
  #
  # Return:
  #   A vector of values of the parameter for each simulation
  
  res <- sapply(reslist, function(l) {
    return(l[solver, param])
  })
  
  return(res)
}

optMethodSims_getBestVals <- function(reslist, opt_vec = TRUE,
                                      reslike = FALSE) {
  # A function that gets the optimizer that maximized the likelihood function
  # for each entry in reslist
  #
  # Args:
  #   reslist: A list created by optMethodSims
  #   opt_vec: A boolean indicating whether to return a vector with the name of
  #            the optimizers that maximized the log likelihood
  #   reslike: A bookean indicating whether the resulting list should consist of
  #            matrices of only one row labeled "best" with a structure like
  #            reslist
  #
  # Return:
  #   If opt_vec is TRUE, a list of lists, where each sublist contains a vector
  #   of strings naming the opimizers that maximized the likelihood function and
  #   a matrix of the parameters found. Otherwise, just the matrix (resembles
  #   the list generated by optMethodSims)
  
  res <- lapply(reslist, function(l) {
    max_llh <- max(l[, "LLH"], na.rm = TRUE)
    best_idx <- (l[, "LLH"] == max_llh) & (!is.na(l[, "LLH"]))
    best_mat <- l[best_idx, , drop = FALSE]
    
    if (opt_vec) {
      return(list("solvers" = rownames(best_mat), "params" = best_mat))
    } else {
      return(best_mat)
    }
  })
  
  if (reslike) {
    res <- lapply(res, function(l) {
      mat <- l$params[1, , drop = FALSE]
      rownames(mat) <- "best"
      return(mat)
    })
  }
  
  return(res)
}

optMethodSims_getCaptureRate <- function(param, solver, reslist, multiplier = 2,
                                         spec, use_robust = TRUE) {
  # Gets the rate a confidence interval for a parameter captures the true value
  #
  # Args:
  #   param: A string for the parameter being worked with
  #   solver: A string for the solver used to estimate the parameter
  #   reslist: A list created by optMethodSims
  #   multiplier: A floating-point number for the multiplier to the standard
  #               error, appropriate for the desired confidence level
  #   spec: A ugarchspec specification with the fixed parameters containing the
  #         true parameter value
  #   use_robust: Use robust standard errors for computing CIs
  #
  # Return:
  #   A float for the proportion of times the confidence interval captured the
  #   true parameter value
  
  se_string <- ifelse(use_robust, "robust.se.", "se.")
  
  est <- optMethodSims_getAllVals(param, solver, reslist)
  moe_est <- multiplier * optMethodSims_getAllVals(
    paste0(se_string, param), solver, reslist)
  param_val <- spec@model$fixed.pars[[param]]
  contained <- (param_val <= est + moe_est) & (param_val >= est - moe_est)
  return(mean(contained, na.rm = TRUE))
}

optMethodSims_getMaxRate <- function(solver, maxlist) {
  # Gets how frequently a solver found a maximal log likelihood
  #
  # Args:
  #   solver: A string for the solver
  #   maxlist A list created by optMethodSims_getBestVals with entries
  #           containing vectors naming the solvers that maximized the log
  #           likelihood
  #
  # Return:
  #   The proportion of times the solver maximized the log likelihood
  
  maxed <- sapply(maxlist, function(l) {
    solver %in% l$solvers
  })
  
  return(mean(maxed))
}

optMethodSims_getFailureRate <- function(solver, reslist) {
  # Computes the proportion of times a solver failed to converge.
  #
  # Args:
  #   solver: A string for the solver
  #   reslist: A list created by optMethodSims
  #
  # Return:
  #   Numeric proportion of times a solver failed to converge
  
  failed <- sapply(reslist, function(l) {
    is.na(l[solver, "LLH"])
  })
  
  return(mean(failed))
}

# Vectorization
optMethodSims_getCaptureRate <- Vectorize(optMethodSims_getCaptureRate,
                                          vectorize.args = "solver")
optMethodSims_getMaxRate <- Vectorize(optMethodSims_getMaxRate,
                                      vectorize.args = "solver")
optMethodSims_getFailureRate <- Vectorize(optMethodSims_getFailureRate,
                                      vectorize.args = "solver")

I first create tables containing, for a fixed sample size and model:

The rate at which a solver attains the highest log likelihood among all solvers for a series
The rate at which a solver failed to converge
The rate at which a roughly 95% confidence interval based on the solver’s solution managed to contain the true parameter value for each parameter (referred to as the “capture rate”, and using the robust standard errors)

solver_table <- function(reslist, tags, spec) {
  # Creates a table describing important solver statistics
  #
  # Args:
  #   reslist: A list created by optMethodSims
  #   tags: A vector with strings naming all solvers to include in the table
  #   spec: A ugarchspec specification with the fixed parameters containing the
  #         true parameter value
  #
  # Return:
  #   A matrix containing metrics describing the performance of the solvers
  
  params <- names(spec1@model$fixed.pars)
  
  max_rate <- optMethodSims_getMaxRate(tags, optMethodSims_getBestVals(reslist))
  failure_rate <- optMethodSims_getFailureRate(tags, reslist)
  capture_rate <- lapply(params, function(p) {
    optMethodSims_getCaptureRate(p, tags, reslist, spec = spec)
  })
  
  return_mat <- cbind("Maximization Rate" = max_rate,
                      "Failure Rate" = failure_rate)
  
  capture_mat <- do.call(cbind, capture_rate)
  colnames(capture_mat) <- paste(params, "95% CI Capture Rate")
  
  return_mat <- cbind(return_mat, capture_mat)
  
  return(return_mat)
}

Model 1, $n = 100$

as.data.frame(round(solver_table(spec1_n100, tags, spec1) * 100, digits = 1))

##                             Maximization Rate   Failure Rate   omega 95% CI Capture Rate   alpha1 95% CI Capture Rate   beta1 95% CI Capture Rate
## -------------------------  ------------------  -------------  --------------------------  ---------------------------  --------------------------
## nlminb                                   16.2           20.0                        21.8                         29.2                        24.0
## solnp                                     0.1            0.0                        13.7                         24.0                        15.4
## lbfgs                                    15.1           35.2                        56.6                         67.9                        58.0
## gosolnp                                  20.3            0.0                        20.3                         32.6                        21.9
## hybrid                                    0.1            0.0                        13.7                         24.0                        15.4
## nloptr+COBYLA                             0.0            0.0                         6.3                         82.6                        19.8
## nloptr+BOBYQA                             0.0            0.0                         5.4                         82.1                        18.5
## nloptr+PRAXIS                            15.8            0.0                        42.1                         54.5                        44.1
## nloptr+NELDERMEAD                         0.4            0.0                         5.7                         19.3                         8.1
## nloptr+SBPLX                              0.1            0.0                         7.7                         85.7                        24.1
## nloptr+AUGLAG+COBYLA                      0.0            0.0                         6.1                         84.5                        19.9
## nloptr+AUGLAG+BOBYQA                      0.1            0.0                         6.5                         83.2                        19.4
## nloptr+AUGLAG+PRAXIS                     22.6            0.0                        41.2                         54.6                        44.1
## nloptr+AUGLAG+NELDERMEAD                 11.1            0.0                         7.5                         18.8                         9.7
## nloptr+AUGLAG+SBPLX                       0.6            0.0                         7.9                         86.5                        23.0

Model 1, $n = 500$

as.data.frame(round(solver_table(spec1_n500, tags, spec1) * 100, digits = 1))

##                             Maximization Rate   Failure Rate   omega 95% CI Capture Rate   alpha1 95% CI Capture Rate   beta1 95% CI Capture Rate
## -------------------------  ------------------  -------------  --------------------------  ---------------------------  --------------------------
## nlminb                                   21.2            0.4                        63.3                         67.2                        63.8
## solnp                                     0.1            0.2                        32.2                         35.6                        32.7
## lbfgs                                     4.5           41.3                        85.0                         87.6                        85.7
## gosolnp                                  35.1            0.0                        69.0                         73.2                        69.5
## hybrid                                    0.1            0.0                        32.3                         35.7                        32.8
## nloptr+COBYLA                             0.0            0.0                         3.2                         83.3                        17.8
## nloptr+BOBYQA                             0.0            0.0                         3.5                         81.5                        18.1
## nloptr+PRAXIS                            18.0            0.0                        83.9                         87.0                        84.2
## nloptr+NELDERMEAD                         0.0            0.0                        16.4                         20.7                        16.7
## nloptr+SBPLX                              0.1            0.0                         3.7                         91.4                        15.7
## nloptr+AUGLAG+COBYLA                      0.0            0.0                         3.2                         83.3                        17.8
## nloptr+AUGLAG+BOBYQA                      0.0            0.0                         3.5                         81.5                        18.1
## nloptr+AUGLAG+PRAXIS                     21.9            0.0                        80.2                         87.4                        83.4
## nloptr+AUGLAG+NELDERMEAD                  0.6            0.0                        20.0                         24.0                        20.5
## nloptr+AUGLAG+SBPLX                       0.0            0.0                         3.7                         91.4                        15.7

Model 1, $n = 1000$

as.data.frame(round(solver_table(spec1_n1000, tags, spec1) * 100, digits = 1))

##                             Maximization Rate   Failure Rate   omega 95% CI Capture Rate   alpha1 95% CI Capture Rate   beta1 95% CI Capture Rate
## -------------------------  ------------------  -------------  --------------------------  ---------------------------  --------------------------
## nlminb                                   21.5            0.1                        88.2                         86.1                        87.8
## solnp                                     0.4            0.2                        54.9                         53.6                        54.6
## lbfgs                                     1.1           44.8                        91.5                         88.0                        91.8
## gosolnp                                  46.8            0.0                        87.2                         85.1                        87.0
## hybrid                                    0.5            0.0                        55.0                         53.6                        54.7
## nloptr+COBYLA                             0.0            0.0                         4.1                         74.5                        15.0
## nloptr+BOBYQA                             0.0            0.0                         3.6                         74.3                        15.9
## nloptr+PRAXIS                            17.7            0.0                        92.6                         90.2                        92.2
## nloptr+NELDERMEAD                         0.0            0.0                        30.5                         29.6                        30.9
## nloptr+SBPLX                              0.0            0.0                         3.0                         82.3                        11.6
## nloptr+AUGLAG+COBYLA                      0.0            0.0                         4.1                         74.5                        15.0
## nloptr+AUGLAG+BOBYQA                      0.0            0.0                         3.6                         74.3                        15.9
## nloptr+AUGLAG+PRAXIS                     13.0            0.0                        83.4                         93.9                        86.7
## nloptr+AUGLAG+NELDERMEAD                  0.0            0.0                        34.6                         33.8                        35.0
## nloptr+AUGLAG+SBPLX                       0.0            0.0                         3.0                         82.3                        11.6

Model 2, $n = 100$

as.data.frame(round(solver_table(spec2_n100, tags, spec2) * 100, digits = 1))

##                             Maximization Rate   Failure Rate   omega 95% CI Capture Rate   alpha1 95% CI Capture Rate   beta1 95% CI Capture Rate
## -------------------------  ------------------  -------------  --------------------------  ---------------------------  --------------------------
## nlminb                                    8.2           24.2                        22.3                         34.7                        23.9
## solnp                                     0.3            0.0                        21.1                         32.6                        21.3
## lbfgs                                    11.6           29.5                        74.9                         73.2                        70.4
## gosolnp                                  19.0            0.0                        31.9                         41.2                        30.8
## hybrid                                    0.3            0.0                        21.1                         32.6                        21.3
## nloptr+COBYLA                             0.0            0.0                        20.5                         94.7                        61.7
## nloptr+BOBYQA                             0.2            0.0                        19.3                         95.8                        62.2
## nloptr+PRAXIS                            16.0            0.0                        70.2                         57.2                        52.8
## nloptr+NELDERMEAD                         0.2            0.0                         7.8                         27.8                        14.1
## nloptr+SBPLX                              0.1            0.0                        24.9                         91.0                        65.0
## nloptr+AUGLAG+COBYLA                      0.0            0.0                        21.2                         95.1                        62.5
## nloptr+AUGLAG+BOBYQA                      0.9            0.0                        20.1                         96.2                        62.5
## nloptr+AUGLAG+PRAXIS                     38.8            0.0                        70.4                         57.2                        52.7
## nloptr+AUGLAG+NELDERMEAD                 14.4            0.0                        10.7                         26.0                        16.1
## nloptr+AUGLAG+SBPLX                       0.1            0.0                        25.8                         91.9                        65.5

Model 2, $n = 500$

as.data.frame(round(solver_table(spec2_n500, tags, spec2) * 100, digits = 1))

##                             Maximization Rate   Failure Rate   omega 95% CI Capture Rate   alpha1 95% CI Capture Rate   beta1 95% CI Capture Rate
## -------------------------  ------------------  -------------  --------------------------  ---------------------------  --------------------------
## nlminb                                    1.7            1.6                        35.0                         37.2                        34.2
## solnp                                     0.1            0.2                        46.2                         48.6                        45.3
## lbfgs                                     2.2           38.4                        85.2                         88.1                        82.3
## gosolnp                                   5.2            0.0                        74.9                         77.8                        72.7
## hybrid                                    0.1            0.0                        46.1                         48.5                        45.2
## nloptr+COBYLA                             0.0            0.0                         8.2                        100.0                        40.5
## nloptr+BOBYQA                             0.0            0.0                         9.5                        100.0                        41.0
## nloptr+PRAXIS                            17.0            0.0                        83.8                         85.1                        81.0
## nloptr+NELDERMEAD                         0.0            0.0                        26.9                         38.2                        27.0
## nloptr+SBPLX                              0.0            0.0                         8.2                        100.0                        40.2
## nloptr+AUGLAG+COBYLA                      0.0            0.0                         8.2                        100.0                        40.5
## nloptr+AUGLAG+BOBYQA                      0.0            0.0                         9.5                        100.0                        41.0
## nloptr+AUGLAG+PRAXIS                     77.8            0.0                        84.4                         85.4                        81.3
## nloptr+AUGLAG+NELDERMEAD                  1.1            0.0                        32.5                         40.3                        32.3
## nloptr+AUGLAG+SBPLX                       0.0            0.0                         8.2                        100.0                        40.2

Model 2, $n = 1000$

as.data.frame(round(solver_table(spec2_n1000, tags, spec2) * 100, digits = 1))

##                             Maximization Rate   Failure Rate   omega 95% CI Capture Rate   alpha1 95% CI Capture Rate   beta1 95% CI Capture Rate
## -------------------------  ------------------  -------------  --------------------------  ---------------------------  --------------------------
## nlminb                                    2.7            0.7                        64.1                         68.0                        63.8
## solnp                                     0.0            0.0                        70.1                         73.8                        69.8
## lbfgs                                     0.0           43.4                        90.6                         91.5                        89.9
## gosolnp                                   3.2            0.0                        87.5                         90.3                        86.9
## hybrid                                    0.0            0.0                        70.1                         73.8                        69.8
## nloptr+COBYLA                             0.0            0.0                         2.3                        100.0                        20.6
## nloptr+BOBYQA                             0.0            0.0                         2.5                        100.0                        22.6
## nloptr+PRAXIS                            14.1            0.0                        89.1                         91.3                        88.5
## nloptr+NELDERMEAD                         0.0            0.0                        46.3                         55.6                        45.4
## nloptr+SBPLX                              0.0            0.0                         2.2                        100.0                        19.5
## nloptr+AUGLAG+COBYLA                      0.0            0.0                         2.3                        100.0                        20.6
## nloptr+AUGLAG+BOBYQA                      0.0            0.0                         2.5                        100.0                        22.6
## nloptr+AUGLAG+PRAXIS                     85.5            0.0                        89.1                         91.3                        88.5
## nloptr+AUGLAG+NELDERMEAD                  0.3            0.0                        51.9                         58.2                        51.3
## nloptr+AUGLAG+SBPLX                       0.0            0.0                         2.2                        100.0                        19.5

These tables already reveal a lot of information. In general it seems that the AUGLAG-PRAXIS method (the augmented Lagrangian method using the principal axis solver) provided in NLOpt does best for model 2 especially for large sample sizes, while for model 1 the gosolnp method, which uses the solnp solver by Yinyu Ye but with random initializations and restarts, seems to win out for larger sample sizes.

The bigger story, though, is the failure of any method to be the “best”, especially in the case of smaller sample sizes. While there are some optimizers that consistently fail to attain the maximum log-likelihood, no optimizer can claim to consistently obtain the best result. Additionally, different optimizers seem to perform better with different models. The implication for real-world data–where the true model parameters are never known–is to try every optimizer (or at least those that have a chance of maximizing the log-likelihood) and pick the results that yield the largest log-likelihood. No algorithm is trustworthy enough to be the go-to algorithm.

Let’s now look at plots of the estimated distribution of the parameters. First comes a helper function.

library(ggplot2)

solver_density_plot <- function(param, tags, list_reslist, sample_sizes, spec) {
  # Given a parameter, creates a density plot for each solver's distribution
  # at different sample sizes
  #
  # Args:
  #   param: A string for the parameter to plot
  #   tags: A character vector containing the solver names
  #   list_reslist: A list of lists created by optMethodSimsf, one for each
  #                 sample size
  #   sample_sizes: A numeric vector identifying the sample size corresponding
  #                 to each object in the above list
  #   spec: A ugarchspec object containing the specification that generated the
  #         datasets
  #
  # Returns:
  #   A ggplot object containing the plot generated
  
  p <- spec@model$fixed.pars[[param]]
  nlist <- lapply(list_reslist, function(l) {
    optlist <- lapply(tags, function(t) {
      return(na.omit(optMethodSims_getAllVals(param, t, l)))
    })
    names(optlist) <- tags
    df <- stack(optlist)
    names(df) <- c("param", "optimizer")
    
    return(df)
  })
  ndf <- do.call(rbind, nlist)
  ndf$n <- rep(sample_sizes, times = sapply(nlist, nrow))
  
  ggplot(ndf, aes(x = param)) + geom_density(fill = "black", alpha = 0.5) +
    geom_vline(xintercept = p, color = "blue") +
    facet_grid(optimizer ~ n, scales = "free_y")
}

Now for plots.

Estimated $\omega$ , model 1

solver_density_plot("omega", tags, list(spec1_n100, spec1_n500, spec1_n1000),
                    c(100, 500, 1000), spec1)

Estimated $\alpha$ , model 1

solver_density_plot("alpha1", tags, list(spec1_n100, spec1_n500, spec1_n1000),
                    c(100, 500, 1000), spec1)

Estimated $\beta$ , model 1

solver_density_plot("beta1", tags, list(spec1_n100, spec1_n500, spec1_n1000),
                    c(100, 500, 1000), spec1)

Bear in mind that there are only 1,000 simulated series and the optimizers produce solutions for each series, so in principle optimizer results should not be independent, yet the only time these density plots look the same is when the optimizer performs terribly. But even when an optimizer isn’t performing terribly (as is the case for the gosolnp, PRAXIS, and AUGLAG-PRAXIS methods) there’s evidence of artifacts around 0 for the estimates of $\omega$ and $\alpha$ and 1 for $\beta$ . These artifacts are more pronounced for smaller sample sizes. That said, for the better optimizers the estimators look almost unbiased, especially for $\omega$ and $\alpha$ , but their spread is large even for large sample sizes, especially for $\beta$ ‘s estimator. That’s not the case for the AUGLAG-PRAXIS optimizer, though; it appears to produce biased estimates.

Let’s look at plots for model 2.

Estimated $\omega$ , model 2

solver_density_plot("omega", tags, list(spec2_n100, spec2_n500, spec2_n1000),
                    c(100, 500, 1000), spec2)

Estimated $\alpha$ , model 2

solver_density_plot("alpha1", tags, list(spec2_n100, spec2_n500, spec2_n1000),
                    c(100, 500, 1000), spec2)

Estimated $\beta$ , model 2

solver_density_plot("beta1", tags, list(spec2_n100, spec2_n500, spec2_n1000),
                    c(100, 500, 1000), spec2)

The estimators don’t struggle as much for model 2, but the picture is still hardly rosy. The PRAXIS and AUGLAG-PRAXIS methods seem to perform well, but far from spectacularly for small sample sizes.

So far, my experiments suggest practitioners should not rely on any one optimizer but instead to try different ones and choose the results that have the largest log-likelihood. Suppose we call this optimization routine the “best” optimizer. how does this optimizer perform?

Let’s find out.

Model 1, $n = 100$

as.data.frame(round(solver_table(
  optMethodSims_getBestVals(spec1_n100, reslike = TRUE),
  "best", spec1) * 100,
  digits = 1))

##         Maximization Rate   Failure Rate   omega 95% CI Capture Rate   alpha1 95% CI Capture Rate   beta1 95% CI Capture Rate
## -----  ------------------  -------------  --------------------------  ---------------------------  --------------------------
## best                  100              0                        49.5                         63.3                        52.2

Model 1, $n = 500$

as.data.frame(round(solver_table(
  optMethodSims_getBestVals(spec1_n500, reslike = TRUE),
  "best", spec1) * 100,
  digits = 1))

##         Maximization Rate   Failure Rate   omega 95% CI Capture Rate   alpha1 95% CI Capture Rate   beta1 95% CI Capture Rate
## -----  ------------------  -------------  --------------------------  ---------------------------  --------------------------
## best                  100              0                          86                         88.8                        86.2

Model 1, $n = 1000$

as.data.frame(round(solver_table(
  optMethodSims_getBestVals(spec1_n1000, reslike = TRUE),
  "best", spec1) * 100,
  digits = 1))

##         Maximization Rate   Failure Rate   omega 95% CI Capture Rate   alpha1 95% CI Capture Rate   beta1 95% CI Capture Rate
## -----  ------------------  -------------  --------------------------  ---------------------------  --------------------------
## best                  100              0                        92.8                         90.3                        92.4

Model 2, $n = 100$

as.data.frame(round(solver_table(
  optMethodSims_getBestVals(spec2_n100, reslike = TRUE),
  "best", spec2) * 100,
  digits = 1))

##         Maximization Rate   Failure Rate   omega 95% CI Capture Rate   alpha1 95% CI Capture Rate   beta1 95% CI Capture Rate
## -----  ------------------  -------------  --------------------------  ---------------------------  --------------------------
## best                  100              0                        55.2                         63.2                        52.2

Model 2, $n = 500$

as.data.frame(round(solver_table(
  optMethodSims_getBestVals(spec2_n500, reslike = TRUE),
  "best", spec2) * 100,
  digits = 1))

##         Maximization Rate   Failure Rate   omega 95% CI Capture Rate   alpha1 95% CI Capture Rate   beta1 95% CI Capture Rate
## -----  ------------------  -------------  --------------------------  ---------------------------  --------------------------
## best                  100              0                          83                         86.3                        80.5

Model 2, $n = 1000$

as.data.frame(round(solver_table(
  optMethodSims_getBestVals(spec2_n1000, reslike = TRUE),
  "best", spec2) * 100,
  digits = 1))

##         Maximization Rate   Failure Rate   omega 95% CI Capture Rate   alpha1 95% CI Capture Rate   beta1 95% CI Capture Rate
## -----  ------------------  -------------  --------------------------  ---------------------------  --------------------------
## best                  100              0                        88.7                         91.4                        88.1

Bear in mind that we evaluate the performance of the “best” optimizer by the CI capture rate, which should be around 95%. The “best” optimizer obviously has good performance but does not outperform all optimizers. This is disappointing; I had hoped that the “best” optimizer would have the highly desirable property of a 95% capture rate. Performance is nowhere near that except for larger sample sizes. Either the standard errors are being underestimated or for small sample sizes the Normal distribution poorly describes the actual distribution of the estimators (which means multiplying by two does not lead to intervals with the desired confidence level).

Interestingly, there is no noticeable difference in performance between the two models for this “best” estimator. This suggests to me that the seemingly better results for models often seen in actual data might be exploiting the bias of the optimizers.

Let’s look at the distribution of the estimated parameters.

Estimated $\omega$ , model 1

solver_density_plot("omega", "best",
                    lapply(list(spec1_n100, spec1_n500, spec1_n1000),
                           function(l) {optMethodSims_getBestVals(l,
                                                              reslike = TRUE)}),
                    c(100, 500, 1000), spec1)

Estimated $\alpha$ , model 1

solver_density_plot("alpha1", "best",
                    lapply(list(spec1_n100, spec1_n500, spec1_n1000),
                           function(l) {optMethodSims_getBestVals(l,
                                                              reslike = TRUE)}),
                    c(100, 500, 1000), spec1)

Estimated $\beta$ , model 1

solver_density_plot("beta1", "best",
                    lapply(list(spec1_n100, spec1_n500, spec1_n1000),
                           function(l) {optMethodSims_getBestVals(l,
                                                              reslike = TRUE)}),
                    c(100, 500, 1000), spec1)

Estimated $\omega$ , model 2

solver_density_plot("omega", "best",
                    lapply(list(spec2_n100, spec2_n500, spec2_n1000),
                           function(l) {optMethodSims_getBestVals(l,
                                                              reslike = TRUE)}),
                    c(100, 500, 1000), spec2)

Estimated $\alpha$ , model 2

solver_density_plot("alpha1", "best",
                    lapply(list(spec2_n100, spec2_n500, spec2_n1000),
                           function(l) {optMethodSims_getBestVals(l,
                                                              reslike = TRUE)}),
                    c(100, 500, 1000), spec2)

Estimated $\beta$ , model 2

solver_density_plot("beta1", "best",
                    lapply(list(spec2_n100, spec2_n500, spec2_n1000),
                           function(l) {optMethodSims_getBestVals(l,
                                                              reslike = TRUE)}),
                    c(100, 500, 1000), spec2)

The plots suggest that the “best” estimator still shows some pathologies even though it behaves less poorly than the other estimators. I don’t see evidence for bias in parameter estimates regardless of choice of model but I’m not convinced the “best” estimator truly maximizes the log-likelihood, especially for smaller sample sizes. the estimates for $\beta$ are especially bad. Even if the standard error for $\beta$ should be large I don’t think it should show the propensity for being zero or one that these plots reveal.

Conclusion

I initially wrote this article over a year ago and didn’t publish it until now. The reason for the hang up was because I wanted a literature review of alternative ways to estimate the parameters of a GARCH model. Unfortunately I never completed such a review, and I’ve decided to release this article regardless.

That said, I’ll share what I was reading. One article by Gilles Zumbach tried to explain why estimating GARCH parameters is hard. He noted that the quasi-likelihood equation that solvers try to maximize has bad properties, such as being non-concave and having “flat” regions that algorithms can become stuck in. He suggested an alternative procedure to finding the parameters of GARCH models, where one finds the best fit in an alternative parameter space (which supposedly has better properties than working with the original parameter space of GARCH models) and estimating one of the parameters using, say, the method of moments, without any optimization algorithm. Another article, by Fiorentini, Calzolari, and Panattoni, showed that analytic gradients for GARCH models could be computed explicitly, so gradient-free methods like those used by the optimization algorithms seen here are not actually necessary. Since numerical differentiation is generally a difficult problem, this could help ensure that no additional numerical error is being introduced that causes these algorithms to fail to converge. I also wanted to explore other estimation methods to see if they somehow can avoid numerical techniques altogether or have better numerical properties, such as estimation via method of moments. I wanted to read an article by Andersen, Chung, and Sørensen to learn more about this approach to estimation.

Life happens, though, and I didn’t complete this review. The project moved on and the problem of estimating GARCH model parameters well was essentially avoided. That said, I want to revisit this point, perhaps exploring how techniques such as simulated annealing do for estimating GARCH model parameters.

So for now, if you’re a practitioner, what should you do when estimating a GARCH model? I would say don’t take for granted that the default estimation procedure your package uses will work. You should explore different procedure and different parameter choices and go with the results that lead to the largest log-likelihood value. I showed how this could be done in an automated fashion but you should be prepared to manually pick the model with the best fit (as determined by the log-likelihood). If you don’t do this the model you estimated may not actually be the one for which theory works.

I will say it again, one last time, in the last sentence of this article for extra emphasis: don’t take numerical techniques and results for granted!

sessionInfo()

## R version 3.4.2 (2017-09-28)
## Platform: i686-pc-linux-gnu (32-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ggplot2_2.2.1 rugarch_1.3-8 printr_0.1   
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.16               htmltools_0.3.6            
##  [3] SkewHyperbolic_0.3-2        expm_0.999-2               
##  [5] scales_0.5.0                DistributionUtils_0.5-1    
##  [7] Rsolnp_1.16                 rprojroot_1.2              
##  [9] grid_3.4.2                  stringr_1.3.1              
## [11] knitr_1.17                  numDeriv_2016.8-1          
## [13] GeneralizedHyperbolic_0.8-1 munsell_0.4.3              
## [15] pillar_1.3.0                tibble_1.4.2               
## [17] compiler_3.4.2              highr_0.6                  
## [19] lattice_0.20-35             labeling_0.3               
## [21] Matrix_1.2-8                KernSmooth_2.23-15         
## [23] plyr_1.8.4                  xts_0.10-0                 
## [25] spd_2.0-1                   zoo_1.8-0                  
## [27] stringi_1.2.4               magrittr_1.5               
## [29] reshape2_1.4.2              rlang_0.2.2                
## [31] rmarkdown_1.7               evaluate_0.10.1            
## [33] gtable_0.2.0                colorspace_1.3-2           
## [35] yaml_2.1.14                 tools_3.4.2                
## [37] mclust_5.4.1                mvtnorm_1.0-6              
## [39] truncnorm_1.0-7             ks_1.11.3                  
## [41] nloptr_1.0.4                lazyeval_0.2.1             
## [43] crayon_1.3.4                backports_1.1.1            
## [45] Rcpp_1.0.0

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

When I wrote this article initially, my advisor and a former student of his developed a test statistic that should detect early or late change points in a time series, including a change in the parameters of a GARCH model. My contribution to the paper we were writing included demonstrating that the test statistic detects structural change sooner than other test statistics when applied to real-world data. To be convincing to reviewers, our test statistic should detect a change that another statistic won’t detect until getting more data. This means that the change should be present but not so strong that both statistics immediately detect the change with miniscule $p$ -values. ↩
The profile on LinkedIn I linked to may or may not be the correct person; I’m guessing it is based on the listed occupations and history. If I got the wrong person, I’m sorry. ↩

http://ntguardian.wordpress.com/?p=3504

Extensions