GeistHaus
log in · sign up

FZ Blogs

Part of wordpress.com

AACTAAAGGAACTTT… + some stochastic processes so far…

stories
Software Performance: Data-type profiling for perf and a few more tools
BooksLinuxProgramlamapythoneBPFjavalinuxPerformance
2024 will be an interesting year for software performance based on what I read a few days ago: “Data-type profiling for perf“: There’s also the presentation of the author from November 2023: A few more relevant tools for people doing performance work: See also:
Show full content

2024 will be an interesting year for software performance based on what I read a few days ago: “Data-type profiling for perf“:

  • “Tooling for profiling the effects of memory usage and layout has always lagged behind that for profiling processor activity, so Namhyung Kim’s patch set for data-type profiling in perf is a welcome addition. It provides aggregated breakdowns of memory accesses by data type that can inform structure layout and access pattern changes. Existing tools have either, like heaptrack, focused on profiling allocations, or, like perf mem, on accounting memory accesses only at the address level. This new work builds on the latter, using DWARF debugging information to correlate memory operations with their source-level types.”

There’s also the presentation of the author from November 2023:

  • “Memory accesses can suffer from problems like poor spacial and temporal locality, as well as false sharing of cache lines. Existing presentations of profile data, such data from the perspective of code, can make it difficult to reason as to what the problems are and to work out what the fixes should be. A typical fix may be to reorder variables within a data structure. In this work Namhyung Kim will present ongoing work combining perf event and DWARF debug information, in order to correlate samples and present data type of the variables accessed within a program. However, DWARF debug information is not reliable in enabling a good understanding of variables accessed. The presentation will discuss the state of data type profiling and its addition to the Linux perf tool, how toolchain limitations are worked around by the tool, and how toolchains can be improved for data type profiling in the future.”

A few more relevant tools for people doing performance work:

  • FTrace: Ftrace (Function Tracer) is an internal tracer designed to help out developers and designers of systems to find what is going on inside the kernel. It can be used for debugging or analyzing latencies and performance issues that take place outside of user-space. Although ftrace is typically considered the function tracer, it is really a framework of several assorted tracing utilities. There’s latency tracing to examine what occurs between interrupts disabled and enabled, as well as for preemption and from a time a task is woken to the task is actually scheduled in. One of the most common uses of ftrace is the event tracing. Throughout the kernel is hundreds of static event points that can be enabled via the tracefs file system to see what is going on in certain parts of the kernel.
  • Coz: Finding Code that Counts with Causal Profiling. Coz is a profiler for native code (C/C++/Rust) that unlocks optimization opportunities missed by traditional profilers. Coz employs a novel technique called causal profiling that measures optimization potential. It predicts what the impact of optimizing code will have on overall throughput or latency. Profiles generated by Coz show the “bang for buck” of optimizing a line of code in an application. In the below profile, almost every effort to optimize the performance of this line of code directly leads to an increase in overall performance, making it an excellent candidate for optimization efforts.
  • eBPF: acronym for extended Berkeley Packet Filter is a technology that can run programs in a privileged context such as the operating system kernel. It can run sandboxed programs in a privileged context such as the operating system kernel. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules.
  • perf c2c: Use for “Shared Data Cache-to-Cache (C2C)” analysis. See also “Chapter 26. Detecting false sharing” in Red Hat documentation:
    • False sharing occurs when a processor core on a Symmetric Multi Processing (SMP) system modifies data items on the same cache line that is in use by other processors to access other data items that are not being shared between the processors.
    • This initial modification requires that the other processors using the cache line invalidate their copy and request an updated one despite the processors not needing, or even necessarily having access to, an updated version of the modified data item.
    • You can use the perf c2c command to detect false sharing.
  • Perfetto: System profiling, app tracing and trace analysis
    • Linux kernel tracing: Capture high frequency ftrace data: scheduling activity, task switching latency, CPU frequency and much more.
    • Userspace profilers and extra probes: Native heap profiling, Java heap profiling, pollers for /proc stat files.
  • Intel® VTune™ Profiler: optimizes application performance, system performance, and system configuration for HPC, cloud, IoT, media, storage, and more. Multilingual: C, C++, C#, Fortran, Python, Go, Java*, .NET, Assembly, or any combination of languages.

See also:

http://ileriseviye.wordpress.com/?p=6001
Extensions
Computing History meets Personal History: Ellis D. Kropotechev and ZEUS, A Marvelous Time-Sharing System
CogSciLispProgramlamapsychologyTarihCommon LispHistory of ComputingStanford University
Recently I’ve come across the following message from the Twitter account of SDF Public Access UNIX System, and I realized that I have some personal connection to the whole thing, albeit weakly. “How so?” you might ask, well, keep on reading… It points to a short movie from 1967, “Ellis D. Kropotechev and Zeus, A […]
Show full content

Recently I’ve come across the following message from the Twitter account of SDF Public Access UNIX System, and I realized that I have some personal connection to the whole thing, albeit weakly. “How so?” you might ask, well, keep on reading…

Prof Dr Hamit Fisek stars as Ellis in Ellis D. Kropotechev and ZEUS, his marvelous time sharing system.

Possibly the only footage of ZEUS! on the PDP-1 at Stanford.https://t.co/ru8qEnVfw8#timesharing #vintage #retro #stanford #computer #history

— SDF (@sdf_pubnix) January 13, 2022

It points to a short movie from 1967, “Ellis D. Kropotechev and Zeus, A Marvelous Time-Sharing Device“: “Set in the Stanford University computer center and cafeteria, the film gives the viewer a feel for the process of computer programming in the 1960s. It illustrates the transition from punched card batch processing computers (using teletypes) to time-sharing computing (using video terminals). Additional technologies employed throughout the film include the IBM 7090, the IBM 26 Printing Card Punch, the Zeus time-sharing program and the Algol/Gogol computer languages. The film’s soundtrack includes “Cool, Calm and Collected” by The Rolling Stones (1967).”

Ellis D. Kropotechev and Zeus, This marvelous time-sharing system. 1967 (VPRI 215)

If you can’t watch it on YouTube, or want a better version, visit the following web page: https://toobnix.org/w/2hcCuwyFXx85hKaE2CYtgp

What is my connection to all of this? You can see a young Hamit Fişek in the movie, who, after graduating from Stanford, became a professor at the Department of Psychology at Boğaziçi University. I also spent time at Boğaziçi University in 2000s, studying for my MA at the cognitive science program. This academic program also included courses from the psychology department, but I didn’t have the opportunity to take any classes from Prof. Hamit Fişek. Unfortunately, we lost Prof. Fişek back in 2020.

If I knew I was physically and organizationally very close to the professor who spent his time as a Ph.D. student at Stanford University in 1960s, having been featured in a movie produced by none other than John McCarthy himself, I would probably ask a lot of questions! That’s because in 2000s I was not only studying at the university where Prof. Fişek worked, but also started to learn about Lisp, and I wouldn’t want to miss the opportunity to talk to a person that had been a part of that history as it was being created.

Now that I have a user account on http://sdf.org, I will always remember this, cherish the memory of the professor that I unfortunately never talked to, and will try to avoid a similar mistake in the future.

Who knows, maybe I’ll meet some new people during SDF Plan9 Boot Camp Summer 2022, and get to hear some interesting stories about the history of computing, as well as gain perspectives on its future:

Registration opens today for the SDF Plan9 Boot Camp Summer 2022! Learn about the Plan9 operating system in a collaborative and friendly community based learning environment.https://t.co/rRztEGrW5K

Join us!#community #plan9 #9front #experimental #unix #bootcamp #summer pic.twitter.com/iJzS140Eq5

— SDF (@sdf_pubnix) June 1, 2022
http://ileriseviye.wordpress.com/?p=5905
Extensions
AI & Math: Hey, Alexa, what is 200 factorial?
CogSciMathProgramlamaAIAmazon AlexaArtificial IntelligenceGoogle Calculator
Today I decided to ask some difficult questions about very big numbers to Amazon Alexa: At least, it is better than the current Google Calculator, I mean, Amazon Alexa tries at least, before throwing in the towel 😉 It all started with the following observation: Well, dealing with such large numbers isn’t easy, especially if […]
Show full content

Today I decided to ask some difficult questions about very big numbers to Amazon Alexa:

At least, it is better than the current Google Calculator, I mean, Amazon Alexa tries at least, before throwing in the towel 😉

It all started with the following observation:

https://twitter.com/pwr2dppl/status/1495877075047264261
And went like that:

I've just tested it with Amazon Alexa, and it worked until 170!, and when I asked "hey, Alexa, what is 171 factorial?" and it said "Here's something I found on the Internet, 70 factorial equals…". Same thing happened for 172!

— Emre Sevinç (@EmreSevinc) February 22, 2022
Can you guess what’s going on?

Well, dealing with such large numbers isn’t easy, especially if it’s about JavaScript and big numbers:

JavaScript hates numbers because the original author was being fast and sloppy and it caught on. Google should do this right, but satisfying math folks via the calculator in search is not considered a valuable use of time.

— Jeremy Kun (j2kun@mathstodon.xyz) (@jeremyjkun) February 22, 2022

All in all, this is in the category of “cute, misleading, and not suitable for some mathematics questions”; and not lethal as reported by BBC in “Alexa tells 10-year-old girl to touch live plug with penny“. 😱

Having one of the best speech recognition technologies (for English) does not necessarily make human level understanding (semantics and pragmatics) easy, but nevertheless, I can only congratulate the researchers and software engineers building the speech recognition part 😉

Future we were promised versus… well, Alexa ¯_(ツ)_/¯ pic.twitter.com/pbzHt9trxH

— Emre Sevinç (@EmreSevinc) August 8, 2021

About the author: Emre is the co-founder & CTO of TM Data ICT Solutions in Belgium. You can read more about him in About page of this blog.

http://ileriseviye.wordpress.com/?p=5893
Extensions
Diversity in Belgium: facts and maps
GeneralbelgiumData visualizationdiversityInformation visualizationvisualization
My wife has recently drawn my attention to a very interesting report about diversity in Belgium: In a piece dated 8th January, 2022, written by Tobias Santens in one of the mainstream media outlets, you can find interactive information visualizations about some diversity figures in Belgium. The piece starts with the following (automatic translation): “Unknown […]
Show full content

My wife has recently drawn my attention to a very interesting report about diversity in Belgium: In a piece dated 8th January, 2022, written by Tobias Santens in one of the mainstream media outlets, you can find interactive information visualizations about some diversity figures in Belgium.

The piece starts with the following (automatic translation): “Unknown often remains unloved: discover more about diversity and integration in your municipality. Almost 1 in 3 Flemish people thinks that there are too many people of different origin living in their municipality. 7 out of 10 Flemish people almost never say they have a chat. The differences between municipalities are large. VRT NWS delved into the figures that the Agency for Home Affairs recently bundled in the Local Integration Scan. Search for your municipality below to view the situation in your area.”

Following the introduction and catchy phrases, there are some interactive maps where you can enter where you live in Belgium, and read about people’s perception with regards to people with an immigration background:

When I selected where I live, I was presented with the following information (translation from the original Dutch is done by Google Translate):

If you think Antwerp isn’t positive enough, I suggest you check some other parts of the Flanders of Belgium! 😉 As someone who worked in very different parts of Belgium, it was interesting for me to discover those places and the perception of their residents with respect to diversity.

Thanks to the following interactive map, I learned about different aspects of different regions in Belgium:

For example, I learned that according to the official statistics, “In Antwerp, 54% of the inhabitants have a foreign origin. Those are people who themselves have a different nationality, or where at least one of the parents has a foreign nationality. 22% of the inhabitants have a different nationality on their papers. This may also concern people with roots in neighboring countries, such as the Netherlands or France. 38% of the inhabitants have a background outside the EU. At least one of their parents, or themselves, therefore have a country outside the EU (and the United Kingdom) on their identity card.”

The final part of the piece finishes with the following information based on statistics, and as I read it, I wonder how things will evolve in the next 10-15 years. What do you think?

“Over the past ten years, the share of people of foreign origin in Flanders has increased from 17 to 24 percent. In Brussels, that share is of course much higher, rising from 66 to 75 percent between 2011 and 2021. In addition, the border areas with the Netherlands, France and Luxembourg are also darker, and cities such as Antwerp, Genk, Mons, Charleroi and Liège are a lot more diverse.

Note that a broad definition of “foreign origin” is used. You will be counted in this group if at least one of your parents, or you yourself have another nationality as the first registered nationality. So it is possible that you end up in this group, although you do not recognize yourself in the name. A subdivision with more details can be found via the search bar above the map.

Where does this data come from?

The first map was made with data from the Flemish Agency for the Interior. Their survey took place in September and October 2020. In each municipality, a group between 17 and 85 years old was surveyed, which is representative of the rest of the municipality in terms of age and gender. It is the intention in the future to also bundle data on integration for Brussels.

Results from a survey are always an approximation. The actual views of a municipality may therefore differ a few percentage points from the results above. It is therefore not justified to compare municipalities whose results are very close. More about that margin of error can be found here. Do you want even more figures about your municipality? You will find it here.

The second map was made with data from the Belgian statistical office Statbel, which works with information from the National Register. You can find more about this here.”

http://ileriseviye.wordpress.com/?p=5861
Extensions
How to activate hotplugged / newly added RAM in Linux?
DebianLinuxsysadmincpuhotplugkernellinuxnutanixOperating systemramudev
These days I’m busy helping one of our clients build a data platform for their renewable energy project in their own data center using Nutanix. I requested from their tech support a RAM and CPU cores upgrade for one of the virtual machines that was already running Debian GNU/Linux. When they informed me that they […]
Show full content

These days I’m busy helping one of our clients build a data platform for their renewable energy project in their own data center using Nutanix. I requested from their tech support a RAM and CPU cores upgrade for one of the virtual machines that was already running Debian GNU/Linux.

Should I buy this htop t-shirt, or go on a vacation? 😉

When they informed me that they increased the number of CPU cores and the amount RAM from the Nutanix side, I proceeded to reboot the server: To my surprise, even though I was able to see the correct number of CPU cores in htop, it seemed like the amount of RAM stayed the same! Where was the missing RAM? Nutanix management system showed that it allocated the requested amount of RAM to the server, but unlike the newly added CPU cores, we simply couldn’t see the expected amount of RAM from within the virtual machine running Debian GNU/Linux server.

After a brief investigation, we discovered that this has to do with Memory Hotplug mechanism of Linux kernel: using lsmem showed the ranges of available memory, the ones corresponding to the missing amount marked as offline.

I found out that it was possible to bring the offline memory ranges online (and vice versa) using chmem utility, e.g.:

  • This command requests 1024 MB of memory to be set online.
    • # chmem --enable 1024
  • This command requests 2 GB of memory to be set online.
    • # chmem --enable 2g
  • This command requests the memory range that starts with 0x00000000e4000000 and ends with 0x00000000f3ffffff to be set offline.
    • # chmem --disable 0x00000000e4000000-0x00000000f3ffffff

It was all good, but then I’ve realized bringing the newly added RAM online and making it fully available to the virtual machine running Debian didn’t survive a reboot.

A bit of research led to the following article: “How to enable memory/CPU hot plugging on my Linux server – Guide (instructions) on how to enable memory/CPU hot plugging on Linux server

Based on that, I’ve added the following udev rules by creating a new filed named 40-cpu-mem-hotplug.rules:

$ cat /etc/udev/rules.d/40-cpu-mem-hotplug.rules

# CPU hotadd request

SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}="1"

# Memory hotadd request

SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"

After rebooting the system, I was glad that it now automatically detected all of the available RAM.

Further Reading

About the author: Emre is the co-founder & CTO of TM Data ICT Solutions in Belgium. You can read more about him in About page of this blog.

http://ileriseviye.wordpress.com/?p=5815
Extensions
A new data structure in town: Maple Tree
LinuxProgramlamadata structuredata structureslinux
Thanks to a recent post on lwn.net, I learned about a new data structure: Maple Tree. Apparently, it’s been in development for the last 1.5 years: “The Maple Tree is a new data structure for Linux that provides an efficient way to store index ranges which map to a single pointer. It is RCU-safe and […]
Show full content

Thanks to a recent post on lwn.net, I learned about a new data structure: Maple Tree. Apparently, it’s been in development for the last 1.5 years: “The Maple Tree is a new data structure for Linux that provides an efficient way to store index ranges which map to a single pointer. It is RCU-safe and optimised for modern CPUs. For this application, it outperforms both the existing rbtree and radix tree data structures. The API is inspired by the XArray, and is significantly easier to use than the rbtree. This talk will cover the details of the implementation and show examples of users.”

This is what I could find about this up and coming “Maple Tree” data structure for enhancing Linux performance:

The Linux Maple Tree – Matthew Wilcox, Oracle

http://ileriseviye.wordpress.com/?p=5730
Extensions
Unix and Women
ProgramlamaTarihHistory of Computingunix
I’ve recently come across the names of two women that were active during the birth and early days of Unix, back in 1970s and 1980s. For future reference, I wanted to note down information about these pioneering women. “For many people, writing is painful and editing one’s own prose is difficult, tedious, and error-prone. It […]
Show full content

I’ve recently come across the names of two women that were active during the birth and early days of Unix, back in 1970s and 1980s. For future reference, I wanted to note down information about these pioneering women.

Lorinda Cherry in a historical Unix video from AT&T, Bell Labs
Lorinda Cherry in a historical Unix video from AT&T, Bell Labs

“For many people, writing is painful and editing one’s own prose is difficult, tedious, and error-prone. It is often hard to see which parts of a document are difficult to read or how to transform a wordy sentence into a more concise one. It is even harder to discover that one overuses a particular linguistic construct. The system of programs described here helps writers to evaluate documents and to produce better written and more readable prose. The system consists of programs to measure surface features of text that are important to good writing style as well as programs to do some of the tedious jobs of a copy editor. Some of the surface features measured are readability, sentence and word length, sentence type, word usage, and sentence openers. The copy editing programs find spelling errors, wordy phrases, bad diction, some punctuation errors, double words, and split infinitives.”

Computer aids for writers“, Lorinda Cherry, ACM SIGPLAN Notices, April 1981

Lorinda Cherry and Nina McDonald worked on Writer’s Workbench among other things in 1970s at Bell Labs. I wish the utilities that made up Writer’s Workbench would still be available and actively developed as free and open source software, maybe via GitHub (all I could find was this discussion on Hacker News).

According to M. Douglas McIlroy, Lorinda Cherry also contributed to another operating system: Plan 9.

The curious readers of history of computing can learn more about these women in the following online resources:

I think Lorinda Cherry also worked with Ken Knowlton, another important and inspiring historical figure when it comes to computing and innovation in many different fields.

http://ileriseviye.wordpress.com/?p=5707
Extensions
Truth, correctness and utility: an example from Information Theory
BooksMathData Scienceinformation theoryMathematicsstatistics
I’ve come across the following when doing research on “data processing inequality“: As it’s also stated in Scholarpedia’s “Mutual information” article, “Kullback-Leibler divergence is not a true distance: it is not symmetric, and it does not obey the triangle inequality (Cover and Thomas, 1991). It is not hard to show that DKL(P(z)||Q(z)) is non-negative, and […]
Show full content

I’ve come across the following when doing research on “data processing inequality“:

Fom page 19 of “Elements of Information Theory“, Second Edition, 2006, Thomas M. Cover and Joy A. Thomas

As it’s also stated in Scholarpedia’s “Mutual information” article, “Kullback-Leibler divergence is not a true distance: it is not symmetric, and it does not obey the triangle inequality (Cover and Thomas, 1991). It is not hard to show that DKL(P(z)||Q(z)) is non-negative, and zero if and only if P(z)=Q(z) .”

I found this a striking example of an expression not being true, and mathematically wrong, but the concept still being “useful“, as stated by Cover and Thomas, as long as you are experienced, and well aware of what you’re doing.

Further Reading:

http://ileriseviye.wordpress.com/?p=5698
Extensions
Diacritics restoration: can we do better by using neural networks and deep learning? Perspectives from a 10-year-old open source project
LinguisticsProgramlamapythonSciencedeasciifyturkish-deasciifier
UPDATE (2025-09-17): Five years later, maybe I should update the title as “Perspectives from a 15-year-old open source project”! OK, let’s get on with an exciting update! ➡ Bülent Arman Aksoy has recently announced https://github.com/armish/nokta-ai, “A lightweight PyTorch-based neural network package for restoring diacritics in Turkish text. nokta-ai can accurately restore Turkish special characters (ç, […]
Show full content

UPDATE (2025-09-17): Five years later, maybe I should update the title as “Perspectives from a 15-year-old open source project”! OK, let’s get on with an exciting update! ➡ Bülent Arman Aksoy has recently announced https://github.com/armish/nokta-ai, “A lightweight PyTorch-based neural network package for restoring diacritics in Turkish text. nokta-ai can accurately restore Turkish special characters (ç, ğ, ı, ö, ş, ü) from text where they have been removed or replaced with ASCII equivalents.” According to his experiments, a relatively small and simple model that had been trained on an Apple M1 Pro had reasonable accuracy (> 85%). Another larger and more complex model that had been trained on an NVIDIA A100 GPU (for less than 24 hours) had a remarkable accuracy of > 99%. I find that result as well as his write-up, “The Deceptively Complex World of Turkish Diacritics: A Neural Network Journey“, impressive! Even though that is an expensive way of Turkish diacritics restoration (deasciification), especially compared to Deniz Yüret‘s original Emacs-based deasciifier that uses very compact pattern matching (with an accuracy that is almost 96%, see below), Arman’s approach is yet another strong demonstrator of using “smart brute force” to solve an interesting language processing problem. Is this another “bitter lesson?” 😉 By the way, Arman also ported the, what we can now deservedly call, “classical” Turkish Deasciifier to Swift, so that you can run it natively on your MacBook: https://github.com/armish/TurkishDeasciifier, “A macOS menu bar application that converts ASCII Turkish text to proper Turkish characters with diacritics (ç, ğ, ı, İ, ö, ş, ü). ” I’m glad to see such developments and witness the evolution of technology that serves new approaches to an old problem after more than 15 years.

UPDATE (2023-06-14): Now that we’re living in the world of ChatGPT and Large Language Models (LLMs), a software developer, Murat Çorlu, suggested that ChatGPT’s performance for diacritics restoration (deasciification) for Turkish is very successful: https://twitter.com/muratcorlu/status/1668335101602848768 He shared his example at https://chat.openai.com/share/3bb666fd-9f35-40df-8efb-9dd0c59bb264. In order to see if ChatGPT is really the best (see the Accuracy benchmark given in “TABLE IV” below), a nice experiment would be to take a validated Turkish corpus, “asciify” it, feed the output to ChatGPT (e.g. via its API), retrieve the “deasciified” output, comparing it to the original corpus and checking what percentage of the text matches the original one. If the result turns out to be at least 1-2 points bigger than 97.06%, we’ll have a clear winner! 😉 Of course, enough care should be taken so that the initial Turkish corpus is not only validated (all diacritics are correct), but also representative of Turkish usage in a lot of domains, including multi-lingual texts, texts with heavy foreign terminology, abbreviations, ambiguities, etc.

2020-10-22: People who need to write correctly in languages that have letters with various diacritics such as ‘ğ‘, ‘ş‘, ‘ö‘, ‘ı‘, etc., can be troubled with US or UK standard QWERTY keyboards because of the lack of such letters on those keyboard layouts. If you also need to switch between languages such as English, and Turkish, you know what I mean.

Possible forms of diacritic restoration in Turkish for “aci”. Source: “Diacritic Restoration Using Recurrent Neural Network” by Ayşenur Genç Uzun

The process of taking a piece of writing without correct spelling (that uses standard ASCII characters, without proper diacritics) , and replacing the relevant letters with the correct ones is known as “diacritics restoration“, or “diacritics reconstruction” (or “deASCIIfication” colloquially). About 10 years ago, I wrote a Python program to help people with this: Turkish Deasciifier; a port of the Emacs Lisp code developed by Prof. Deniz Yüret. There’s also a web interface at http://turkceyap.appspot.com (I’m currently too busy to fix this broken web app, please use https://deasciifier.com/ instead!)

Ten years later, I still get some feature requests on GitHub, and one of the main themes is about adding the capability to update the system’s statistical model based on new language data, so that some of its wrong “deasciification” can be fixed. Thanks to one such recent request, I’ve learned that another natural language researcher, Ayşenur Genç Uzun, worked on a similar system using recurrent neural networks. She described her work in “Diacritic Restoration Using Recurrent Neural Network” [PDF], and the source code she developed, together with data sets, can be found at https://github.com/aysnrgenc/TurkishDeasciifier. The importance of her approach is the fact that a recurrent neural network can be relatively easily and constantly trained with new data all the time, making the results more accurate, with the ability to fix mistakes by retraining the network.

Another interesting fact is the performance comparison of various Turkish diacritics restoration (deASCIIfication) systems given in Table IV at the end of Uzun’s article:

But what does this table say? For me, the key messages are clear:

  • 10-year-old Turkish Deasciifier, with almost 96% accuracy, is good enough for a lot of practical purposes, but there’s definitely room for improvement.
  • The new, recurrent neural network based approach, without doing any feature engineering, can easily reach 86% accuracy. It can definitely be improved.

The author describes the reasons of relatively low accuracy and ways to improve this system at the end of her article:

“Compared to CNN or MLP, RNN gives better result on sequence to sequence translation. Therefore in this project RNN model is used. Diacritic resolution or deasciifing problem is approached as sequence to sequence translation problem and for a given input sentence (non-normalized, asciified sentence), model translate input sentence to normalized and deasciified sentence. Because lack of more powerful computers, GPUs model is only trained with 3 epoch size on 630K sentences and model achieves 86% accuracy score. In further works, an enhanced RNN model with higher embedding layer and state layer can be trained with more epoch size on larger corpus. This project gives hope about with developments, Turkish diacritic resolution problem can be result with higher accuracy without human effort.”

I see that the author used Dynet neural network library to train the model, and the primary roadblocks are:

  • Access to more processing power, e.g. a high-end GPU.
  • Bigger data sets.

I wonder if anybody would sponsor and support such research, to see if it would be possible to push the accuracy of such recurrent neural network based system to more than 97.06%. It might also make sense to see if another neural network / deep learning library such as TensorFlow, PyTorch, or Knet would make training easier, and accuracy higher. If this happens, and if somebody builds an easy-to-install and easy-to-use system that can let us do diacritics restoration with maybe 98% or 99% accuracy, millions of people could use this new system. You can check out some of the advanced research in this section if you want to see the ongoing developments, and decide for yourself if you want to tackle this interesting natural language processing challenge using cutting edge deep learning techniques.

You can read more about the author in About page of this blog.

http://ileriseviye.wordpress.com/?p=5651
Extensions
What is Engineering? Perspectives from “The Sciences of the Artificial”
businessManagementProgramlamaSciencecomplexityDesignengineering
If you are an engineer, or an engineering manager responsible for designing software-intensive complex systems, you will find a lot of food for thought in the following quotes from “The Sciences of the Artificial” by Nobel laureate and Turing Award recipient Herbert A. Simon. You might realize that the term ‘software‘ never appears in the […]
Show full content

If you are an engineer, or an engineering manager responsible for designing software-intensive complex systems, you will find a lot of food for thought in the following quotes from “The Sciences of the Artificial” by Nobel laureate and Turing Award recipient Herbert A. Simon. You might realize that the term ‘software‘ never appears in the following quotations, and the word ‘program‘ is mentioned only twice. Yet, the issues, concerns, methods, and the line of reasoning proposed by Simon can be used to attack the core of challenges facing software engineers working on different systems, and diverse domains. I believe these, as well as most of the rest of the book, deserve a critical and deep reading by generations of engineers.

“There is nothing special that needs to be said here about resource conservation—cost minimization, for example, as a design criterion. Cost minimization has always been an implicit consideration in the design of engineering structures, but until a few years ago it generally was only implicit, rather than explicit. More and more cost calculations have been brought explicitly into the design procedure, and a strong case can be made today for training design engineers in that body of technique and theory that economists know as “cost-benefit analysis.””

“The notion that the costs of designing must themselves be considered in guiding the design process began to take root only as formal design procedures have developed, and it still is not universally applied. An early example, but still a very good one, of incorporating design costs in the design process is the procedure, developed by Marvin L. Manheim as a doctoral thesis at MIT, for solving highway location problems.

Manheim’s procedure incorporates two main notions: first, the idea of specifying a design progressively from the level of very general plans down to determining the actual construction; second, the idea of attaching values to plans at the higher levels as a basis for deciding which plans to pursue to levels of greater specificity.

In the case of highway design the higher-level search is directed toward discovering “bands of interest” within which the prospects of finding a good specific route are promising. Within each band of interest one or more locations is selected for closer examination. Specific designs are then developed for particular locations. The scheme is not limited of course to this specific three-level division, but it can be generalized as appropriate.

Manheim’s scheme for deciding which alternatives to pursue from one level to the next is based on assigning costs to each of the design activities and estimating highway costs for each of the higher-level plans. The highway cost associated with a plan is a prediction of what the cost would be for the actual route if that plan were particularized through subsequent design activity. In other words, it is a measure of how “promising” a plan is. Those plans are then pursued to completion that look most promising after the prospective design costs have been offset against them.

In the particular method that Manheim describes, the “promise” of a plan is represented by a probability distribution of outcomes that would ensue if it were pursued to completion. The distribution must be estimated by the engineer—a serious weakness of the method—but, once estimated, it can be used within the framework of Bayesian decision theory. The particular probability model used is not the important thing about the method; other methods of valuation without the Bayesian superstructure might be just as satisfactory.

In the highway location procedure the evaluation of higher-level plans performs two functions. First, it answers the question, “Where shall I search next?” Second, it answers the question, “When shall I stop the search and accept a solution as satisfactory?” Thus it is both a steering mechanism for the search and a satisficing criterion for terminating the search.”

“Let us generalize the notion of schemes for guiding search activity beyond Manheim’s specific application to a highway location problem and beyond his specific guidance scheme based on Bayesian decision theory. Consider the typical structure of a problem-solving program. The program begins to search along possible paths, storing in memory a “tree” of the paths it has explored. Attached to the end of each branch—each partial path—is a number that is supposed to express the “value” of that path.

But the term “value” is really a misnomer. A partial path is not a solution of the problem, and a path has a “true” value of zero unless it leads toward a solution. Hence it is more useful to think of the values as estimates of the gain to be expected from further search along the path than to think of them as “values” in any more direct sense. For example it may be desirable to attach a relatively high value to a partial exploration that may lead to a very good solution but with a low probability. If the prospect fades on further exploration, only the cost of the search has been lost. The disappointing outcome need not be accepted, but an alternative path may be taken instead. Thus the scheme for attaching values to partial paths may be quite different from the evaluation function for proposed complete solutions.

When we recognize that the purpose of assigning values to incomplete paths is to guide the choice of the next point for exploration, it is natural to generalize even further. All kinds of information gathered in the course of search may be of value in selecting the next step in search. We need not limit ourselves to valuations of partial search paths.”

“Thus search processes may be viewed—as they have been in most discussions of problem solving—as processes for seeking a problem solution. But they can be viewed more generally as processes for gathering information about problem structure that will ultimately be valuable in discovering a problem solution. The latter viewpoint is more general than the former in a significant sense, in that it suggests that information obtained along any particular branch of a search tree may be used in many contexts besides the one in which it was generated.”

The Shape of the Design: Hierarchy

“In my first chapter I gave some reasons why complex systems might be expected to be constructed in a hierarchy of levels, or in a boxes-with boxes form. The basic idea is that the several components in any complex system will perform particular subfunctions that contribute to the overall function. Just as the “inner environment” of the whole system may be defined by describing its functions, without detailed specification of its mechanisms, so the “inner environment” of each of the subsystems may be defined by describing the functions of that subsystem, without detailed specification of its submechanisms.

To design such a complex structure, one powerful technique is to discover viable ways of decomposing it into semi-independent components corresponding to its many functional parts. The design of each component can then be carried out with some degree of independence of the design of others, since each will affect the others largely through its function and independently of the details of the mechanisms that accomplish the function.

There is no reason to expect that the decomposition of the complex design into functional components will be unique. In important instances there may exist alternative feasible decompositions of radically different kinds. This possibility is well known to designers of administrative organizations, where work can be divided up by subfunctions, by subprocesses, by subareas, and in other ways. Much of classical organization theory in fact was concerned precisely with this issue of alternative decompositions of a collection of interrelated tasks.”

Process as a Determinant of Style

“… When we come to the design of systems as complex as cities, or buildings, or economies, we must give up the aim of creating systems that will optimize some hypothesized utility function, and we must consider whether differences in style of the sort I have just been describing do not represent highly desirable variants in the design process rather than alternatives to be evaluated as “better” or “worse.” Variety, within the limits of satisfactory constraints, may be a desirable end in itself, among other reasons, because it permits us to attach value to the search as well as its outcome—to regard the design process as itself a valued activity for those who participate in it.

We have usually thought of city planning as a means whereby the planner’s creative activity could build a system that would satisfy the needs of a populace. Perhaps we should think of city planning as a valuable creative activity in which many members of a community can have the opportunity of participating—if we have wits to organize the process that way. I shall have more to say on these topics in the next chapter.

However that may be, I hope I have illustrated sufficiently that both the shape of the design and the shape and organization of the design process are essential components of a theory of design.”

Extra Reading

If you found the previous quotations useful, and want to explore more about these topics, in addition to Simon’s books, you might also be interested in the following books and papers, written by very experienced engineers, scientists, and managers:

About the author: Emre is the co-founder & CTO of TM Data ICT Solutions in Belgium. You can read more about him in About page of this blog.

http://ileriseviye.wordpress.com/?p=5610
Extensions