GeistHaus
log in · sign up

Dominique Dumont's Blog

Part of wordpress.com

Thoughts about programming, sysadmin, Perl, Debian ...

stories
Drawbacks of using Cookiecutter with Cruft
softwarecookiecuttertemplate
Hi Cookiecutter is a tool for building coding project templates. It’s often used to provide a scaffolding to build lots of similar project. I’ve seen it used to create Symfony projects and several cloud infrastructures deployed with Terraform. This tool was useful to accelerate the creation of new projects. 🏃 Since these templates were bound […]
Show full content

Hi

Cookiecutter is a tool for building coding project templates. It’s often used to provide a scaffolding to build lots of similar project. I’ve seen it used to create Symfony projects and several cloud infrastructures deployed with Terraform. This tool was useful to accelerate the creation of new projects. 🏃

Since these templates were bound to evolve, the teams providing these template relied on cruft to update the code provided by the template in their user’s code. In other words, they wanted their users to apply a diff of the template modification to their code.

At the beginning, all was fine. But problems began to appear during the lifetime of these projects.

What went wrong ?

In both cases, we had the following scenario:

  • user team:
    • 🙂 creates new project with cookiecutter template
    • 😏 makes modification on their code, including on code provided by template
  • meanwhile, provider team:
    • 😏 makes modifications to cookiecutter template
    • 🙂 releases new template version
    • 🙂 asks his users to update code brought by template using cruft
  • user team then:
    • 🤨 runs cruft to update template code
    • 😵‍💫 discovers a lot of code conflicts (similar to git merge conflicts)
    • 🤮 often rolls back cruft update and gives up on template update

User team giving up on updates is a major problem because these update may bring security or compliance fixes. 🚨

Note that code conflicts seen with cruft are similar to git merge conflicts, but harder to resolve because, unlike with a git merge, there’s no common ancestor, so 3-way merges are not possible.

From an organisation point of view, the main problem is the ambiguous ownership of the functionalities brought by template code: who own this code ? The provider team who writes the template or the user team who owns the repository of the code generated from the template ? Conflicts are bound to happen. ⛐

Possible solutions to get out of this tar pit:

  • Assume that template are one shot. Template update are not practical in the long run.
  • Make sure that template are as thin as possible. They should contain minimal logic.
  • Move most if not all logic in separate libraries or scripts that are owned by provider team. This way update coming from provider team can be managed like external dependencies by upgrading the version of a dependency.

Of course your users won’t be happy to be faced with a manual migration from the old big template to the new one with external dependencies. On the other hand, this may be easier to sell than updates based on cruft since the painful work will happen once. Further updates will be done by incrementing dependency versions (which can be automated with renovate).

If many projects are to be created with this template, it may be more practical to provide use a CLI that will create a skeleton project. See for instance terragrunt scaffold command.

My name is Dominique Dumont, I’m a devops freelance. You can find the devops and audit services I propose on my website or reach out to me on LinkedIn.

All the best

ddumont
http://ddumont.wordpress.com/?p=1096
Extensions
Azure API throttling strikes back
kubernetesAPIazure
Hi In my last blog, I explained how we resolved a throttling issue involving Azure storage API. In the end, I mentioned that I was not sure of the root cause of the throttling issue. Even though we no longer had any problem in dev and preprod cluster, we still faced throttling issue with prod. […]
Show full content

Hi

In my last blog, I explained how we resolved a throttling issue involving Azure storage API. In the end, I mentioned that I was not sure of the root cause of the throttling issue.

Even though we no longer had any problem in dev and preprod cluster, we still faced throttling issue with prod. The main difference between these 2 environments is that we have about 80 PVs in prod versus 15 in the other environments. Given that we manage 1500 pods in prod, 80 PVs does not look like a lot. 🤨

To continue the investigation, I’ve modified k8s-scheduled-volume-snapshotter to limit the number of snaphots done in a single cron run (see add maxSnapshotCount parameter pull request).

In prod, we used the modified snapshotter to trigger snapshots one by one.

Even with all previous snapshots cleaned up, we could not trigger a single new snapshot without being throttled🕳. I guess that, in the cron job, just checking the list of PV to snapshot was enough to exhaust our API quota. 😒

Azure doc mention that a leaky bucket algorithm is used for throttling. A full bucket holds tokens for 250 API calls, and the bucket gets 25 new tokens per second. Looks like that not enough.🐌

I was puzzled 😵‍💫 and out of ideas 😶.

I looked for similar problems in AKS issues on GitHub where I found this comment that recommend using useDataPlaneAPI parameter in the CSI file driver. That was it! 😃

I was flabbergasted 🤯 by this parameter: why is CSI file driver able to use 2 APIs ? Why is one on them so limited ? And more importantly, why is the limited API the default one ?

Anyway, setting useDataPlaneAPI: "true" in our VolumeSnapshotClass manifest was the right solution. This indeed solved the throttling issue in our prod cluster. ⚕

But not the snaphot issue 😑. Amongst the 80 PV, I still had 2 snaphots failing.🦗

Fortunately, the error was mentioned in the description of the failed snapshots: we had too many (200) snapshots for these shared volumes.

What ?? 😤 All these snapshots were cleaned up last week.

I then tried to delete these snaphots through azure console. But the console failed to delete these snapshot due to API throttling. Looks like Azure console is not using the right API. 🤡

Anyway, I went back to the solution explained in my previous blog, I listed all snapshots with az command. I indeed has a lot of snaphots, a lot of them dated Jan 19 and 20. There was often a new bogus snaphot created every minute.

These were created during the first attempt at fixing the throttling issue. I guess that even though CSI file driver was throttled, a snaphot was still created in the storage account, but the CSI driver did not see it and retried a minute later💥. What a mess.

Anyway, I’ve cleaned up again these bogus snapshot 🧨, and now, snaphot creation is working fine 🤸🏻‍♂️.

For now.

All the best.

ddumont
http://ddumont.wordpress.com/?p=1088
Extensions
How we solved storage API throttling on our Azure Kubernetes clusters
kubernetesAKSazuretroubleshooting
Hi This issue was quite puzzling, so I’m sharing how we investigated this issue. I hope it can be useful for you. My client informed me that he was no longer able to install new instances of his application. k9s showed that only some pods could not be created, only the ones that created physical […]
Show full content

Hi

This issue was quite puzzling, so I’m sharing how we investigated this issue. I hope it can be useful for you.

My client informed me that he was no longer able to install new instances of his application.

k9s showed that only some pods could not be created, only the ones that created physical volume (PV). The description of these pods showed a HTTP error 429 when creating pods: New PVC could not be created because we were throttled by Azure storage API.

This issue was confirmed by Azure diagnostic console on Kubernetes ( menu “Diagnose and solve problems” → “Cluster and Control Plane Availability and Performance” → “Azure Resource Request Throttling“).

We had a lot of throttling:

2025-01-18_11-01-k8s-throttles.png

Which were explained by the high call rate:

2025-01-18_11-01-k8s-calls.png

The first clue was found at the bottom of Azure diagnostic page:

2025-01-18_11-27-throttles-by-user-agent.png

According, to this page, throttling is done by services whose user agent is:

Go/go1.23.1 (amd64-linux) go-autorest/v14.2.1 Azure-SDK-For-Go/v68.0.0
storage/2021-09-01microsoft.com/aks-operat azsdk-go-armcompute/v1.0.0 (go1.22.3; linux)

The main information is Azure-SDK-For-Go, which means the program making all these calls to storage API is written in Go. All our services are written in Typescript or Rust, so they are not suspect.

That leaves controllers running in kube-systems namespace. I could not find anything suspects in the logs of these services.

At that point I was convinced that a component in Kubernetes control plane was making all those calls. Unfortunately, AKS is managed by Microsoft and I don’t have access to the control plane logs.

However, we’re realized that we had quite a lot of volumesnapshots that are created in our clusters using k8s-scheduled-volume-snapshotter:

  • about 1800 on dev instead of 240
  • 1070 on preprod instead of 180
  • 6800 on prod instead of 2400

We suspected that kubernetes reconciliation loop is throttled when checking the status of all these snapshots. May be so, but we also had the same issues and throttle rates on preprod and prod were the number of snapshots were quite different.

We tried to get more information using Azure console on our snapshot account, but it was also broken by the throttling issue.

We were so puzzled that we decided to try Léodagan‘s advice (tout crâmer pour repartir sur des bases saines, loosely translated as “burn everything down to start from scratch”) and we destroyed 🧨 piece by piece our dev cluster while checking if the throttling stopped.

First, we removed all our applications, no change. 😐

Then, all ancillary components like rabbitmq, cert-manager were removed, no change. 😶

Then, we tried remove the namespace containing our applications. But, we faced another issue: Kubernetes was unable to remove the namespace because it could not destroy some PVC and volumesnapshots. 🧐 That was actually good news, because it meant that we were close to the actual issue. 🤗

🪓 We managed to destroy the PVC and volumesnapshots by removing their finalizers. Finalizers are some kind of markers that tell kubernetes that something needs to be done before actually deleting a resource.

The finalizers were removed with a command like:

kubectl patch volumesnapshots ${volumesnapshot} \
  -p '{\"metadata\":{\"finalizers\":null}}'  --type merge

Then, we got the first progress 🎉: the throttling and high call rate stopped on our dev cluster.

To make sure that the snapshots were the issue, we re-installed the ancillary components and our applications. Everything was copacetic. 👌🏻

So, the problem was indeed with PVC and snapshots.

Even though we have backups outside of Azure, we weren’t really thrilled at trying Léodagan’s method 💥 on our prod cluster…

So we looked for a better fix to try on our preprod cluster. 🧐

⛏️ Poking around in PVC and volumesnapshots, I finally found this error message in the description on a volumesnapshotcontents:

Code="ShareSnapshotCountExceeded" Message="The total number of snapshots
for the share is over the limit."

The number of snapshots found in our cluster was not that high. So I wanted to check the snapshots present in our storage account using Azure console, which was still broken. ⚰️

Fortunately, Azure CLI is able to retry HTTP calls when getting 429 errors. I managed to get a list of snapshots with

az storage share list --account-name [redacted] --include-snapshots \
    | tee preprod-list.json

There, I found a lot of snapshots dating back from 2024. These were no longer managed by Kubernetes and should have been cleaned up. That was our smoking gun.

I guess that we had a chain of events like:

  • too many snapshots in some volumes
  • Kubernetes control plane tries to reconcile its internal status with Azure resources and frequently retries snapshot creation
  • API throttling kicks in
  • client not happy ☹️

To make things worse, k8s-scheduled-volume-snapshotter creates new snapshots when it cannot list the old ones. So we had 4 new snapshots per day instead of one. 🌊

Since we had the chain of events, fixing the issue was not too difficult (but quite long 😵‍💫):

  1. stop k8s-scheduled-volume-snapshotter by disabling its cron job
  2. delete all volumesnapshots and volume snapshots contents from k8s.
  3. since Azure API was throttled, we also had to remove their finalizers
  4. delete all snapshots from azure using az command and a Perl script (this step took several hours)
  5. re-enable k8s-scheduled-volume-snapshotter

After these steps, preprod was back to normal. 🎯 I’m now applying the same recipe on prod. 💊

We still don’t know why we had all these stale snapshots. It may have been a human error or a bug in k8s-scheduled-volume-snapshotter.

Anyway, to avoid this problem is the future, we will:

  • setup an alert on the number of snapshots per volume
  • check with k8s-scheduled-volume-snapshotter author to better cope with throttling

My name is Dominique Dumont, I’m a devops freelance. You can find the devops and audit services I propose on my website or reach out to me on LinkedIn.

All the best

ddumont
2025-01-18_11-01-k8s-throttles.png
2025-01-18_11-01-k8s-calls.png
2025-01-18_11-27-throttles-by-user-agent.png
http://ddumont.wordpress.com/?p=1069
Extensions
cme: new field in fill.copyright.blanks.yml for Debian copyright file
Config::ModelDebianpackagingEmacsLisp
Hi The file fill.copyright.blanks.yml is used to fill missing copyright information when running cme update dpkg-copyright. This file can contain a comment field that is used for book-keeping. Here’s an example from libuv1: README.md: comment: |- the license from this file is used as a main license and tends to apply expat or CC to […]
Show full content

Hi

The file fill.copyright.blanks.yml is used to fill missing copyright information when running cme update dpkg-copyright. This file can contain a comment field that is used for book-keeping.

Here’s an example from libuv1:

README.md:
comment: |-
  the license from this file is used as a main license and tends to
  apply expat or CC to all files. Which is wrong. Let's skip this file
  and let cme retrieve data from files.
skip: true

You may ask: why no use a YAML comments ? The problem is that YAML comments are dropped by cme edit dpkg. So you should not use them in fill.copyrights.blanks.yml.

It occurred to me that it may be interesting to copy the content of this comment in to debian/copyright file entries. But not in all cases, as some comments make sense in fill.copyright.blanks.yml but not in debian/copyright.

So I’ve added a new forwarded-comment parameter in fill.copyright.blanks.yml. The content of this field is copied verbatim in debian/copyright.

This way, you can add comments for book keeping and comments for debian/copyright entries.

For instance:

pan/gui/*:
  forwarded-comment: some comment about gui
  comment: this is an example from cme test files

yields:

Files: pan/gui/*
Copyright: 1989, 1991, Free Software Foundation, Inc.
License: GPL-2
Comment: some comment about gui

This new functionality is available in libconfig-model-dpkg-perl >= 3.008.

All the best

ddumont
http://ddumont.wordpress.com/?p=1062
Extensions
New cme command to update Debian Standards-Version field
Config::ModelDebianpackaging
Hi While updating my Debian package, I often have to update a field from debian/control file. This field is named Standards-Version and it declares which version of Debian policy the package complies to. When updating this field, one must follow the upgrading checklist. That being said, I maintain a lot of similar package and I […]
Show full content

Hi

While updating my Debian package, I often have to update a field from debian/control file.

This field is named Standards-Version and it declares which version of Debian policy the package complies to. When updating this field, one must follow the upgrading checklist.

That being said, I maintain a lot of similar package and I often have to update this Standards-Version field.

This field can be updated manually with cme fix dpkg (see Managing Debian packages with cme). But this command may make other changes and does not commit the result.

So I’ve created a new update-standards-version cme script that:

  • udpate Standards-Version field
  • commit the changed

For instance:

$ cme run update-standards-version 
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Connecting to api.ftp-master.debian.org to check 31 package versions. Please wait...
Got info from api.ftp-master.debian.org for 31 packages.
Warning in 'source Standards-Version': Current standards version is '4.7.0'. Please read https://www.debian.org/doc/debian-policy/upgrading-checklist.html for the changes that may be needed on your package
to upgrade it from standard version '4.6.2' to '4.7.0'.

Offending value: '4.6.2'

Changes applied to dpkg-control configuration:
- source Standards-Version: '4.6.2' -> '4.7.0'
[master 552862c1] control: declare compliance with Debian policy 4.7.0
 1 file changed, 1 insertion(+), 1 deletion(-)

Here’s the generated commit. Note that the generated log mentions the new policy version:

$ git show
commit 552862c1f24479b1c0c8c35a6289557f65e8ff3b (HEAD -> master)
Author: Dominique Dumont <dod[at]debian.org>
Date:   Sat Dec 7 19:06:14 2024 +0100

    control: declare compliance with Debian policy 4.7.0

diff --git a/debian/control b/debian/control
index cdb41dc0..e888012e 100644
--- a/debian/control
+++ b/debian/control
@@ -48,7 +48,7 @@ Build-Depends-Indep: dh-sequence-bash-completion,
                      libtext-levenshtein-damerau-perl,
                      libyaml-tiny-perl,
                      po-debconf
-Standards-Version: 4.6.2
+Standards-Version: 4.7.0
 Vcs-Browser: https://salsa.debian.org/perl-team/modules/packages/libconfig-model-perl
 Vcs-Git: https://salsa.debian.org/perl-team/modules/packages/libconfig-model-perl.git
 Homepage: https://github.com/dod38fr/config-model/wiki

Notes:

  • this script can run only if there’s not pending change. Please commit or stash these changes before running this script.
  • this script requires:
    • cme >= 1.041
    • libconfig-model-perl >= 2.155
    • libconfig-model-dpkg-perl >= 3.006

I hope this will be useful to all my fellow Debian developers to reduce the boring parts of packaging activities.

All the best

ddumont
http://ddumont.wordpress.com/?p=1058
Extensions
How I investigated connection hogs on Kubernetes
UncategorizedEmacsLisp
Hi My name is Dominhique Dumont, DevOps freelance in Grenoble, France. My goal is to share my experience regarding a production issue that occurred last week where my client complained that the applications was very slow and sometime showed 5xx errors. The production service is hosted on a Kubernetes cluster on Azure and use a […]
Show full content

Hi

My name is Dominhique Dumont, DevOps freelance in Grenoble, France.

My goal is to share my experience regarding a production issue that occurred last week where my client complained that the applications was very slow and sometime showed 5xx errors. The production service is hosted on a Kubernetes cluster on Azure and use a MongoDB on ScaleGrid.

I reproduced the issue on my side and found that the API calls were randomly failing due to timeouts on server side.

The server logs were showing some MongoDB disconnections and reconnections and some time-out on MongoDB connections, but did not give any clue on why some connections to MongoDB server were failing.

Since there was not clue in the cluster logs, I looked at ScaleGrid monitoring. There was about 2500 connections on MongoDB: 2022-07-19-scalegrid-connection-leak.png That seemed quite a lot given the low traffic at that time, but not necessarily a problem.

Then, I went to the Azure console, and I got the first hint about the origin of the problem: the SNATs were exhausted on some nodes of the clusters. 2022-07-28_no-more-free-snat.png

SNATs are involved in connections from the cluster to the outside world, i.e. to our MongoDB server and are quite limited: only 1024 SNAT ports are available per node. This was consistent with the number of used connections on MongoDB.

OK, then the number of used connections on MongoDB was a real problem.

The next question was: which pods and how many connections ?

First I had to filter out the pods that did not use MongoDB. Fortunately, all our pods have labels so I could list all pods using MongoDB:

$ kubectl -n prod get pods -l db=mongo | wc -l
236

Hmm, still quite a lot.

Next problem is to check which pod used too many MongoDB connections. Unfortunately, the logs mentioned that a connection to MongoDB was opened, but that did not give a clue on how many were used.

Netstat is not installed on the pods, and cannot be installed since the pods are not running as root (which is a good idea for security reasons)

Then, my Debian Developer experience kicked in and I remembered that /proc file system on Linux gives a lot of information on consumed kernel resources, including resources consumed by each process.

The trick is to know the PID of the process using the connections.

In our case, Docker files are written in a way so the main process of a pod using NodeJS is 1, so, the command to list the connections of pod is:

$ kubectl -n prod exec redacted-pod-name-69875496f8-8bj4f -- cat /proc/1/net/tcp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode                                                     
   0: AC00F00A:C9FA C2906714:6989 01 00000000:00000000 02:00000DA9 00000000  1001        0 376439162 2 0000000000000000 21 4 0 10 -1                 
   1: AC00F00A:CA00 C2906714:6989 01 00000000:00000000 02:00000E76 00000000  1001        0 376439811 2 0000000000000000 21 4 0 10 -1                 
   2: AC00F00A:8ED0 C2906714:6989 01 00000000:00000000 02:000004DA 00000000  1001        0 445806350 2 0000000000000000 21 4 30 10 -1                
   3: AC00F00A:CA02 C2906714:6989 01 00000000:00000000 02:000000DD 00000000  1001        0 376439812 2 0000000000000000 21 4 0 10 -1                 
   4: AC00F00A:C9FE C2906714:6989 01 00000000:00000000 02:00000DA9 00000000  1001        0 376439810 2 0000000000000000 21 4 0 10 -1                 
   5: AC00F00A:8760 C2906714:6989 01 00000000:00000000 02:00000810 00000000  1001        0 375803096 2 0000000000000000 21 4 0 10 -1                 
   6: AC00F00A:C9FC C2906714:6989 01 00000000:00000000 02:00000DA9 00000000  1001        0 376439809 2 0000000000000000 21 4 0 10 -1                 
   7: AC00F00A:C56C C2906714:6989 01 00000000:00000000 02:00000DA9 00000000  1001        0 376167298 2 0000000000000000 21 4 0 10 -1                 
   8: AC00F00A:883C C2906714:6989 01 00000000:00000000 02:00000734 00000000  1001        0 375823415 2 0000000000000000 21 4 30 10 -1 

OK, that’s less appealing that netstat output. The trick is that rem_address and port are expressed in hexa. A quick calculation confirms the port 0x6989 is indeed port 27017, which is the listening port of MongoDB server.

So the number of opened MongoDB connections is given by:

$ kubectl -n prod exec redacted-pod-name-69875496f8-8bj4f -- cat /proc/1/net/tcp | grep :6989 | wc -l
9

What’s next ?

The ideal solution would be to fix the NodeJS code to handle correctly the termination of the connections, but that would have taken too long to develop.

So I’ve written a small Perl script to:

  • list the pods using MongoDB using kubectl -n prod get pods -l db=mongo
  • find the pods using more that 10 connections using the kubectl exec command shown above
  • compute the deployment name of these pods (which was possible given the naming convention used with our pods and deployments)
  • restart the deployment of these pods with a kubectl rollout restart deployment command

Why restart a deployment instead of simply deleting the gluttonous pods? I wanted to avoid downtime if all pods of a deployment were to be killed. There’s no downtime when applying rollout restart command on deployments.

This script is now run regularly until the connections issue is fixed for good in NodeJS code. Thanks to this script, there’s no need to rush a code modification.

All in all, working around this connections issues was made somewhat easier thanks to:

  • the monitoring tools provided by the hosting services.
  • a good knowledge of Linux internals
  • consistent labels on our pods
  • the naming conventions used for our kubernetes artifacts
ddumont
2022-07-19-scalegrid-connection-leak.png
2022-07-28_no-more-free-snat.png
http://ddumont.wordpress.com/?p=1046
Extensions
Important bug fix for OpenSsh cme config editor
Config::ModelPerlUncategorizedcmeconfig-modelOpenSsh
The new release of Config::Model::OpenSsh fixes a bugs that impacted experienced users: the order of Hosts or Match sections is now preserved when writing back ~/.ssh/config file. Why does this matter ? Well, the beginning of ssh_config man page mentions that “For each parameter, the first obtained value will be used.” and “Since the first […]
Show full content

The new release of Config::Model::OpenSsh fixes a bugs that impacted experienced users: the order of Hosts or Match sections is now preserved when writing back ~/.ssh/config file.

Why does this matter ?

Well, the beginning of ssh_config man page mentions that “For each parameter, the first obtained value will be used.” and “Since the first obtained value for each parameter is used, more host-specific declarations should be given near the beginning of the file, and general defaults at the end.“.

Looks like I missed these statements when I designed the model for OpenSsh configuration: the Host section was written back in a neat, but wrong, alphabetical order.

This does not matter except when there an overlap between the specifications of the Host (or Match) sections like in the example below:

Host foo.company.com
Port 22

Host *.company.com
Port 10022

With this example, ssh connection to “foo.company.com” is done using port 22 and connection to “bar.company.com” with port 10022.

If the Host sections are written back in reverse order:

Host *.company.com
Port 10022

Host foo.company.com
Port 22

Then, ssh would be happy to use the first matching section for “foo.company.com“, i.e. “*.company.com” and would use the wrong port (10022)

This is now fixed with Config::Model::OpenSsh 2.8.4.3 which is available on cpan and in Debian/experimental.

While I was at it, I’ve also updated Managing OpenSsh configuration with cme wiki page.

All the best

ddumont
http://ddumont.wordpress.com/?p=1032
Extensions
An improved GUI for cme and Config::Model
UncategorizedcmeConfig::ModelPerl
I’ve finally found the time to improve the GUI of my pet project: cme (aka Config::Model). Several years ago, I stumbled on a usability problem on the GUI. Some configuration (like OpenSsh or Systemd) feature a lot of configuration parameters. Which means that the GUI displays all these parameters, so finding a specfic parameter might […]
Show full content

I’ve finally found the time to improve the GUI of my pet project: cme (aka Config::Model).

Several years ago, I stumbled on a usability problem on the GUI. Some configuration (like OpenSsh or Systemd) feature a lot of configuration parameters. Which means that the GUI displays all these parameters, so finding a specfic parameter might be challenging:

To workaround this problem, I’ve added a Filter widget in 2018 which did more or less the job, but it suffered from several bugs which made its behavior confusing.

This is now fixed. The Filter widget is now working in a more consistent way.

In the example below, I’ve typed “IdentityFile” (1) in the Filter widget to show the identityFile used for various hosts (2):

Which is quite good, but some hosts use the default identity file so no value show up in the GUI. You can then click on “hide empty value” checkbox to show only the hosts that use a specific identity file:

I hope that this new behavior of the Filter box will make this project more useful.

The improved GUI was released with Config::Model::TkUI 1.374. This new version is available on CPAN and on Debian/experimental). It will be released on Debian/unstable once the next Debian version is out.

All the best

ddumont
http://ddumont.wordpress.com/?p=1019
Extensions
Security gotcha with log collection on Azure Kubernetes cluster.
computerazurekubernetessecurity

Azure Kubernetes Service provides a nice way to set up Kubernetes cluster in the cloud. It’s quite practical as AKS is setup by default with a rich monitoring and reporting environment.

But the default setup associated with common, but not ideal, practices can lead to security issues

Show full content

Azure Kubernetes Service provides a nice way to set up Kubernetes
cluster in the cloud. It’s quite practical as AKS is setup by default
with a rich monitoring and reporting environment. By default, all
container logs are collected, CPU and disk data are gathered. 👍

I used AKS to setup a cluster for my first client as a
freelance. Everything was nice until my client asked me why logs
collection was as expensive as the computer resources.💸

Ouch… 🤦

My first reflex was to reduce the amount of logs produced by all our
containers, i.e. start logging at warn level instead of info
level
. This reduced the amount of logs quite a lot.

But this did not reduce the cost of collecting logs, which looks like
to a be a common issue.

Thanks to the documentation provided by Microsoft, I was able to find
that ContainerInventory data table was responsible of more than 60%
of our logging costs.

What is ContainerInventory ? It’s a facility to monitor the content
of all environment variables from all containers.

Wait… What ? ⚠

Should we be worried about our database credentials which are, legacy
oblige, stored in environment variables ?

Unfortunately, the query shown below confirmed that, yes, we should:
the logs aggregated by Azure contains the database credentials of my
client.

ContainerInventory
| where TimeGenerated > ago(1h)

Having credentials collected in logs is lackluster from a security
point of view. 🙄

And we don’t need it because our environment variables do not change.

Well, it’s now time to fix these issues. 🛠

We’re going to:

  1. disable the collection of environment variables in Azure, which
    will reduce cost and plug the potential credential leak
  2. renew all DB credentials, because the previous credentials can be
    considered as compromised (The renewal of our DB passwords is quite
    easy with the script I provided to my client)
  3. pass credentials with files instead of environment variables.

In summary, the service provided by Azure is still nice, but beware of
the default configuration which may contain surprises.

I’m a freelance, available for hire. The https://code-straight.fr site
describes how I can help your projects.

All the best

 

ddumont
http://ddumont.wordpress.com/?p=1014
Extensions
How to run CEWE photo creator on Debian
computerDebiandebian
Hi This post describes how I debug an issue with a proprietary software. I hope this will give you some hint on how to proceed should you face a similar issue. If you’re in a hurry, you can read the TL;DR; version at the end. After the summer vacations, I’ve decided to offer a photo-book […]
Show full content

Hi

This post describes how I debug an issue with a proprietary software. I hope this will give you some hint on how to proceed should you face a similar issue. If you’re in a hurry, you can read the TL;DR; version at the end.

After the summer vacations, I’ve decided to offer a photo-book to my mother. I searched for open-source solution but the printed results were lackluster.

Unfortunately, the only possible solution was to use professional service. Some of these services offer a web application to create photo books, but this is painful to use on a slow DSL line. Other services provide a program named CEWE. This proprietary program can be downloaded for Windows, Mac and, lo and behold: Linux !

The download goes quite fast as the downloaded program is a Perl script that does the actual download. I would have preferred a proper Debian package, but at least Linux amd64 is supported.

Once installed, CEWE program is available as an executable and a bunch of shared libraries.

This program works quite well to create a photo album. I won’t go into the details there.

I ran into trouble when trying to connect the application to the service site to order the photo-book: the connection fails with a cryptic message “error code 10000”.

Commercial support was not much help as they insisted that I check my proxy settings. I downloaded again CEWE from another photo service. The new CEWE installation gave me the same error. This showed that the issue was on my side and not on the server’s side.

Given that the error occurred quite fast when trying to connect, I guessed that the connection setup was going south. Since the URL shown in the installation script began with https, I had to check for SSL issues.

I checked certificate issues: curl had no problem connecting to the server mentioned in the Perl script. Wireshark showed that the connection to the server was reset by the server quite fast. I wondered which version of SSL was used by CEWE and ran ldd. To my surprise, I found that ldd did not list libssl. Something weird was going on: SSL was required but CEWE was not linked to libssl…

I used another trick: explore all the menus of the application. This was a good move as I found a checkbox to enable debug report in CEWE in “Options -> paramètres -> Service” menu (that may be “options-> parameters -> support” in English CEWE). When set, debug traces are also shown on standard output of CEWE,

And, somewhere in the debug traces, I found:

W (2018-10-30T18:36:37.143) [ 0] ==> QSslSocket: cannot resolve SSLv3_client_method <==

So CEWE was looking for SSL symbols even though ldd did not require libssl…

I guessed that CEWE was using dlopen to open the ssl library. But which file was opened by dlopen ?

Most likely, the guys who wrote the call to dlopen did not want to handle file names with so version (i.e. like libssl.so.1.0.2), and added code to open directly libssl.so. This file is provided by libssl-dev package, which was already installed on my system.

But wait, CEWE was probably written for Debian stable with an older libssl. I tried libssl1.0-dev.. which conflicts with libssl-dev. Oh well, I can live with that for a while…

And that was it ! With libssl1.0-dev installed, CEWE was able to connect to the photo service web site without problems.

So here’s the TL;DR; version. To run CEWE on Debian, run:

sudo apt install libssl1.0-dev

Last but not least, here are some suggestions for CEWE:

  • use libssl1.1. as libssl1.0 is deprecated and will be removed from Debian
  • place the debug checkbox in “System” widget. This widget was the first I opened when I began troubleshooting. “Service” does not mean much to me. Having this checkbox in both “Service” and “System” widgets would not harm

All the best

[ Edit: I first blamed CEWE for loading libssl in a non-standard way. libssl is actually loaded by QtNetwork. Depending on the way Qt is built, SSL is either disabled (-no-openssl option), loaded by dlopen (default) or loaded with dynamic linking (-openssl-linked). The way Qt is built is CEWE choice. Thanks Uli Schlachter for the heads-up]

 

ddumont
http://ddumont.wordpress.com/?p=1010
Extensions