Konstantin Kirchheim on kkirchheim.de

On the Implementation of AI Ethics

Mon, 24 Feb 2025 00:00:00 +0000

This is a (german) term paper that I wrote in 2019 (in a pre-LLM era) for a seminar on the philosophical aspects of AI. It discusses general strategies for implementing ethical behavior in AI systems at the example of autonomous vehicles. While somewhat outdated, it still constitutes a reasonable introduction to the topic.

Einleitung §

Lange Zeit galten menschliche Individuen und Gesellschaften als die einzigen intelligenten Entscheidungsträger. Durch die Fortschritte in der Informatik, insbesondere in den Bereichen Künstliche Intelligenz und Machine Learning, entstanden autonom agierende Maschinen, die dem Menschen diese Position zunehmend streitig machen. Während diese autonomen Systeme bisher nur in abgegrenzten Gebieten eingesetzt wurden, wird erwartet, dass sie in Zukunft sämtliche Bereiche des gesellschaftlichen Lebens durchdringen werden. Die Bandbreite reicht dabei von Pflege- bis hin zu militärischen Robotern. Die Front dieser Generation von neuen, intelligenten Systemen bilden derzeit führerlose Autos, die bereits in der Praxis getestet werden.

Das Aufkommen dieser neuen Klasse von Entscheidungsträgern wirft ein anderes Licht auf einige ältere ethische Fragen, wie z.B. das Trolley-Problem. Die Moralphilosophie behandelt solche und ähnliche Dilemmata bereits seit langer Zeit, jedoch vor allem unter der Annahme, dass es sich bei den moralischen Agenten um Menschen handelt. Es hat sich gezeigt, dass sich die populären Ansätze wie der Konsequentialismus und die Deontologische Ethik relativ einfach auf nicht-menschliche Agenten übertragen lassen. In der praktischen Anwendung stellt sich jedoch die Frage, wie Ethisches Verhalten in nicht-menschlichen Agenten implementiert werden kann.¹

Im Folgenden sollen bekannte Implementierungsstrategien für moralisches Verhalten am Beispiel selbstfahrender Autos untersucht werden. Dazu wird zunächst erläutert, warum ungeschriebene ethische Regeln für künstliche Systeme von Bedeutung sind und wie sie sich zu expliziten Regeln (z.B. der Straßenverkehrsordnung) verhalten. Anschließend werden moralphilosophische Theorien vorgestellt, welche die Grundlage der meisten Implementierungen darstellen. Danach folgen einige ausgewählte Implementierungsstrategien, die zuletzt diskutiert werden.

Abgrenzung §

Um den Rahmen der Arbeit einzugrenzen, werden im Folgenden einige Aspekte der Maschinenethik aufgezeigt, die nicht weiter betrachtet werden. So soll nicht erörtert werden, ob Maschinen aus philosophischer Perspektive die grundsätzliche Befähigung zu moralischem Verhalten besitzen. Aus Sicht der Kantschen Ethik ist ein Agent nur dann im Stande moralisch zu handeln, wenn er (1) alternative Entscheidungen treffen könnte und (2) “bewusst” die moralisch richtige Entscheidung trifft. Beide Kriterien werden durch Roboter nur bedingt erfüllt, da (insbesondere) bei Maschinen streitbar ist, ob und zu welchem Grad sie über freien Willen verfügen, der ihnen das tatsächliche Entscheiden ermöglichen würde. Andererseits ist ungeklärt, ob und wie Maschinen in der Lage sein können, sich einer solchen Entscheidung bewusst zu sein. Es soll also weder über freien Willen noch über Bewusstsein diskutiert werden.

Moralische Regeln §

Es wird im Folgenden für die Notwendigkeit von ethisch handelnden Maschinen argumentiert. Eine solche Argumentation kann allerdings nicht auf der Basis einer moralischen Bewertung erfolgen, da dies nur unter der jeweils angewandten moralischen Perspektive gültig wäre.² Um einen Zirkelschluss zu vermeiden, erfolgt eine Betrachtung aus marktwirtschaftlicher Sicht.³

Nach Maurer et al.² werden autonome Fahrzeuge bzw. die sie steuernden Algorithmen nicht anhand von Statistiken oder durch Tests bewertet, sondern durch die ethischen und moralischen Standards der Gesellschaft. So wäre ein amokfahrendes Auto für die Hersteller ein PR-Desaster, wohingegen Autos mit sozial vorbildlichem Verhalten die soziale Akzeptanz (und somit ggf. auch die Absatzzahlen) erhöhen können. Wenn ein Auto also einen Unfall verursacht, ist es notwendig, dass es eine unter diesen Umständen von der Gesellschaft als moralisch vertretbar wahrgenommene Entscheidung getroffen hat. Dabei wird es nicht ausreichen, sich lediglich an die expliziten juristischen Gesetze zu halten und dabei die ungeschriebenen gesellschaftlichen Normen zu ignorieren. Dies würde zwar eine Verurteilung vor Gericht verhindern, nicht jedoch eine Verurteilung durch die Gesellschaft.

Unabhängig davon, ob Maschinen schuldfähig sein können, wird die Schuld für als unmoralisch wahrgenommene Entscheidung den Herstellern gegeben werden, denn ein Auto — das in absehbarer Zeit nicht als Entität mit freiem Willen gesehen wird — kann aus Sicht der Gesellschaft schwerlich selbst für etwaiges Fehlverhalten verantwortlich gemacht werden. Aus diesen Gründen ist es also aus marktwirtschaftlicher Sicht sinnvoll, Verhalten zu implementieren, das zumindest nach außen moralisch wirkt.

Verkehrsregeln §

Wie bereits festgestellt, ist es für die Autohersteller von Vorteil, moralische Überlegungen in die Entscheidungslogik ihrer autonomen Fahrzeuge zu integrieren. Noch nicht geklärt ist, wie mit Straßenverkehrsregeln umgegangen werden soll. Maurer et al.² behaupten, dass es sich bei Straßenverkehrsregeln in der Praxis nicht um tatsächliche Regeln handelt, die unbedingt eingehalten werden müssen, sondern eher um Richtlinien, die mit anderen Einflussfaktoren wie der Verkehrssicherheit, dem Verkehrsfluss oder allgemein der Zweckdienlichkeit abgewogen werden müssen. So hatte beispielsweise ein selbstfahrendes Auto von Google in Praxistests Schwierigkeiten an vierarmigen Kreuzungen, weil die anderen, menschlichen Autofahrer vor dem Weiterfahren nicht lange genug hielten, während das Google-Auto auf Rücksicht programmiert wurde und die anderen Fahrer passieren ließ, ohne selbst jemals loszufahren. Als Konsequenz wurde es den Autos erlaubt, die Verkehrsregeln zu brechen, wenn ihr Nutzer es ihnen befahl.

Allgemein lässt sich festhalten, dass es für die Autohersteller unvorteilhaft wäre, wenn die Autos durch das strikte Einhalten der Verkehrsregeln z.B. den Verkehrsfluss aufhielten, weil dies die gesellschaftliche Akzeptanz verringern könnte. Eine praxistaugliche Implementierung sollte daher die Interessen der unterschiedlichen Verkehrsteilnehmer gegeneinander abwiegen, anstatt Verkehrsregeln als in jedem Fall einzuhaltende Regeln anzusehen.

Moralphilosophische Theorien §

Dieser Abschnitt stellt einige moralphilosophische Theorien vor, die sich eignen könnten, um eine berechenbare mathematische Beschreibung von Moral zu erzeugen. Die hier gewählte Gliederung orientiert sich an Allen et al.⁴, wobei anzumerken ist, dass sich eine strikte Einteilung aufgrund unscharfer Grenzen nur schwer erreichen lässt.

Konsequentialismus §

Der Konsequentialismus beschreibt die Ansicht, dass die normative Einordnung einer Handlung nur von ihren Konsequenzen abhängt. Zu den konsequentialistischen Ethiken gehören beispielsweise der Utilitarismus und der ethische Egoismus.

Utilitarismus §

Der Utilitarismus befindet diejenige Entscheidung für gut, deren Konsequenzen den größtmöglichen aggregierten Gesamtnutzen für alle Individuen bedeuten. Man kann den Utilitarismus dementsprechend als Optimierungsproblem auffassen.

Ethischer Egoismus §

Die Theorie des ethischen Egoismus betrachtet diejenigen Handlungen als gut, die dem handelnden Individuum maximal nützen. Sie kann wie der Utilitarismus auch als Optimierungsproblem aufgefasst werden, berücksichtigt die Interessen anderer aber nur insoweit, wie dies den eigenen Interessen dient.

Deontologische Ethik §

In der deontologischen Ethik (oder Pflichtethik) wird die normative Einordnung einer Handlung nicht (nur) aufgrund ihrer Konsequenzen vorgenommen. Zusätzlich können bestimmte Handlungen als intrinsisch gut oder schlecht gelten (z.B. lügen oder töten), wobei sich der Grad des Absolutheitsanspruchs zwischen verschiedenen Strömungen unterscheidet. Es kann zudem zu Konflikten zwischen verschiedenen Regeln kommen, wenn jemand in eine Situation gerät, in der er entweder töten oder aber lügen muss. Auch der Umgang mit derartigen Konflikten ist unterschiedlich.

Kategorischer Imperativ §

Der kategorische Imperativ Kants lässt sich als eine Pflicht im Sinne der deontologischen Ethik verstehen. Er lautet:

“Handle nur nach derjenigen Maxime, durch die du zugleich wollen kannst, dass sie ein allgemeines Gesetz werde.”

Eine ausführlichere Beschreibung kann der umfangreichen Literatur zu diesem Thema entnommen werden.

Asimovsche Ethik §

Im Gegensatz zu anderen Theorien, die zunächst für menschliche Agenten gedacht waren, wurde die Asimovsche Ethik explizit als Regelsystem für Maschinen formuliert und betont dementsprechend auch den Unterschied, wobei den Menschen Vorrang eingeräumt wird.⁵ Die Regeln lauten:

Ein Roboter darf kein menschliches Wesen (wissentlich) verletzen oder durch Untätigkeit (wissentlich) zulassen, dass einem menschlichen Wesen Schaden zugefügt wird.
Ein Roboter muss den ihm von einem Menschen gegebenen Befehlen gehorchen — es sei denn, ein solcher Befehl würde mit Regel eins kollidieren.
Ein Roboter muss seine Existenz beschützen, solange dieser Schutz nicht mit Regel eins oder zwei kollidiert.

Asimov begegnet der Möglichkeit von Konflikten der Regeln durch eine strikte Priorisierung. Für den Straßenverkehr sind diese Gesetze vermutlich nicht ausreichend, dennoch haben sie einen großen Einfluss auf das Gebiet der Maschinenethik und es existieren mehrere Implementierungen.⁶⁷ Daher sollen die Asimovschen Robotergesetze als Prototyp einer strikt hierarchischen deontologischen Ethik erwähnt werden.

Tugendethik §

Die Tugendethik bewertet diejenigen Handlungen als gut, welche tugendhaft sind oder zumindest aus gutem Willen erfolgen. Bei einer Literaturrecherche ließen sich hierzu keine konkreten Implementierungsversuche finden. Laut Allen et al.⁸ lassen sich Tugendethiken auf Deontologische Ethiken zurückführen, indem Tugenden als Pflichten formuliert werden. Abstrakte Tugenden lassen sich allerdings nur schwer mathematisch erfassen. Die Tugendethik wird daher im Folgenden nicht weiter betrachtet.

Implementierungsstrategien §

Verschiedene Quellen schlagen unterschiedliche Gliederungen für Implementierungsansätze vor.⁹⁸ In Anlehnung an Allen et al.⁴ lassen sich einige Strategien wie folgt klassifizieren, wobei die Grenzen fließend sein können.

Grobe Einteilung der Strategien für die Implementierung moralischer AI (Blau). Bereits existierende Implementierungen sind Rot dargestellt. Zu Bottom-Up-Hybrid-Verfahren konnte keine Literatur gefunden werden. Es existieren keine bekannten Top-Down-Bottom-Up-Hybrid-Implementierungen.

Top-Down §

Bei Top-Down-Ansätzen versucht man, ein System zu entwerfen, das eine vorgegebene Aufgabe löst. Übertragen auf moralische KI bedeutet dies, eine konkrete Moralphilosophie algorithmisch abzubilden. Während sich dieses Vorgehen für “allgemeine” künstliche Intelligenz als zu unflexibel erwiesen hat, kann es in bestimmten Spezialbereichen weiterhin nützlich sein.⁴

Konsequentialismus §

Ein Beispiel für eine konsequentialistische Implementierung liefert der “Ethical Layer.”⁷ Dort führt ein moralischer Agent interne Simulationen durch, um die Konsequenzen seiner Handlungen abzuschätzen, die dann ethisch bewertet werden. Der Ansatz ist von der Theorie der simulierenden Kognition inspiriert.¹⁰ Experimente zeigen, dass ein so programmierter Agent die Asimovschen Gesetze befolgen kann, indem er in allen Szenarien Handlungen vermeidet, die gegen diese Gesetze verstoßen würden.

Vereinfachte Darstellung der Ethical-Layer-Architektur. Die Steuerungseinheit des Roboters ermittelt aus den aktuellen Zielen mögliche Handlungen und schickt diese an den Ethical-Layer. In dessen Simulations-Modul werden die erwarteten Zustände der Welt für die jeweiligen Handlungen berechnet. Die Ergebnisse dieser Simulationen werden dann durch ein Evaluations-Modul nach Ethischen Maßstäben bewertet und abschließend an die Steuerungseinheit zurückgeschickt.

Deontologische Ethik §

Eine Implementierung deontologischer Ethik beschreibt Briggs und Scheutz.¹¹ Ein Agent arbeitet hier auf einer widerspruchsfreien Menge von Regeln $\varphi$. Erhält er einen Auftrag $\alpha$, prüft er zunächst, ob er eine Pflicht $\mathrm{obl}$ zur Erfüllung hat und ob dies seinen aktuellen Zielen widerspricht. Falls nicht, fügt er sich das Erfüllen des Auftrags als neues Ziel hinzu. Formal kann dies wie folgt ausgedrückt werden: $$ \mathrm{obl}(\alpha, \varphi) \wedge \neg \mathrm{per}(\alpha, \neg \varphi)\ \rightarrow \mathrm{goal}(\alpha, \varphi) $$

Hybrid (Top-Down) §

Manche Autoren vertreten die Ansicht, dass eine einzelne Ethik nicht genügt, um Moral in autonomen Fahrzeugen vollständig zu implementieren.²¹² Daher existieren Modelle, die deontologische und konsequentialistische Elemente zusammenführen. Maurer et al.² beschreiben etwa einen Ansatz mit einer Mischung aus Kostenfunktion (Utility) und unverletzbaren Regeln. Die Kostenfunktion berücksichtigt gewichtete Teilziele (z.B. Verkehrsregeln, Transportbedürfnisse des Nutzers). Für jede potenzielle Trajektorie des Fahrzeugs wird die Kostenfunktion berechnet. Anschließend wird versucht, die Kosten zu minimieren. Die strikten Regeln dagegen dürfen in keinem Fall verletzt werden. So soll etwa verhindert werden, dass das Fahrzeug einen Fußgänger überfährt, nur weil scharfes Bremsen für die Insassen unangenehm wäre. Dilemma-Situationen werden durch Prioritäten für die Regeln aufgelöst. Mathematisch interpretiert man dies als eine Optimierung der Kostenfunktion unter bestimmten Constraints.

Bottom-Up §

Bottom-Up-Ansätze verzichten darauf, die Moralphilosophie vorzuschreiben. Stattdessen wird das System “von unten” her aufgebaut, indem es schrittweise lernt (oder im evolutiven Sinne “heranwächst”), wie es sich verhalten sollte, ohne dass explizit eine bestimmte Moraltheorie implementiert wird.

Assoziatives Lernen §

Beim assoziativen Lernen erwirbt ein Agent das erwartete moralische Verhalten anhand von Feedback. Abel et al.⁹ beschreiben ein Beispiel mit Reinforcement Learning in Form eines Markow-Entscheidungsproblems. Dieses wird definiert durch

Zustände $s \in S$,
Aktionen $a \in A$,
eine Belohnungsfunktion $R(s,a)$ und
eine Wahrscheinlichkeitsverteilung $T(s,a,s’)$ für Zustandsübergänge von $s$ nach $s’$, gegeben Aktion $a$.

Der Agent wählt in jedem Zustand $s$ eine Aktion $a$, erhält eine Belohnung und wechselt gemäß $T$ in einen Folgezustand $s’$. Das Ziel ist es, über die Zeit möglichst viel Belohnung anzusammeln. Experimentell ließ sich zeigen, dass der Agent hierbei bestimmte Dilemmata wie das “Cake or Death” Problem¹³ auflösen kann.

Soziobiologisch §

Soziobiologische Ansätze versuchen, durch die Simulation einer biologischen Evolution moralische Agenten hervorzubringen. Viele existierende Arbeiten konzentrieren sich dabei auf Szenarien, in denen Individuen in direkter Konkurrenz stehen.¹⁴ Hier kommt ihnen Kooperationsverhalten zugute, wodurch sich ein evolutiver Vorteil ergibt. Dieser “eigennützige Ursprung der Moral” kann mancher Auffassung nach auch beim Menschen beobachtet werden.⁴ Da auf diesem Weg letztlich das eigene Überleben im Verbund gefördert wird, lässt sich dies als eine Form des ethischen Egoismus werten. Für den Einsatz in autonomen Fahrzeugen ist bislang keine derartige Lösung bekannt, was wohl an der hohen Komplexität dieses Anwendungsfalls liegt.

Hybrid (Bottom-Up) §

Es ist auch denkbar, Bottom-Up- und Top-Down-Ansätze zu kombinieren.¹⁴ Beispielsweise könnte man bestimmte Grundregeln fest einprogrammieren und unzugänglich machen, während das System weniger starre Normen oder Prioritäten durch Erfahrung lernt. Ein Vergleich zum Menschen wird oft angeführt: Genetische Anlagen und gesellschaftliche Prägung schaffen die Grundlage, auf der sich individuelle moralische Vorstellungen ausbilden. Eine andere Möglichkeit wäre, dass ein System Bottom-Up lernt, die Folgen seiner Handlungen einzuschätzen, bevor es anschließend Top-Down eine ethische Entscheidung fällt.

Hybrid §

Sehr selten werden beide Prinzipien miteinander kombiniert, etwa indem man Top-Down-Module mit einem Bottom-Up-Mechanismus zur automatischen Bestimmung von Parametern (z.B. Gewichte in Kostenfunktionen) verbindet. In anderen Bereichen werden solche Methoden durchaus genutzt (etwa Evolutionäre Algorithmen zur Hyperparameter-Optimierung im Deep Learning).¹⁵ Für das autonome Fahren bleibt dies weitgehend unerforscht.

Diskussion §

Top-Down §

Konsequentialistische Ansätze stoßen in der praktischen Umsetzung an Grenzen, weil sie einerseits nie vollständige Informationen über die Welt haben können und andererseits in der Theorie unendlich weit in die Zukunft schauen müssten.⁸ Dennoch wird dieses Problem in der Praxis meist dadurch entschärft, dass nur bis zu einem sinnvollen Zeithorizont oder nur über eine begrenzte Zahl an Handlungsalternativen simuliert wird. Der Ethical Layer⁷ zeigt trotz dieser Grenzen keine offensichtlichen Nachteile, wenngleich es in Echtzeit-Szenarien (z.B. drohenden Unfallsituationen) komplex werden kann, schnell genug alle potenziellen Handlungen zu simulieren.

Deontologische Systeme sind in ihrer Logik oft eindeutig interpretierbar und erlauben Beweise über bestimmte Zusicherungen (z.B. dass nie eine Regel verletzt wird). Dies ist bei sicherheitskritischen Anwendungen ein wichtiger Vorteil. Allerdings erweisen sie sich als unflexibel, da sie eventuell nicht zulassen, im Bedarfsfall Verkehrsregeln zu verletzen, wenn dies ein höheres Gut schützen würde. Außerdem eignen sie sich schlecht für Dilemmata, in denen alle verfügbaren Handlungsoptionen “schlecht” sind. Manche Philosophen, wie Kant, versuchen durch eine einzige, sehr komplexe Regel (den Kategorischen Imperativ) verschiedene Fälle abzudecken, was wiederum praktisch schwer umsetzbar ist.⁸

Hybride Top-Down-Systeme versuchen, deontologische und konsequentialistische Ideen zu verbinden. Sie können bestimmte unverletzliche Regeln vorschreiben und für andere Aspekte eine Kostenfunktion minimieren. Maurer et al.² haben dies für autonome Fahrzeuge diskutiert. Das Problem verschiebt sich allerdings oft auf die Auswahl und Gewichtung der Regeln und Kostenfunktionen. Sofern diese nicht “richtig” gewählt sind, wird das Ergebnis in den Augen der Gesellschaft nicht als moralisch akzeptabel erscheinen.

Bottom-Up §

Der Vorteil von Bottom-Up-Ansätzen liegt darin, dass sie keine bestimmte Moraltheorie “hart kodieren.” Sie können also selbstständig lernen, wie sie sich verhalten sollten, auch wenn die Gesellschaft widersprüchliche Vorstellungen hat. Gleichzeitig ist ein Nachteil, dass Entscheidungsfindungen solcher Systeme meist undurchsichtig sind (z.B. bei neuronalen Netzen). Für sicherheitskritische Anwendungen möchte man jedoch nachvollziehbare Systeme. Hier bleibt also abzuwarten, wie weit Forschung zu Erklärbarkeit und Transparenz kommt.¹⁶

Fazit §

Diese Arbeit hat verschiedene Strategien zur Implementierung von Moral in autonomen Systemen vorgestellt, speziell am Beispiel selbstfahrender Autos. Es wurde erläutert, warum ungeschriebene moralische Regeln — neben den geschriebenen Verkehrsgesetzen — von Bedeutung sind. Ferner wurden verschiedene Moralphilosophien skizziert, die als Basis für Implementierungen dienen können, insbesondere der Konsequentialismus und die Deontologische Ethik. Die vorgestellten Strategien wurden in Top-Down-, Bottom-Up- und Hybrid-Ansätze unterteilt. Die Grenzen sind jedoch fließend. Nach einer Diskussion der Stärken und Schwächen erscheint vor allem der Ethical Layer aussichtsreich, da er eine generische, konsequentialistisch inspirierte Architektur zur moralischen Entscheidungsfindung liefert und zugleich an spezifische Anforderungen angepasst werden kann. Zudem lassen sich Hybrid-Varianten konstruieren, bei denen unterschiedliche moralische Theorien oder Lernverfahren je nach Bedarf kombiniert werden.

Vicky Charisi, Louise Dennis, Michael Fisher, Robert Lieck, Andreas Matthias, Marija Slavkovik, Janina Sombetzki, Alan F. T. Winfield, and Roman Yampolskiy. Towards moral autonomous systems. arXiv preprint arXiv:1703.04741, 2017. ↩︎ ↩︎
Markus Maurer, J. Christian Gerdes, Barbara Lenz, and Hermann Winner. Autonomes Fahren: Technische, rechtliche und gesellschaftliche Aspekte. Springer-Verlag, 2015. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Ebd. (vgl. dortige Argumentation) ↩︎
Colin Allen, Iva Smit, and Wendell Wallach. Artificial morality: Top-down, bottom-up, and hybrid approaches. Ethics and Information Technology 7(3):149–155, 2005. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Susan Leigh Anderson. Asimov’s “Three Laws of Robotics” and machine metaethics. AI & Society, 22(4):477–493, 2008. ↩︎
Mateo Alvarez, Øyvind Berge, Audun Berget, Eirin Bjørknes, Dag V. K. Johnsen, Fredrik Madsen, and Marija Slavkovik. Implementing Asimov’s First Law of Robotics. 30th Norsk Informatikkonferanse, NIK, pages 27–29, 2017. ↩︎
Dieter Vanderelst and Alan Winfield. An architecture for ethical robots inspired by the simulation theory of cognition. Cognitive Systems Research, 48:56–66, 2018. ↩︎ ↩︎ ↩︎
Colin Allen, Gary Varner, and Jason Zinser. Prolegomena to any future artificial moral agent. Journal of Experimental & Theoretical Artificial Intelligence, 12(3):251–261, 2000. ↩︎ ↩︎ ↩︎ ↩︎
David Abel, James MacGlashan, and Michael L. Littman. Reinforcement learning as a framework for ethical decision making. In Workshops at the Thirtieth AAAI Conference on Artificial Intelligence, 2016. ↩︎ ↩︎
Simulation theory of cognition (allgemeines Konzept in der Kognitionswissenschaft). ↩︎
Gordon Michael Briggs and Matthias Scheutz. “Sorry, I can’t do that”: Developing mechanisms to appropriately reject directives in human-robot interactions. In 2015 AAAI fall symposium series, 2015. ↩︎
Noah J. Goodall. Machine ethics and automated vehicles. In Road Vehicle Automation, pages 93–102. Springer, 2014. ↩︎
Stuart Armstrong. Motivated value selection for artificial agents. In Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. ↩︎
Robert Axelrod and William D. Hamilton. The evolution of cooperation. Science, 211(4489):1390–1396, 1981. ↩︎
Steven R. Young, Derek C. Rose, Thomas P. Karnowski, Seung-Hwan Lim, and Robert M. Patton. Optimizing deep learning hyper-parameters through an evolutionary algorithm. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, page 4. ACM, 2015. ↩︎
Kevin Baum, Holger Hermanns, and Timo Speith. From machine ethics to machine explainability and back. In ISAIM, 2018. ↩︎

Home Server Setup 2025

Sun, 26 Jan 2025 00:00:00 +0000

In this post, I want to present my current home server setup, including the hardware, the virtualized infrastructure (Networks, VMs), and the services (Containers) I am running.¹ The goal is to give you some inspiration and also to have some more thorough documentation for myself. While writing, I noticed some possible improvements, so there is value in the documentation process itself.

This post will be quite long as the infrastructure evolved over a prolonged period. To avoid convoluting it unnecessarily, I will only cover the most relevant subsystems and provide docker-compose.yaml files for some services.

Physical Devices §

While there are many physical computers in my home, like my personal computer, numerous ESP8266 microcontrollers, and some Raspberry Pis that I use for playing music, as an alarm clock, or to control the 3D printer, the following are the most relevant to my infrastructure. Most of the devices are mounted in a 24-unit server rack. Those that can not be mounted just sit on rack shelves.

All of the devices combined draw $\approx$ 130W, depending on utilization. At the current energy price, this amounts to 30€ per Month.

Hypervisor §

Currently, I only use a single server (running Proxmox) as a virtualization platform. This server is assembled from spare consumer components left over from old PCs. It has been upgraded several times with new disks and more RAM. I packaged it into a rack-mountable chassis four units in height.

OS: Proxmox
RAM: 128GB DDR4
CPU: Ryzen 7 1700 8 cores @3.2Ghz

I used to have a multi-node Proxmox cluster, but at some point, I could not rationalize the increased power consumption while having almost no benefits.

My rack

Router §

While it would be possible to virtualize the router, this is generally discouraged: you do not want a server reboot to also bring down your network. Also, sometimes, the hypervisor might crash, and while this is bad enough, it gets worse if the server takes down the network as well (which you might need to fix the server).

So, I decided to buy a dedicated physical computer for the router: an APU.2E4.

OS: OPNsense
CPU: 4 cores (AMD) @1GHz
RAM: 4GB
Disk: 16GB SSD

The power draw is about 6-12W.

Public Key Infrastructure OPNsense can be used to manage a Public Key Infrastructure (PKI) with a graphical user interface, which is very convenient. The PKI can issue certificates for self-hosted web services or keys for authentication.

VPN The router also runs OpenVPN. The authentication uses the PKI, so each client has its own private key, which it uses to access the VPN. Exposing only a small number of the services to the outside world helps to reduce the attack surface. All other services can still be accessed via VPN.

Switch §

I use a Mikrotik Crs326-24G-2S, a managed switch with 24x RJ45 Ports running at 1GBit/s and two additional SPF+ ports. I could not yet convince myself that it is necessary, but at some point, I want to use the SPF+ ports and upgrade my internal network to 10Gbit/s. At the moment, this would not make much sense, as throughput is usually limited by the HDDs IO.

OS: RouterOS

This low-power and very quiet switch does not support Power over Ethernet (PoE). Therefore, I use another cheap switch with 4 RJ45 ports for PoE.

WiFi Access Point §

As WiFi AP, I use a Unify U6 Pro that allows me to run three isolated WiFi networks:

Main WiFi: for devices I trust
IoT WiFi: for all IoT devices (that I might not trust)
Guest WiFi: for visitors

This access point additionally requires the Unifi-Controller software to run somewhere. Buying this particular AP might not have been the best choice from a free (as in freedom) software standpoint, but it has been working without any issues since I bought it, so there is that.

This AP needs to be powered over ethernet.

DNS §

I have a Raspberry Pi 3 running PiHole as a DNS for my entire network. This does not run inside a VM because I still want the DNS to work if the primary node goes offline. The router is configured to use this PI as the DNS server for all requests that do not go to *.local addresses.

VLANS §

Virtual LANs allows you to create several virtual sub-network segments on top of a single physical LAN. For a device in a VLAN, it looks as if it was in an isolated network. The advantage of this is improved isolation: only machines in the same VLAN can communicate with each other. Other traffic has to go over the router, where you have control over the firewall.

I set up a couple of VLANs in my home network, including:

External: Every machine exposed to the outside world is in this VLAN
Internal: Every machine that runs services that do not have to be accessible from the outside is in this VLAN
IoT: Everything related to home automation is in this VLAN
Guests: Guests, who, for example, connect via the dedicated guest WiFi, will be assigned to this VLAN

Storage §

I use a 128GB SSD to store the OS and another 2TB NVME SSD to back the VM disks.

ZFS RAID §

Apart from this, I only recently switched to RAID-backed HDD storage for my data. Currently, I am using 40TB ZFS RAID (from 3x20TB Disks), which are available to the network via Samba. This RAID setup allows one of the three HDDs to fail without loss of data.

Configuring VM Disks §

When configuring hard-disks for VMs in Proxmox, there are several configuration options that can improve performance.

Set SCSI Controller to VirtIO SCSI Single, which means using a dedicated SCSI controller for a disk rather then multiple disks, which can improve IO performance.
SSD emulation: Proxmox will tell the guest that the disk is a SSD, which can improve performance if the underlying disk is actually a SSD.
Cache: Write‐back caching accumulates writes in host RAM and flushes them later. This improves performance but can cause problems on power loss.
IO thread: Run IO in a separate thread to improve performance.
Discard: Will automatically prune thin disks, so that, when you delete files/free blocks, the host will shrink the VM disk image.
Async-IO: You can change this option for better performance, but this led to instability in my case.

Hard-Disk configuration in Proxmox

I did not do any benchmarks, but it after switching to the above configuration, the VMs felt like bare metal.

Backups §

The Pi that runs my DNS server also has a 4TB external HDD available via Samba and is set up as external storage for my Proxmox host. In regular intervals, Proxmox saves backups of VM disks in this storage. This makes it very convenient to restore a particular VM to its state of - say - yesterday in case something goes awry.

Local Backups I additionally store daily backups on the ZFS RAID.

Offsite Backups I have further backups in a remote location. These are updated in irregular intervals.

Virtual Machines §

All of these machines run on the main hypervisor node. On the VMs, I run all of the services inside docker containers. From outside of a VM, some of the containers (those exposing web interfaces) are accessible via a Nginx reverse proxy. This proxy takes HTTP(s) requests and forwards them to the corresponding container, based on the Host-field of the request. This allows me to use different subdomains, e.g., wiki.myhost.local or kiwix.myhost.local, to access different services instead of binding them to different ports on the same machine. This way, I only have to remember the subdomain for each service and not some random port number.

Internal Services §

I have a VM that runs most of my internal services and does not have to be accessible from the outside. Hardware-wise, it uses the following:

OS: Debian 12
CPU: 4 vCPU
RAM: 8GB
Storage: 32GB Thin Disk
NIC: single VirtIO (paravirtualized) connected to Internal VLAN

Applications §

Nginx-Proxy-Manager: HTTPs for web-services
FreshRSS: RSS reader
Wallabag: allows you to store websites for later reading, comes with an excellent browser plugin
MediaWiki: basically Wikipedia, used for note-taking
Kiwix: offline browser
Uptime Kuma: monitors the online status of websites
Watchtower: automatically updates docker containers when new images are available
Telegraf: server monitoring solution
Stirling-PDF: PDF utilities
Unify-Controller: used to manage the WiFi AP
Portainer: GUI for managing docker containers
OpenWebUI: frontend for interacting with (local) LLMs
Firefly-III: personal finance management

Nginx Proxy Manager §

The proxy manager has a wildcard certificate (*.myhost.local) signed by my internal root certificate authority so that I can connect to all of the services via HTTPs.

The proxy and all containers accessible through it share a common docker network called rproxy-network. This has to be created externally by running:

docker network create rproxy-network

The docker-compose.yaml looks like this:

version: '3'
services:
  app:
    image: 'jc21/nginx-proxy-manager'
    container_name: proxy-manager
    restart: always
    ports:
      - '81:81'
      - '443:443'
    volumes:
      - ./data:/data
      - ./letsencrypt:/etc/letsencrypt

networks:
  default:
    external:
      name: rproxy-network

The forwarding has to be configured manually via Web UI that runs on port 81.

FreshRSS §

FreshRSS is an RSS reader. The setup with the linuxserver.io image is straight-forward:

version: '3'
services:
  freshrss:
    image: lscr.io/linuxserver/freshrss:latest
    container_name: freshrss
    hostname: freshrss
    restart: unless-stopped
    networks:
      - rproxy
    volumes:
      - ./data:/config
    environment:
      TZ: Europe/Berlin
      PUID: 1000
      PGID: 1000

networks:
  rproxy:
    external:
      name: rproxy-network

Wallabag §

Wallabag allows you to save webpages for later offline reading. There is also a browser plugin for Firefox that lets you add a website swiftly.

The setup is a bit more complex. Note that only the main container is connected to rproxy-network.

version: '3'
services:
  wallabag:
    image: wallabag/wallabag
    container_name: wallabag
    restart: always
    networks:
      - default
      - rproxy
    environment:
      - MYSQL_ROOT_PASSWORD=xxxx
      - SYMFONY__ENV__DATABASE_DRIVER=pdo_mysql
      - SYMFONY__ENV__DATABASE_HOST=wallabag-db
      - SYMFONY__ENV__DATABASE_PORT=3306
      - SYMFONY__ENV__DATABASE_NAME=wallabag
      - SYMFONY__ENV__DATABASE_USER=xxxx
      - SYMFONY__ENV__DATABASE_PASSWORD=xxxx
      - SYMFONY__ENV__DATABASE_CHARSET=utf8mb4
      - SYMFONY__ENV__DATABASE_TABLE_PREFIX="wallabag_"
      - SYMFONY__ENV__MAILER_DSN=smtp://127.0.0.1
      - SYMFONY__ENV__FROM_EMAIL=wallabag@example.com
      - SYMFONY__ENV__DOMAIN_NAME=https://wallabag.xxx
      - SYMFONY__ENV__SERVER_NAME="xxx"
    volumes:
      - ./images:/var/www/wallabag/web/assets/images
    healthcheck:
      test: ["CMD", "wget" ,"--no-verbose", "--tries=1", "--spider", "http://localhost"]
      interval: 1m
      timeout: 3s
    depends_on:
      - db
      - redis
      
  db:
    image: mariadb:10.8.2
    container_name: wallabag-db
    restart: always
    networks:
      - default
    environment:
      - MYSQL_ROOT_PASSWORD=xxxx
    volumes:
      - ./data:/var/lib/mysql
    healthcheck:
      test: ["CMD", "mysqladmin" ,"ping", "-h", "localhost"]
      interval: 20s
      timeout: 3s

  redis:
    image: redis:alpine
    restart: always
    networks:
      - default
    container_name: wallabag-redis
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 20s
      timeout: 3s

networks:
  rproxy:
    external:
      name: rproxy-network

MediaWiki §

I use Mediawiki mainly for note-taking. While there might be better solutions, it works reasonably well for me.

Here is the docker-compose.yaml:

version: '3'
services:
  mediawiki:
    image: mediawiki:1.39.5
    container_name: mediawiki
    restart: always
    links:
      - database
    volumes:
      # make sure this directory is writable
      - ./images/:/var/www/html/images
      - ./LocalSettings.php:/var/www/html/LocalSettings.php:ro
    networks:
      - default
      - rproxy
    security_opt:
      - seccomp:unconfined # wiki needs online services to render latex

  database:
    image: mariadb:10.8.2 
    container_name: mediawiki-db
    restart: always
    volumes:
      - ./db/:/var/lib/mysql
    environment:
      MYSQL_DATABASE: my_wiki
      MYSQL_USER: wikiuser
      MYSQL_PASSWORD: xxxxx
      MYSQL_RANDOM_ROOT_PASSWORD: 'yes'
    networks:
      - default

networks:
  rproxy:
    external:
      name: rproxy-network

Seccomp is a security feature of the Linux kernel that disables certain system calls for containers. Unfortunately, some containers seem to require some of these syscalls. The option seccomp:unconfined disables seccomp entirely.

Disabling seccomp entirely is probably not optimal. You might want to consider if you should do this with regards to your threat model. Seccomp also provides more granular syscall access management.

KiWix §

KiWix is an offline browser that stores entire websites for offline use. This way, you can use Wikipedia or Stack-Overflow even if you have no internet.

This machine has my NAS mounted to /mnt/.

version: '3'
services:
  kiwix:
    container_name: kiwix-serve
    image: ghcr.io/kiwix/kiwix-serve
    networks:
      - rproxy
    command: '*.zim'
    volumes:
      - "/mnt/app-data/kiwix:/data"

networks:
  rproxy:
    external:
      name: rproxy-network

Offline websites are saved in so-called zim archive files. Since these can be quite large (for me, the files amount to 246GB), I keep them on the network-attached storage. These are the zim files I have stored for offline use:

user@machine:~/$ ls -lh /mnt/app-data/kiwix/
total 246G
-rw-rw---- 1 user user  29M Dec 16  2022 archlinux_en_all_maxi_2022-12.zim
-rw-rw---- 1 user user 115M May 16  2024 cooking.stackexchange.com_en_all_2024-05.zim
-rw-rw---- 1 user user 3.6G Oct  9 10:48 docs.python.org_en_2024-10.zim
-rw-rw---- 1 user user  70M Mar 20  2021 gentoo_en_all_maxi_2021-03.zim
-rw-rw---- 1 user user 2.7G Mar  8  2024 gutenberg_de_all_2023-08.zim
-rw-rw---- 1 user user 5.9M Aug 10 01:57 mspeekenbrink_en_all_2024-08.zim
-rw-rw---- 1 user user 112M Mar 10  2021 rationalwiki_en_all_maxi_2021-03.zim
-rw-rw---- 1 user user  75G Dec  1  2023 stackoverflow.com_en_all_2023-11.zim
-rw-rw---- 1 user user  74G Jul 18 10:09 ted_mul_all_2024-07.zim
-rw-rw---- 1 user user  28M May  7  2024 tor.stackexchange.com_en_all_2024-05.zim
-rw-rw---- 1 user user 103G Jan 21  2024 wikipedia_en_all_maxi_2024-01.zim

I could imagine that, in the future, it might be possible to use such offline files as a basis for RAG systems.

Logging & Monitoring §

For historical reasons, I have a dedicated VM for logging and monitoring purposes. It lives on the main LAN, and access is managed via firewall.

OS: Ubuntu 22.04
CPUs: 2 vCores
RAM: 2GB
Disk: 64GB Thin Disk
NIC: single VirtIO (paravirtualized) connected to the main LAN

Applications §

The VM runs the TIG stack:

Telegraf: monitoring, sends measurements to the database
InfluxDB: time series database that works well with Telegraf
Grafana: visualization dashboard for InfluxDB data
Watchtower
Nginx-Proxy-Manager

The Influx database stores everything on the VM disk. After close to 2 years of constant operation, it has written around 30GB of data.

Example of Grafana UI for monitoring my Router

External Services §

On this VM, I run everything that is reachable from the outside world.

OS: Debian 12
CPUs: 4 vCores
RAM: 16GB
NIC: single VirtIO (paravirtualized) connected to External VLAN

Applications §

Portainer Agent
Nextcloud
Nginx (3x)
Nginx-Proxy-Manager
Gitlab CE
Docker Registry
Transmission: torrent client, in my case, mainly seeding Linux ISOs
Telegraf
Watchtower

Nextcloud §

Nextcloud is a self-hosted cloud service that can store files, contact information, as well as calendars. Basically, it acts as a replacement for Google Drive, Dropbox, or similar services that you pay for one way or another.

The service requires a cronjob to run at regular intervals to do some housekeeping in the background. If you do not run these tasks, old access tokens, for example, will never be invalidated automatically. Also, I had some problems with files being locked. These problems could be resolved manually, but this is quite tedious. The solution I am using here - running an additional container with this cronjob - is taken from here.

This VM has a subdirectory of the NAS mounted to /home/user/docker/netxcloud/nextcloud-data/. User id must be 33.

version: '3'
services:
  nextcloud-db:
    image: mariadb:11.2.3
    container_name: nextcloud-db
    restart: always
    # see https://github.com/nextcloud/server/issues/25436
    command: |
      --transaction-isolation=READ-COMMITTED
      --binlog-format=ROW
      --skip-innodb-read-only-compressed
    networks:
      - default
    volumes:
      - ./mysql/:/var/lib/mysql
    environment:
      - MYSQL_ROOT_PASSWORD=xxxx
      - MYSQL_PASSWORD=xxxx
      - MYSQL_DATABASE=xxxx
      - MYSQL_USER=xxxx

  nextcloud:
    image: nextcloud:latest
    container_name: nextcloud
    volumes:
      - ./nextcloud/:/var/www/html
      - ./nextcloud-data:/var/www/html/data
    restart: always
    depends_on:
      - nextcloud-db
    environment:
      - NEXTCLOUD_ADMIN_USER=xxxx
      - NEXTCLOUD_ADMIN_PASSWORD=xxxx
      - MYSQL_DATABASE=nextcloud
      - MYSQL_USER=xxxx
      - MYSQL_PASSWORD=xxxx
      - MYSQL_HOST=nextcloud-db
      - VIRTUAL_HOST=xxxx
    networks:
      - rproxy
      - default

  # taken from https://blog.networkprofile.org/vms-and-containers-i-am-running-2023/
  cron:
    image: nextcloud:latest
    container_name: nextcloud-cron
    restart: unless-stopped
    volumes:
      - ./nextcloud/:/var/www/html
      - ./nextcloud-data/:/var/www/html/data
    entrypoint: /cron.sh
    depends_on:
      - nextcloud-db
    networks:
      - default

networks:
  rproxy:
    external:
      name: rproxy-network

GPU Server §

I have a VM with a GPU that is passed through from the physical host. This VM runs all of the containers that can benefit from hardware acceleration. Additionally, as the GPU has HDMI outputs, I can connect this VM to a TV. This way, I do not have to power up my personal computer if I just need a desktop or want to watch a movie.

OS: Debian 12
CPU: 6 vCores
RAM 16GB
Storage: 32GB Thin Disk
GPU: Nvidia 1060 (6GB VRAM)
NIC: single VirtIO (paravirtualized) connected to External VLAN

Setting up PCIe passthrough in Proxmox was quite tricky. There are several guides that help you to configure your system. In Proxmox, it looks as follows:

GPU passthrough in Proxmox

However, in my case, I had to additionally

add video=efifb:off to the kernel parameters of the Proxmox host
use UEFI for the VM to enable the primary GPU option in Proxmox
disable secure boot in the VM
set the VMs CPU model to host

Afterward, you will have to install the nvidia-container-toolkit so that the docker containers can work with the GPU.

Applications

Jellyfin
Ollama
Watchtower

Jellyfin §

I possess quite an extensive and ever-growing media collection. Jellyfin is an open-source media server that allows you to access your media files over a nice web interface that looks similar to Netflix, Spotify, etc. Jellyfin can use the GPU to accelerate media transcoding.

There are also some well-maintained mobile clients, such as Finamp, which you can use as a replacement for Spotify on your phone.

This VM has the NAS mounted to /mnt/.

version: "3.5"
services:
    jellyfin:
        image: linuxserver/jellyfin
        container_name: jellyfin
        ports:
          - 80:8096
          - 443:8920
        volumes:
          - ./config:/config
          - /mnt/media/Video:/media/Video
          - /mnt/media/Music:/media/Music
        environment:
          - PUID=1000
          - PGID=1000
          - NVIDIA_DRIVER_CAPABILITIES=all
          - NVIDIA_VISIBLE_DEVICES=all
          - JELLYFIN_PublishedServerUrl=xxxx
        restart: "unless-stopped"
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: 1
                  capabilities: [gpu]

Ollama §

Ollama allows you to run LLMs locally with GPU acceleration via CUDA. The models are then served via an API, which can be neatly integrated with several other projects. Currently, the LLMs weights are stored on the NAS, which drastically increases the latency for the first request (since the model has to be loaded into the RAM over the network). However, many of the models are just too large for the VM disk.

This VM has the NAS mounted to /mnt/.

version: '3'
services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - /mnt/app-data/ollama:/root/.ollama
    restart: "unless-stopped"
    environment:
      - OLLAMA_KEEP_ALIVE="60m"
      - OLLAMA_LOAD_TIMEOUT="60m"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Home Assistant §

Home Assistant is a very good open source home automation platform. They also provide a VM disk image that can be imported to set everyting up with minimal effort.

OS: Home Assistant OS
CPU: 4 vCore
RAM: 4GB
Storage: 32GB Thin Disk
NIC: single VirtIO (paravirtualized) connected to IoT VLAN

Tor Relay §

After reading Permanent Record by Snowden, I decided to donate some of my bandwidth to the Tor network. I configured the server to not run as an exit node, otherwise, my IP address would find its way on some blacklists, and there might even be legal ramifications. Rather, the software acts as a relay, transferring messages from one tor node to another.

CPU: 1 vCore
RAM: 1GB
Storage: 32GB Thin Disk
NIC: single VirtIO (paravirtualized) connected to External VLAN

For security reasons, the tor relay runs in its own VM, in the “external services” VLAN, so that it is isolated from most other systems in the network.

Over the last year, the node has transferred several terabytes of data.

Screenshot of the tor node monitoring software Nyx.

Overall, this post is heavily inspired by this one. ↩︎

Training a German LLM from Scratch

Thu, 14 Nov 2024 00:00:00 +0000

This article is not finished and will be updated.

The research group I work with has access to a small GPU cluster, which occasionally sits idle. To avoid wasting valuable compute resources (IDLE GPUs essentially burn money through opportunity costs), I decided to train a German GPT-2-style model from scratch, using only German text.

Existing German models available on Hugging Face have 137M parameters and a context length of 1024 tokens¹, which is quite limited compared to recently released models, such as those in the LLAMA family.

After the training and writing the first draft of this article, I became aware of some larger German models, such as the

While the existence of these larger and more capable models probably means that the one presented here will not be used as much, I still enjoyed the learning experience.

To make the model at least somewhat competitive with current alternatives, I aimed to support context lengths of at least double that. I also wanted the model to have more parameters, which generally enhances model quality. Therefore, I set out to train a GPT-2-style model with 358M parameters and a context window of 2048 tokens. While still modest compared to state-of-the-art models, it’s an improvement. The resulting model is available on at kkirchheim/german-gpt2-medium.

Dataset §

A large dataset is required before training a model. Since this LLM is German-only, it was crucial to ensure that the collected texts were in German.

Selection §

While we could have scraped the internet ourselves to gather enough data, this would be a lengthy process, requiring a custom crawler seeded with relevant pages and a substantial runtime.

Thankfully, others have already done this work: Common Crawl provides a massive text dataset from internet scrapes spanning the past decade. A derivative project, the German Colossal, Cleaned Common Crawl corpus (GC4), contains the German subset of the entire Common Crawl. This means that we do not have to download the entire internet and filter for German content manually.

Since the data was scraped from 2015 to 2020, this will be the knowledge cutoff for our LLM. For context, existing German-only models were trained on just 90GB of text.²

While this dataset is publicly available, which is nice for reproducibility, the fact that it is a collection of scraped data also means that we do not have the licenses. For research purposes, it is allowed to train models on such content.³

Preparation §

To start, we downloaded all the .tar archives listed on the website - around 180GB of compressed text. After extraction, we are left with 300GB of uncompressed, high-quality German text in something similar to JSON format.⁴ We can inspect the resulting files with

head de_head_extracted/de_head_0000_2015-48.txt

which gives us something like

{
  "url": ...,
  "date_download": ..., 
  "length": ..., 
  "nlines": ..., 
  "source_domain": ..., 
  "title": ..., 
  "language": "de",
  "language_score": 0.99,
  "raw_content": "Siegmar Gerber Titel:\nAnwendungslösungen zur Simulation von Rechenanlagen auf dem ZRA 1 und zur Bibliographieautomatisierung mit Hilfe des Rechners ODRA Erscheinungsdatum:\nIm Beitrag werden Lösungen für zwei Anwendungsprojekte beschrieben, die in den sechziger Jahren am Institut für Maschinelle Rechentechnik der Leipziger Universität mit Hilfe der Rechenanlagen ZRA 1 bzw. ODRA realisiert wurden."
}

Here, raw_content is the field that we are interested in, as it contains the extracted text from the scraped websites. We can use the other fields to get some insights into our dataset, and filter for higher-quality content.

So, first, we filter all entries where the language score < 0.98, which makes sure that our dataset only contains german webpages. Then, we can investigate the source of the articles, by counting the values of the source_domain column:

Article Sources

As we can see, the dataset contains mostly news sites and Wikipedia. Furthermore, we can inspect the length of the articles:

Distribution of length of articles (in characters)

We store the filtered dataset as JSON, discarding all fields apart from raw_content. This allows us to directly load the datasets using the Huggingface datasets library:

from datasets import load_dataset

dataset = load_dataset(
    'json',
    cache_dir="./cache",
    data_files=['de_head_extracted/*.json'],
    split="train"
)

print(f"Length: {len(dataset)}")

This tells us that there are 117,412,577 texts in total. After loading everything, the cache will be 1.2T in size.

Training §

Training an LLM involves two main steps: first, creating a tokenizer to map character sequences to tokens that the LLM can process (and vice versa). Second, training the LLM to predict a probability distribution over the next tokens, given preceding tokens in the text.

Tokenization §

Training a tokenizer with Hugging Face is quite straightforward⁵, and I gave it a try. However, in the end, I opted to reuse the tokenizer used of stefan-it/german-gpt2-larger:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stefan-it/german-gpt2-larger")
tokenizer.pad_token = tokenizer.eos_token

There are better tokenizers available that, as far as I know, differ mainly in how they deal with numerals.

We tokenize the entire dataset, caching the results on disk:

# Tokenize the dataset and count tokens in one step
def tokenize_and_count(examples):
    tokenized = tokenizer(
        examples["raw_content"],
        truncation=True,
        max_length=2048
    )
    tokenized["num_tokens"] = [len(t) for t in tokenized["input_ids"]]
    return tokenized

# Tokenize and count in a single step
tokenized_dataset = dataset.map(
    tokenize_and_count,
    batched=True,
    num_proc=128,
    cache_file_name="cache-tokenized/.tokenized_dataset_cache"
)

total_tokens = sum(tokenized_dataset["num_tokens"])
print(f"Total number of tokens: {total_tokens}")

This tells us that the entire dataset has 66,537,920,947 tokens. The num_procs=128 parameter significantly speeds up the process, from 24h to < 1h. We can then split the dataset into a training and a validation portion.

# Split the dataset into train and validation sets
train_val_split = tokenized_dataset.train_test_split(test_size=0.0001) 

# Get the train and validation sets
train_dataset = train_val_split['train']
val_dataset = train_val_split['test']

Model Configuration §

As described earlier, we want to train a gpt-medium-model, but with increased context size. How do we do this?

In Huggingface, models are described by config.json configuration files that parameterize the architecture. The original configuration for a gpt2-medium looks like this:

{
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1024,
  "n_head": 16,
  "n_layer": 24,
  "n_positions": 1024,
  "n_special": 0,
  "predict_special_tokens": true,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "vocab_size": 50257
}

The documentation for these hyperparameters is here. There are a couple of modifications that we have to make:

n_positions: the maximum number of tokens that the model can be used with, which we adjust to 2048
n_ctx: this is the actual context length, so we set it to 2048 as well.

We then put the modified config.json this into a directory called mymodel and create the model with:

cfg = GPT2Config.from_pretrained("mymodel")
model = GPT2LMHeadModel(cfg)

Optimization §

After everything is set up, we can use the Hugging Face API to train the model. The API makes this extremely convenient.

Given the corpus size and the limited resources, I only trained for a single epoch.

date_time = datetime.now().strftime("%m%d%Y-%H-%M-%S")

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    learning_rate=6e-4,
    per_device_train_batch_size=12,
    gradient_accumulation_steps=12,
    per_device_eval_batch_size=12,
    gradient_checkpointing=True,
    warmup_steps=1000,
    torch_compile=False,
    weight_decay=0.1,
    logging_dir=f'./logs/{date_time}',
    logging_strategy="steps",
    disable_tqdm=False,
    report_to="tensorboard",
    save_total_limit = 3,
    logging_steps=10,
    fp16=True,
    ddp_find_unused_parameters=False,
    dataloader_num_workers=32,
    optim="adamw_torch",
    resume_from_checkpoint=True,
    eval_strategy="steps",
    eval_steps=100
)

cb = TextGenerationCallback(tokenizer=tokenizer, log_dir=f"./logs/{date_time}")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    callbacks=[cb]
)

trainer.train()

We run the training script with

torchrun --nproc_per_node 4 train.py

If the training crashes, you can resume by using

trainer.train(resume_from_checkpoint="results/checkpoint-xxx")

Monitoring §

Once the training runs, we can use different tools to monitor the process.

nvtop §

nvtop displays the utilization of the GPUs.

GPUs go BRRRRR

This way, we can, for example, determine whether the process allocates sufficient VRAM or if there is still space to increase the batch size.

Tensorboard §

The trainer prints statistics to the terminal at regular intervals. However, Tensorboard provides a web interface to watch training statistics in real time. Tensorboard can be enabled by the report_to="tensorboard" argument in the training configuration. The web interface can then be launched by executing:

tensorboard --logdir logs/

Live training statistics in Tensorboard

By implementing a custom TextGenerationCallback, we can sample from the GPT during training.

Live text samples in Tensorboard

Plotting §

We can also download statistics in JSON format from Tensorboard to process them programmatically.

The loss curve over the training period is shown below. Aside from some initial spikes, it follows the expected pattern: a sharp loss drop at first, followed by a gradual decrease as training progresses.

Loss over Training. The gaps in the data indicate crashes of the training script.

Gradient Norm Spikes §

During training, we can observe an interesting phenomenon: when we look at the norm of the gradient of the loss $\lVert \nabla_{\theta} \mathcal{L}(x, y) \rVert$ w.r.t. the models weights $\theta$, we see (plot below) that

they start at around $1$ and then quickly decrease. However, we observe some spikes, particularly in early epochs. These spikes also correlate with some drastic jumps in the model’s loss (see image above).
we can see that the gradient norm increases towards the end of the epoch.

This magnitude tells us something about how large the updates are that we apply to the model’s weights. It makes intuitive sense to me that we start out with quite large updates at the beginning of the training, and then gradually move towards the minimum of the loss in smaller steps as it becomes more difficult to improve the loss, so the gradient is not as steep.

However, to be honest, I do not know why we observe these jumps and the gradual increase towards the end of the epoch. If you have any suggestions, feel free to contact me.

Spikes in the norm of the gradient

Evaluation §

Now that the model is trained, how can we evaluate it?

Qualitative §

One of the first things we can do to assess how good (or bad) the model is, is to simply have a look at some example generations. For example, we make the model complete the following text:

from transformers import pipeline

pipe = pipeline("text-generation", model="kkirchheim/german-gpt2-medium", device="cuda")

text = pipe("Der Sinn des Lebens ist", 
            max_length=256,  
            no_repeat_ngram_size=3,  
            top_k=50, 
            top_p=0.95,
            do_sample=True
 )[0]["generated_text"]
print(text)

The result looks like this:

Der Sinn des Lebens ist der Weg in die Freiheit, die wir für uns und unsere Kinder anstreben.

Das Wichtigste dabei ist es, die richtige Entscheidung für eine Lebens- und Sozialform zu treffen.

Denn nur wenn Kinder in einer Familie aufwachsen und Familie ihr Leben selbst bestimmen können, werden sie sich auch in Zukunft in ihrer Persönlichkeit verwirklichen. Wie wichtig es ist, in einer gesunden und lebenswerten Umwelt zu leben, zeigt sich am großen Anteil von älteren Menschen. Der demografische Wandel ist in vielen Bereichen bereits spürbar und wird viele Menschen immer stärker belasten.

Viele Familien in unserer Region leben seit Generationen im Eigentum. Sie sind in einem Generationenverbund mit ihren Kindern mit ihren eigenen Bedürfnissen und Ideen an den Ort ihrer Wohnumgebung gebunden. Die Generation der Jüngeren lebt zu einem Großteil allein in einer kleinen Wohnsiedlung ohne eigenen Garten, im Altersheim oder als alleinstehende Rentnerin oder Rentner.

Die Lebensbereiche Wohnen, Familie und Gesellschaft rücken in dieser Situation in den Fokus der Gesellschaft und erfordern die Entwicklung von neuen gesellschaftlichen, sozialen und ökonomischen Lebensmodellen.

Mit unserer Gesellschaft und unseren Kindern ist es oft nicht mehr so einfach wie früher, in dieser Lebensphase, sich von einer festen Bindung in die neuen Lebensphase zu lösen. Neue soziale Systeme müssen deshalb ganz neu entwickelt werden, um

While this reads strange, at times, it does resemble valid German text.

Language Modeling §

For English models, there is a plethora of benchmarks that evaluate all kinds of properties of the model, such as its reasoning abilities, its knowledge in certain fields, or its truthfulness. However, for German text, our choices are quite limited. However, what you can always do is to compare the losses of different models on the same corpus. This will give you an idea of how well the models can predict the next token. Instead of comparing the loss, people often compare the per-token-perplexity, which is a measure of how perplexed the model is by a given text. Perplexity over a sequence of tokens $w$ with length $N$ is computed as: $$ PPL(w) = \exp \left( -\frac{1}{N} \sum_{i=1}^{N} \log(p_{\theta}(w_i \mid w_1, …, w_{i-1})) \right) $$ so, in essence, it is the exponentiated loss.⁶ In practice, perplexity is often only approximated, as computing it exactly requires $N$ forward passes, which can take a very long time for larger corpora.

There are several implementations of the perplexity metric available online, and interestingly, many of them give slightly different results. So, I went with the implementation of higgingface evaluate, which I only modified slightly, because it would throw an error for some of the models.

For the evaluation, we took the first 10k articles of german Wikipedia.

from datasets import load_dataset

dataset = load_dataset("wikipedia", "20220301.de", split="train")
text = [sample['text'] for n, sample in enumerate(dataset) if n < 10000]

We can safely assume that the german wikipedia was part of the models training dataset. However, this wikipedia dump is from 2022, while our models training data only includes scraped websites up until 2020.

You can find the resulting perplexity values below:

Perplexity of different models on some test data

As you can see, the LLama model outperforms ours, which is unsurprising, given that it has over $20 \times$ the number of parameters. Our model, on the other hand, outperforms the smaller German models (also, unsurprisingly, as it is larger and was trained on much more data). It should be noted that per-token perplexity can be difficult to compare between models with different tokenizers, so I am not entirely sure how to interpret the performance difference to LLama3. However, the German models all use the same tokenizer.

I am not entirely sure why the stefan-it model performs so poorly. According to the model card, it is basically a variant of the dbmdz model trained on much more data, so you would expect it to perform better.

Memory Footprint §

Model quantization can be used to reduce the VRAM required for inference. The table below shows the maximum required GPU memory for generating 1024 tokens. As we can see, our model requires more RAM compared to the model of stefan-it (and, similarly, dbmdz, which has the same architecture), but is still significantly less RAM intensive than the Llama model.

VRAM Usage Comparison in MB (1k Tokens)
Quantization Level	Ours	stefan-it/ german-gpt2-larger	meta-llama/ Meta-Llama-3-8B
fp32	2242.32	641.44	31218.98
fp16	1174.63	341.87	15614.70
int8	910.54	260.32	8970.10
int4	771.42	219.27	6126.09

Inference Speed §

Full Precision §

Measuring the time that each model requires to generate 1k tokens on an A100 reveals that our model is approximately two times slower compared to the smaller stefan-it, but still twice as fast as the Llama model.

Generated tokens per second on an A100

Quantization §

While one could assume that quantization also accelerates inference (as I did), this does not seem to be the case. Below, you can see a histogram depicting the distribution of time required to sample 1024 tokens from our model on an A100. We use histograms since this allows us to additionally inspect the distribution of values.

Time required to generate 1024 tokens with different levels of quantizations on an A100

Lessons Learned §

Throughout collecting data, implementing the training script and finally evaluating the model, there were several lessons which I learned.

Crashes Happen You might have noticed gaps in the previous plots. One key lesson I learned is that training can unexpectedly be interrupted, even when there’s no apparent reason. For instance, if the disk becomes full and the Hugging Face Trainer tries to save a new model checkpoint, it crashes. Without prior checkpointing, this can mean a lot of wasted compute.

Batch-Size Matters Initially, I started training the model with a moderate batch-size, however, it turns out that this leads to a loss plateau early on. In my search for solutions to this problem, I had a look the the hyperparameters in Kaparthys Nano GPT and noticed that this implementation uses much larger batch sizes.

To my knowledge, the largest and best purely German models are dbmdz/german-gpt2 and stefan-it/german-gpt2-larger. The latter is trained on the same corpus, but only on 90GB of the CommonCrawl. ↩︎
According to the information provided on huggingface. ↩︎
Concerning the EU AI act, which will be enacted soon, this is still legal for research purposes in Europe. I assume that the EU AI Act is the reason that some recently released LLAMA models are not available in the EU: Meta does not want to get sued. ↩︎
The format is not exactly JSON, but serialized Python. On the common-crawl website, there is example code that demonstrates how to load data in this format. ↩︎
A tutorial is provided here ↩︎
There is an excellent post on perplexity available on the Gradient. There is also a paper describing alternative evaluation strategies. ↩︎

Language Models as Reasoners for Out-of-Distribution Detection

Tue, 17 Sep 2024 00:00:00 +0000

Our paper, Language Models as Reasoners for Out-of-Distribution Detection, was presented at the Workshop on AI Safety Engineering (WAISE) 2024 and received the best paper award by popular vote.

It constitutes an extension of our idea of Out-of-Distribution Detection with Logical Reasoning, where we replaced the prolog-based reasoning component with an LLM.

Abstract §

Deep neural networks (DNNs) are prone to making wrong predictions with high confidence for data that does not stem from their training distribution. Consequentially, out-of-distribution (OOD) detection is important in safety-critical applications, as it identifies such inputs. Using prior knowledge about the training distribution through formal constraints has shown promise in enhancing OOD detection. However, developing and maintaining formal knowledge bases can be cumbersome. Large language models (LLMs) have recently excelled in various natural language processing tasks. In this study, we investigate the use of LLMs for OOD detection, where domain constraints are expressed in natural language. Our results indicate that LLMs can outperform random guessing by leveraging general world knowledge learned during training. Moreover, LLMs can par with methods based on formal constraints when supplemented with domain-specific constraints articulated in natural language.

Presentation §

The presentation slides are available here.

Deep learning-based harmonization and super-resolution of Landsat-8 and Sentinel-2 images

Fri, 17 May 2024 00:00:00 +0000

Our paper Deep learning-based harmonization and super-resolution of Landsat-8 and Sentinel-2 images, which is based on the masters thesis of my colleague Venkatesh Thirugnana Sambandham, has been published in the ISPRS Journal of Photogrammetry and Remote Sensing. This work is an extension of our previous workshop paper on transformers for satellite homogenization. In summary, we find that a simple UNet model provides surprisingly good performance for the satellite homogenization task.

We demonstrate that this 100M parameter model

can enhance the spatial resolution of satellite images
is able to increase the availability of cloud-free images by 21% on average
can thereby provide benefits for downstream tasks, like crop segmentation
generalizes well to different regions of the world
is able to provide uncertainty estimates

The model is also available on Huggingface, so you can easily test it on your own images:

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "venkatesh-thiru/s2l8h-UNet-5depth-upsample", 
    trust_remote_code=True
)

Abstract §

Multi-spectral satellite images of the Earth’s surface are used in various applications, from water quality assessment and urban planning to climate monitoring, disaster response, infrastructure oversight, and agricultural surveillance. Many of these applications would benefit from higher spatial and temporal resolution of observations, which could be achieved by combining observations from several sources. This study introduces a deep learning-based pipeline to harmonize the spectral and spatial discrepancies between the Landsat-8 and Sentinel-2 Earth Observation satellites. Through established image quality metrics, we demonstrate a significant enhancement in the spatial resolution of Landsat-8 images. Field observation experiments show that leveraging unified images from both satellites increases the availability of cloud-free images by 21% annually on average in our study area. Additionally, our pipeline enhances the Normalized Difference Vegetation Index (NDVI) correlation between Landsat-8 and Sentinel-2 observations by about 4.9%, offering significant performance gains in a downstream crop segmentation task. Our 100M parameter model, trained on European data, generalizes to most regions with only minor limitations. Furthermore, we show that the pipeline can provide uncertainty estimates for its outputs, which are valuable for decision-making in downstream applications.

Satelite images upsampled by our model compared to baseline

Out-of-Distribution Detection with Logical Reasoning

Thu, 04 Jan 2024 00:00:00 +0000

Our paper Out-of-Distribution Detction with Logical Reasoning has been accepted on the WACV 2024.

Abstract §

Machine Learning models often only generalize reliably to samples from the training distribution. Consequentially, detecting when input data is out-of-distribution (OOD) is crucial, especially in safety-critical applications. Current OOD detection methods, however, tend to be domain agnostic and often fail to incorporate valuable prior knowledge about the structure of the training distribution. To address this limitation, we introduce a novel, hybrid OOD detection algorithm that combines a deep learning-based perception system with a first-order logic-based knowledge representation. A logical reasoning system uses this knowledge base at run-time to infer whether inputs are consistent with prior knowledge about the training distribution. In contrast to purely neural systems, the structured knowledge representation allows humans to inspect and modify the rules that govern the OOD detectors’ behavior. This not only enhances performance but also fosters a level of explainability that is particularly beneficial in safety-critical contexts. We demon- strate the effectiveness of our method through experiments on several datasets and discuss advantages and limitations.

Video §

Below, you can find the presentation video I created for the conference. I used OpenAIs API for writing the script, as well as voice synthesis. Overall costs of production were $0.15.

There should have been a video here but your browser does not seem to support it.

Towards Deep Anomaly Detection with Structured Knowledge Representations

Thu, 15 Jun 2023 00:00:00 +0000

My paper Towards Deep Anomaly Detection with Structured Knowledge Representations has been accepted on the Workshop on AI Safety Engineering at SafeComp.

Abstract §

Machine learning models tend to only make reliable predictions for inputs that are similar to the training data. Consequentially, anomaly detection, which can be used to detect unusual inputs, is critical for ensuring the safety of machine learning agents operating in open environments. In this work, we identify and discuss several limitations of current anomaly detection methods, such as their weak performance on tasks that require abstract reasoning, the inability to integrate background knowledge, and the opaqueness that undermines their trustworthiness in critical applications. Furthermore, we propose an architecture for anomaly detection models that aims to integrate structured knowledge representations to address these limitations. Our hypothesis is that this approach can improve performance and robustness, reduce the required resources (such as data and computation), and provide a higher degree of transparency. As a result, our work contributes to the increased safety of machine learning systems.

SuMNIST: Can you find the anomaly? State-of-the-Art models fail at this task

Mining the Bundestag

Sun, 22 Jan 2023 00:00:00 +0000

Did you know that the German parliament publishes protocols for all of its proceedings in PDF format? It is relatively straightforward to download and parse them, so we can easily collect a dataset of transcripts of what seems to be every speech in the Bundestag since the Second World War.

My original idea was to mine the speeches for word associations. Some words will be associated with other words based on the intended connotation, and this association might change over time as the connotations change. Also, these associations can probably be correlated to individual parties. Furthermore, this dataset could be used to automatically identify emerging topics.

Protocols §

Scraping §

Scraping the PDF documents is not difficult, however, extracting the text from the PDFs can be a bit tedious. In the end, I had to use an OCR solution based on Tesseract that reads the text from some hand-selected locations in the documents. The results are far from perfect, and some additional pre-processing is required.

Language Model §

Now that we have a dataset of transcripts from the Bundestag, creating a GPT-2-based language model that generates speeches is not particularly difficult. Below is an example output:

Präsident Dr. Wolfgang Schäuble:

Letzter Redner in dieser Debatte ist der Kollege Michael Brand für die CDU/CSU-Fraktion.

(Beifall bei der CDU/CSU)

Michael Brand (CDU/CSU):

Vielen Dank, Herr Präsident, - Sehr geehrte Damen und Herren Abgeordnete! Die Zustände in den Flüchtlingslagern der Vereinten Nationen sind unerträglich.

Wir müssen jetzt schnell Lösungen finden.

Die Vereinten Nationen, das ist die UN-Vollversammlung, haben die Krise des UN-Flüchtlingsprogramms 2015 beschlossen. Die Krise hat zu massiven Versorgungsengpässen geführt, und das bei so viel Leid und Elend in den Lagern, Die Situation der vielen Flüchtlinge, die vor Krieg, Konflikten und vor allem vor den Toren der Hauptstadt Schutz und Hilfe suchen, ist katastrophal. Der UNHCR, der weltweit die Vergabe von humanitärer Hilfe, auch durch die Vereinten Nationen, durchführt, ist einer der großen Arbeitgeber in diesem Bereich. Es ist ein großer Erfolg, dass der UN-Plan 2015 heute veröffentlicht wurde. Das zeigt: Das UN-Flüchtlingsprogramm ist ein Erfolg - und das wird auch so bleiben. Die Lage der Flüchtlinge in Deutschland hat sich im Vergleich zu den Jahren davor nicht weiter entspannt, im Gegenteil: Die UN-Lager in Deutschland sind überbelegt und überlastet. Das ist ein Problem, dem wir uns auch im nächsten Jahr stellen müssen.

Die Situation in den Flüchtlingslagern ist weiterhin sehr belastet. Es gibt immer noch viele Binnenflüchtlinge. Es gibt immer noch viele Flüchtlinge, die versuchen, in Nachbarländer zu kommen. Die Lage der Flüchtlinge im Libanon ist extrem. Die Lage in Afghanistan ist ebenfalls extrem, Die Situation in Pakistan und in Afghanistan ist ebenfalls extrem, und es gibt weiterhin Flüchtlinge, die aus den Flüchtlingslagern nach Indien oder aus Nordafghanistan in den Norden oder in den Irak kommen.

Im Jahr 2018 - das wurde schon erwähnt - wird es eine große Zahl an Asylbewerbern geben, die zu uns gekommen sind, und wir werden in diesem Jahr insgesamt über 1 Million Flüchtlinge und Einwanderer haben, Diese Menschen brauchen eine sichere, legale, aber faire und Unterstützung, Die Vereinten Nationen müssen jetzt die richtigen Signale aussenden und die richtigen Signale aussenden.

Wir können es uns nicht leisten, diese Menschen alleine zu lassen, Deshalb haben wir heute das Recht, darüber zu entscheiden, ob die Situation weiterhin für die Menschlichkeit haft oder für die Rechtsstaatlichkeit in den Flüchtlingslagern sprechen.

Wir sollten nicht in Abwägung kommen, ob diese Menschen, die wir haben, weiterhin Schutz und humanitäre Hilfe brauchen, und wir müssen ihnen auch weiterhin eine Heimat bieten, in denen sie leben können und in denen die Regeln, die Gesetze und die Regeln der Gesellschaft gelten, Wir dürfen nicht zulassen, dass diese Menschen in den Lagern, von denen wir in den vergangenen Jahren gesprochen haben, wieder in ihre Heimat zurückkehren. Das ist unser gemeinsamer Anspruch, auch im Interesse der Vereinten Nationen.

Video Recordings §

The Bundestag has also been publishing video recordings of all speeches for a couple of years now, so there are huge volumes of high-resolution video and audio data (as well as transcriptions?), recorded in a standardized environment, of every major German politician publicly available on the internet. What could possibly go wrong?

Using the videos is allowed for educational purposes. The following might give you an idea (educate you) about what might be possible with this data: Using pre-trained face detectors and Tesseract, we can extract faces, names, and party membership information from the videos.

Random sample of crops from the video recordings.

Autoencoders §

An Autoencoder (AE) is a simple neural network architecture that can be used for dimensionality reduction. It takes an input image $x$, which is then sent through an encoder $E(x)$, which compresses the input into a lower-dimensional latent representation $z$. This $z$ is then processed by a decoder $D(z)$, which decompresses $z$ back into $\hat{x}$. During training, we optimize the encoder and decoder jointly to minimize (in this case) the sum of the squared errors between $x$ and $\hat{x}$. This way, the AE learns to efficiently reduce the dimensionality of the input, while ensuring that the original input can be reconstructed.

Architecture of an Autoencoder

AEs can be used for clustering since similar inputs tend to be close to each other in the latent space.

T-SNE of latent space generated by Deep Convolutional Autoencoder.

We can also use AEs to interpolate between different inputs. Given two images $x_1$ and $x_n$, we can calculate the point in the latent space for each of them by passing them through the encoder. We can then interpolate between both points in the latent space and send each of the resulting latent representations $z_1, …, z_n$ through the decoder to produce a video $\hat{x}_1, …, \hat{x}_n$ that shows a smooth transition between the original images. Below is a video that interpolates between images of three different politicians.

Interpolating between points in the latent space.

There are different variations of AEs, many of which impose additional constraints on the latent space to induce a specific structure, such as variational AEs. Interestingly, in this case, the vanilla AE seems to be sufficient to learn a “smooth” latent that can be used for interpolation.

Mining tagesschau.de

Sat, 26 Nov 2022 00:00:00 +0000

I like to read tagesschau.de, so I wrote a script to scrape it in regular intervals.

My original goal was to determine which articles stay on the front page the longest, which ones allow commenting (a feature that seems to have been disabled almost entirely since March 2020), and if articles are modified after the initial release (without mentioning this), because I sometimes feel that headlines change.

Dataset Creation §

Tagesschau provides a JSON API, so fetching all of the articles is relatively straightforward and can be done with just a few lines of code.

now = datetime.now()
date_time = now.strftime("%Y-%m-%d_%H_%M_%S")

url = f"https://www.tagesschau.de/api2/"
r = requests.get(url)
path = join(root, f"{date_time}.json")

if r.status_code == 200:
 data = r.content

 with open(path, "w") as f:
 f.write(data.decode())

I automatically ran this script once per hour for more than two years, which gave me $\approx$ 15,000 unique news articles.

Exploratory Data Analysis §

Now that we have a dataset, we can do some exploratory data analysis. For example, we can investigate when articles are published. Let’s plot the number of articles per weekday:

More articles are published on Wednesday and Friday, while, during the weekend, the least articles are published. This sounds reasonable: fewer people work on the weekend, so there are fewer articles. But what is the reason for the spike on Fridays? Since the articles contain the exact publication date, we can plot the distribution of articles for each day, over each hour. The plot looks like this:

Here, we notice something interesting: Quite a lot of articles are published on Friday around 17:00 and 20:00. My hypothesis is that these are articles that the editorial staff pushed out so that people have stuff to read during the weekend.

Let’s have a look at the length of the articles:

If the hypothesis is true, we could expect that the articles published on a Friday evening are longer than average. The histogram of the number of articles, plotted against the hour and the number of words in the articles looks like this:

Length of articles released on Fridays, over time.

This seems to support the hypothesis: Friday evening after the tagesschau has aired, an unusual amount of lengthy articles is published. This does not seem too far-fetched.

Masked Language Modeling §

Masked language modeling can be seen as a special kind of classification task. Given the previous and the next word, what is the probability of the masked word?

Consider the sentence:

The [mask] jumps over the lazy dog.

Here, we are trying to find the most probable word for [mask]. We can then do this for every word in some dataset, and multiply the results, or, in mathematical terms $$ p(\mathcal{D} \vert \theta) = \prod_{x \in \mathcal{D}} p(x_i | x_{j \neq i}, \theta) $$ where $\mathcal{D}$ is a dataset with a set of documents $x$,and $\theta$ are the parameters of our model. In practice, instead of maximizing this probability during training, we will minimize its negative logarithm, which will turn the product into a sum. This also has the benefit of being more stable, numerically.

Language modeling by recovering masked inputs.

Clustering §

We can use a model trained for masked language modeling for clustering. Below, you can find a clustering of the articles based on their content. Articles are vectorized by a German version of BERT, the visualization uses PCA and T-SNE. The color represents the category to which the articles were assigned. Using the categories as a sanity check, the clustering seems to work reasonably well. In fact, we can even find some articles that apparently have been categorized wrong.

Article Clustering based on BERT

Generative Language Modeling §

We can use this dataset to create a fake news generator.

GPT General architecture

GPTagesschau §

I fine-tuned a German GPT-2-based language model on the dataset to generate news articles in the style of tagesschau.de. The model is not that good yet, which is probably due to the fact that

German pre-trained language models are not as good as their English counterparts and
the dataset is too small (15k unique articles at the time of writing).

Still, the generated (fake) news articles are somewhat coherent, even if they tend to contradict themselves. The model is also able to generate titles and headlines.

In the following, I give a rather funny example. Note that I did not select this article for its realism, but because I thought the apparent mixture of two topics makes it an interesting read.

Inhaftierter Wikileaks-Gründer: Erste Anklage gegen Assange?

Die US-Justiz hat in London den Prozess gegen Julian Assange eingeleitet. Ihm wird vorgeworfen, im Irak Anhänger einer Terrormiliz bekämpft zu haben. Assanges Unterstützer sollen im Krieg gegen den Irak an Waffen und Ausrüstung gekommen sein.

Ein britisches Gericht hat den Rechtsstreit um die Auslieferung des Enthüllers der Enthüllungsplattform Wikileaks, Julian Assange, in Großbritannien begonnen. Für den Gründer der Demokratie-Bewegung Assange bestehe die Chance, die Untersuchungshaft in London absitzen zu können, teilte die Londoner Generalstaatsanwältin Letitia James mit. “Assange kann hoffen, eine Chance zu haben, in einem fairen Prozess von allen Seiten an das Richtige erinnert zu werden.” Assange drohen bei einer Verurteilung bis zu 175 Jahre Haft. Die Staatsanwaltschaft wirft ihm vor, Anhänger der Terrormiliz “Islamischer Staat” (IS) und des IS-Regimes mit Waffen und Ausrüstung versorgt zu haben.

Assange: Ein Anhänger von Al-Kaida und Al-Nur?

Seine Verteidiger hingegen sagten, das Gericht sei der Meinung, dass Assange zu den Aktivitäten von Al-Kaida oder der Terrororganisation Islamischer Staat (IS) gehöre. Es gebe keine Beweise dafür, dass er IS-Mitglieder angeworben habe. Dem Gericht zufolge wird Assange vorgeworfen, er habe falsche Angaben gemacht, um Kämpfer der IS-Miliz und IS-Anführer zu unterstützen. Der Anwalt von Assange, Michel Barnier, nannte die Anklageerhebung einen "Meilenstein" für ihn. "Der Rechtsstaat gibt Julian Assange das Recht, sich frei zu bewegen", sagte Barnier im Sender Euronews.

Assange drohen bis zu 175 Jahre Haft

Ein Prozess gegen Assange wäre der erste, in dem ein Gericht ein Urteil fällte. Der Gründer der ältesten und wichtigste Nachrichtenplattform der Welt sitzt in der ecuadorianischen Metropole Quito in Haft, seit er 2007 festgenommen und im September vergangenen Jahres in die USA gebracht worden war. Es wäre die erste Anklage gegen Assange, die ein Gericht in Großbritannien erhebt. Der 37-Jährige ist der größte investigative Journalist, der je inhaftiert wurde.

On Outlier Exposure with Generative Models

Wed, 23 Nov 2022 00:00:00 +0000

Our paper On Outlier Exposure with Generative Models has been accepted on the NeurIPS Machine Learning Safety Workshop.

Abstract §

While Outlier Exposure reliably increases the performance of Out-of-Distribution detectors, it requires a set of available outliers during training. In this paper, we propose Generative Outlier Exposure (GOE), which alleviates the need for available outliers by using generative models to sample synthetic outliers from low-density regions of the data distribution. The approach requires no modification of the generator, works on image and text data, and can be used with pre-trained models. We demonstrate the effectiveness of generated outliers on several image and text datasets, including ImageNet.

Outliers generated by BigGAN trained on different datasets

Social Work Research Map

Fri, 11 Nov 2022 00:00:00 +0000

During the last weeks, I worked with some colleagues on a website that aims to improve access to social work literature. We described the results in out paper Social Work Research Map – ein niederschwelliger Zugang zu internationalen Publikationen der Sozialen Arbeit, which has been published in the journal Soziale Passagen.

While the paper is written in german, there is also a technical report in english.

Abstract §

Internationalization is a central topic in higher education policy in Germany. An orientation towards international discourses is also required in the teaching, research and practice of social work. Due to rapidly growing research results, obtaining a systematic overview of disciplinary knowledge is becoming increasingly difficult. This paper describes the development of an interactive website called Social Work Research Map, which should facilitate access to scientific publications in social work. For this purpose, a database with almost 25,000 journal articles from 23 social work journals was created. With the help of automated text analysis (topic modeling), the abstracts were examined and structured into 40 thematic clusters. Different visualization techniques and filter functions enable users to search the database independently according to their corresponding interests. Individual search results can be saved, and an artificial-intelligence-based recommendation system suggests similar publications. The development of SWORM is an example of the use of computer science methods in social work and illustrates the potential of structuring large amounts of text and making it accessible to people. At the same time, it becomes clear that the application of such methods is challenging for social scientists and that the use of AI raises ethical problems.

There should have been a video here but your browser does not seem to support it.

Towards Transformer-based Homogenization of Satellite Imagery for Landsat-8 and Sentinel-2

Sat, 13 Aug 2022 01:42:22 +0200

Our abstract Towards Transformer-based Homogenization of Satellite Imagery for Landsat-8 and Sentinel-2 was accepted for presentation on the Transformers Workshop for Environmental Science.

In summary, we somewhat surprisingly found that transformers, a neural network architecture that achieves state-of-the-art results on most tasks it is applied to, does not outperform a vanilla U-Net model on our particular superresolution task.

Convolutional Filter Visualization

Wed, 27 Jul 2022 00:00:00 +0000

Deep Neural Networks are black-boxes: they map some input to some output, and we can make them do this surprisingly well. However, we usually have no idea how this mapping works. Particularly Convolutional Neural Networks (CNNs), which employ “convolutions” as filters, achieved some impressive results (before Vision Transformers came along).

Filter Visualization can help us understand what kind of patterns the convolutional filters in CNNs detect.

Why would we want to do it? §

Visualizing filters can help us to get an understanding of what the neural network is doing. The method can also be used to identify filters that are not required for the model, because they are redundant copies of other filters, or compute not valuable features at all.

How does it work? §

Filter visualization aims to find the input $x$ that activates a certain convolutional filter the most. Mathematically, this means we are solving $$ \arg \max_x \mathcal{L} (f(x)) = \sqrt{\sum_i \sum_j f(x)_{ij}^2} $$

where $f(x)_{ij}$ refers to the value at position $i,j$ in the feature map (the output of the filter) computed by $f$.

In practice, we solve this optimization problem via gradient descent (or, in this case, ascend, since we aim to maximize the activation). That is: we start with a randomly initialized input, $x$, calculate the gradient of the magnitude of the filter activation $\nabla_x \mathcal{L}$, and iteratively update $x$ to increase the magnitude:

$$ x’ = x + \alpha \nabla_x \mathcal{L}(f(x)) . $$

Additionally, we normalize the gradient during updates for stability.

layer = net.conv1 # this is the layer we are targeting 
filter_no = 5 # the index of the filter we are targeting 
alpha =  0.01  # learning rate 

x = torch.randn(size=(3,256,256))

x_v = Variable(x.unsqueeze(0)).cuda(), # size will be BxCxHxW 
x_v.requires_grad = True # enable grad to include in backprop 

for i in range(50):
   # gradient ascend iteration
   out = layer(x_v)  # size will be BxCxHxW
   f = out[0, filter_no] # select filter by index, size will be HxW
   
   loss = f.pow(2).sum().sqrt()

    # zero grads
    if x_v.grad is not None:
      x_v.grad.data = torch.zeros_like(x_v.grad.data)

   loss.backward()

   with torch.no_grad():
      # gradient normalization and upate 
      x_v.grad /= x_v.grad.pow(2).mean().sqrt() + 0.000001
      x_v += x_v.grad * alpha

Results §

What we observe is that, the deeper we go, the more abstract the features become. While the lower features - lines with different orientations, certain colors, and color blobs - are comparatively straightforward to intepret, guessing the meaning of the lower-level features feels more like a Rorschach test.

Filter visualization of the first convolutional layer of a ResNet 101

These observations are evidence for the hypothesis that neural networks learn increasingly abstract, high-level features in upper layers. On the other hand, this also means that we can not really get an understanding of what these lower layers are doing.

Filter visualization of the last layer of a ResNet 101

Multi-Class Hypersphere Anomaly Detection (MCHAD)

Wed, 13 Jul 2022 21:58:50 +0200

Our Paper Multi-Class Hypersphere Anomaly Detection (MCHAD) has been accepted for presentation at the ICPR 2022. In summary, we propose a new loss function for learning neural networks that are able to detect anomalies in their inputs.

Poster for MCHAD (PDF).

MACHAD is available via pytorch-ood. You can find example code here.

How does it work? §

The general idea is that we want a neural network $f_{\theta}: \mathcal{X} \rightarrow \mathcal{Z}$ that maps inputs from the input space to some lower dimensional representation in such a way that points from class $y$ cluster around a hypersphere with center $\mu_y$ in the output space. Because the neural network can learn non-linear functions, the classes in the input space can have arbitrarily complex shapes.

To train this neural network, we optimize its parameters $\theta$ to minimize a loss function. We then hope that the model only maps points from the known classes into the spheres of the corresponding spheres, while other points that are dissimilar to the training data (i.e., anomalies) are mapped further away because the model never learned to map these points close to the centers.

Omitting some details, the loss function we propose has three different components, each of which we will explain in the following.

Intra-Class Variance §

We want the representations $f(x)$ of one class to cluster as tightly around a class center $\mu_y$ as possible. For this, we can use the Intra class variance loss, which is defined as:

$$ \mathcal{L}_{\Lambda}(x,y) = \Vert \mu_y - f(x) \Vert^2 $$

Inter-Class Variance §

A trivial solution to minimize $ \mathcal{L}_{\Lambda}$ would be to map all inputs to the same point, which would lead to the collapse of the model. To prevent this, we have to add a second term that ensures that the points remain separable:

$$ \mathcal{L}_{\Delta}(x,y) = \log \left( 1 + \sum e^{ \Vert \mu_y - f(x) \Vert^2 - \Vert\mu_j - f(x) \Vert^2} \right) $$

This expression might seem somewhat random, but it can, in fact, be derived from the method of maximum likelihood.

MCHAD on CIFAR 10

Extra-Class Variance §

Sometimes, we have a set of example outliers at hand. Previous work showed that the robustness of models can be significantly improved by including these in the optimization. Therefore, we can add a term that incentivizes such outliers to be mapped sufficiently far away from the class centers:

$$ \mathcal{L}_{\Theta}(x) = \max \lbrace 0, r_y^2 - \Vert \mu_y - f(x) \Vert^2 \rbrace $$

where $x$ is some outlier and $r_y$ is some class conditional radius. This term can also be applied to other methods that aim to learn spherical clusters in their output space. We refer to it as Generalized MCHAD.

Generalized MCHAD on CIFAR 10

How well does it work? §

Our experiments found that both MCHAD and Generalized MCHAD outperform other hypersphere learning methods. In ablations studies, we also investigated the influence of each of the loss terms and demonstrated that all of them contribute to the overall performance regarding discriminative power on normal data and the ability to detect anomalies.

On Challenging Aspects of Reproducibility in Deep Anomaly Detection

Wed, 13 Jul 2022 00:00:00 +0000

Our companion paper, On Challenging Aspects of Reproducibility in Deep Anomaly Detection, has been accepted for presentation at the Fourth Workshop on Reproducible Research in Pattern Recognition (satellite event of ICPR 2022).

In it, we discuss aspects of reproducibility for our anomaly detection algorithm MCHAD, as well as anomaly detection with deep neural networks in general. In particular, we discussed the following challenges for the reproducibility:

Nondeterminism: conducting the same experiment with different random seeds might lead to significantly different outcomes.
Sensitivity to hyper-parameters: slight changes in hyper-parameters can drastically alter the outcomes.
Complexity: the more complex an algorithm, the more likely an implementation contains errors.
Dataset Selection: The performance of a method is going to depend on the dataset on which you evaluate it.
Resource Limitations: resource requirements can limit the number of individuals or institutions that are able to reproduce the training.
Dependencies: dependencies, in the form of data, pre-trained weights, or software libraries, might get taken down at some point.

The large number of dependencies in our experiments may harm the reproducibility of our exact numerical results. However, we argue that the reproducibility of conclusions should be prioritized over the reproducibility of exact numerical results since the former contributes to the advancement of scientific knowledge.

PyTorch-OOD: A library for Out-of-Distribution Detection based on PyTorch

Wed, 13 Jul 2022 00:00:00 +0000

Our paper, PyTorch-OOD: A library for Out-of-Distribution Detection based on PyTorch, has been presented at the CVPR 2022 Workshops. You can find the most recent version of the Python source code on GitHub.

Abstract §

Machine Learning models based on Deep Neural Networks behave unpredictably when presented with inputs that do not stem from the training distribution and sometimes make egregiously wrong predictions with high confidence. This property undermines the trustworthiness of systems depending on such models and potentially threatens the safety of their users. Out-of-distribution (OOD) detection mechanisms can be used to prevent errors by detecting inputs that are so dissimilar from the training set that the model can not be expected to make reliable predictions. In this paper, we present PyTorch-OOD, a Python library for OOD detection based on PyTorch. Its primary goals are to accelerate OOD detection research and improve the reproducibility and comparability of experiments. PyTorch-OOD provides well-tested and documented implementations of OOD detection methods with a unified interface, as well as training and benchmark datasets, architectures, pre-trained models, and utility functions. The library is available online under the permissive Apache 2.0 license and can be installed via the Python Package Index (PyPI).

Installation §

You can install the package directly via pip:

pip install pytorch-ood

Presentation §

The presentation slides are available here.

Addressing Randomness in Evaluation Protocols for Out-of-Distribution Detection

Tue, 13 Jul 2021 00:00:00 +0000

Our Paper Addressing Randomness in Evaluation Protocols for Out-of-Distribution Detection has been accepted at the ICJAI 2021 Workshop for Artificial Intelligence for Anomalies and Novelties.

In summary, we investigated the following phenomenon: when you train neural networks several times, and then measure their performance on some task, there is a certain variance in the performance measurements, since the results of experiments may vary based on several factors (that are effectively controlled by the random seed). We investigated how the performance measures for several evaluation protocols used in Anomaly Detection, Out-of-Distribution Detection, Open Set Recognition (OSR) and related fields vary when the random seed is varied.

In some of these fields, like OSR, it is common to measure the average performance over 3-5 experiments. Is this sufficient to draw reliable conclusions regarding a possible performance difference between methods?

We found that the variance is so large that it may, in fact, not. Consequentially, experiments based on too few random seed might provide a brittle foundation for conclusions. We the argue that such experiments should rather be seen as a fundamentally random process. Therefore, we should measure the expected value of the performance $\mathbb{E}_{x \sim p} [ f(x) ] $ where $p$ is the distribution of the random seeds and $f$ is an experimental setting.

Given a set of measurements, we can use statistical tests to determine if an observed difference can be considered significant. However, we found that in some cases even 1000 experiments were insufficient to infer significant differences in the results.

Data-Mining als Werkzeug empirischer Sozialforschung

Mon, 13 Jul 2020 22:09:10 +0200

Inspired by David Kriesel’s talk “Spiegel-Mining”, a friend of mine and a professor from the Hochschule Magdeburg scraped a German website that regularly publishes reviews of social work literature and mined the resulting 18.000 articles, hoping for interesting insights.

In an attempt to visualize the discourse, we created several topic maps, like the one below, which you can find on the accompanying (German) website. The colors represent the gender of the authors of the review. Note that we are not entirely sure if the editors or the authors are responsible for this gender assignment. Also, the explicit gender assignment was removed and can not be found on the scraped website anymore.

Section of the concept web, visualized with Gephi

The findings were not surprising: people whom the website identified as women tend to write reviews on topics one could consider traditional female-dominated fields, like child care.

Explanation-based Anomaly Detection in Deep Neural Networks

Sat, 01 Feb 2020 00:00:00 +0000

Masters Thesis (PDF).

If an AI gives you a weird explanation for its prediction, you should remain septical about the accuracy of the prediction. Sounds reasonable?

This was the general idea of my masters thesis, which was originally titled Self-Assessment of Visual Recognition Systems based on Attribution. Today, I would call it Explanation-based Anomaly Detection in Deep Neural Networks. The general idea was to use attribution-based explanation methods to detect anomalies (such as unusual inputs) in convolutional neural networks. This basically boils down to detecting unusual gradients in the network, which, at the time, was, to my knowledge, a novel idea. We did some experiments and found that it somewhat worked in some cases.

Abstract §

Convolutional Neural Networks (CNNs) achieve state of the art results in various visual recognition tasks like object classification and object detection. While CNNs perform surprisingly well, it is difficult to retrace why they arrive at a certain prediction. Additionally, they have been shown to be prone to certain errors. As CNN are increasingly deployed into physical systems - for example in self driving vehicles - undetected errors could result in catastrophic consequences. Approaches to prevent this include the usage of attribution based explanation methods to facilitate an understanding in the systems decision in hindsight, as well as the detection of recognition errors at runtime, called self-assessment. Some state-of-the-art self-assessment approaches aim to detect anomalies in the activation patterns of neurons in a CNN.

This work explores the usage of attribution based explanations for self-assessment of CNNs. We build multiple self-assessment models and evaluate their performance in various settings. In our experiments, we find that, while self-assessment based on attribution does not outperform self-assessment based on neural activity on its own, it always surpasses random guessing. Furthermore, we find that self-assessment models using neural activation patterns as well as neural attribution can in some cases outperform models which do not consider attribution patterns. Thus, we conclude that it might be possible to improve self-assessment models by including the explanation of the model into the assessment-process.