Friday, 5 January 2007

pppd persist not so persist with udev

OK so this is my first official rant.

Background: I have a dell server which (amongst other things) shares my Internet connection here at home. It essentially shares a persistent pppoe connection with the rest of my network. Great simple use of linux you agree? Well in the pre-udev world it was!

Up until a couple of weeks ago this machine happily ran Debian however I was getting a little tired of Unstable's slow release schedule and Testing breaking things every update. So I decided to switch to Ubuntu 6.10 which I've been using on my test machines at Uni and was quite happy with.

Problem: After setting everything up I noticed that whenever my ppp link got disconnected it didn't automatically reconnect. After a quick examination I discovered that the pppd daemon was being killed once it disconnected:
pppd[15890]: No response to 4 echo-requests
pppd[15890]: Serial link appears to be disconnected.
pppd[15890]: Connect time 2436.6 minutes.
pppd[15890]: Sent 410764979 bytes, received 645991881 bytes.
pppd[15890]: Connection terminated.
pppd[15890]: Modem hangup
pppd[15890]: Terminating on signal 15
pppd[15890]: Exit.
Solution: After a LOT of googling and stumbling on other users with similar problems a SuSe bugzilla bug report suggested that there was a problem caused by udev. The Novel bug report is here: https://bugzilla.novell.com/show_bug.cgi?id=211936

and a copy (Novell seem to remove public access to bugs once they are solved... "thanks for fixing this bug in our open source software, now we won't tell anyone about the fix") here:
http://lists.opensuse.org/opensuse-bugs/2006-10/msg02573.html

OK now after looking into this further I found reports that under Debian based distros one of the udev scripts was broken. Specifically: /etc/udev/rules.d/85-ifupdown.rules.

A note to users of other distros, this may be a different script in udev for you, perhaps the best way to find it is to:
grep "ifup" /etc/udev/rules.d/*

My original version of this file contained the following:
# This file causes network devices to be brought up or down as a result
# of hardware being added or removed, including that which isn't ordinarily
# removable.
# See udev(8) for syntax.

SUBSYSTEM!="net", GOTO="net_end"

# Bring devices up and down only if they're marked auto.
# Use start-stop-daemon so we don't wait on dhcp
ACTION=="add", RUN+="/sbin/start-stop-daemon --start --background --pidfile /var/run/network/bogus --startas /sbin/ifup -- --allow auto $env{INTERFACE}"
ACTION=="remove", RUN+="/sbin/start-stop-daemon --start --background --pidfile /var/run/network/bogus --startas /sbin/ifdown -- --allow auto $env{INTERFACE}"

LABEL="net_end"

The first fix (which was found by googling) was to remove "--" after the ifup and ifdown commands. A little explanation of what's going on here. The "--allow auto" option is meant to be set so that the interface specified by $env{INTERFACE} will only go up/down if it is prefixed AUTO in /etc/network/interfaces like this: auto eth0 or this: allow-auto eth0. The problem is the extra "--" breaks this.

Now this is just the start of the problem. The real issue here is that the udev system is essentially running "ifdown ppp0" whenever my pppoe connection gets disconnected. As you may know, running ifdown on a pppd based interface sends a TERM signal to pppd (and so pppd closes without being able to reconnect). The SuSe fix to this is to put a line something like the following in the udev script so that special interfaces (such as ppp) are not ifup/ifdown by udev:
SUBSYSTEM=="net", ENV{INTERFACE}=="ppp*|ippp*|isdn*|plip*|lo*|irda*|dummy*|ipsec*|tun*|tap*|bond*|vlan*|modem*|dsl*", GOTO="net_end"
This basically will skip the add and remove actions in the udev script mentioned earlier. I, however, disagree with this fix and propose the following (and IMHO cleaner and more cautious) fix. I propose the following two lines are changed accordingly:
ACTION=="add", RUN+="/sbin/start-stop-daemon --start --background --pidfile /var/run/network/bogus --startas /sbin/ifup --allow=hotplug $env{INTERFACE}"
ACTION=="remove", RUN+="/sbin/start-stop-daemon --start --background --pidfile /var/run/network/bogus --startas /sbin/ifdown --allow=hotplug $env{INTERFACE}"

And also the following line be put in /etc/network/interfaces for each and every interface that the user indeed wants udev to ifup/ifdown:
allow-hotplug eth0
What my solution does is to bring back the hotplug class (I think which came from the hotplug days before udev) so that udev will only ifup/ifdown interfaces that are marked as hotplug(able). To me this seems to be the original intent of the --allow=auto option. Why don't we just make ppp0 not auto? Because it is a persistent connection that needs to come up on boot.

Rant: Now I'm very pleased that this is all fixed. BUT I am very pissed off that a crappy bug like this can make its way into a release that has been out for so long! Also, who the hell wrote the udev script and OBVIOUSLY didn't test it?! The thing that gets me is that I'm not doing a really obscure task here, I'm just trying to create a persistent ppp connection! Isn't this one of linux's niche markets (cheap simple internet gateway)?! In Windows I tick a box in my dial-up connection properties and IT JUST WORKS.

If people want linux to be taken seriously a bit more professionalism over issues like this is needed. Why the hell an update hasn't been released to fix this simple yet very silly bug is beyond me. Also, when udev replaced hotplug, who on earth tested it? One of linux's biggest problems is that their testing and quality assurance seems to be "leave it in beta until people stop posting bug reports, then presume it's been tested enough". Even Microsoft has worked out that this just doesn't work. Some sort of system needs to be in place which documents how much (if any) testing has been actually performed on each section of code (and I mean testing where you sit down and test for the sake of testing, not "I'll just use it in one of my production machines for a week and see if it falls over").

OK I feel better now, and I sincerely hope this helps those of you I've seen posting problems related to this. I feel for you!

Update
(4/4/2007): Good news! After 4 months of the bug sitting in the Ubuntu bug tracker I have just received word that it has been fixed. Not sure what package and version the patch will be in though...

Fixed rules to only take affect for devices with drivers

** Changed in: ifupdown (Ubuntu)
Sourcepackagename: udev => ifupdown
S
tatus: Unconfirmed => Fix Released


Update (18/11/2009): Just to clarify current behaviour of udev and pppd...
The current behaviour with udev is that interfaces marked auto in /etc/network/interfaces are told to ifdown when udev detects that the connection has terminated. With pppd this causes it to be sent a TERM signal causing pppd to end and breaking the persistent properties of the connection.

Due to this it is best not to mark ppp interfaces auto.

Let the rants begin

I always thought myself too lazy to blog but I've realised there are a few things in life that give me enough motivation. So, here in this blog you'll find my rants and raves about things that piss me off enough for me to come on here and let you know. Considering I'm a graduate Computer Engineer/Scientist (and I'm currently doing a PhD) (un)fortunately you'll probably find most of these rants focused on technology.

I aim for this blog not only for me to vent frustration but also share the solutions that I hopefully find in the end. Feel free to reply with a comment to any of my posts.

Thanks and good luck everyone!

Alan