If someone sends you an “HTML” mail from Outlook, even Tidy will run away screaming unless you strip out some of the gunk manually before trying to fix it.
If it’s Quoted-Printable, you have a bit more work to do first [maybe this (web service) or this (sed script).], though you probably have even more work to do if the original document used a non-Western encoding. Not tested.
sed -e "s/\<o\:p\>/\<p\>/g" | sed -e "s/\<\/o\:p\>/\<\/p\>/g" | /usr/local/bin/tidy -c
broken into two
sed invocations for
readability’s (hah!) sake…
Of course, it’s all very brute-force, but usually good enough for government work.
A biologist, a statistician, a mathematician and a computer scientist are on
a photo-safari in Africa. As they’re driving along the savannah in their
jeep, they stop and scout the horizon with their binoculars.
The biologist: “Look! A herd of zebras! And there’s a white zebra!
Fantastic! We’ll be famous!”
The statistician: “Hey, calm down, it’s not significant. We only know
there’s one white zebra.”
The mathematician: “Actually, we only know there exists a zebra, which is
white on one side.”
The computer scientist : “Oh, no! A special case!”