bourdain-lists

Finding Anthony Bourdain’s Lost Li.st’s in Common Crawl

Link to github for code: https://github.com/mirandrom/bourdain-lists/

Update: another HN user thecsw ended up doing the same thing, check out their website for much more polished results:
https://sandyuraz.com/blogs/bourdain/

Background

On November 26, HN user gregsadetsky posted a compilation of “Anthony Bourdain’s Lost Li.st’s”, including a table of those that are still lost. A lot of these seem fun and interesting, so I decided to try and find them. Just because they aren’t on the Internet Archive, doesn’t mean they haven’t been saved somewhere else. It turns out that somewhere else is Common Crawl. I’ve downloaded and parsed most of these and included links in the table below. Unfortunately, I wasn’t able to find a way to also scrape embedded images, which seem to point to dead cloudfront urls not indexed by Common Crawl.

Title Date
Things I No Longer Have Time or Patience For 4/28/2016
Nice Views 3/4/2016
If I Were Trapped on a Desert Island With Only Three TV Series 3/2/2016
The Film Nobody Ever Made 2/25/2016
I Want Them Back 1/23/2016
Objects of Desire 1/21/2016
David Bowie Related 1/14/2016
Four Spy Novels by Real Spies and One Not by a Spy 11/6/2015
Hotel Slut (That’s Me) 11/7/2015
Steaming Hot Porn 10/18/2015
5 Photos on My Phone, Chosen at Random 10/16/2015
People I’d Like to Be for a Day 10/15/2015
I’m Hungry and Would Be Very Happy to Eat Any of This Right Now 10/2/2015
Observations From a Beach 9/27/2015
Guilty Pleasures 9/23/2015
Some New York Sandwiches 9/5/2015
Great Dead Bars of New York 8/19/2015

Finding the lost lists on Common Crawl

Searching Common Crawl through their web UI is not a great experience.

Instead I used cdx_toolkit which allows us to easily search a range of crawls with --crawl CC-MAIN-2015,CC-MAIN-2016,CC-MAIN-2017 and see which specific crawls contain which URLs.

cdxt -v --cc --crawl iter CC-MAIN-2015,CC-MAIN-2016,CC-MAIN-2017 iter "https://li.st/Bourdain/*"

...

INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2017-51-index
INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2017-47-index
INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2017-43-index
INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2017-39-index
status 200, timestamp 20170923054957, url https://li.st/Bourdain/5-photos-on-my-phone-chosen-at-random-0nsXCpUt69UbZvcMsOh4P3
status 200, timestamp 20170923092731, url https://li.st/Bourdain/caption-the-donald-4ooyAsCr5FSapB1kvlUlpy
status 200, timestamp 20170919134839, url https://li.st/Bourdain/observations-from-a-beach-0X3d0NujKKomnM2HIukguv
...

Downloading and parsing the lost lists from Common Crawl

We can also use cdx_toolkit to download from Common Crawl, similarly to how we searched it. This will create a warc.gz file that we can process with warcio.

cdxt --cc --crawl CC-MAIN-2017 warc "https://li.st/Bourdain/*" 
python warc_to_html.py bourdain-000000.extracted.warc.gz parsed_html

There are some duplicates with the same URL, but there are also lists with different URLs that Anthony edited. In “Crimes Against Food” he seems to have had a change of heart, removing the Corn Dog and adding instead:

These aren’t all the lost lists (I only downloaded CC-MAIN-2017), but hopefully the remaining ones are in CC-MAIN-2015 and/or CC-MAIN-2016. Just be careful not to get your requests blocked by Common Crawl like I did, it seems the defaults in cdx_toolkit are too aggressive.