Link to github for code: https://github.com/mirandrom/bourdain-lists/
Update: another HN user thecsw ended up doing the same thing, check out their website for much more polished results:
https://sandyuraz.com/blogs/bourdain/
On November 26, HN user gregsadetsky posted a compilation of “Anthony Bourdain’s Lost Li.st’s”, including a table of those that are still lost. A lot of these seem fun and interesting, so I decided to try and find them. Just because they aren’t on the Internet Archive, doesn’t mean they haven’t been saved somewhere else. It turns out that somewhere else is Common Crawl. I’ve downloaded and parsed most of these and included links in the table below. Unfortunately, I wasn’t able to find a way to also scrape embedded images, which seem to point to dead cloudfront urls not indexed by Common Crawl.
| Title | Date |
|---|---|
| Things I No Longer Have Time or Patience For | 4/28/2016 |
| Nice Views | 3/4/2016 |
| If I Were Trapped on a Desert Island With Only Three TV Series | 3/2/2016 |
| The Film Nobody Ever Made | 2/25/2016 |
| I Want Them Back | 1/23/2016 |
| Objects of Desire | 1/21/2016 |
| David Bowie Related | 1/14/2016 |
| Four Spy Novels by Real Spies and One Not by a Spy | 11/6/2015 |
| Hotel Slut (That’s Me) | 11/7/2015 |
| Steaming Hot Porn | 10/18/2015 |
| 5 Photos on My Phone, Chosen at Random | 10/16/2015 |
| People I’d Like to Be for a Day | 10/15/2015 |
| I’m Hungry and Would Be Very Happy to Eat Any of This Right Now | 10/2/2015 |
| Observations From a Beach | 9/27/2015 |
| Guilty Pleasures | 9/23/2015 |
| Some New York Sandwiches | 9/5/2015 |
| Great Dead Bars of New York | 8/19/2015 |
Searching Common Crawl through their web UI is not a great experience.
Instead I used cdx_toolkit which allows us to easily search a range of crawls with --crawl CC-MAIN-2015,CC-MAIN-2016,CC-MAIN-2017 and see which specific crawls contain which URLs.
cdxt -v --cc --crawl iter CC-MAIN-2015,CC-MAIN-2016,CC-MAIN-2017 iter "https://li.st/Bourdain/*"
...
INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2017-51-index
INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2017-47-index
INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2017-43-index
INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2017-39-index
status 200, timestamp 20170923054957, url https://li.st/Bourdain/5-photos-on-my-phone-chosen-at-random-0nsXCpUt69UbZvcMsOh4P3
status 200, timestamp 20170923092731, url https://li.st/Bourdain/caption-the-donald-4ooyAsCr5FSapB1kvlUlpy
status 200, timestamp 20170919134839, url https://li.st/Bourdain/observations-from-a-beach-0X3d0NujKKomnM2HIukguv
...
We can also use cdx_toolkit to download from Common Crawl, similarly to how we searched it.
This will create a warc.gz file that we can process with warcio.
cdxt --cc --crawl CC-MAIN-2017 warc "https://li.st/Bourdain/*"
python warc_to_html.py bourdain-000000.extracted.warc.gz parsed_html
There are some duplicates with the same URL, but there are also lists with different URLs that Anthony edited. In “Crimes Against Food” he seems to have had a change of heart, removing the Corn Dog and adding instead:
These aren’t all the lost lists (I only downloaded CC-MAIN-2017), but hopefully the remaining ones are in CC-MAIN-2015 and/or CC-MAIN-2016.
Just be careful not to get your requests blocked by Common Crawl like I did, it seems the defaults in cdx_toolkit are too aggressive.