[JS] Leo.org 爬蟲筆記

17 Mar 2017

最近想做小東西，需要查詢德文字的英文意思，最好還可以來個發音。於是看上了Leo.org 的 API。

當然人家是禁止 CORS，所以無法直接前端要資料，要透過 server 來爬。

偷研究了一下，發現手機版的網頁是透過呼叫 API 來取資料，bingo!

拿一個複詞來舉例：Briefe 點我開啟XML

http://pda.leo.org/dictQuery/m-vocab/ende/query.xml?tolerMode=nof&lp=ende&lang=de&resultOrder=basic&multiwordShowSingle=on&sectLenMax=16&search=Briefe

是一個 XML 檔案，但為了方便處理，利用了 nashwaan/xml-js 將它轉成 JS object：

const enLeoJson = word => (
  new Promise((resolve, reject) => {
    request({
      url: `http://pda.leo.org/dictQuery/m-vocab/ende/query.xml?tolerMode=nof&lp=ende&lang=de&resultOrder=basic&multiwordShowSingle=on&sectLenMax=16&search=${word}`,
      method: 'GET',
    }, (e, r, b) => {
      const data = xml2js(b, { compact: true, spaces: 2 });
      if (data) {
        resolve(data);
      } else {
        reject('WORD NOT FOUND!');
      }
    });
  })
);

檢查有沒有查到字

以下說明請透過示範XML 來看，透過 XML 可以看得出來最上面的 search 內含有 hitcount，如果是零代表沒查到。

data.xml.search._attributes.hitcount

英文跟德文分別是 hitWordCntLeft 和 hitWordCntRight

Section & Entry

再來就是它有很多個 section 和 entry ，一個 section 代表一個詞類，一個 entry 代表一個詞意，我們只取最常見的用法，所以就是 section[0].entry[0]，但要注意當只有一個 section 或 entry 時會是 section.entry[0] 或 section[0].entry 之類的。

詞類：

data.xml.sectionlist.section[0]._attributes.sctTitle

名詞：Substantive

entry 有一個 attributes 叫 langlvl 有分 A 跟 B，看了一下字的感覺應該是 b=basic a=advance，每個 entry 可取到詞性。

詞性：

data.xml.sectionlist.section[0].entry[0].info.category._attributes.type

名詞： noun

Side

每個 entry 固定有2個 side 第一個是英文第二個是德文。

詞的原始樣貌：

data.xml.sectionlist.section[0].entry[0].side[1].words.word._text

名詞會自動變回單數加上冠詞：Briefe => der Brief

動詞會自動變回原型：schlaft => schlafen

Audio

英文檔名：

data.xml.sectionlist.section[0].entry[0].side[0].ibox.pron._attributes.url

德文檔名：

data.xml.sectionlist.section[0].entry[0].side[1].ibox.pron._attributes.url

音檔連結：

http://pda.leo.org/media/audio/${url}.ogg

示範：das Auto

結合

最後長這樣，很醜，之後再用漂亮一點的方式重寫，先擋一下：

if (data && data.xml.search._attributes.hitcount > 0) {
  const section = data.xml.sectionlist.section[0] ? data.xml.sectionlist.section[0] : data.xml.sectionlist.section;
  const entry = section.entry[0] ? section.entry[0] : section.entry;

  const de = entry.side[1].words.word[0] ? entry.side[1].words.word[0]._text : entry.side[1].words.word._text;
  const en = entry.side[0].words.word[0] ? entry.side[0].words.word[0]._text: entry.side[0].words.word._text;
  const deAudio = entry.side[1].ibox.pron ? entry.side[1].ibox.pron._attributes.url : '';
  const enAudio = entry.side[0].ibox.pron ? entry.side[0].ibox.pron._attributes.url : '';
  const type = entry.info.category._attributes.type;

pcwu's TIL Notes

[JS] Leo.org 爬蟲筆記

檢查有沒有查到字

Section & Entry

Side

Audio

結合

Reference