Data

Our project can be used to download evaluation datasets, thumbnails, YouTube-videos and allows to scrape images from Google.

Datasets

src.data.datasets.download_imdb_faces_dataset(path: str = 'data/datasets/imdb-faces')[source]

Downloads the IMDb-Faces dataset and parses a information.csv. Details about the dataset can be found here: https://github.com/fwang91/IMDb-Face.

!!! Many links are outdated. Only half of the dataset can still be downloaded. !!!

Parameters

path (str) – Path where the videos and information.csv should be saved.

src.data.datasets.download_imdb_wiki_dataset(path: str = 'data/datasets/imdb-wiki')[source]

Downloads the IMDb-Wiki dataset and parses a information.csv. Details about the dataset can be found here: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/.

Parameters

path (str) – Path where the videos and information.csv should be saved

src.data.datasets.download_seqamlab_dataset(path: str = 'data/datasets/ytcelebrity')[source]

Downloads the YouTube Celebrities Face Tracking and Recognition Dataset and parses a information.csv. Details about the dataset can be found here: http://seqamlab.com/youtube-celebrities-face-tracking-and-recognition-dataset/.

Parameters

path (str) – Path where the videos and information.csv should be saved.

src.data.datasets.download_youtube_faces_db(path: str = 'data/datasets/youtube-faces-db', download: bool = False)[source]

Downloads the YouTube Faces Database and parses a information.csv. Details about the dataset can be found here: https://www.cs.tau.ac.il/~wolf/ytfaces/.

Parameters
  • path (str) – Path where the videos are located and the information.csv should be saved at.

  • download (bool) – Whether the dataset should be downloaded automatically or only parsed. The download can take long.

Thumbnails

src.data.knowledge_graphs.download_dbpedia_thumbnails(path: str = 'data/thumbnails/dbpedia_thumbnails', query_links: bool = True, download: bool = True)[source]
Queries the thumbnail links from dbpedia and saves the links in a file path/Thumbnails_links.csv

Downloads the thumbnails of dbpedia and parses them in the following structure: <Entity1> <Thumbnail1> <Entity2> <Thumbnail1> Saves a summary of the results in path/download_results.csv Saves the images in path/thumbnails

Parameters
  • path (str) – ath where the thumbnails should be saved at

  • query_links (bool) – Boolean that indicates whether to query the thumbnails links

  • download (bool) – Boolean that indicated whether to download the thumbnails

src.data.knowledge_graphs.download_entity_list(path: str = 'data/thumbnails', entity_list: Optional[list] = None)[source]

Downloads a specific list of entity thumbnails from wikidata.

Parameters
  • path (str) – Path where the thumbnails are stored.

  • entity_list (list) – A list of entities required to download

Returns

A list containing the still missing entities

Return type

sm (list)

src.data.knowledge_graphs.download_images(path, method='wikidata')[source]

Downloads entity thumbnails

Parameters
  • path (str) – Path where the thumbnails should be stored and the Thumbnails_links.csv is located.

  • method (str) – Source knowledge graph.

src.data.knowledge_graphs.download_missing_thumbnails(path: str = './videos/ytcelebrity', path_thumbnails: str = 'data/thumbnails')[source]

Compares a list of entities with the ones in a dataset and downloads missing ones.

Parameters
  • path (str) – Path where the information.csv of the dataset is saved.

  • path_thumbnails (str) – Path where the Thumbnails_links.csv is saved.

Returns

List of missing entities that have ben found.

Return type

missing_entities (list)

src.data.knowledge_graphs.download_thumbnail(index: int, i_thumbnail_url: str, i_path: str, i_file_name: str)[source]

Downloads a single thumbnail

Parameters
  • index (int) – The index of the downloaded thumbnail taken from the thumbnail urls dataframe

  • i_thumbnail_url (str) – The url of the downloaded thumbnail

  • i_path (str) – The download path

  • i_file_name (str) – The file name

Returns

A list containing the index, the thumbnail url and the result outcome (success, HTTPError or UnicodeEncodeError)

Return type

output (list)

src.data.knowledge_graphs.download_wikidata_thumbnails(path: str = 'data/thumbnails/wikidata_thumbnails', query_links: bool = True, download: bool = True)[source]
Queries the thumbnail links from wikidata and saves the links in a file path/Thumbnails_links.csv

Downloads the thumbnails of wikidata and parses them in the following structure: <Entity1> <Thumbnail1> <Entity2> <Thumbnail1> Saves a summary of the results in path/download_results.csv Saves the images in path/thumbnails

Parameters
  • path (str) – Path where the thumbnails should be saved at

  • query_links (bool) – Boolean that indicates whether to query the thumbnails links

  • download (bool) – Boolean that indicated whether to download the thumbnails

Gets the corresponding Wikidata/DBpedia uri for a DBpedia/Wikidata uri.

Parameters

uri (str) – A DBpedia- or Wikidata-URI.

Returns

The uri of the other knowledge graph.

Return type

corresponding_uri (str)

src.data.knowledge_graphs.get_uri_from_csv(name: str, data: pandas.core.frame.DataFrame)[source]

Gets the DBpedia- and Wikidata-URI from a Thumbnail_links.csv as a dataframe.

Parameters
  • name (str) – Name of the entity.

  • data (DataFrame) – Dataframe of the Thumbnail_links.csv

Returns

The uri of the entity in DBpedia. wikidata_uri (str): The uri of the entity in Wikidata.

Return type

dbpedia_uri (str)

src.data.knowledge_graphs.get_uri_from_label(label: str) tuple[source]

Gets the corresponding Wikidata and DBpedia uri for a label.

Parameters

label (str) – A label.

Returns

The uri of the entity in DBpedia. wikidata_uri (str): The uri of the entity in Wikidata.

Return type

dbpedia_uri (str)

Scraping from Google Images

src.data.enrich_with_photos.compare_install_face(img, img_dir, downloaded, encode=None)[source]

Downloads an image only if the detected face is to an extent similar to the other images

src.data.enrich_with_photos.download_images(path, main_keyword, supplemented_keywords, download_dir, num_images, encode=None)[source]

download images with one main keyword and multiple supplemented keywords

Parameters
  • main_keyword (str) – main keyword

  • supplemented_keywords (list[str]) – list of supplemented keywords

Returns

None

src.data.enrich_with_photos.download_page(url: str)[source]

download raw content of the page

Parameters

url (str) – URL of the page

Returns

Raw content of the page

Return type

content (str)

src.data.enrich_with_photos.download_thumbnails_entity_list(download_dir, entity_list, num_images, enrich=True)[source]
src.data.enrich_with_photos.encode_downloaded_img(img)[source]

Creates the embedding for an image

Parameters

img – The image to encode

Returns

embedding

src.data.enrich_with_photos.enrich_with_google_photos(thumbnails_path, num_images, enrich=True)[source]
src.data.enrich_with_photos.fetch_image(link)[source]
src.data.enrich_with_photos.get_face_encoding(entity)[source]
src.data.enrich_with_photos.parse_page(url: str)[source]

Parse the page and get all the links of images, max number is 100 due to limit by google

Parameters

url (str) – url of the page

Returns

A set containing the urls of images

Return type

urls (set)

YouTube

src.data.youtube.download_youtube_video(url: str, path: str = 'data/datasets/youtube') str[source]

Downloads a single video from youtube.

Parameters
  • url (str) – YouTube-link of the video to download.

  • path (str) – Path where the videos should be saved.

Returns

Path to the downloaded file.

Return type

path (str)

src.data.youtube.download_youtube_videos(txt_path: Optional[str] = None, path: str = 'data/datasets/youtube') list[source]

Downloads videos from youtube

Parameters
  • txt_path (str) – Location of a text-file containing line-wise URLs of youtube videos.

  • path (str) – Path where the videos should be saved.

Returns

List of paths to the downloaded videos.

Return type

video_paths (list)