Data

Our project can be used to download evaluation datasets, thumbnails, YouTube-videos and allows to scrape images from Google.

Datasets

src.data.datasets.download_imdb_faces_dataset(path: str = 'data/datasets/imdb-faces')[source]

Downloads the IMDb-Faces dataset and parses a information.csv. Details about the dataset can be found here: https://github.com/fwang91/IMDb-Face.

!!! Many links are outdated. Only half of the dataset can still be downloaded. !!!

Parameters: path (str) – Path where the videos and information.csv should be saved.

src.data.datasets.download_imdb_wiki_dataset(path: str = 'data/datasets/imdb-wiki')[source]

Downloads the IMDb-Wiki dataset and parses a information.csv. Details about the dataset can be found here: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/.

Parameters: path (str) – Path where the videos and information.csv should be saved

src.data.datasets.download_seqamlab_dataset(path: str = 'data/datasets/ytcelebrity')[source]

Downloads the YouTube Celebrities Face Tracking and Recognition Dataset and parses a information.csv. Details about the dataset can be found here: http://seqamlab.com/youtube-celebrities-face-tracking-and-recognition-dataset/.

Parameters: path (str) – Path where the videos and information.csv should be saved.

src.data.datasets.download_youtube_faces_db(path: str = 'data/datasets/youtube-faces-db', download: bool = False)[source]

Downloads the YouTube Faces Database and parses a information.csv. Details about the dataset can be found here: https://www.cs.tau.ac.il/~wolf/ytfaces/.

Parameters

path (str) – Path where the videos are located and the information.csv should be saved at.
download (bool) – Whether the dataset should be downloaded automatically or only parsed. The download can take long.

Thumbnails

src.data.knowledge_graphs.download_dbpedia_thumbnails(path: str = 'data/thumbnails/dbpedia_thumbnails', query_links: bool = True, download: bool = True)[source]

Queries the thumbnail links from dbpedia and saves the links in a file path/Thumbnails_links.csv: Downloads the thumbnails of dbpedia and parses them in the following structure: <Entity1> <Thumbnail1> <Entity2> <Thumbnail1> Saves a summary of the results in path/download_results.csv Saves the images in path/thumbnails

Parameters

path (str) – ath where the thumbnails should be saved at
query_links (bool) – Boolean that indicates whether to query the thumbnails links
download (bool) – Boolean that indicated whether to download the thumbnails

src.data.knowledge_graphs.download_entity_list(path: str = 'data/thumbnails', entity_list: Optional[list] = None)[source]

Downloads a specific list of entity thumbnails from wikidata.

Parameters

path (str) – Path where the thumbnails are stored.
entity_list (list) – A list of entities required to download

Returns

A list containing the still missing entities

Return type

sm (list)

src.data.knowledge_graphs.download_images(path, method='wikidata')[source]

Downloads entity thumbnails

Parameters

path (str) – Path where the thumbnails should be stored and the Thumbnails_links.csv is located.
method (str) – Source knowledge graph.

src.data.knowledge_graphs.download_missing_thumbnails(path: str = './videos/ytcelebrity', path_thumbnails: str = 'data/thumbnails')[source]

Compares a list of entities with the ones in a dataset and downloads missing ones.

Parameters

path (str) – Path where the information.csv of the dataset is saved.
path_thumbnails (str) – Path where the Thumbnails_links.csv is saved.

Returns

List of missing entities that have ben found.

Return type

missing_entities (list)

src.data.knowledge_graphs.download_thumbnail(index: int, i_thumbnail_url: str, i_path: str, i_file_name: str)[source]

Downloads a single thumbnail

Parameters

index (int) – The index of the downloaded thumbnail taken from the thumbnail urls dataframe
i_thumbnail_url (str) – The url of the downloaded thumbnail
i_path (str) – The download path
i_file_name (str) – The file name

Returns

A list containing the index, the thumbnail url and the result outcome (success, HTTPError or UnicodeEncodeError)

Return type

output (list)

src.data.knowledge_graphs.download_wikidata_thumbnails(path: str = 'data/thumbnails/wikidata_thumbnails', query_links: bool = True, download: bool = True)[source]

Queries the thumbnail links from wikidata and saves the links in a file path/Thumbnails_links.csv: Downloads the thumbnails of wikidata and parses them in the following structure: <Entity1> <Thumbnail1> <Entity2> <Thumbnail1> Saves a summary of the results in path/download_results.csv Saves the images in path/thumbnails

Parameters

path (str) – Path where the thumbnails should be saved at
query_links (bool) – Boolean that indicates whether to query the thumbnails links
download (bool) – Boolean that indicated whether to download the thumbnails

src.data.knowledge_graphs.get_same_as_link(uri: str) → str[source]

Gets the corresponding Wikidata/DBpedia uri for a DBpedia/Wikidata uri.

Parameters: uri (str) – A DBpedia- or Wikidata-URI.
Returns: The uri of the other knowledge graph.
Return type: corresponding_uri (str)

src.data.knowledge_graphs.get_uri_from_csv(name: str, data: pandas.core.frame.DataFrame)[source]

Gets the DBpedia- and Wikidata-URI from a Thumbnail_links.csv as a dataframe.

Parameters

name (str) – Name of the entity.
data (DataFrame) – Dataframe of the Thumbnail_links.csv

Returns

The uri of the entity in DBpedia. wikidata_uri (str): The uri of the entity in Wikidata.

Return type

dbpedia_uri (str)

src.data.knowledge_graphs.get_uri_from_label(label: str) → tuple[source]

Gets the corresponding Wikidata and DBpedia uri for a label.

Parameters: label (str) – A label.
Returns: The uri of the entity in DBpedia. wikidata_uri (str): The uri of the entity in Wikidata.
Return type: dbpedia_uri (str)

Scraping from Google Images

src.data.enrich_with_photos.compare_install_face(img, img_dir, downloaded, encode=None)[source]: Downloads an image only if the detected face is to an extent similar to the other images

src.data.enrich_with_photos.create_image_links(main_keyword, supplemented_keywords)[source]

src.data.enrich_with_photos.download_images(path, main_keyword, supplemented_keywords, download_dir, num_images, encode=None)[source]

download images with one main keyword and multiple supplemented keywords

Parameters

main_keyword (str) – main keyword
supplemented_keywords (list[str]) – list of supplemented keywords

Returns

None

src.data.enrich_with_photos.download_page(url: str)[source]

download raw content of the page

Parameters: url (str) – URL of the page
Returns: Raw content of the page
Return type: content (str)

src.data.enrich_with_photos.download_thumbnails_entity_list(download_dir, entity_list, num_images, enrich=True)[source]

src.data.enrich_with_photos.encode_downloaded_img(img)[source]

Creates the embedding for an image

Parameters: img – The image to encode
Returns: embedding

src.data.enrich_with_photos.enrich_with_google_photos(thumbnails_path, num_images, enrich=True)[source]

src.data.enrich_with_photos.fetch_image(link)[source]

src.data.enrich_with_photos.get_face_encoding(entity)[source]

src.data.enrich_with_photos.parse_page(url: str)[source]

Parse the page and get all the links of images, max number is 100 due to limit by google

Parameters: url (str) – url of the page
Returns: A set containing the urls of images
Return type: urls (set)

YouTube

src.data.youtube.download_youtube_video(url: str, path: str = 'data/datasets/youtube') → str[source]

Downloads a single video from youtube.

Parameters

url (str) – YouTube-link of the video to download.
path (str) – Path where the videos should be saved.

Returns

Path to the downloaded file.

Return type

path (str)

src.data.youtube.download_youtube_videos(txt_path: Optional[str] = None, path: str = 'data/datasets/youtube') → list[source]

Downloads videos from youtube

Parameters

txt_path (str) – Location of a text-file containing line-wise URLs of youtube videos.
path (str) – Path where the videos should be saved.

Returns

List of paths to the downloaded videos.

Return type

video_paths (list)