NLP Regional Translator
NLP, Keras/Tensorflow, LSTM, Translation, Twitter, Python
December 18, 2018
In this project, I...
- Collect, clean, parse, geotag and label a dataset of ~2m tweets using Python/Pandas.
- Use SK Learn and Keras to build and evaluate a set of NLP Classifiers via LSTM, Naive Bayes, Logistic Regression, etc.
- Invent a new method to geographically 'translate' tweets using an LSTM encoder/decoder and Gradient Ascent.
This NLP project had two aims:
- Identify the geographic origin of a tweet solely by its content.
- "Translate" a given tweet from one region to another (ie, San Francisco to New York)
The first part was fairly straightforward: we trained a number of classifiers using various techniques, and while accuracy was never much better than 20%, we were able to outperform the baseline classifiers using more advanced neural net models (Twitter is noisy, and most tweets don't contain much geographic signal).
The second part presented a significant challenge, as there is no reference set of geographically-tagged sentences to train a translator as you would do with English-French (for example). I wound up devising an encoder-decoder model that leveraged LSTMs and, inspired by Google DeepDream, utilized gradient ascent in latent space to alter a sentence's signal from one region to another. To my knowledge, nobody has done this with language processing before.
This was a group project, and each team member attempted the translation piece with a different technique. In the interest of brevity, here is an edited version of our final paper that includes only the portions of the project that I worked on. You can read the full paper here.
Intralingual Translation without Reference
via Gradient Ascent and Adversarial Training
With the advent of big data, we have achieved impressive feats in interlingual translation from one language to another. In this paper, we wanted to explore the possibility of intralingual ‘translation’ of sentences between topics in the same language, using data collected from the social media platform Twitter. Specifically, we will automatically identify and rewrite sentences from one geographical region of North America to another. We will do so without using reference data, and generate the translator model in three different ways: enhancing a particular regional pattern in a LSTM and CNN encoder-decoder models and training a LSTM encoder-decoder model adversarially. Translations will be evaluated both by regional accuracy and by retranslation.
Automatically rewriting a piece of text to reflect a different topic, style or sentiment while still maintaining the other characteristics of the original presents an interesting challenge, and one we were unable to find addressed in current research. While there is ample work in machine translation that has achieved amazing results (Wu et al. 2016), and in machine paraphrasing (Barzilay and Lee 2003) or changing writing styles (Xu et al. 2012), nearly all of these solutions involve leveraging parallel corpora or, if monolingual, limit themselves to word or phrase level (Conneau et al. 2017). Others have used variational autoencoders to generate text that interpolates between the latent space of two previously known sentences (Kingma and Welling 2013). Our problem is differentiated because rather than trying to say the same thing in a different way or language, we’re trying to say a different thing in the same language.
Two works serve as inspiration for our approach on this challenge. The first is Google Deep Dream, which was designed as a way to use gradient ascent to generate images out of random noise to gain a better understanding of what exactly the model was learning (“DeepDream - a Code Example for Visualizing Neural Networks” n.d.). The second is based on Facebook’s work on monolingual unsupervised machine translation (Lample et al. 2017). In both cases, the authors leverage their model’s latent space to extract meaning.
As our vehicle for exploring this problem, we chose geographically-coded Twitter data, with the aim of relocating tweets that contain geographic information to a different region in North America. While highly-regional words such as sports team names are an obvious target, the model should also pick up more subtle nuances of vocabulary, writing style and regional dialects. As a framework, the approach should work on any corpus of differentially-labelled data: for example, rewriting restaurant reviews from sushi to pizza, or product reviews from negative to positive. To determine the feasibility of this task, we will use gradient ascent to enhance the regional aspects of latent states in a LSTM encoder-decoder model such as those developed for language translation (Sutskever, Vinyals, and Le 2014). Since multiple characteristics can define the regional information within a tweet, the encoder is used to compress the high dimensionality so that the weights of the hidden layers contain enough information to represent the output in a lower density format. The decoder then steps in to reconstruct the compressed input in a hopefully meaningful way.
Although our goal is to translate tweets from one region to another, in order to accomplish this, the model needs to learn during training the origin of a particular tweet. Prior research on identifying geolocation and regional dialects from Twitter data was based on probabilistic models (Eisenstein, n.d.) and content based approaches (Cheng, Caverlee, and Lee 2010). However, we chose to employ deep neural networks to solve this particular problem. The model predicts the regional origin of the dialect and if it’s not accurate, the error that was computed with respect to the cost function is propagated backwards until the weights change to reflect the accuracy of the output. Our approach takes the output of the encoding and, utilizing the differentiated gradients learned during training, emphasizes the features that fall in line with the targeted region more and more until it appears to originate from that region.
Data consists of ~3 million geocoded original tweets (no quote-tweets or retweets) collected from Twitter’s Real-Time API during November 2018. Every tweet was assigned to one of 23 regions in North America according to proximity to a particular city. The dataset was then balanced across the 13 most populated regions, resulting in a final dataset of ~1.7m tweets, with 135k tweets per region. Only the full text of the tweet and its corresponding label were used as features in the models. URLs and user tags were removed during preprocessing, however since language used in Twitter is often symbolic or strange, we kept everything else within the 10k most frequent tokens, including emojis and unusual spellings that might convey regional signal.
Models were evaluated on a standardized test bed of 130 tweets, 10 from each region and at 10 levels of predictive probability according to Naive Bayes. They are then scored on two main criteria: BLEU scores on sentences that have been translated into a target category and then back into the source category (Rapp 2009), and percentage of times the model’s encoder successfully predicts the translation as belonging to the target region. Combined, these measure the model’s ability to accurately target the regional identifiers in a given sentence and adjust them with minimal warping to the sentences’ coherence. We also measure ‘Fidelity,’ which is simply a measure of the model’s ability to reconstruct an input sentence without any alteration of the latent states.
As regional classification is a key mechanism in our model, we set out to determine the best classifier for identifying the source of a tweet’s content. The most successful classifier was a two-layer word-level LSTM, consisting of 200-vector GloVe embeddings, a single bidirectional layer followed by a regular layer. The CNN classifier that gave us comparable results to LSTM was a 2-layer CNN with max pooling and 2 dense layers with Twitter 200 dimensional GloVe word embeddings. After hyperparameter tuning we found larger word embeddings and more hidden features in each layer improved the classification.
|Random choice||7.8%||Neural Bag-of-Words||9.5%|
|Human Guess||13.5%||LSTM (Character Level)||10.03%|
|Unigram Naive Bayes||18.51%||LSTM (Word Level||21.3%|
|Bigram Naive Bayes||13.05%||CNN (Character Level)||19.4%|
|Logistic Regression||18.62%||CNN (Word Level)||21.2%|
Our data has no reference sentences to train the decoder, but as we are staying within the same language, we can leverage the intelligence of the model to alter the sentence. The encoder/decoder is trained to simply reconstruct its own input, so the decoder learns how to build an English sentence from latent space vectors. Rather than being discarded as usual, the output of the encoder is passed through a softmax layer to a classifier, thereby training the encoder to differentiate sentence content based on labels.
During translation, a sentence is passed through the encoder to the classifier, and the gradients of the input layer are calculated with respect to the loss of a target region. Those gradients are iteratively added to the input and the loss recalculated, until the probability that the hidden states now match the target region is maximized. The altered states are then fed into the decoder, which makes its best effort at reconstructing an English sentence from them.
Although language translators of this sort are often done at the character level, our preliminary results indicated a word-level approach might be more productive.
In all, seven LSTM models were trained at varying levels of complexity, dataset size, regularization, and training length. The most useful results came from two models: a 300k tweet set with a relatively simple Bidirectional encoder, single LSTM decoder and zero regulation, and a more complex 500k tweet set.
Like Deep Dream, the results that come out of the model are often somewhat hallucinatory, but there are signs that it is following some sort of internal logic. There is a lot of noise in tweets, and as a result the models can have a difficult time accurately targeting the translation--or at least sufficiently altering the text to classify as originating from the correct region. Despite this, regional patterns are clearly reflected in the results, as is apparent in the tweet ‘nobody has better food than waffle house’ translated into every region:
|Target Region||Translated Tweet||Predicted Region||Probability|
|Chicago||nobody has better meat food house||Nashville||12.81%|
|Cincinnati||nobody has better food than waffle house||Charlotte||23.44%|
|Houston||nobody has every better waffle juice house bro||Charlotte||24.9%|
|Los Angeles||nobody has better food than james house||Charlotte||19.3%|
|Nashville||nobody has good food than waffle house||Charlotte||20.98%|
|New York||nobody has rich food for illegal||Charlotte||12.38%|
|Oklahoma City||nobody has better food than pizza house||Oklahoma City||14.37%|
|San Francisco||nobody has better taste two bucks house||Cincinnati||29.38%|
|Seattle||nobody has better food than james house||Charlotte||25.06%|
|Tampa||nobody has better food than waffle house bill||Charlotte||25.06%|
|Toronto||nobody has food better meat at against||Cincinnati||13.45%|
|Washington||nobody has better food for diet||Houston||12.35%|
Although the model failed to meaningfully relocate the tweet, it did recognize that, by and large, Waffle House is a southeastern US phenomenon and is not reflected in northern or western regions (the further you get from Waffle House territory, the more likely the translator will have excerpted the words 'waffle' and/or 'house'). There is a chain called ‘Pizza House’ in Oklahoma City. New York is curious.
Because Gradient Ascent is iterative, and we compute the loss at each step, we can set the threshold for how far to translate a given sentence into its target category, and choose how much regional ‘flavor’ to imbue the result. Here are the interim steps of translating a job search from Indiana to New York:
|Step||Decoded Latent State|
|0||(source) interested in a #job in #indianapolis, in? this could be a great fit: #construction #careerarc|
|1||interested in a #job in #winchester, va? this could be a great fit: #construction #hiring #careerarc|
|2||interested in a #job in #winchester, va? this could be a great fit: #physician #nyc #careerarc|
|3||interested in a #job in ct? this could be a great film. @ new #facilitiesmgmt #hiring #careerarc|
|4||interested in a #job in #winchester, ct? this could be a great fit: nj #hiring #careerarc|
Iterations after 100% loss quickly lose any relation to the source sentence or the target region, but there is a nebulous ‘sweet spot’ around between loss of 85% and a couple steps past 100% where coherence and translation are maximized. Small modifications to the normalization that is applied to the gradients at each step can have a profound effect on the results. We got the best results from simple x/max(x), although this could be scaled.
Results and Discussion
|Model||Type||Words||Epochs||Fidelity||Trans Accuracy||Retrans Accuracy||BLEU for translation||BLUE for retranslation|
Surprisingly, we found that in some ways for this particular problem, less is more. Generally speaking, the more we trained the models--whether on more complex models, on larger datasets, or for more epochs, the more they “hardened” and became reluctant to reconstruct the altered hidden states into something else. On the other hand, simpler models may be more willing to change the text, but the outputs are generally more chaotic, so there is a trade-off to consider. According to the metrics we’ve chosen, the ideal model would have high translation accuracy (ie, it actually shifts the text to the target region), a lower BLEU score for the translation and a higher BLEU score for the retranslation, indicating a return of some of the original text (the perfect model would recreate the source sentence completely).
Part of the highly-trained models’ rigidity is in their deep uncertainty over what differentiates the classes. These histograms represent probability levels for predicted tweets, and it is clear that the 300k model is much more similar to Naive Bayes in its level of confidence than the 1m-tweet model. Although their respective accuracies are not far off (~20%), when you ask the highly-trained LSTM to alter and amplify the regional elements of a tweet, it is unclear what to do. The result is a potpourri of words in the higher levels of probability that might have topical relevance, but are literally all over the map regionally (ie, attempting to translate a tweet containing the word Packers will retrieve Jets, Giants, Bears, Packers, Steelers, and even Yankees and Warriors as candidates), and the model isn’t sure which of these to promote to the top. Because of this, the more foolhardy 1Bix1L 300k model often provides the most interesting results, even if its BLEU scores are not as high.
If the chaotic nature of gradient ascent could be tamed, the LSTM model shows the most promise as a means of algorithmically rewriting text to related topics, styles, or sentiment within a labelled corpus. Further research is necessary to determine a way to balance the differing losses between the encoder and decoder during training, and boost the ability of the model to differentiate regional characteristics within a piece of text. Facebook AI’s technique of training the model to decode noisy versions of itself could be beneficial, as could trying it on a more easily differentiable corpus (ie, food reviews). Attention could also be a better way of surfacing the elements that carry the highest categorical signal for change.
- Barzilay, Regina, and Lillian Lee. 2003. “Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment.” In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics. http://aclweb.org/anthology/N/N03/N03-1003.
- Cheng, Zhiyuan, James Caverlee, and Kyumin Lee. 2010. “You Are Where You Tweet: A Content-Based Approach to Geo-Locating Twitter Users.” In Proceedings of the 19th ACM International Conference on Information and Knowledge Management - CIKM ’10, 759. Toronto, ON, Canada: ACM Press. https://doi.org/10.1145/1871437.1871535.
- Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2017. “Word Translation Without Parallel Data.” ArXiv:1710.04087 [Cs], October. http://arxiv.org/abs/1710.04087.
- “DeepDream - a Code Example for Visualizing Neural Networks.” n.d. Google AI Blog (blog). Accessed December 6, 2018. http://ai.googleblog.com/2015/07/deepdream-code-example-for-visualizing.html.
- Eisenstein, Jacob. n.d. “A Latent Variable Model for Geographic Lexical Variation,” 11.
- Kingma, Diederik P., and Max Welling. 2013. “Auto-Encoding Variational Bayes.” ArXiv:1312.6114 [Cs, Stat], December. http://arxiv.org/abs/1312.6114.
- Lample, Guillaume, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2017. “Unsupervised Machine Translation Using Monolingual Corpora Only.” ArXiv:1711.00043 [Cs], October. http://arxiv.org/abs/1711.00043.
- Rapp, Reinhard. 2009. “The Backtranslation Score: Automatic MT Evaluation at the Sentence Level without Reference Translations.” In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, 133–136. Suntec, Singapore: Association for Computational Linguistics. http://www.aclweb.org/anthology/P/P09/P09-2034.
- Sutskever, Ilya, Oriol Vinyals, and Quoc V Le. 2014. “Sequence to Sequence Learning with Neural Networks.” In Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 3104–3112. Curran Associates, Inc. http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf.
- Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, et al. 2016. “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.” ArXiv:1609.08144 [Cs], September. http://arxiv.org/abs/1609.08144.
- Xu, Wei, Alan Ritter, Bill Dolan, Ralph Grishman, and Colin Cherry. 2012. “Paraphrasing for Style.” In Proceedings of COLING 2012, 2899–2914. Mumbai, India: The COLING 2012 Organizing Committee. http://www.aclweb.org/anthology/C12-1177.
- Eisenstein, Jacob, Brendan O’Connor, Noah A Smith, and Eric P Xing. “A Latent Variable Model for Geographic Lexical Variation,” n.d., 11. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1277–1287, MIT, Massachusetts, USA, 9-11 October 2010. http://aclweb.org/anthology/D10-1124.
- Cheng, Zhiyuan, James Caverlee, and Kyumin Lee. “You Are Where You Tweet: A Content-Based Approach to Geo-Locating Twitter Users.” In Proceedings of the 19th ACM International Conference on Information and Knowledge Management - CIKM ’10, 759. Toronto, ON, Canada: ACM Press, 2010. https://doi.org/10.1145/1871437.1871535.
- Rapp, Reinhard. “The Back-Translation Score: Automatic MT Evaluation at the Sentence Level without Reference Translations.” In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers on - ACL-IJCNLP ’09, 133. Suntec, Singapore: Association for Computational Linguistics, 2009. https://doi.org/10.3115/1667583.1667625.